How to Extract Image from Web Page
The ability to extract images from HTML is crucial for various applications, including web scraping and content analysis. Aspose.HTML for Python via .NET is a robust library that simplifies this process by offering developers a set of tools to navigate and gather information from HTML documents seamlessly. This powerful solution is ideal for anyone who needs to collect images for analysis, archiving, or content creation – eliminating the need for manual work. Let’s explore how to download images from web pages.
Extract Images Using Python
Using Aspose.HTML for Python via .NET, you can easily create your own application, as our API provides a robust set of tools for parsing and extracting information from HTML documents. If you want to use HTML data parsing features in your product or programmatically extract data from HTML, see the code example below.
Python code to download images from web page
import os
import aspose.html as ah
import aspose.html.net as ahnet
# Prepare output directory
output_dir = "output/"
os.makedirs(output_dir, exist_ok=True)
# Open HTML document from URL
with ah.HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-color/") as doc:
# Collect all <img> elements
images = doc.get_elements_by_tag_name("img")
# Get distinct relative image URLs
urls = set(img.get_attribute("src") for img in images)
# Create absolute image URLs
abs_urls = [ah.Url(url, doc.base_uri) for url in urls]
for url in abs_urls:
# Create a network request
request = ahnet.RequestMessage(url.href)
# Send request
response = doc.context.network.send(request)
# Check if successful
if response.is_success:
# Extract file name
file_name = os.path.basename(url.pathname)
# Save image locally
with open(os.path.join(output_dir, file_name), "wb") as f:
f.write(response.content.read_as_byte_array())
Steps to Extract Images from Web Page
- Open the target HTML document, a web page, using the
HTMLDocument
class. This document is the source from which images will be extracted. - Call the
get_elements_by_tag_name(“img”)
method of the
HTMLDocument
object to collect all<img>
elements within the HTML document. - Extract unique image URLs by iterating over the collection of
<img>
elements and accessing each element’ssrc
attribute using the get_attribute(“src”) method. Store these URLs in a set to ensure there are no duplicates. - Create absolute image URLs by passing each relative or incomplete URL along with the document’s
base_uri
to theUrl
constructor. This ensures each URL is complete and valid for network access. - For each absolute image URL, create a RequestMessage object to represent the HTTP request needed to fetch the image data.
- Use the
doc.context.network.send(request)
method to send the request and receive a response. Check if the response is successful by evaluating theis_success
property. - Parse the absolute image URL using
os.path.basename()
to extract the file name, then save the image content to the output directory by writing the binary data from the response to a file.
To learn more about how to programmatically extract various types of images from a website using Python, refer to the documentation article Extracting Images from a Website in Python .
Note: Always respect copyright laws. Make sure you have the appropriate rights, permissions, or licenses before using the extracted images for commercial purposes. We do not endorse or support the unauthorized use of copyrighted content.
Get Started with Python API
If you want to parse, manipulate, and manage HTML documents, install our flexible, high-speed Aspose.HTML for Python via .NET API. pip
is the easiest way to download and install Aspose.HTML for Python via .NET. To do this, run the following command:
pip install aspose-html-net
For more details about Python library installation and system requirements, please refer to Aspose.HTML Documentation.
Other Supported Features
Use the Aspose.HTML for Python via .NET library to parse and manipulate HTML-based documents. Clear, safe and simple!