How to Extract Image from HTML
The ability to extract images from HTML is important for various applications such as web scraping and content analysis. Aspose.HTML for .NET is a robust library that simplifies this process by offering developers a set of tools to navigate and gather information from HTML documents seamlessly. Let’s explore how to extract images from HTML documents.
First, make sure you have Aspose.HTML for .NET installed in your project. The installation process of this library is quite simple. Open the NuGet package manager, search for Aspose.HTML, and install. You may also use the following command from the Package Manager Console:
Install Aspose.HTML for .NET
Install-Package Aspose.HTML
Extract Images from HTML using C#
Using Aspose.HTML for .NET class library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents. If you want to use HTML data scraping features in your product or programmatically extract data from HTML, see the code example below. Whether you’re building web scrapers or content analyzers, Aspose.HTML makes image extraction an intuitive process. You can download all images from an HTML document with a few lines of C# code:
C# code to extract images from HTML
using Aspose.Html;
using Aspose.Html.Net;
using System.Linq;
using System.IO;
...
// Prepare a path to a source HTML file
string documentPath = Path.Combine(DataDir, "images-from-html.html");
// Create an instance of an HTML document
using (var document = new HTMLDocument(documentPath))
{
// Collect all <img> elements
var images = document.GetElementsByTagName("img");
// Create a distinct collection of relative image URLs
var urls = images.Select(element => element.GetAttribute("src")).Distinct();
// Create absolute image URLs
var absUrls = urls.Select(src => new Url(src, document.BaseURI));
foreach (var url in absUrls)
{
// Create an image request message
using var request = new RequestMessage(url);
// Download image
using var response = document.Context.Network.Send(request);
var imgName = url.Pathname.Split('/').Last();
//Check the image in base64 encoding
if (url.Protocol == "data:" && response.Headers.ContentType.MediaType.Type == "image")
{
// Get the image type and set to extension
imgName = "img1." + response.Headers.ContentType.MediaType.SubType;
}
// Check whether a response is successful
if (response.IsSuccess)
{
// Save image to a local file system
File.WriteAllBytes(Path.Combine(OutputDir, imgName), response.Content.ReadAsByteArray());
}
}
}
Steps to Extract Images from HTML
- Use the HTMLDocument() constructor to initialize an HTML document.
- Use the
GetElementsByTagName(
"img"
) method to collect all<img>
elements. The method returns a list of the HTML document’s<img>
elements. - Utilize the
Select()
method to create a distinct collection of relative image URLs and the GetAttribute("src"
) method to extract thesrc
attribute of each<img>
element. - Create absolute image URLs using the
Url
class and the
BaseURI
property of theHTMLDocument
class. - For each absolute URL, create a request using the
RequestMessage(
url
) constructor. - Use the document’s
Context.Network.Send(request)
method to send the request. The response is checked to ensure it was successful. - If the response was successful, use the
File.WriteAllBytes()
method to save each image to a local file. - In the code snippet, we check whether the image is in Base64 encoded format by examining the protocol of the URL and, if true, set the image name and extension.
Aspose.HTML for .NET is an advanced HTML parsing library. One can create, edit, navigate through nodes, extract data, merge and convert HTML, XHTML, MD, EPUB, and MHTML files to PDF, DOCX, Images, and other popular formats. Moreover, it also handles CSS, HTML Canvas, SVG, XPath, and JavaScript out-of-the-box to extend manipulation tasks. For more details about C# library installation and system requirements, please refer to Aspose.HTML Documentation .
Other Supported C# library Features
Use the Aspose.HTML for .NET library to parse and manipulate HTML-based documents. Clear, safe and simple!