How to Extract Image from HTML

The ability to extract images from HTML is important for various applications such as web scraping and content analysis. Aspose.HTML for .NET is a robust library that simplifies this process by offering developers a set of tools to navigate and gather information from HTML documents seamlessly. Let’s explore how to extract images from HTML documents.

First, make sure you have Aspose.HTML for .NET installed in your project. The installation process of this library is quite simple. Open the NuGet package manager, search for Aspose.HTML, and install. You may also use the following command from the Package Manager Console:


Install Aspose.HTML for .NET

Install-Package Aspose.HTML



Extract Images from HTML using C#

Using Aspose.HTML for .NET class library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents. If you want to use HTML data scraping features in your product or programmatically extract data from HTML, see the code example below. Whether you’re building web scrapers or content analyzers, Aspose.HTML makes image extraction an intuitive process. You can download all images from an HTML document with a few lines of C# code:


C# code to extract images from HTML

using Aspose.Html;
using Aspose.Html.Net;
using System.Linq;
using System.IO;
...

    // Prepare a path to a source HTML file
    string documentPath = Path.Combine(DataDir, "images-from-html.html");

    // Create an instance of an HTML document
    using (var document = new HTMLDocument(documentPath))
    {
        // Collect all <img> elements
        var images = document.GetElementsByTagName("img");

        // Create a distinct collection of relative image URLs
        var urls = images.Select(element => element.GetAttribute("src")).Distinct();

        // Create absolute image URLs
        var absUrls = urls.Select(src => new Url(src, document.BaseURI));

        foreach (var url in absUrls)
        {
            // Create an image request message
            using var request = new RequestMessage(url);

            // Download image
            using var response = document.Context.Network.Send(request);

            var imgName = url.Pathname.Split('/').Last();

            //Check the image in base64 encoding
            if (url.Protocol == "data:" && response.Headers.ContentType.MediaType.Type == "image")
            {
                // Get the image type and set to extension
                imgName = "img1." + response.Headers.ContentType.MediaType.SubType;
            }

            // Check whether a response is successful
            if (response.IsSuccess)
            {
                // Save image to a local file system
                File.WriteAllBytes(Path.Combine(OutputDir, imgName), response.Content.ReadAsByteArray());
            }
        }
    }



Steps to Extract Images from HTML

  1. Use the HTMLDocument() constructor to initialize an HTML document.
  2. Use the GetElementsByTagName("img") method to collect all <img> elements. The method returns a list of the HTML document’s <img> elements.
  3. Utilize the Select() method to create a distinct collection of relative image URLs and the GetAttribute("src") method to extract the src attribute of each <img> element.
  4. Create absolute image URLs using the Url class and the BaseURI property of the HTMLDocument class.
  5. For each absolute URL, create a request using the RequestMessage(url) constructor.
  6. Use the document’s Context.Network.Send(request) method to send the request. The response is checked to ensure it was successful.
  7. If the response was successful, use the File.WriteAllBytes() method to save each image to a local file.
  8. In the code snippet, we check whether the image is in Base64 encoded format by examining the protocol of the URL and, if true, set the image name and extension.

Aspose.HTML for .NET is an advanced HTML parsing library. One can create, edit, navigate through nodes, extract data, merge and convert HTML, XHTML, MD, EPUB, and MHTML files to PDF, DOCX, Images, and other popular formats. Moreover, it also handles CSS, HTML Canvas, SVG, XPath, and JavaScript out-of-the-box to extend manipulation tasks. For more details about C# library installation and system requirements, please refer to Aspose.HTML Documentation .