Programmatically Extract SVG from Websites

The ability to extract images from HTML is important for various applications such as web scraping and content analysis. Aspose.HTML for Java is a robust library that simplifies this process by offering developers a set of tools to navigate and gather information from HTML documents seamlessly. Let’s explore how to extract external SVG images from a website.


Extract SVGs from HTML Using Java

With the Aspose.HTML library for Java, you can quickly build your own application using a robust set of tools for parsing and extracting data from HTML documents. The example below shows how to extract all external SVGs from an HTML document with just a few lines of Java code.


Java code to extract SVG from website

// Open a document you want to download external SVGs from
final HTMLDocument document = new HTMLDocument("https://products.aspose.com/html/net/");

// Collect all <img> elements
HTMLCollection images = document.getElementsByTagName("img");

// Create a distinct collection of relative image URLs
java.util.Set<String> urls = new HashSet<>();
for (Element element : images) {
    urls.add(element.getAttribute("src"));
}

// Filter out non SVG images
java.util.List<String> svgUrls = new ArrayList<>();
for (String url : urls) {
    if (url.endsWith(".svg")) {
        svgUrls.add(url);
    }
}
// Create absolute SVG image URLs
java.util.List<Url> absUrls = svgUrls.stream()
    .map(src -> new Url(src, document.getBaseURI()))
    .collect(Collectors.toList());

for (Url url : absUrls) {
    // Create a downloading request
    final RequestMessage request = new RequestMessage(url);

    // Download SVG image
    final ResponseMessage response = document.getContext().getNetwork().send(request);

    // Check whether response is successful
    if (response.isSuccess()) {
        String[] split = url.getPathname().split("/");
        String path = split[split.length - 1];

        // Save file to a local file system
        FileHelper.writeAllBytes(path, response.getContent().readAsByteArray());
    }
}



Steps to Extract SVGs from HTML

  1. Use the HTMLDocument(Url) constructor to create an instance of the HTMLDocument class and pass the URL of the website from which you want to extract external SVG images.
  2. Use the getElementsByTagName("img") method to collect all <img> elements.
  3. Extract the src attribute from each image element using the getAttribute("src") method and create a distinct collection of relative image URLs.
  4. Filter only SVG image URLs by checking if each URL ends with .svg, and add those to a new list.
  5. Create absolute image URLs using the Url class and the BaseURI property of the HTMLDocument class.
  6. For each absolute URL, create a request using the RequestMessage(url) constructor. Send each request and check the response for success.
  7. If the response was successful, use the FileHelper.writeAllBytes() to save the SVG content to the local file system.

With Aspose.HTML for Java, you can easily create a tool that parses a web page, identifies SVG image sources, and downloads SVGs. It is a powerful solution for those who need to collect SVGs for analysis, archiving, or content creation - without the hassle of doing it manually. To learn more about how to programmatically extract different types (inline and external) of SVGs from a website using Java, refer to the documentation article Extract SVG From Website in Java .

Note: It is important to respect copyright laws and obtain the proper permissions or licenses before using saved images for commercial purposes. We do not support the extraction and use of other people’s files for commercial purposes without their consent.




Get Started with Aspose.HTML for Java Library

Aspose.HTML for Java is an advanced web scraping and HTML parsing library. One can create, edit, navigate through nodes, extract data and convert HTML, XHTML, and MHTML files to PDF, Images, and other formats. Moreover, it also handles CSS, HTML Canvas, SVG, XPath, and JavaScript out-of-the-box to extend manipulation tasks. It’s a standalone API and does not require any software installation.
You can download its latest version directly from Aspose Maven Repository and install it within your Maven-based project by adding the following configurations to the pom.xml.


Repository

<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>https://repository.aspose.com/repo/</url>
</repository>

Dependency

<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-html</artifactId>
<version>version of aspose-html API</version>
<classifier>jdk17</classifier>
</dependency>

Other Supported Features

Use the Aspose.HTML for Java library to parse and manipulate HTML-based documents. Clear, safe and simple!