How to Extract Tables from Web Page

Extracting HTML tables from web pages is a common task in web scraping, data analysis, and content processing. Using Aspose.HTML for Python via .NET , developers can easily automate the process of finding, downloading, and saving <table> elements from any web page. This powerful solution for extracting tables programmatically is ideal for anyone who needs to work with structured data from articles, reports, or any other web pages.


Extract Tables Using Python

The following Python code demonstrates how to download an HTML document from a website, identify all table elements in it, and export each table into separate, self-contained HTML files for later use:


Python code to download tables from web page

import os
import aspose.html as ah

# Define output directory
output_dir = "output/"
os.makedirs(output_dir, exist_ok=True)

# Open an HTML document from which you want to extract tables
with ah.HTMLDocument("https://docs.aspose.com/html/net/edit-html-document/") as doc:
    # Get all <table> elements
    tables = doc.get_elements_by_tag_name("table")

    if tables.length > 0:
        for i, table in enumerate(tables):
            # Construct output file path
            file_name = f"table{i}.htm"
            file_path = os.path.join(output_dir, file_name)

            # Create a new HTML document from the table's outer HTML
            new_doc = ah.HTMLDocument(table.outer_html, file_path)

            # Save the new document
            new_doc.save(file_path)
    else:
        # Handle case where no tables are found
        print("No tables found in the document.")


Steps to Extract Tables from Web Page

  1. Use the HTMLDocument(url) constructor to open the HTML document from the specified URL. This document is the source from which <table> elements will be extracted.
  2. Call the get_elements_by_tag_name(“table”) method to collect all <table> elements from the HTML document.
  3. Check if any tables are found. If tables are present, start a loop to iterate over each table element.
    • Create unique filenames for each table.
    • Create a new HTMLDocument using the outer_html property of the table element and the output path for saving.
    • Save the new HTML document containing the single table using the save() method.
  4. If no <table> elements are found, print a message with information that no tables were found in the document.

To learn more about how to programmatically extract different types of data from the web or any HTML documents using Python, refer to the Data Extraction in Python chapter of the documentation. This chapter offers practical guidance on how to automatically inspect, capture, and extract valuable data from HTML using the Aspose.HTML for Python via .NET API. It covers essential topics such as navigating HTML documents with CSS selectors and XPath, as well as downloading and saving remote resources like images, SVG graphics, and other files.



Get Started with Python API

If you want to parse, manipulate, and manage HTML documents, install our flexible, high-speed Aspose.HTML for Python via .NET API. pip is the easiest way to download and install Aspose.HTML for Python via .NET. To do this, run the following command:

pip install aspose-html-net

For more details about Python library installation and system requirements, please refer to Aspose.HTML Documentation.

Other Supported Features

Use the Aspose.HTML for Python via .NET library to parse and manipulate HTML-based documents. Clear, safe and simple!