How to Extract Table Data from HTML

HTML tables are widely used on the web to display information. Extracting data from HTML tables is common in web scraping, data analysis, and automation. When building parsers, extracting data from an HTML table and converting it into a structured format, such as JSON, CSV, or Excel, is often necessary. Let’s explore how to extract table data from HTML.

First, make sure you have Aspose.HTML for .NET installed in your project. The installation process of this library is quite simple. Open the NuGet package manager, search for Aspose.HTML, and install. You may also use the following command from the Package Manager Console:


Install Aspose.HTML for .NET

Install-Package Aspose.HTML



Extract Table Data from HTML using C#

Aspose.HTML for .NET is a robust library that provides a powerful set of tools for parsing and gathering information from HTML documents. The following example shows how to find all <table> elements in an HTML document, extract the table data, and output it in JSON format. Let’s say that a table in HTML contains a list of tests, where each test has an ID, a name, a comment for it, and a hyperlink to the contents of the test. This is the information we want to extract from the table in the following example:


C# code to extract data from HTML table

using Aspose.Html;
using System.IO;
using System.Linq;
using System.Text.Json;
using System.Collections.Generic;
...

    // Open the document from which you want to extract table data
    using var document = new HTMLDocument(Path.Combine(DataDir, "chapter-9.htm"));
    {
        // Check if there are any table elements in the document
        var tables = document.GetElementsByTagName("table");

        if (tables.Any())
        {
            var result = new List<Dictionary<string, string>>();
            //var i = 0;
            foreach (var table in tables)
            {
                // extract data from html table
                var tbodies = table.GetElementsByTagName("tbody");

                foreach (var tbody in tbodies)
                {
                    if (tbody.Children.Length > 1)
                    {
                        foreach (var row in tbody.Children)
                        {
                            if (row.HasAttribute("id"))
                            {
                                //test row
                                var data = new Dictionary<string, string>();

                                data["Id"] = row.GetAttribute("id");
                                if (row.Children.Length > 0)
                                {
                                    var td = row.Children[0];
                                    if (td.Children.Length > 0)
                                    {
                                        var element = td.Children[0].TagName == "STRONG"
                                            ? td.Children[0].Children[0]
                                            : td.Children[0];
                                        var href = ((HTMLAnchorElement)element).Href;
                                        data["Href"] = href;
                                        data["TestName"] = Path.GetFileNameWithoutExtension(href);
                                    }
                                }

                                data["TestComment"] = string.Join(" ",
                                    row.Children[3].TextContent
                                        .Split(new char[0], StringSplitOptions.RemoveEmptyEntries).ToList()
                                        .Select(x => x.Trim()));
                                result.Add(data);
                            }
                        }
                        var json = JsonSerializer.Serialize(result);
                        Console.WriteLine(json);
                    }
                }
            }
        }
        else
        {
            // Handle the case where no tables are found
            Console.WriteLine("No tables found in the document.");
        }
    }



Steps to Extract Table Data from HTML

By following these steps, you can extract table data from HTML such as hyperlinks and text content – for various purposes, including data analysis or reporting.

  1. Use the HTMLDocument() constructor to initialize an HTML document. Pass the path of the source HTML file as a parameter to the constructor.
  2. Use the GetElementsByTagName("table") method to collect all <table> elements. The method returns a list of the HTML document’s <table> elements. Store the collection of table elements in the tables variable.
  3. Use the LINQ Any() method to check if there are any <table> elements in the HTML document. This ensures that there are tables to extract data from.
  4. Iterate through each table found in the document using a foreach loop:
    • Use the GetElementsByTagName("tbody") method to retrieve all <tbody> elements (table bodies).
    • Within the loop, iterate through each <tbody> element to access the rows of data obtained using another foreach loop.
    • Extract relevant data from each row based on specific conditions or attributes.
    • After extracting data from all rows, use the JsonSerializer.Serialize() method to serialize the list of dictionaries containing the extracted data to JSON format.
  5. Use the Console.WriteLine() method to output serialized JSON for display in the console.
  6. If the document does not contain tables, print a message to the console indicating that no tables were found.

To learn more about Aspose.HTML API, please visit our documentation guide. Aspose.HTML for .NET is an advanced HTML parsing library that allows you to create, edit, and convert HTML, XHTML, MD, EPUB, and MHTML files. The Data Extraction documentation section describes how to automatically inspect, collect, and extract data from web pages using Aspose.HTML for .NET. In the articles in this section, you’ll learn how to navigate an HTML document and perform detailed inspection of its elements, save a website or file from a URL, extract different types of images from websites, and more.



HTML Table Generator – Online App

Aspose.HTML offers the HTML Table Generator is an online application for creating tables with customizable options. It’s free and clear to use. Just fill in all required options and get a result! The HTML Table Generator automatically creates the HTML table code. This tool was designed to let you get a required HTML table and put it online as quickly as possible.