Extract Tables from PDF via Python

Extract table from PDF document. Use Aspose.PDF for Python for .NET to modify PDF files programmatically

C# Java C++ Python

Aspose.PDF
for Python for .NET

Download

Learn

Buy

How to extracting Tables from PDF document Using Python for .NET Library

In order to extract table, we’ll use Aspose.PDF for .NET API which is a feature-rich, powerful and easy to use document manipulation API for python-net platform. Open NuGet package manager, search for Aspose.PDF and install. You may also use the following command from the Package Manager Console.

Python Package Manager Console

pip install aspose-pdf

Extract Tables from PDF via Python

You need Aspose.PDF for Python via .NET to try the code in your environment.

Load the PDF with an instance of Document.
Create TableAbsorber object to find tables.
Visit first page with absorber.
Get first table on the page.
Remove the table. Save the file.

Extract Tables from PDF - Python

    import aspose.pdf as ap

    input_file = DIR_INPUT_TABLE + "Table_input.pdf"
    # Load source PDF document
    pdf_document = ap.Document(input_file)
    for page in pdf_document.pages:
        absorber = ap.text.TableAbsorber()
        absorber.visit(page)
        for table in absorber.table_list:
            for row in table.row_list:
                for cell in row.cell_list:
                    text_fragment_collection = cell.text_fragments
                    for fragment in text_fragment_collection:
                        txt = ""
                        for seg in fragment.segments:
                            txt += seg.text
                        print(txt)