Extract PDF via Python

Parse text and images from PDF document. Use Aspose.PDF for Python for .NET to modify PDF files programmatically

Most popular action with Parser in Python

How to parse PDF with Python for .NET Library

Do you need to extract PDF? Programmatic modification of PDF documents is an essential part of modern digital workflows. With Python libraries like Aspose.PDF, developers can extract text from PDF or extract images from PDF. These libraries are stand-alone solutions that don’t rely on other software and are ready for commercial use. They cover all possible needs of professional Python developers.

  • Extract PDF data: texts, images, forms, fields, etc.
  • Extract text from PDF
  • Extract Images from PDF
  • Extract Fonts from PDF
  • Extract Data from the Form
  • Extract Text From Stamps
  • Extract Data from Table

In order to extract PDF file, we’ll use Aspose.PDF for .NET API which is a feature-rich, powerful and easy to use document manipulation API for python-net platform. Open NuGet package manager, search for Aspose.PDF and install. You may also use the following command from the Package Manager Console.

Python Package Manager Console

pip install aspose-pdf

Parse PDF via Python


To try the code in your environment, you need Aspose.PDF for Python.

  1. Load the PDF with an instance of Document.
  2. Create TextAbsorber object to extract text.
  3. Accept the absorber for all the pages.
  4. Get the extracted text
  5. Create a writer and open the file, write a line of text to the file

Extract PDF Files - Python

This sample code shows how to extract PDF documents

Input file:

File not added

Output format:

PDF

Output file:

    # Open document
    document = Document(dataDir + "ExtractTextAll.pdf")

    # Create TextAbsorber object to extract text
    textAbsorber = TextAbsorber()
    # Accept the absorber for all the pages
    document.Pages.Accept(textAbsorber)
    # Get the extracted text
    extractedText = textAbsorber.Text
    # Create a writer and open the file
    tw = new StreamWriter(dataDir + "extracted-text.txt")
    # Write a line of text to the file
    tw.WriteLine(extractedText)
    # Close the stream
    tw.Close()

About Aspose.PDF for Python for .NET API

Aspose.PDF for Python via .NET API supports most established PDF standards and PDF specifications. It allows developers to insert tables, graphs, images, hyperlinks, custom fonts - and more - into PDF documents. Moreover, it is also possible to compress PDF documents. Aspose.PDF for Python via .NET provides excellent security features to develop secure PDF documents. Some of the key features of Aspose.PDF for Python via .NET API include:

  • Ability to read & export PDF in multiple image formats including BMP, GIF, JPEG & PNG.
  • Set basic information (e.g. author, creator) of the PDF document.
  • Conversion Features: Convert PDF to Word, Excel, and PowerPoint. Convert PDF to Images formats. Convert PDF file to HTML format and vice versa. Convert PDF to EPUB, Text, XPS, etc.

You can find more information about Aspose.PDF for Python via .NET API on our documentation on how to use API.