Extract PDF using Java

How to extract text and images from PDF using Aspose.PDF for Java library

C# Java C++ Python

The most popular action with a Parser

Extract Text

Extract Images

Extract Fonts

How to parse PDF with Aspose.PDF for Java Library

Do you need to extract PDF? Programmatic modification of PDF documents is an essential part of modern digital workflows. With Java libraries like Aspose.PDF, developers can extract text from PDF or extract images from PDF. These libraries are stand-alone solutions that don’t rely on other software and are ready for commercial use. They cover all possible needs of professional Java developers.

Extract PDF data: texts, images, forms, fields, etc.
Extract text from PDF
Extract Images from PDF
Extract Fonts from PDF
Extract Data from the Form
Extract Text From Stamps
Extract Data from Table

In order to extract PDF file, we’ll use Aspose.PDF for Java API which is a feature-rich, powerful, and easy-to-use conversion API for the Java platform. You can download its latest version directly from Maven and install it within your Maven-based project by adding the following configurations to the pom.xml.

Repository

<repository>
    <id>AsposeJavaAPI</id>
    <name>Aspose Java AP</name>
    <url>https://releases.aspose.com/java/repo/</url>
</repository>

Dependency

<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-pdf</artifactId>
<version>version of aspose-pdf API</version>
</dependency>

Parse PDF using Java

You need Aspose.PDF for Java to try the code in your environment.

Load the PDF with an instance of Document.
Create a TextAbsorber object to extract text.
Accept the absorber for all the pages.
Get the extracted text
Create a writer and open the file, write a line of text to the file

Extract PDF Files - Java

This sample code shows how to extract PDF documents

Input file:

Upload a file

File not added

Output format:

PDF

Output file:

document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document("sample.pdf");

e TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

t the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);

he extracted text
String extractedText = textAbsorber.getText();
try {
    java.io.FileWriter writer = new java.io.FileWriter("extracted-text.txt", true);
    // Write a line of text to the file
    writer.write(extractedText);
    // Close the stream
    writer.close();
} catch (java.io.IOException e) {
    System.out.println(e.getMessage());
}

About Aspose.PDF for Java API

Aspose.PDF for Java API is a library that enables developers to add PDF processing capabilities to their applications. It can be used to build any type of 32-bit and 64-bit applications to generate or read, convert and manipulate PDF files without the use of Adobe Acrobat. Aspose.PDF for Java allows developers to insert tables, graphs, images, hyperlinks, custom fonts - and more - into PDF documents. Moreover, it is also possible to compress PDFs. Aspose.PDF for Java provides excellent security features to develop secure PDF files.

You can find more information about Aspose.PDF for Java API on documentation and examples on how to use API. Some of the critical features of Aspose.PDF for Java API include support for various file formats, including HTML, XFA, TXT, PCL, XML, XPS and image file formats, support for different PDF versions, and extensive hyperlink functionality.