How to extract text from PDF

Learn how easily extract text from PDF documents with high quality using .NET PDF library

How to extract text from PDF with C#

Extracting text from PDF documents is a common task for data processing. In this article, we will look at how to extract text from PDF files using the C# programming language based on the Aspose.PDF .NET library.

Extracting text from PDF files can significantly improve performance when working with PDF documents. PDF documents often contain important data such as reports, research documents, financial reports, or survey responses. Extracting text from the PDF allows you to analyze and extract specific information for further processing, analysis, or integration into other systems.

Also extracting text from PDF allows you to convert content for different purposes. You can convert extracted text into other formats, such as Word documents or text files, which are easily edited later. Language translation is quite an important function. Extracted text from PDF can be easily translated into different languages. This is particularly useful for providing multilingual content.

Extracting text from PDF enhances data usability, content management, and automation capabilities. It unlocks information in PDF files, allowing efficient data analysis, re-profiling of content, and automation in various fields and industries.

Remember to consult the Aspose.PDF for .NET library Documentation pages and explore various search strategies based on your specific requirements.

Extract text from PDF

.NET Library to extract text from PDF documents

Before you start working with your PDF, install the Aspose.PDF library using the following command from the Package Manager Console:

First, you can install the library using the following pip command:

PM > Install-Package Aspose.PDF

Or you can open NuGet package manager, search for Aspose.PDF and install. Learn the Landing Page Parsing PDF files for more details.

How to extract text from PDF documents

  • Initialize a new Document
  • Create TextAbsorber object to extract text
  • Accept the absorber for all the pages
  • Get the extracted text
  • Create a writer and open the file
  • Write a line of text to the file
  • Close the stream

Use following code snippet for this:

// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.PDF-for-.NET
// The path to the documents directory.
string dataDir = RunExamples.GetDataDir_AsposePdf_Text();

// Open document
Document pdfDocument = new Document(dataDir + "ExtractTextAll.pdf");

// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
// Get the extracted text
string extractedText = textAbsorber.Text;
// Create a writer and open the file
TextWriter tw = new StreamWriter(dataDir + "extracted-text.txt");
// Write a line of text to the file
tw.WriteLine(extractedText);
// Close the stream
tw.Close();

Try to extract text from PDF files online

Aspose.PDF for .NET presents you Online Free App – Aspose.PDF Parser. It is an online free web application that allows you to investigate how presentation extracting functionality works.

Documentation Aspose.PDF for .NET Library

See other features of Aspose.PDF for .NET library on Documentation pages