Convert PDF to searchable PDF in C#
Recognize Scanned PDF and convert it into Searchable PDF using Aspose.OCR for .NET library.
How to convert Scanned PDF into Searchable PDF using C#
To convert recognition results into a searchable and indexable PDF document, use SaveMultipageDocument method of Aspose.OCR.AsposeOcr class. This can be useful for recognizing books, contracts, articles, and other printouts consisting of multiple pages, as well as for batch recognition. Provide Aspose.OCR.SaveFormat.Pdf as saveFormat parameter.
In addition to the recognized text, you can save the resulting PDF may have original images in the background and a transparent text overlay that can be searched, selected and copied. The type of the PDF document is controlled by the selected result type option:
processes scanned PDF and creates Searchable PDF documents containing recognized text (with images or with text only). To run the examples, you just need to download the Aspose.OCR
tools with the following link:
Download Comman Line Tools
or run the example project in IDE: RecognizeAndSaveSearchablePdf project
Run program in Command Prompt
RecognizeAndSaveSearchablePdf
or
Run program in Command Prompt if you want to use own PDF document, start page and pages count
RecognizeAndSaveSearchablePdf folder/image.pdf 0 2
The result pdf file will be stored in the out folder
Format | Description |
---|---|
Aspose.OCR.SaveFormat.Pdf | The original images are placed in the background; the recognized text is placed as an invisible but searchable and selectable overlay on top of the images. Can be useful if you need to keep all notes, images, marks and other data along with the text. |
Aspose.OCR.SaveFormat.PdfNoImg | The PDF document containing only the recognized text. The original images are not saved along with the recognition results. This can be useful when digitizing large amounts of high-quality text (such as books) so that the resulting file takes up much less space than using the Aspose.OCR.SaveFormat.Pdf parameter. |
This sample code shows how to get areas coordinates
// Set the license file
//License lic = new License();
//lic.SetLicense("Aspose.Total.lic");
// Create AsposeOcr instance.
// You can use the overloaded constructor to set characters restriction.
AsposeOcr api = new AsposeOcr();
// Create OcrInput object to containerize images
// Add filters as you need
OcrInput input = new OcrInput(InputType.PDF);
input.Add(fileName, pageStart, pageCount);
// Set the options for recognition - start page and the pages number
List<RecognitionResult> res = api.Recognize(input, new RecognitionSettings
{
//// allowed options
// AllowedCharacters = CharactersAllowedType.LATIN_ALPHABET, // ignore not latin symbols
// AutoSkew = true, // switch off if your image not rotated
// DetectAreasMode = DetectAreasMode.DOCUMENT, // depends on the structure of your image
// IgnoredCharacters = "*-!@#$%^&", // define the symbols you want to ignore in the recognition result
// Language = Language.Eng, // we support 26 languages
// LinesFiltration = false, // this works slowly, so choose it only if your picture has lines and it they bad detected in TABLE ar DOCUMENT DetectAreasMode
// ThreadsCount = 1, // by default our API use all you threads. But you can run it in one thread. Simply set up this here
// ThresholdValue = 150 // if you want to binarize image with your own threashold value, you can set up this here (from 1 to 255)
});
Console.WriteLine("RESULT");
Console.ResetColor();
for (int i = 0; i < res.Count; i++)
{
Console.WriteLine($"PAGE {i+1}\n------------------------------------------------------------");
Console.WriteLine(res[i].RecognitionText);
// you can print here additional information and spell-check the result
// also you can save each page result in your prefered file format
// res[i].Save(...);
// or convert your result to json or xml string
// res[i].GetJson();
// res[i].GetXml();
}
// save result as one multipage document
AsposeOcr.SaveMultipageDocument("result.pdf", SaveFormat.Pdf, res);
Other Supported Tools
Using C#, one can easily run our examples.