In order to extract Attachments in PDF file, we’ll use Aspose.PDF for .NET API which is a feature-rich, powerful and easy to use document manipulation API for python-net platform. Open NuGet package manager, search for Aspose.PDF and install. You may also use the following command from the Package Manager Console.
Extract Attachments from PDF Python
You need Aspose.PDF for Python via .NET to try the code in your environment.
- Get embedded files collection.
- Get count of the embedded files.
- Loop through the collection to get all the attachments.
- Check if parameter object contains the parameters.
- Get the Attachment and write to file or stream.
Extract Attachment from PDF document
def attachment_extract(self, infile):
path_infile = self.dataDir + infile
# Open document
pdfDocument = Document(path_infile)
# Get embedded files collection
embeddedFiles = pdfDocument.EmbeddedFiles
# Get count of the embedded files
print ( "Total files : %d " % (embeddedFiles.Count))
count = 1
# Loop through the collection to get all the attachments
for fileSpecification in embeddedFiles:
print("Name: " + fileSpecification.Name)
print("Description: " + fileSpecification.Description)
print("Mime Type: " + fileSpecification.MIMEType)
# Check if parameter object contains the parameters
if (fileSpecification.Params != None):
print("CheckSum: " + fileSpecification.Params.CheckSum)
print("Creation Date: " + fileSpecification.Params.CreationDate)
print("Modification Date " + fileSpecification.Params.ModDate)
print("Size: " + fileSpecification.Params.Size)
# Get the attachment and write to file or stream
File.WriteAllBytes(self.dataDir + count + "_out" + ".txt", fileSpecification.Contents)
count+=1