Posted onFeb 25, 2024 • Edited onFeb 28, 2024

Working with PDF files using PyMuPDF

#python #programming

Introduction

PyMuPDF is a versatile Python library that empowers developers to work with PDF documents effortlessly. From extracting text and images to performing complex manipulations, PyMuPDF offers a rich set of features for handling PDF files programmatically. In this chapter, we explore the capabilities of PyMuPDF and demonstrate its usage through practical examples.

Topics

Installation and setup of PyMuPDF
Text extraction from PDF documents
Image extraction from PDF documents
PDF manipulation and modification

Installation and Setup of PyMuPDF

To begin harnessing the capabilities of PyMuPDF, you first need to install the library. You can install PyMuPDF via pip:

pipinstallPyMuPDF

Once installed, you can import the library into your Python scripts:

importfitz

Text Extraction from PDF Documents

PyMuPDF allows you to extract text from PDF documents with ease. Here's a simple example:

PDF file:

Test document PDFLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.

importfitzdefextract_text_from_pdf(filename:str)->str:doc=fitz.open(filename=filename)text=""forpageindoc:text+=page.get_text()returntextextracted_text=extract_text_from_pdf(filename="example.pdf")print(extracted_text)

Output:

Test document PDFLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.

Image Extraction from PDF Documents

In addition to text, PyMuPDF enables you to extract images from PDF documents:

importfitzdefextract_images_from_pdf(filename:str)->list:doc=fitz.open(filename=filename)images=[]forpageindoc:forimginpage.get_images():xref=img[0]base_image=doc.extract_image(xref=xref)image_bytes=base_image["image"]images.append(image_bytes)returnimagesextracted_images=extract_images_from_pdf(filename="example.pdf")print("Number of images extracted:",len(extracted_images))

Output:

Number of images extracted: 1

PDF Manipulation and Modification

PyMuPDF facilitates various manipulations and modifications of PDF documents, such as adding annotations, merging documents, and more:

importfitzdefadd_annotation_to_pdf(in_filename:str,annotation:str,out_filename:str)->None:doc=fitz.open(filename=in_filename)page=doc[0]# Add annotation to the first pageannot=page.add_text_annot(point=(100,100),text=annotation)annot.set_colors(colors=(1,0,0))# Set annotation color to reddoc.save(filename=out_filename)in_filename="example.pdf"annotation="This is an annotation added using PyMuPDF."out_filename="example2.pdf"add_annotation_to_pdf(in_filename=in_filename,annotation=annotation,out_filename=out_filename)

This code adds the note "This is an annotation added using PyMuPDF." to the output PDF.

Conclusion

PyMuPDF emerges as a powerful ally for Python developers tasked with working with PDF documents. Whether it's extracting text and images, performing manipulations, or modifying PDF files, PyMuPDF offers a comprehensive toolkit for tackling diverse PDF-related tasks programmatically.