Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Working with PDF files using PyMuPDF
≀Paulo Portela
≀Paulo Portela

Posted on • Edited on

     

Working with PDF files using PyMuPDF

Introduction

PyMuPDF is a versatile Python library that empowers developers to work with PDF documents effortlessly. From extracting text and images to performing complex manipulations, PyMuPDF offers a rich set of features for handling PDF files programmatically. In this chapter, we explore the capabilities of PyMuPDF and demonstrate its usage through practical examples.

Topics

  • Installation and setup of PyMuPDF
  • Text extraction from PDF documents
  • Image extraction from PDF documents
  • PDF manipulation and modification

Installation and Setup of PyMuPDF

To begin harnessing the capabilities of PyMuPDF, you first need to install the library. You can install PyMuPDF via pip:

pipinstallPyMuPDF
Enter fullscreen modeExit fullscreen mode

Once installed, you can import the library into your Python scripts:

importfitz
Enter fullscreen modeExit fullscreen mode

Text Extraction from PDF Documents

PyMuPDF allows you to extract text from PDF documents with ease. Here's a simple example:

PDF file:

Test document PDFLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
Enter fullscreen modeExit fullscreen mode
importfitzdefextract_text_from_pdf(filename:str)->str:doc=fitz.open(filename=filename)text=""forpageindoc:text+=page.get_text()returntextextracted_text=extract_text_from_pdf(filename="example.pdf")print(extracted_text)
Enter fullscreen modeExit fullscreen mode

Output:

Test document PDFLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero tempor. Donec quis augue quis magna condimentum lobortis.
Enter fullscreen modeExit fullscreen mode

Image Extraction from PDF Documents

In addition to text, PyMuPDF enables you to extract images from PDF documents:

importfitzdefextract_images_from_pdf(filename:str)->list:doc=fitz.open(filename=filename)images=[]forpageindoc:forimginpage.get_images():xref=img[0]base_image=doc.extract_image(xref=xref)image_bytes=base_image["image"]images.append(image_bytes)returnimagesextracted_images=extract_images_from_pdf(filename="example.pdf")print("Number of images extracted:",len(extracted_images))
Enter fullscreen modeExit fullscreen mode

Output:

Number of images extracted: 1
Enter fullscreen modeExit fullscreen mode

PDF Manipulation and Modification

PyMuPDF facilitates various manipulations and modifications of PDF documents, such as adding annotations, merging documents, and more:

importfitzdefadd_annotation_to_pdf(in_filename:str,annotation:str,out_filename:str)->None:doc=fitz.open(filename=in_filename)page=doc[0]# Add annotation to the first pageannot=page.add_text_annot(point=(100,100),text=annotation)annot.set_colors(colors=(1,0,0))# Set annotation color to reddoc.save(filename=out_filename)in_filename="example.pdf"annotation="This is an annotation added using PyMuPDF."out_filename="example2.pdf"add_annotation_to_pdf(in_filename=in_filename,annotation=annotation,out_filename=out_filename)
Enter fullscreen modeExit fullscreen mode

This code adds the note "This is an annotation added using PyMuPDF." to the output PDF.

Conclusion

PyMuPDF emerges as a powerful ally for Python developers tasked with working with PDF documents. Whether it's extracting text and images, performing manipulations, or modifying PDF files, PyMuPDF offers a comprehensive toolkit for tackling diverse PDF-related tasks programmatically.

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

  • Location
    Portugal
  • Education
    ISEP
  • Work
    Senior Software Developer @ adidas
  • Joined

More from≀Paulo Portela

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp