Movatterモバイル変換

Abdeladim Fadheli · 5 min read · Updated jun 2023 ·PDF File Handling

Struggling with multiple programming languages? No worries. OurCode Converter has got you covered. Give it a go!

In this tutorial, we will demonstrate how to extract images from PDF files and save them on the local disk using Python, along with thePyMuPDF and Pillow libraries.

PyMuPDF is a versatile library that allows you to access PDF, XPS, OpenXPS, epub, and various other file extensions, while Pillow is an open-source Python imaging library that adds image processing capabilities to your Python interpreter.

To follow along with this tutorial, you will need:

Python 3.x installed on your system
A PDF file containing images you want to extract

Download: Practical Python PDF Processing EBook.

Installing PyMuPDF and Pillow Libraries

First, we need to install the PyMuPDF and Pillow libraries. Open your terminal or command prompt and run the following command:

pip3 install PyMuPDF Pillow

Importing the Libraries and Setting Up Options

Create a new Python file named pdf_image_extractor.py and import the necessary libraries. Also, define the output directory, output image format, and minimum dimensions for the extracted images:

import osimport fitz  # PyMuPDFimport iofrom PIL import Image# Output directory for the extracted imagesoutput_dir = "extracted_images"# Desired output image formatoutput_format = "png"# Minimum width and height for extracted imagesmin_width = 100min_height = 100# Create the output directory if it does not existif not os.path.exists(output_dir):    os.makedirs(output_dir)

Get Our Practical Python PDF Processing EBook

Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!

Download EBook

Loading the PDF File

I'm gonna test this withthis PDF file, but you're free to bring and PDF file and put it in your current working directory, let's load it to the library:

# file path you want to extract images fromfile = "1710.05006.pdf"# open the filepdf_file = fitz.open(file)

Iterating Over Pages and Extracting Images

Since we want to extract images from all pages, we need to iterate over all the pages available and get all image objects on each page, the following code does that:

# Iterate over PDF pagesfor page_index in range(len(pdf_file)):    # Get the page itself    page = pdf_file[page_index]    # Get image list    image_list = page.get_images(full=True)    # Print the number of images found on this page    if image_list:        print(f"[+] Found a total of {len(image_list)} images in page {page_index}")    else:        print(f"[!] No images found on page {page_index}")    # Iterate over the images on the page    for image_index, img in enumerate(image_list, start=1):        # Get the XREF of the image        xref = img[0]        # Extract the image bytes        base_image = pdf_file.extract_image(xref)        image_bytes = base_image["image"]        # Get the image extension        image_ext = base_image["ext"]        # Load it to PIL        image = Image.open(io.BytesIO(image_bytes))        # Check if the image meets the minimum dimensions and save it        if image.width >= min_width and image.height >= min_height:            image.save(                open(os.path.join(output_dir, f"image{page_index + 1}_{image_index}.{output_format}"), "wb"),                format=output_format.upper())        else:            print(f"[-] Skipping image {image_index} on page {page_index} due to its small size.")

In this code snippet, we use theget_images(full=True) method to list all available image objects on a particular page. Then, we loop through the images, check if they meet the minimum dimensions, and save them using the specified output format in the output directory.

We use theextract_image() method that returns the image in bytes and additional information, such as the image extension.

So, we convert the image bytes to a PIL image instance and save it to the local disk using thesave() method which accepts a file pointer as an argument; we're simply naming the images with their corresponding page and image indices.

Running the Script and Verifying the Output

Now, save the script and run it using the following command:

$ python pdf_image_extractor.py

I got the following output:

[!] No images found on page 0[+] Found a total of 3 images in page 1[+] Found a total of 3 images in page 2[!] No images found on page 3[!] No images found on page 4

The extracted images that meet the minimum dimensions will be saved in the specified output directory with their corresponding page and image indices, using the desired output format.

The images are saved in theextracted_images folder, as specified:

Conclusion

In this tutorial, we have successfully demonstrated how to extract images from PDF files using Python, PyMuPDF, and Pillow libraries. This technique can be extended to work with various file formats and customized to fit your specific requirements.

For more information on how the libraries work, refer to the official documentation:

I haveused the argparse module to create a quick CLI script, you can get both versions of the codehere.

Here are some PDF Related tutorials:

Finally, for more PDF handling guides on Python, you can check ourPractical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!

Happy coding ♥

Let ourCode Converter simplify your multi-language projects. It's like having a coding translator at your fingertips. Don't miss out!

View Full Code Switch My Framework

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!