Movatterモバイル変換

Abdeladim Fadheli · 4 min read · Updated jun 2023 ·PDF File Handling

Want to code faster? OurPython Code Generator lets you create Python scripts with just a few clicks. Try it now!

The metadata in PDFs is useful information about the PDF document, it includes the title of the document, the author, last modification date, creation date, subject, and much more. Some PDF files got more information than others, and in this tutorial, you will learn how to extract PDF metadata in Python.

There are a lot of libraries and utilities in Python to accomplish the same thing but I like usingpikepdf, as it's an active and maintained library. Let's install it:

$ pip install pikepdf

Pikepdf is a Pythonic wrapper around the C++QPDF library. Let's import it in our script:

import pikepdfimport sys

We'll also use thesys module to get the filename from the command-line arguments:

# get the target pdf file from the command-line argumentspdf_filename = sys.argv[1]

Let's load the PDF file using the library, and get the metadata:

# read the pdf filepdf = pikepdf.Pdf.open(pdf_filename)docinfo = pdf.docinfofor key, value in docinfo.items():    print(key, ":", value)

Thedocinfo attribute contains a dictionary of the document's metadata. Here is an example execution:

$ python extract_pdf_metadata_simple.py bert-paper.pdf

Output:

/Author : /CreationDate : D:20190528000751Z/Creator : LaTeX with hyperref package/Keywords :/ModDate : D:20190528000751Z/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2/Producer : pdfTeX-1.40.17/Subject :/Title :/Trapped : /False

Here is another PDF file:

$ python extract_pdf_metadata_simple.py python_cheat_sheet.pdf

Output:

/CreationDate : D:20201002181301Z/Creator : wkhtmltopdf 0.12.5/Producer : Qt 4.8.7/Title : Markdown To PDF

As you can see, not all documents have the same fields, some contain much less information.

Notice that the/ModDate and/CreationDate are the last modification date and creation date respectively in the PDF datetime format. If you want to convert this format into Python datetime format, then I have copiedthis code from StackOverflow and edit it a little to run on Python 3:

import pikepdfimport datetimeimport refrom dateutil.tz import tzutc, tzoffsetimport syspdf_date_pattern = re.compile(''.join([    r"(D:)?",    r"(?P<year>\d\d\d\d)",    r"(?P<month>\d\d)",    r"(?P<day>\d\d)",    r"(?P<hour>\d\d)",    r"(?P<minute>\d\d)",    r"(?P<second>\d\d)",    r"(?P<tz_offset>[+-zZ])?",    r"(?P<tz_hour>\d\d)?",    r"'?(?P<tz_minute>\d\d)?'?"]))def transform_date(date_str):    """    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime    http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm    (D:YYYYMMDDHHmmSSOHH'mm')    :param date_str: pdf date string    :return: datetime object    """    global pdf_date_pattern    match = re.match(pdf_date_pattern, date_str)    if match:        date_info = match.groupdict()        for k, v in date_info.items():  # transform values            if v is None:                pass            elif k == 'tz_offset':                date_info[k] = v.lower()  # so we can treat Z as z            else:                date_info[k] = int(v)        if date_info['tz_offset'] in ('z', None):  # UTC            date_info['tzinfo'] = tzutc()        else:            multiplier = 1 if date_info['tz_offset'] == '+' else -1            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed            del date_info[k]        return datetime.datetime(**date_info)# get the target pdf file from the command-line argumentspdf_filename = sys.argv[1]# read the pdf filepdf = pikepdf.Pdf.open(pdf_filename)docinfo = pdf.docinfofor key, value in docinfo.items():    if str(value).startswith("D:"):        # pdf datetime format, convert to python datetime        value = transform_date(str(pdf.docinfo["/CreationDate"]))    print(key, ":", value)

Here is the same output previously, but with datetime formats converted to Python datetime objects:

/Author : /CreationDate : 2019-05-28 00:07:51+00:00/Creator : LaTeX with hyperref package/Keywords :/ModDate : 2019-05-28 00:07:51+00:00/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2/Producer : pdfTeX-1.40.17/Subject :/Title :/Trapped : /False

Get Our Practical Python PDF Processing EBook

Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!

Download EBook

Much better. I hope this quick tutorial helped you to get the metadata of PDF documents with Python.

Check the complete codehere.

Here are some PDF-related tutorials:

For more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!

Learn also:How to Extract Image Metadata in Python

Happy coding ♥

Take the stress out of learning Python. Meet ourPython Code Assistant – your new coding buddy. Give it a whirl!

View Full Code Generate Python Code

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!