Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Batch-convert pdf to text, extract data from pdf in python

License

NotificationsYou must be signed in to change notification settings

shine-jayakumar/Extract-Data-From-PDF-In-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MIT License

In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.

We're going to achieve that by:

  • Using PDFtoText converter from XPdf to convert pdf files to text files
  • Using regular expressions to extract data
  • Performing data cleaning using pandas
  • Exporting to Excel file

Why Not Use PyPDF2/4

Short Answer: I got this error:

TypeError: object of type 'IndirectObject' has no len()

Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways.I looked onStackOverflow however couldn't find a solution for this error.Obviously, there had to be someone with thesame problem but there's no solution.

I was not willing to manually copy and paste the information from 52 of my payslips.Isn't that what programs are used for?

Table of Contents

Packages

Converting PDF To Text

Converting PDF to text usingXpdf's pdftotext is really simple.

Using this command-line tool we can batch-convert PDFs to text files.

pdftotext source.pdf dest.txt

Script Link

Script Link:parse_payslips.py


[8]ページ先頭

©2009-2025 Movatter.jp