- Notifications
You must be signed in to change notification settings - Fork12
Batch-convert pdf to text, extract data from pdf in python
License
shine-jayakumar/Extract-Data-From-PDF-In-Python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.
We're going to achieve that by:
- Using PDFtoText converter from XPdf to convert pdf files to text files
- Using regular expressions to extract data
- Performing data cleaning using pandas
- Exporting to Excel file
Short Answer: I got this error:
TypeError: object of type 'IndirectObject' has no len()
Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways.I looked onStackOverflow however couldn't find a solution for this error.Obviously, there had to be someone with thesame problem but there's no solution.
I was not willing to manually copy and paste the information from 52 of my payslips.Isn't that what programs are used for?
Table of Contents
Pandas
Check out therequirements.txt
Converting PDF to text usingXpdf's pdftotext is really simple.
Using this command-line tool we can batch-convert PDFs to text files.
pdftotext source.pdf dest.txt
Script Link:parse_payslips.py
About
Batch-convert pdf to text, extract data from pdf in python
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.