Movatterモバイル変換

shine-jayakumar/Extract-Data-From-PDF-In-PythonPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star30

Batch-convert pdf to text, extract data from pdf in python

License

MIT license

30 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
parse_payslips.py		parse_payslips.py
requirements.txt		requirements.txt

Repository files navigation

Extract Data From PDF In Python

In this project, we are going to batch-convert pdf files to text and extract data without using PyPDF2/4.

We're going to achieve that by:

Using PDFtoText converter from XPdf to convert pdf files to text files
Using regular expressions to extract data
Performing data cleaning using pandas
Exporting to Excel file

Why Not Use PyPDF2/4

Short Answer: I got this error:

TypeError: object of type 'IndirectObject' has no len()

Long Answer: If PyPDF4 had worked I would never have had a chance to explore other ways.I looked onStackOverflow however couldn't find a solution for this error.Obviously, there had to be someone with thesame problem but there's no solution.

I was not willing to manually copy and paste the information from 52 of my payslips.Isn't that what programs are used for?

Table of Contents

Packages

Pandas
Check out therequirements.txt

Converting PDF To Text

Converting PDF to text usingXpdf's pdftotext is really simple.

Using this command-line tool we can batch-convert PDFs to text files.

pdftotext source.pdf dest.txt

Script Link

Script Link:parse_payslips.py

About

Batch-convert pdf to text, extract data from pdf in python

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Extract Data From PDF In Python

Why Not Use PyPDF2/4

Packages

Converting PDF To Text

Script Link

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

shine-jayakumar/Extract-Data-From-PDF-In-Python

Folders and files

Latest commit

History

Repository files navigation

Extract Data From PDF In Python

Why Not Use PyPDF2/4

Packages

Converting PDF To Text

Script Link

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages