NanoNets/ocr-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork17
Star122

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

License

MIT license

122 stars 17 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
tests		tests
LICENSE		LICENSE
README.rst		README.rst

Repository files navigation

Python OCR

https://img.shields.io/pypi/v/ocr-nanonets-wrapper.svg?color=green

This python package is an OCR library which reads all text & tables from image & PDF files using an OCR engine & provides intelligent post-processing options to save OCR results in formats you want.

Installation

The package requiresPython 3 to run.

You can usepip to install:

pip install ocr-nanonets-wrapper

Authentication

This software is perpetually free :)

You can get your free API key (with unlimited requests) by creating a free account onhttps://app.nanonets.com/#/keys.

fromnanonetsimportNANONETSOCRmodel=NANONETSOCR()model.set_token('REPLACE_API_KEY')

Usage

You can refer the code shared below ordirectly use code from here.

# InitialisefromnanonetsimportNANONETSOCRmodel=NANONETSOCR()# Authenticate# This software is perpetually free :)# You can get your free API key (with unlimited requests) by creating a free account on https://app.nanonets.com/#/keys?utm_source=wrapper.model.set_token('REPLACE_API_KEY')# PDF / Image to Raw OCR Engine Outputimportjsonpred_json=model.convert_to_prediction('INPUT_FILE')print(json.dumps(pred_json,indent=2))# PDF / Image to Stringstring=model.convert_to_string('INPUT_FILE')print(string)# PDF / Image to TXT Filemodel.convert_to_txt('INPUT_FILE',output_file_name='OUTPUTNAME.txt')# PDF / Image to Boxes# each element contains predicted word and bounding box information# bounding box information denotes the spatial position of each word in the fileboxes=model.convert_to_boxes('test.png')forboxinboxes:print(box)# PDF / Image to CSV# This method extracts tables from your file and prints them in a .csv file.# NOTE : This particular function is a trial offering 1000 pages of use.# To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.model.convert_to_csv('INPUT_FILE',output_file_name='OUTPUTNAME.csv')# PDF / Image to Tables# This method extracts tables from your file and returns a json object.# NOTE : This particular function is a trial offering 1000 pages of use.# To use this at scale, please create your own model at app.nanonets.com --> New Model --> Tables.importjsontables_json=model.convert_to_tables('INPUT_FILE')print(json.dumps(tables_json,indent=2))# PDF / Image to Searchable PDFmodel.convert_to_searchable_pdf('INPUT_FILE',output_file_name='OUTPUTNAME.pdf')

Testing

To make getting started easier for you, there is a bunch of sample code along with sample input files.

Clone or download the repo and open the /tests folder.
all_tests.ipynb is a python notebook containing testing for all methods in the package.
convert_to_{METHOD}.py files are python files corresponding to each method in the package individually.

Note

convert_to_string() and convert_to_txt() methods have two optional parameters -

formatting =

`lines and spaces` : default, all formatting enabled
`none` : space separated text with formatting removed
`lines` : space separated text with lines separated with newline character
`pages` : list of page wise space separated text

line_threshold =

`low` : default
`high` : You can addline_threshold='high' as a parameter while calling the method which in few cases can improve reading flowcharts and diagrams.

Advanced Functions

If extracting flat fields, tables and line items from PDFs and images is your use case, I will strongly advice you to create your own model by signing up onapp.nanonets.com and using our advanced API. This will improve functionalities, accuracy and response times significantly. Once you have created your account and model, you can useAPI documentation present here to extract flat fields, tables and line items from any PDF or image.

Nanonets

We help businesses automate Manual Data Entry Using AI and reduce turn around times & manual effort required. More than 1000 enterprises use Nanonets for Intelligent Document Processing. We have generated incredible ROIs for our clients.

We provide OCR and IDP solutions customised for various use cases - invoice automation, Receipt OCR, purchase order automation, accounts payable automation, ID Card OCR and many more.

Visitnanonets.com for enterprise OCR and IDP solutions.
Sign up onapp.nanonets.com/#/signup to start a free trial.

License

MIT

This software is perpetually free :)

About

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

nanonets.com

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Python OCR

Installation

Authentication

Usage

Testing

Advanced Functions

Nanonets

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

NanoNets/ocr-python

Folders and files

Latest commit

History

Repository files navigation

Python OCR

Installation

Authentication

Usage

Testing

Advanced Functions

Nanonets

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages