Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

The open source extract transaction infomation by using OCR.

NotificationsYou must be signed in to change notification settings

hungtooc/transaction_ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract transaction infomation from scaned pdf. I used statement document statement document of Thuy Tien for example. The source code not included OCR model, instead, I used google OCR services to have best performance. Feel free to become Contributors!

Getting Started

Dependency

git clone https://github.com/hungtooc/transaction_ocr.gitpip install -r requirements.txt

1. Repair data input

1.1 Download raw data

1.2 Convert pdf files to image

PDF password:Vcbsaoke@2021

python tools/pdf-to-images.py --pdf-password Vcbsaoke@2021
usage: pdf-to-images.py [-h] [--pdf-dir PDF_DIR] [--output-dir OUTPUT_DIR] [--pdf-password PDF_PASSWORD] [--from-page-no FROM_PAGE_NO] [--to-page-no TO_PAGE_NO] [--fix-page-number FIX_PAGE_NUMBER]optional arguments:  -h, --help            show this help message and exit  --pdf-dir PDF_DIR     dir to pdf files  --output-dir OUTPUT_DIR                        dir to save images  --pdf-password PDF_PASSWORD                        pdf password  --from-page-no FROM_PAGE_NO                        extra image from page  --to-page-no TO_PAGE_NO                        extra image to page  --fix-page-number FIX_PAGE_NUMBER                        fix page number (page_no += fix_page_number)

2. Extract transaction information

The source perform the basic steps to extract transaction information, you may want to add additional processing to optimize the source code in lines marked #todo.

python run.py
usage: run.py [-h] [--image-dir IMAGE_DIR] [--output-respone-dir OUTPUT_RESPONE_DIR] [--output-content-dir OUTPUT_CONTENT_DIR] [--processed-log-file PROCESSED_LOG_FILE]optional arguments:  -h, --help            show this help message and exit  --image-dir IMAGE_DIR                        dir to images  --output-respone-dir OUTPUT_RESPONE_DIR                        dir to save api respone  --output-content-dir OUTPUT_CONTENT_DIR                        dir to save transaction content  --processed-log-file PROCESSED_LOG_FILE                        path to log file

Filerun.py perform 7 main stages:

  • Step 1. Find header & footer.
  • Step 2. Re-rotate image based on header-corner.
  • Step 3. Clean image.
  • Step 4. Call request google-ocr api. (include:text-detection & text-recognition
  • Step 5. Detect transaction line.
    processing-step-boder

  • Step 6. Classify transaction content each line & each content type.
    read-transactions-border

  • Step 7. Save transactions content to csv.
TNX DateDoc NoDebitCreditBalanceTransaction in detail(note)
13/10/20205091.55821100.000586062.131020.075756.Ung ho mien trung FT20287151644070page_1
13/10/20205091.560801.000.000586279.131020.075829.Ung ho dong bao mien Trung FT20287592192480page_1
13/10/20205091.56138200.000219987.131020.075839.Trinh Thi Thu Thuy chuyen tien ung ho mien Trungpage_1
13/10/20205091.56155100.000586295.131020.075826.UH mien trung FT20287432289640page_1
13/10/20205078.68388500.000MBVCB.807033343.PHAM THUY TRANG chuyen tien ung ho tu thien.CT tu 0561000606153 PHAM THUY TRANG toi 0181003469746 TRAN THI THUY TIENpage_1
13/10/20205091.562611.000.000184997.131020.075853.Em gui giup do ba con vung lupage_1
13/10/20205078.68496200.000MBVCB.807033583.Ung ho mien trung.CT tu 0051000531310 HUYNH THI NHU Y toi 0181003469746 TRAN THI THUY TIENpage_1
13/10/20205078.68526100.000MBVCB.807033514.ung ho mien trung.CT tu 0481000903279 NGUYEN THI HUONG AN toi 0181003469746 TRAN THI THUY TIENpage_1
13/10/20205091.56381100.000479592.131020.075909.ho tro mien trungpage_1
13/10/20205078.68537500.000MBVCB.807034561.Ung ho Mien trung.CT tu 0721000588146 LE THI HONG DIEM toi 0181003469746 TRAN THI THUY TIENpage_1
13/10/20205091.56405200.000292363.131020.075845.Ngan hang TMCP Ngoai Thuong Viet Nam 0181003469746 LUC NGHIEM LE chuyen khoan ung ho mien trungpage_1
13/10/20205091.56410500.000479627.131020.075913.Ung ho mien trungpage_1

3. Export Excel

Export each csv directory to an excel file. Example:

python tools/export-excel.py --csv-dir "data/content/TÀI KHOẢN XXX746 (Pass_ Vcbsaoke@2021)/TỪ 13.10.20 ĐẾN 23.11.20/1. TRANG 1 -1000.pdf"
usage: export-excel.py [-h] --csv-dir CSV_DIR [--output-dir OUTPUT_DIR] [--transaction-template TRANSACTION_TEMPLATE] [--filename FILENAME]optional arguments:  -h, --help            show this help message and exit  --csv-dir CSV_DIR     csv dir  --output-dir OUTPUT_DIR                        output dir  --transaction-template TRANSACTION_TEMPLATE                        dir to save transaction content  --filename FILENAME   output filename, leave blank to set default

4. Extract dataset

From api responed data, you can extract dataset to traintext-recognization model:

 python tools/export-dataset.py
usage: extract-dataset.py [-h] [--respone-dir RESPONE_DIR] [-a OUTPUT_ANNOTATION] [-i OUTPUT_IMAGE_DIR]optional arguments:  -h, --help            show this help message and exit  --respone-dir RESPONE_DIR                        dir to api respone  -a OUTPUT_ANNOTATION, --output-annotation OUTPUT_ANNOTATION                        path to save annotation file  -i OUTPUT_IMAGE_DIR, --output-image-dir OUTPUT_IMAGE_DIR                        path to save annotation file
  • Dataset of first 1000 pages lalebed by google-ocr (~336k):Google Drive
  • Tips: you may want to balance data text type before extract

5. Result

18107 transaction statement pages have been extracted from pdf format:Google Drive - Accuracy >99%.

About

The open source extract transaction infomation by using OCR.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp