AdemBoukhris457/Documents-Parsing-LabPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star75

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

You must be signed in to change notification settings

Folders and files

Repository files navigation

📝 Documents Parsing Lab

A curated collection of Jupyter notebooks for experimenting with state-of-the-art OCR, document parsing, table extraction, and chart understanding techniques. This repository enables easy benchmarking and practical usage of the latest open-source and cloud-based solutions for document image processing.

🚀 Doctra Quick Start

This section provides a quick start guide for getting started withDoctra, a powerful tool for structured document parsing without Vision Language Models (VLM). Learn how to parse PDF documents, extract structured content (text, tables, charts, figures), and generate multiple output formats.

Notebook	Description
01_doctra_quick_start.ipynb	Quick start guide forDoctra structured document parsing

📚 Notebooks Overview

Notebook	Description
bytedance-dolphin-image-parsing.ipynb	Document page parsing withDolphin by ByteDance
Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb	Testing the performance of document parsing withLlama-3.1-Nemotron-Nano-VL-8B-V1
docling-documents-parsing-and-tables-extraction.ipynb	Parsing and table extraction withDocling
typhoon-ocr-7b-docs-pages-parser.ipynb	EvaluatingTyphoon_ocr_7b Document Parsing Capabilities Across Various Use Cases
florence-2-large-ocr-documents-pages.ipynb	OCR of document pages usingFlorence 2 Large
florence-2-large-ocr-images-real-life-scenarios.ipynb	Real-life scenario OCR withFlorence 2 Large
got-ocr2-0-docs-parsing.ipynb	Document pages parsing withGOT-OCR2.0 andGemini 2.5 Flash
marker-docs-parsing.ipynb	Marker-based document parsing experiments
mistralocr-docs-parsing.ipynb	Document parsing usingMistralOCR
monkeyocr-docs-pages-parsing.ipynb	Document parsing withMonkeyOCR
nanonets-OCR-s_docs_parsing.ipynb	Advanced document parsing usingNanonets-OCR-s
ollama-llama3-2-vision-usage.ipynb	UsingLlama3-2 Vision for document parsing
paddleocr-3-0-docs-parsing.ipynb	Parsing withPaddleOCR 3.0 PP-StructureV3
pix2text-docs-pages-parsing.ipynb	Document parsing usingPix2Text
smoldocling-documents-understanding.ipynb	Document understanding withSmolDocling
zerox-pdf-parsing.ipynb	PDF parsing experiments withZerox
qwen2-vl-2b-docs-parsing.ipynb	Documents pages parsing withQwen2-VL-2B
OCRFlux_3B_Docs_Parsing.ipynb	Document parsing withOCRFlux-3B on Lightning AI
granite-docling-258m-document-parsing-review.ipynb	EvaluatingIBM Granite DocLing 258M for document parsing and layout understanding

📑📊 Tables and Charts Recognition

This section includes notebooks focused on table and chart detection, structure recognition, and extraction from documents. It covers various open-source approaches and benchmarks for understanding table and chart layouts and content.

Notebook	Description
unitable-testing-for-table-structure-recognition.ipynb	Testing table detection and structure recognition withUniTable
deepdoctection-tables-recognition.ipynb	EvaluatingDeepdoctection for table extraction across varied structures
gemini-2-5-pro-on-chart-and-table-extraction.ipynb	Chart/table extraction usingGemini 2.5 Pro
deplot-plots-to-tables-converter.ipynb	Converting Charts into Tables withDePlot
cohere-command-a-vision-charts-understanding.ipynb	Cohere Command A Vision for Charts Understanding
cohere-command-a-vision-tables-recognition.ipynb	Cohere Command A Vision for Tables Recognition
moondream2-charts-tables-interpretation.ipynb	Moondream2 for Charts and Tables understanding

📑🔍 Structured Data Extraction

This section covers the structured data extraction phase, detailing methods to extract specific data from documents or images. It includes steps like OCR preprocessing, table extraction, named entity recognition (NER), and conversion to structured formats.

Notebook	Description
NuExtract-2-8b-structured-data-extraction	NuExtract-2.0-8B for Structured Data Extraction

📖 Project Goals

Benchmark different OCR/document parsing models on real documents.
Demonstrate table, chart, and text extraction workflows.
Compare open-source and commercial solutions.
Provide ready-to-use code snippets for rapid prototyping.

🛠️ Usage

Clone the repository:

git clone https://github.com/AdemBoukhris457/Docs_Parsing_Techniques.git

Install dependencies as needed for each notebook (see the first cells of each.ipynb for requirements).
Launch Jupyter Notebook or JupyterLab and open any notebook of interest.
Run the cells and adapt the code for your documents.

📌 Notes

Some notebooks require model weights or API keys, check comments in each notebook for details.
Results, insights, and sample outputs are provided inline.

🔗 Related Resources

📂 You can find more notebooks, experiments, and datasets related to document parsing and OCR on my Kaggle profile:👉https://www.kaggle.com/ademboukhris/code

Star History

About

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📝 Documents Parsing Lab

🚀 Doctra Quick Start

📚 Notebooks Overview

📑📊 Tables and Charts Recognition

📑🔍 Structured Data Extraction

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Charts_Samples		Charts_Samples
Different_Tables_Images_Testing		Different_Tables_Images_Testing
NVIDIA_Annual_Report		NVIDIA_Annual_Report
Tables_Different_Cases_Cropped		Tables_Different_Cases_Cropped
assets		assets
cga_images		cga_images
pdf_files_pages		pdf_files_pages
tables_and_plots_for_testing		tables_and_plots_for_testing
01_doctra_quick_start.ipynb		01_doctra_quick_start.ipynb
Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb		Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb
Nanonets-OCR-s_docs_parsing.ipynb		Nanonets-OCR-s_docs_parsing.ipynb
NuExtract-2-8b-structured-data-extraction.ipynb		NuExtract-2-8b-structured-data-extraction.ipynb
OCRFlux_3B_Docs_Parsing.ipynb		OCRFlux_3B_Docs_Parsing.ipynb
README.md		README.md
bytedance-dolphin-image-parsing.ipynb		bytedance-dolphin-image-parsing.ipynb
cohere-command-a-vision-charts-understanding.ipynb		cohere-command-a-vision-charts-understanding.ipynb
cohere-command-a-vision-tables-recognition.ipynb		cohere-command-a-vision-tables-recognition.ipynb
deepdoctection-tables-recognition.ipynb		deepdoctection-tables-recognition.ipynb
deplot-plots-to-tables-converter.ipynb		deplot-plots-to-tables-converter.ipynb
docling-documents-parsing-and-tables-extraction.ipynb		docling-documents-parsing-and-tables-extraction.ipynb
florence-2-large-ocr-documents-pages.ipynb		florence-2-large-ocr-documents-pages.ipynb
florence-2-large-ocr-images-real-life-scenarios.ipynb		florence-2-large-ocr-images-real-life-scenarios.ipynb
gemini-2-5-pro-on-chart-and-table-extraction.ipynb		gemini-2-5-pro-on-chart-and-table-extraction.ipynb
got-ocr2-0-docs-parsing.ipynb		got-ocr2-0-docs-parsing.ipynb
granite-docling-258m-document-parsing-review.ipynb		granite-docling-258m-document-parsing-review.ipynb
marker-docs-parsing.ipynb		marker-docs-parsing.ipynb
mistralocr-docs-parsing.ipynb		mistralocr-docs-parsing.ipynb
monkeyocr-docs-pages-parsing.ipynb		monkeyocr-docs-pages-parsing.ipynb
moondream2-charts-tables-interpretation.ipynb		moondream2-charts-tables-interpretation.ipynb
ollama-llama3-2-vision-usage.ipynb		ollama-llama3-2-vision-usage.ipynb
paddleocr-3-0-docs-parsing.ipynb		paddleocr-3-0-docs-parsing.ipynb
pix2text-docs-pages-parsing.ipynb		pix2text-docs-pages-parsing.ipynb
qwen2-vl-2b-docs-parsing.ipynb		qwen2-vl-2b-docs-parsing.ipynb
smoldocling-documents-understanding.ipynb		smoldocling-documents-understanding.ipynb
typhoon-ocr-7b-docs-pages-parser.ipynb		typhoon-ocr-7b-docs-pages-parser.ipynb
unitable-testing-for-table-structure-recognition.ipynb		unitable-testing-for-table-structure-recognition.ipynb
zerox-pdf-parsing.ipynb		zerox-pdf-parsing.ipynb

Movatterモバイル変換

AdemBoukhris457/Documents-Parsing-Lab

Folders and files

Latest commit

History

Repository files navigation

📝 Documents Parsing Lab

🚀 Doctra Quick Start

📚 Notebooks Overview

📑📊 Tables and Charts Recognition

📑🔍 Structured Data Extraction

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages