- Notifications
You must be signed in to change notification settings - Fork0
renswickd/document-parser-collection
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A comprehensive toolkit for extracting and processing content from PDF documents using various parsing technologies.
This project implements five different document parsing approaches:
- Unstructured.io API
- Llama Parse
- Mistral OCR
- Azure Document Intelligence
- Amazon Textract
Strengths:
- Excellent at handling complex document layouts
- Advanced table extraction capabilities
- Maintains document structure and formatting
- Supports multiple document formats
Limitations:
- API rate limits on free tier
- Higher latency due to cloud processing
- Cost increases with document volume
Best For:
- Complex documents with mixed content
- Documents with tables and structured data
- Batch processing requirements
Strengths:
- Strong text extraction capabilities
- Good handling of simple layouts
- Local processing option available
- Efficient for text-heavy documents
Limitations:
- Limited table extraction capabilities
- May struggle with complex layouts
- Requires more computational resources locally
Best For:
- Text-heavy documents
- Simple document layouts
- Local processing requirements
Strengths:
- Excellent OCR accuracy
- Good language support
- Handles handwritten text well
- Real-time processing capabilities
Limitations:
- Limited formatting preservation
- May struggle with complex tables
- Higher cost for high-volume processing
Best For:
- Documents with handwritten content
- Multi-language documents
- Real-time OCR requirements
Strengths:
- Advanced AI-powered extraction
- Excellent form field recognition
- Strong table extraction
- Built-in pretraining for common documents
Limitations:
- Azure platform lock-in
- Higher cost for large-scale processing
- Requires Azure subscription
Best For:
- Forms and structured documents
- Enterprise-scale deployments
- Integration with Azure services
Strengths:
- Excellent table extraction
- Good form field recognition
- Scales well for large volumes
- Strong integration with AWS
Limitations:
- AWS platform lock-in
- Cost can be high for large volumes
- Limited customization options
Best For:
- AWS ecosystem integration
- Large-scale document processing
- Forms and table extraction
- Sign up atUnstructured.io
- Obtain API key from dashboard
- Sign up atLlama Cloud
- Generate API key from dashboard
- VisitMistral AI
- Create account and generate API key
- Create resource inAzure Portal
- Get endpoint URL and API key from resource settings
- Set up AWS account
- Create IAM user with Textract permissions
- Get AWS access key and secret
- Install Dependencies
python -m venv venvsource venv/bin/activate# For Mac/Linuxpip install -r requirements.txt
- Configure Environment VariablesCreate a
.env
file:
# Unstructured.ioUNSTRUCTURED_API_KEY=your_key# Llama ParseLLAMA_API_KEY=your_key# MistralMISTRAL_API_KEY=your_key# AzureAZURE_ENDPOINT=your_endpointAZURE_API_KEY=your_key# AWSAWS_ACCESS_KEY_ID=your_keyAWS_SECRET_ACCESS_KEY=your_secretAWS_REGION=your_region
Simple Text Documents
- Recommended: Llama Parse or Unstructured.io
- Alternative: Mistral OCR
Forms and Structured Documents
- Recommended: Azure Document Intelligence or Amazon Textract
- Alternative: Unstructured.io
Complex Tables
- Recommended: Amazon Textract or Azure Document Intelligence
- Alternative: Unstructured.io
Handwritten Content
- Recommended: Mistral OCR
- Alternative: Azure Document Intelligence
Multi-Language Documents
- Recommended: Mistral OCR or Azure Document Intelligence
- Alternative: Amazon Textract
Contributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.
About
This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.