- Notifications
You must be signed in to change notification settings - Fork1
Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.
License
NanoNets/llm-data-converter
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
https://github.com/NanoNets/docstrange
https://docstrange.nanonets.com/
Try Cloud Mode for Free!
Convert documents instantly with our cloud API - no setup required.
For unlimited processing,get your free API key.
Transform any document, image, or URL into LLM-ready formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.
- Cloud Processing (Default): Instant conversion with Nanonets API - no local setup needed
- Local Processing: CPU/GPU options for complete privacy and control
- Universal Input: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text
- Smart Output: Markdown, JSON, CSV, HTML, and plain text formats
- LLM-Optimized: Clean, structured output perfect for AI processing
- Intelligent Extraction: Extract specific fields or structured data using AI
- Advanced OCR: Multiple OCR engines with automatic fallback
- Table Processing: Accurate table extraction and formatting
- Image Handling: Extract text from images and visual content
- URL Processing: Direct conversion from web pages
pip install llm-data-converter
fromllm_converterimportFileConverter# Default cloud mode - no setup requiredconverter=FileConverter()# Convert any documentresult=converter.convert("document.pdf")# Get different output formatsmarkdown=result.to_markdown()json_data=result.to_json()html=result.to_html()csv_tables=result.to_csv()# Extract specific fieldsextracted_fields=result.to_json(specified_fields=["title","author","date","summary","key_points"])# Extract using JSON schemaschema= {"title":"string","author":"string","date":"string","summary":"string","key_points": ["string"],"metadata": {"page_count":"number","language":"string" }}structured_data=result.to_json(json_schema=schema)
# Get your free API key from https://app.nanonets.com/#/keysconverter=FileConverter(api_key="your_api_key_here")result=converter.convert("document.pdf")
# Force local CPU processingconverter=FileConverter(cpu_preference=True)# Force local GPU processing (requires CUDA)converter=FileConverter(gpu_preference=True)
- Markdown: Clean, LLM-friendly format with preserved structure
- JSON: Structured data with metadata and intelligent parsing
- HTML: Formatted output with styling and layout
- CSV: Extract tables and data in spreadsheet format
- Text: Plain text with smart formatting
fromllm_converterimportFileConverterconverter=FileConverter()# PDF documentpdf_result=converter.convert("report.pdf")print(pdf_result.to_markdown())# Word documentdocx_result=converter.convert("document.docx")print(docx_result.to_json())# Excel spreadsheetexcel_result=converter.convert("data.xlsx")print(excel_result.to_csv())# PowerPoint presentationpptx_result=converter.convert("slides.pptx")print(pptx_result.to_html())# Image with textimage_result=converter.convert("screenshot.png")print(image_result.to_text())# Web pageurl_result=converter.convert("https://example.com")print(url_result.to_markdown())
# Extract all tables from a documentresult=converter.convert("financial_report.pdf")csv_data=result.to_csv(include_all_tables=True)print(csv_data)
The library now uses intelligent document understanding for JSON conversion:
fromllm_converterimportFileConverterconverter=FileConverter()result=converter.convert("document.pdf")# Enhanced JSON with Ollama (when available)json_data=result.to_json()print(json_data["format"])# "ollama_structured_json" or "structured_json"# The enhanced conversion provides:# - Better document structure understanding# - Intelligent table parsing# - Automatic metadata extraction# - Key information identification# - Proper data type handling
Requirements for enhanced JSON (if using cpu_preference=True):
- Install:
pip install 'llm-data-converter[local-llm]' - Install Ollama and run:
ollama serve - Pull a model:
ollama pull llama3.2
If Ollama is not available, the library automatically falls back to the standard JSON parser.
# Extract specific fields from any documentresult=converter.convert("invoice.pdf")# Method 1: Extract specific fieldsextracted=result.to_json(specified_fields=["invoice_number","total_amount","vendor_name","due_date"])# Method 2: Extract using JSON schemaschema= {"invoice_number":"string","total_amount":"number","vendor_name":"string","line_items": [{"description":"string","amount":"number" }]}structured=result.to_json(json_schema=schema)
How it works:
- Automatically uses cloud API when available
- Falls back to local Ollama for privacy-focused processing
- Same interface works for both cloud and local modes
Cloud Mode Usage Examples:
fromllm_converterimportFileConverter# Default cloud mode (rate-limited without API key)converter=FileConverter()# With API key for unlimited accessconverter=FileConverter(api_key="your_api_key_here")# Extract specific fields from invoiceresult=converter.convert("invoice.pdf")# Extract key invoice informationinvoice_fields=result.to_json(specified_fields=["invoice_number","total_amount","vendor_name","due_date","items_count"])print("Extracted Invoice Fields:")print(invoice_fields)# Output: {"extracted_fields": {"invoice_number": "INV-001", ...}, "format": "specified_fields"}# Extract structured data using schemainvoice_schema= {"invoice_number":"string","total_amount":"number","vendor_name":"string","billing_address": {"street":"string","city":"string","zip_code":"string" },"line_items": [{"description":"string","quantity":"number","unit_price":"number","total":"number" }],"taxes": {"tax_rate":"number","tax_amount":"number" }}structured_invoice=result.to_json(json_schema=invoice_schema)print("Structured Invoice Data:")print(structured_invoice)# Output: {"structured_data": {...}, "schema": {...}, "format": "structured_json"}# Extract from different document typesreceipt=converter.convert("receipt.jpg")receipt_data=receipt.to_json(specified_fields=["merchant_name","total_amount","date","payment_method"])contract=converter.convert("contract.pdf")contract_schema= {"parties": [{"name":"string","role":"string" }],"contract_value":"number","start_date":"string","end_date":"string","key_terms": ["string"]}contract_data=contract.to_json(json_schema=contract_schema)
Local extraction requirements (if using cpu_preference=True):
- Install ollama package:
pip install 'llm-data-converter[local-llm]' - Install Ollama and run:
ollama serve - Pull a model:
ollama pull llama3.2
# Perfect for LLM workflowsdocument_text=converter.convert("research_paper.pdf").to_markdown()# Use with any LLMresponse=your_llm_client.chat(messages=[{"role":"user","content":f"Summarize this research paper:\n\n{document_text}" }])
# Basic conversion (cloud mode default)llm-converter document.pdf# With API key for unlimited accessllm-converter document.pdf --api-key YOUR_API_KEY# Local processing modesllm-converter document.pdf --cpu-modellm-converter document.pdf --gpu-mode# Different output formatsllm-converter document.pdf --output jsonllm-converter document.pdf --output htmlllm-converter document.pdf --output csv# Extract specific fieldsllm-converter invoice.pdf --output json --extract-fields invoice_number total_amount# Extract with JSON schemallm-converter document.pdf --output json --json-schema schema.json# Multiple filesllm-converter*.pdf --output markdown# Save to filellm-converter document.pdf --output-file result.md# Comprehensive field extraction examplesllm-converter invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items# Extract from different document types with specific fieldsllm-converter receipt.jpg --output json --extract-fields merchant_name total_amount date payment_methodllm-converter contract.pdf --output json --extract-fields parties contract_value start_date end_date# Using JSON schema files for structured extractionllm-converter invoice.pdf --output json --json-schema invoice_schema.jsonllm-converter contract.pdf --output json --json-schema contract_schema.json# Combine with API key for unlimited accessllm-converter document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary# Force local processing with field extraction (requires Ollama)llm-converter document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations
Example schema.json file:
{"invoice_number":"string","total_amount":"number","vendor_name":"string","billing_address": {"street":"string","city":"string","zip_code":"string" },"line_items": [{"description":"string","quantity":"number","unit_price":"number" }]}FileConverter(preserve_layout:bool=True,# Preserve document structureinclude_images:bool=True,# Include image contentocr_enabled:bool=True,# Enable OCR processingapi_key:str=None,# API key for unlimited cloud accessmodel:str=None,# Model for cloud processing ("gemini", "openapi")cpu_preference:bool=False,# Force local CPU processinggpu_preference:bool=False# Force local GPU processing)
result.to_markdown()->str# Clean markdown outputresult.to_json(# Structured JSONspecified_fields:List[str]=None,# Extract specific fieldsjson_schema:Dict=None# Extract with schema)->Dictresult.to_html()->str# Formatted HTMLresult.to_csv()->str# CSV format for tablesresult.to_text()->str# Plain text
converter=FileConverter(cpu_preference=True,# Use local processingocr_enabled=True,# Enable OCRpreserve_layout=True,# Maintain structureinclude_images=True# Process images)
export NANONETS_API_KEY="your_api_key"# Now all conversions use your API key automatically
We welcome contributions! Please see ourContributing Guidelines for details.
This project is licensed under the MIT License - see theLICENSE file for details.
- Email:support@nanonets.com
- Issues:GitHub Issues
- Discussions:GitHub Discussions
Star this repo if you find it helpful! Your support helps us improve the library.
About
Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.