Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

(eBook,PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.

License

NotificationsYou must be signed in to change notification settings

CBIhalsen/PolyglotPDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English |简体中文 |繁體中文 |日本語 |한국어

PolyglotPDF

PythonPDFLaTeXTranslationMathPyMuPDF

Demo

Speed comparison

llms has been added as the translation api of choice, Doubao ,Qwen ,deepseek v3 , gpt4-o-mini are recommended. The color space error can be resolved by filling the white areas in PDF files. The old text to text translation api has been removed.

In addition, consider adding arxiv search function and rendering arxiv papers after latex translation.

Pasges show

Chinese LLM API Application

Doubao & Deepseek

Apply through Volcengine platform:

Tongyi Qwen

Apply through Alibaba Cloud platform:

Overview

PolyglotPDF(EbookTranslation) is an advanced PDF processing tool that employs specialized techniques for ultra-fast text, table, and formula recognition in PDF documents, typically completing processing within 1 second. It features OCR capabilities and layout-preserving translation, with full document translations usually completed within 10 seconds (speed may vary depending on the translation API provider).

Features

  • Ultra-Fast Recognition: Processes text, tables, and formulas in PDFs within ~1 second
  • Layout-Preserving Translation: Maintains original document formatting while translating content
  • OCR Support: Handles scanned documents efficiently
  • Text-based PDF:No GPU required
  • Quick Translation: Complete PDF translation in approximately 10 seconds
  • Flexible API Integration: Compatible with various translation service providers
  • Web-based Comparison Interface: Side-by-side comparison of original and translated documents
  • Enhanced OCR Capabilities: Improved accuracy in text recognition and processing
  • Support for offline translation: Use smaller translation model

Installation and Setup

There are several ways to use it. One is to install the library,

pip install EbookTranslator

Basic usage:

EbookTranslator your_file.pdf

Usage with parameters:

EbookTranslator your_file.pdf -o en -t zh -b 1 -e 10 -c /path/to/config.json -d 300

Using in Python Code

fromEbookTranslatorimportmain_functiontranslator=main_function(pdf_path="your_file.pdf",original_language="en",target_language="zh",bn=1,en=10,config_path="/path/to/config.json",DPI=300)translator.main()

Parameter Description

ParameterCommand Line OptionDescriptionDefault Value
pdf_pathPositional argumentPDF file pathRequired
original_language-o, --originalSource languageauto
target_language-t, --targetTarget languagezh
bn-b, --beginStarting page number1
en-e, --endEnding page numberLast page of the document
config_path-c, --configConfiguration file pathconfig.json in the current working directory
DPI-d, --dpiDPI for OCR mode72

Configuration File

The configuration file is a JSON file, by default located atconfig.json in the current working directory. If it doesn't exist, the program will use built-in default settings.

Configuration File Example

{"count":4,"PPC":20,"translation_services": {"Doubao": {"auth_key":"","model_name":""    },"Qwen": {"auth_key":"","model_name":"qwen-plus"    },"deepl": {"auth_key":""    },"deepseek": {"auth_key":"","model_name":"ep-20250218224909-gps4n"    },"openai": {"auth_key":"","model_name":"gpt-4o-mini"    },"youdao": {"app_key":"","app_secret":""    }  },"ocr_services": {"tesseract": {"path":"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"    }  },"default_services": {"ocr_model":false,"line_model":false,"Enable_translation":true,"Translation_api":"openai"  }}

Output

Translated PDF files will be saved in the directory specified byoutput_dir (default is thetarget folder in the current working directory).

Use method for friendly UI interface

  1. Clone the repository:
git clone https://github.com/CBIhalsen/PolyglotPDF.gitcd polyglotpdf
  1. Install required packages:
pip install -r requirements.txt
  1. Configure your API key in config.json. The alicloud translation API is not recommended.

  2. Run the application:

python app.py
  1. Access the web interface:Open your browser and navigate tohttp://127.0.0.1:8000

Requirements

  • Python 3.8+
  • deepl==1.17.0
  • Flask==2.0.1
  • Flask-Cors==5.0.0
  • langdetect==1.0.9
  • Pillow==10.2.0
  • PyMuPDF==1.24.0
  • pytesseract==0.3.10
  • requests==2.31.0
  • tiktoken==0.6.0
  • Werkzeug==2.0.1

Acknowledgments

This project leverages PyMuPDF's capabilities for efficient PDF processing and layout preservation.

Upcoming Improvements

  • PDF chat functionality
  • Academic PDF search integration
  • Optimization for even faster processing speeds

Known Issues

  • Issue Description: Error during text re-editing:code=4: only Gray, RGB, and CMYK colorspaces supported
  • Symptom: Unsupported color space encountered during text block editing
  • Current Workaround: Skip text blocks with unsupported color spaces
  • Proposed Solution: Switch to OCR mode for entire pages containing unsupported color spaces
  • Example:View PDF sample with unsupported color spaces

Font Optimization

Current font configuration in thestart function ofmain.py:

# Current configurationcss=f"* {{font-family:{get_font_by_language(self.target_language)};font-size:auto;color: #111111 ;font-weight:normal;}}"

You can optimize font display through the following methods:

  1. Modify Default Font Configuration
# Custom font stylescss=f"""* {{    font-family:{get_font_by_language(self.target_language)};    font-size: auto;    color: #111111;    font-weight: normal;    letter-spacing: 0.5px;  # Adjust letter spacing    line-height: 1.5;      # Adjust line height}}"""
  1. Embed Custom FontsYou can embed custom fonts by following these steps:
  • Place font files (.ttf, .otf) in the project'sfonts directory
  • Use@font-face to declare custom fonts in CSS
css=f"""@font-face {{    font-family: 'CustomFont';    src: url('fonts/your-font.ttf') format('truetype');}}* {{    font-family: 'CustomFont',{get_font_by_language(self.target_language)};    font-size: auto;    font-weight: normal;}}"""

Basic Principles

This project follows similar basic principles as Adobe Acrobat DC's PDF editing, using PyMuPDF for text block recognition and manipulation:

  • Core Process:
# Get text blocks from the pageblocks=page.get_text("dict")["blocks"]# Process each text blockforblockinblocks:ifblock.get("type")==0:# text blockbbox=block["bbox"]# get text block boundarytext=""font_info=None# Collect text and font informationforlineinblock["lines"]:forspaninline["spans"]:text+=span["text"]+" "

This approach directly processes PDF text blocks, maintaining the original layout while achieving efficient text extraction and modification.

  • Technical Choices:

    • Utilizes PyMuPDF for PDF parsing and editing
    • Focuses on text processing
    • Avoids complex operations like AI formula recognition, table processing, or page restructuring
  • Why Avoid Complex Processing:

    • AI recognition of formulas, tables, and PDF restructuring faces severe performance bottlenecks
    • Complex AI processing leads to high computational costs
    • Significantly increased processing time (potentially tens of seconds or more)
    • Difficult to deploy at scale with low costs in production environments
    • Not suitable for online services requiring quick response times
  • Project Scope:

    • This project only serves to demonstrate the correct approach for layout-preserved PDF translation and AI-assisted PDF reading. Converting PDF files to markdown format for large language models to read, in my opinion, is not a wise approach.
    • Aims for optimal performance-to-cost ratio
  • Performance:

    • PolyglotPDF API response time: ~1 second per page
    • Low computational resource requirements, suitable for scale deployment
    • High cost-effectiveness for commercial applications

Related questions answered and discussed:

QQ group:1031477425

About

(eBook,PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp