Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comments

feat(pdf): Add smart PDF routing with local text extraction#2811

Draft
abimaelmartell wants to merge 1 commit intomainfrom
abi/inspect-pdf-before-processing
Draft

feat(pdf): Add smart PDF routing with local text extraction#2811
abimaelmartell wants to merge 1 commit intomainfrom
abi/inspect-pdf-before-processing

Conversation

@abimaelmartell
Copy link
Member

@abimaelmartellabimaelmartell commentedFeb 7, 2026
edited by cubic-dev-aibot
Loading

Add intelligent PDF processing that detects whether a PDF is text-based or scanned before choosing an extraction method. Text-based PDFs (with ≥80% confidence) are now extracted locally using the Rust pdf-inspector library. Optimize the flow by deferring base64 encoding until MinerU is actually needed, and add structured logging for monitoring extraction methods.


Summary by cubic

Adds smart PDF routing that inspects each file and picks the fastest extractor. Text PDFs (≥80% confidence) are extracted locally in Rust; scanned/image PDFs fall back to MinerU (RunPod) or pdf-parse.

  • New Features

    • Added detectPdfType and extractPdfToMarkdown via Rust pdf-inspector with N-API bindings and structured logging.
    • Routed extraction in the scraper: local (text, ≥0.8) → RunPod MU → pdf-parse, with Sentry capturing failures and a final summary log of the chosen method.
  • Performance

    • Skips GPU/OCR for text PDFs with fast local markdown extraction.
    • Defers base64 encoding until MU is needed and checks size before MU to avoid oversized uploads; added per-stage timing for monitoring.

Written for commit7ad32b3. Summary will update on new commits.

@abimaelmartellabimaelmartellforce-pushed theabi/inspect-pdf-before-processing branch fromf738d07 to7ad32b3CompareFebruary 7, 2026 06:49
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@abimaelmartell

[8]ページ先頭

©2009-2026 Movatter.jp