Movatterモバイル変換

We've been evaluating document parsing systems internally for a while now, and today we're releasing SCORE-Bench to the community: 224 real-world documents with expert annotations, plus our complete evaluation methodology and code.

The dataset includes documents that challenge parsers in production: scanned forms with visual degradation, financial reports with deeply nested tables, multi-column layouts, mixed handwriting and printed text. Every document has been manually annotated by domain experts rather than algorithmically labeled.

We built the SCORE, our evaluation framework to handle the tricky parts of comparing generative systems fairly—like recognizing when different structural representations are semantically equivalent. We've already open-sourced the framework and methodology. Now with SCORE-Bench, we hope you can benchmark your own systems using the same approach, reproduce results, and track progress as document parsing evolves.

What we're releasing:

We built this to solve our own evaluation needs but figured the community might find it useful too. We’d love to hear about what documents break your systems usually, other evaluation scenarios that you follow, and more importantly, how we could make this more useful for your case. Feel free to reach out to us!

u/Siemens_Software

•

Promoted

Experience a new level of design automation and accuracy: AI-powered Designcenter X NX accelerates every step, from concept to completion. Compare bundles and unlock the advanced tools trusted by leading engineers, choose yours today.

plm.sw.siemens.com

Learn More

New Notebook Alert: RAG Over Evolving Documents

u/Ajay_Unstructured

New Notebook Alert: RAG Over Evolving Documents

Tutorial

Connectors are soooo underrated. We take so many of their features for granted as we use them.
In this notebook, I cover how change detection works and how your databases can be kept up-to-date with all the latest documents all by setting a single boolean input :)

Check it out!

Upcoming Webinar: RAG Over Evolving Enterprise Knowledge

u/Ajay_Unstructured

Upcoming Webinar: RAG Over Evolving Enterprise Knowledge

Event

Your enterprise knowledge base is constantly changing. New document versions, updated policies, revised procedures. But here's the problem: most RAG pipelines handle this by reprocessingeverything each time, but full reprocessing on every sync is expensive and slow. You need a way to track what's actually changed and only process those deltas without breaking your pipeline.

What we're covering:

Understanding incremental processing — when to use delta-aware syncs vs. full reprocessing, and why it matters
See how connectors detect changes across different storage systems and track evolving files
Follow a practical example with a live walkthrough of evolving documents

The core question: How do you keep your RAG system current with evolving knowledge without wasting compute and time reprocessing unchanged content?

📺Dec 4 at 10am PT / 1pm ET -

Stay for a live demo and Q&A session with the experts.

Community highlights

Welcome to r/UnstructuredIO 👋 Introduce yourself!

Helpful Links

Filter By Flair

ETL+ for GenAI Data

r/UnstructuredIO Rules

Rule 1: Be respectful

Rule 2: Stay Relevant

Rule 3: No Spam

Rule 4: Share to Help

Rule 5: Keep It Safe

Moderators

Movatterモバイル変換

r/UnstructuredIO