dotnet/extensionsPublic

NotificationsYou must be signed in to change notification settings
Fork845
Star3.1k

[API Proposal]: DataIngestion: Document representation #6893

Closed

[API Proposal]: DataIngestion: Document representation#6893

Assignees

Labels

api-approvedAPI was approved in API review, it can be implementedarea-data-ingestionblockingAPI Review Board to prioritise

Description

adamsitnik

opened

on Oct 6, 2025

Background and motivation

DataIngestion is an ETL process, where aDocumentReader parses given document and represents it withDocument type, then it's processed by 0-n processors, split intoChunks and persisted in a Vector Store to allow for Vector Search and RAG.

Document is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed ofDocumentSection objects, which can contain nested elements (including subsections). EachDocumentElement has to provide its content in Markdown format.

Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.

API Proposal

namespaceMicrosoft.Extensions.DataIngestion;publicabstractclassDocumentElement{protectedinternalDocumentElement(stringmarkdown);protectedinternalDocumentElement();// ctor used by Section where providing Markdown up-front is not mandatorypublicvirtualstringMarkdown{get;}publicstring?Text{get;set;}publicint?PageNumber{get;set;}publicDictionary<string,object?>Metadata{get;}}publicsealedclassDocumentSection:DocumentElement{publicDocumentSection(stringmarkdown):base(markdown);publicDocumentSection():base();publicList<DocumentElement>Elements{get;};publicoverridestringMarkdown{get;}}publicsealedclassDocumentParagraph:DocumentElement{publicDocumentParagraph(stringmarkdown):base(markdown);}publicsealedclassDocumentHeader:DocumentElement{publicDocumentHeader(stringmarkdown):base(markdown);publicint?Level{get;set;}}publicsealedclassDocumentFooter:DocumentElement{publicDocumentFooter(stringmarkdown):base(markdown);}publicsealedclassDocumentTable:DocumentElement{publicDocumentTable(stringmarkdown,string[,]cells):base(markdown);// This information is useful when chunking large tables that exceed token count limit.publicstring[,]Cells{get;}}publicsealedclassDocumentImage:DocumentElement{publicDocumentImage(stringmarkdown):base(markdown);publicReadOnlyMemory<byte>?Content{get;set;}publicstring?MediaType{get;set;}publicstring?AlternativeText{get;set;}}publicsealedclassDocument:IEnumerable<DocumentElement>{publicDocument(stringidentifier);publicstringIdentifier{get;}publicList<DocumentSection>Sections{get;}publicstringMarkdown{get;set;}/// <summary>/// Iterate over all elements in the document, including those in nested sections./// </summary>/// <remarks>/// Sections themselves are not included./// </remarks>publicIEnumerator<DocumentElement>GetEnumerator();IEnumeratorIEnumerable.GetEnumerator();}

API Usage

The following example usesDocument API to build a simple structured document:

Documentdoc=new("doc"){Sections={newDocumentSection(){Elements={newDocumentHeader("# Section title"),newDocumentParagraph("This is a paragraph in section 1."),newDocumentParagraph("This is another paragraph in section 1."),newDocumentSection{Elements={newDocumentHeader("## Subsection title"),newDocumentParagraph("This is a paragraph in subsection 1.1."),newDocumentParagraph("This is another paragraph in subsection 1.1.")}}}}}};

Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:

foreach(DocumentElementelementindocuments){string?semanticContent=elementisDocumentImageimg?img.AlternativeText??img.Text:element.Markdown;if(!string.IsNullOrEmpty(semanticContent)){yieldreturn(element,semanticContent);}}

Alternative Designs

Naming:Document may be a bit too generic. My current best idea for a different name isDocumentContent.

The fact thatDocument implementsIEnumerable<DocumentElement> may be hard to discover by the end users. Because of that, it may be easier to just add an explicitFlatten method:

publicsealedclassDocument{publicIEnumerator<DocumentElement>Flatten();}

Risks

Exposing a publicDictionary<string, object> can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by theDocumentReader that is specific to given implementation. Examples:

ConfidenceScore: a double
BoundingRegions: a list ofBoundingRegion structs (X, Y, Width, Height)

TheDocumentElement.Metadata is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.

Metadata

Assignees

adamsitnik

Labels

api-approvedAPI was approved in API review, it can be implementedarea-data-ingestionblockingAPI Review Board to prioritise

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[API Proposal]: DataIngestion: Document representation #6893

Description

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions