Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[API Proposal]: DataIngestion: Document representation #6893

Closed
Assignees
adamsitnik
Labels
api-approvedAPI was approved in API review, it can be implementedarea-data-ingestionblockingAPI Review Board to prioritise
@adamsitnik

Description

@adamsitnik

Background and motivation

DataIngestion is an ETL process, where aDocumentReader parses given document and represents it withDocument type, then it's processed by 0-n processors, split intoChunks and persisted in a Vector Store to allow for Vector Search and RAG.

Document is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed ofDocumentSection objects, which can contain nested elements (including subsections). EachDocumentElement has to provide its content in Markdown format.

Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.

API Proposal

namespaceMicrosoft.Extensions.DataIngestion;publicabstractclassDocumentElement{protectedinternalDocumentElement(stringmarkdown);protectedinternalDocumentElement();// ctor used by Section where providing Markdown up-front is not mandatorypublicvirtualstringMarkdown{get;}publicstring?Text{get;set;}publicint?PageNumber{get;set;}publicDictionary<string,object?>Metadata{get;}}publicsealedclassDocumentSection:DocumentElement{publicDocumentSection(stringmarkdown):base(markdown);publicDocumentSection():base();publicList<DocumentElement>Elements{get;};publicoverridestringMarkdown{get;}}publicsealedclassDocumentParagraph:DocumentElement{publicDocumentParagraph(stringmarkdown):base(markdown);}publicsealedclassDocumentHeader:DocumentElement{publicDocumentHeader(stringmarkdown):base(markdown);publicint?Level{get;set;}}publicsealedclassDocumentFooter:DocumentElement{publicDocumentFooter(stringmarkdown):base(markdown);}publicsealedclassDocumentTable:DocumentElement{publicDocumentTable(stringmarkdown,string[,]cells):base(markdown);// This information is useful when chunking large tables that exceed token count limit.publicstring[,]Cells{get;}}publicsealedclassDocumentImage:DocumentElement{publicDocumentImage(stringmarkdown):base(markdown);publicReadOnlyMemory<byte>?Content{get;set;}publicstring?MediaType{get;set;}publicstring?AlternativeText{get;set;}}publicsealedclassDocument:IEnumerable<DocumentElement>{publicDocument(stringidentifier);publicstringIdentifier{get;}publicList<DocumentSection>Sections{get;}publicstringMarkdown{get;set;}/// <summary>/// Iterate over all elements in the document, including those in nested sections./// </summary>/// <remarks>/// Sections themselves are not included./// </remarks>publicIEnumerator<DocumentElement>GetEnumerator();IEnumeratorIEnumerable.GetEnumerator();}

API Usage

The following example usesDocument API to build a simple structured document:

Documentdoc=new("doc"){Sections={newDocumentSection(){Elements={newDocumentHeader("# Section title"),newDocumentParagraph("This is a paragraph in section 1."),newDocumentParagraph("This is another paragraph in section 1."),newDocumentSection{Elements={newDocumentHeader("## Subsection title"),newDocumentParagraph("This is a paragraph in subsection 1.1."),newDocumentParagraph("This is another paragraph in subsection 1.1.")}}}}}};

Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:

foreach(DocumentElementelementindocuments){string?semanticContent=elementisDocumentImageimg?img.AlternativeText??img.Text:element.Markdown;if(!string.IsNullOrEmpty(semanticContent)){yieldreturn(element,semanticContent);}}

Alternative Designs

Naming:Document may be a bit too generic. My current best idea for a different name isDocumentContent.

The fact thatDocument implementsIEnumerable<DocumentElement> may be hard to discover by the end users. Because of that, it may be easier to just add an explicitFlatten method:

publicsealedclassDocument{publicIEnumerator<DocumentElement>Flatten();}

Risks

Exposing a publicDictionary<string, object> can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by theDocumentReader that is specific to given implementation. Examples:

  • ConfidenceScore: a double
  • BoundingRegions: a list ofBoundingRegion structs (X, Y, Width, Height)

TheDocumentElement.Metadata is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.

Metadata

Metadata

Assignees

Labels

api-approvedAPI was approved in API review, it can be implementedarea-data-ingestionblockingAPI Review Board to prioritise

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions


    [8]ページ先頭

    ©2009-2025 Movatter.jp