- Notifications
You must be signed in to change notification settings - Fork845
Description
Background and motivation
DataIngestion is an ETL process, where aDocumentReader parses given document and represents it withDocument type, then it's processed by 0-n processors, split intoChunks and persisted in a Vector Store to allow for Vector Search and RAG.
Document is a format-agnostic container that normalizes diverse input formats into a structured hierarchy. It's composed ofDocumentSection objects, which can contain nested elements (including subsections). EachDocumentElement has to provide its content in Markdown format.
Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI's GPT-4o, natively "speak" Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.
API Proposal
namespaceMicrosoft.Extensions.DataIngestion;publicabstractclassDocumentElement{protectedinternalDocumentElement(stringmarkdown);protectedinternalDocumentElement();// ctor used by Section where providing Markdown up-front is not mandatorypublicvirtualstringMarkdown{get;}publicstring?Text{get;set;}publicint?PageNumber{get;set;}publicDictionary<string,object?>Metadata{get;}}publicsealedclassDocumentSection:DocumentElement{publicDocumentSection(stringmarkdown):base(markdown);publicDocumentSection():base();publicList<DocumentElement>Elements{get;};publicoverridestringMarkdown{get;}}publicsealedclassDocumentParagraph:DocumentElement{publicDocumentParagraph(stringmarkdown):base(markdown);}publicsealedclassDocumentHeader:DocumentElement{publicDocumentHeader(stringmarkdown):base(markdown);publicint?Level{get;set;}}publicsealedclassDocumentFooter:DocumentElement{publicDocumentFooter(stringmarkdown):base(markdown);}publicsealedclassDocumentTable:DocumentElement{publicDocumentTable(stringmarkdown,string[,]cells):base(markdown);// This information is useful when chunking large tables that exceed token count limit.publicstring[,]Cells{get;}}publicsealedclassDocumentImage:DocumentElement{publicDocumentImage(stringmarkdown):base(markdown);publicReadOnlyMemory<byte>?Content{get;set;}publicstring?MediaType{get;set;}publicstring?AlternativeText{get;set;}}publicsealedclassDocument:IEnumerable<DocumentElement>{publicDocument(stringidentifier);publicstringIdentifier{get;}publicList<DocumentSection>Sections{get;}publicstringMarkdown{get;set;}/// <summary>/// Iterate over all elements in the document, including those in nested sections./// </summary>/// <remarks>/// Sections themselves are not included./// </remarks>publicIEnumerator<DocumentElement>GetEnumerator();IEnumeratorIEnumerable.GetEnumerator();}
API Usage
The following example usesDocument API to build a simple structured document:
Documentdoc=new("doc"){Sections={newDocumentSection(){Elements={newDocumentHeader("# Section title"),newDocumentParagraph("This is a paragraph in section 1."),newDocumentParagraph("This is another paragraph in section 1."),newDocumentSection{Elements={newDocumentHeader("## Subsection title"),newDocumentParagraph("This is a paragraph in subsection 1.1."),newDocumentParagraph("This is another paragraph in subsection 1.1.")}}}}}};
Another one that iterates over all elements and gets their semantic content to be used for generating embeddings:
foreach(DocumentElementelementindocuments){string?semanticContent=elementisDocumentImageimg?img.AlternativeText??img.Text:element.Markdown;if(!string.IsNullOrEmpty(semanticContent)){yieldreturn(element,semanticContent);}}
Alternative Designs
Naming:Document may be a bit too generic. My current best idea for a different name isDocumentContent.
The fact thatDocument implementsIEnumerable<DocumentElement> may be hard to discover by the end users. Because of that, it may be easier to just add an explicitFlatten method:
publicsealedclassDocument{publicIEnumerator<DocumentElement>Flatten();}
Risks
Exposing a publicDictionary<string, object> can cause serialization headache in the future (durable document pipelines are on our radar). The main goal of the metadata is to allow for storing any information provided by theDocumentReader that is specific to given implementation. Examples:
ConfidenceScore: a doubleBoundingRegions: a list ofBoundingRegionstructs (X, Y, Width, Height)
TheDocumentElement.Metadata is not consumed by any of the chunkers we provide as of now (we focus on RAG), but it may be used by users to implement more advanced scenarios like recreating a document and preserving its structure.