Docling

Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

This integration provides Docling's capabilities via theDoclingLoader document loader.

Overview

The presentedDoclingLoader component enables you to:

use various document types in your LLM applications with ease and speed, and
leverage Docling's rich format for advanced, document-native grounding.

DoclingLoader supports two different export modes:

ExportType.DOC_CHUNKS (default): if you want to have each input document chunked andto then capture each individual chunk as a separate LangChain Document downstream, or
ExportType.MARKDOWN: if you want to capture each input document as a separateLangChain Document

The example allows exploring both modes via parameterEXPORT_TYPE; depending on thevalue set, the example pipeline is then set up accordingly.

Setup

%pip install-qU langchain-docling

Note: you may need to restart the kernel to use updated packages.

For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use a GPU-enabled runtime.

Initialization

Basic initialization looks as follows:

from langchain_doclingimport DoclingLoader

FILE_PATH="https://arxiv.org/pdf/2408.09869"

loader= DoclingLoader(file_path=FILE_PATH)

For advanced usage,DoclingLoader has the following parameters:

file_path: source as single str (URL or local file) or iterable thereof
converter (optional): any specific Docling converter instance to use
convert_kwargs (optional): any specific kwargs for conversion execution
export_type (optional): export mode to use:ExportType.DOC_CHUNKS (default) orExportType.MARKDOWN
md_export_kwargs (optional): any specific Markdown export kwargs (for Markdown mode)
chunker (optional): any specific Docling chunker instance to use (for doc-chunkmode)
meta_extractor (optional): any specific metadata extractor to use

Load

docs= loader.load()

Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors

Note: a message saying"Token indices sequence length is longer than the specified maximum sequence length..." can be ignored in this case — more detailshere.

Inspecting some sample docs:

for din docs[:3]:
print(f"-{d.page_content=}")

- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'

Lazy Load

Documents can also be loaded in a lazy fashion:

doc_iter= loader.lazy_load()
for docin doc_iter:
pass# you can operate on `doc` here

End-to-end Example

import os

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"]="false"

The following example pipeline uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env varHF_TOKEN.
Dependencies for this pipeline can be installed as shown below (--no-warn-conflicts meant for Colab's pre-populated Python env; feel free to remove for stricter usage):

%pip install-q--progress-bar off--no-warn-conflicts langchain-core langchain-huggingface langchain_milvus langchain python-dotenv

Note: you may need to restart the kernel to use updated packages.

Defining the pipeline parameters:

from pathlibimport Path
from tempfileimport mkdtemp

from dotenvimport load_dotenv
from langchain_core.promptsimport PromptTemplate
from langchain_docling.loaderimport ExportType


def_get_env_from_colab_or_os(key):
try:
from google.colabimport userdata

try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)


load_dotenv()

HF_TOKEN= _get_env_from_colab_or_os("HF_TOKEN")
FILE_PATH=["https://arxiv.org/pdf/2408.09869"]# Docling Technical Report
EMBED_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID="mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE= ExportType.DOC_CHUNKS
QUESTION="Which are the main AI models in Docling?"
PROMPT= PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K=3
MILVUS_URI=str(Path(mkdtemp())/"docling.db")

API Reference:PromptTemplate

Now we can instantiate our loader and load documents:

from docling.chunkingimport HybridChunker
from langchain_doclingimport DoclingLoader

loader= DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

docs= loader.load()

Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors

Determining the splits:

if EXPORT_TYPE== ExportType.DOC_CHUNKS:
    splits= docs
elif EXPORT_TYPE== ExportType.MARKDOWN:
from langchain_text_splittersimport MarkdownHeaderTextSplitter

    splitter= MarkdownHeaderTextSplitter(
        headers_to_split_on=[
("#","Header_1"),
("##","Header_2"),
("###","Header_3"),
],
)
    splits=[splitfor docin docsfor splitin splitter.split_text(doc.page_content)]
else:
raise ValueError(f"Unexpected export type:{EXPORT_TYPE}")

API Reference:MarkdownHeaderTextSplitter

Inspecting some sample splits:

for din splits[:3]:
print(f"-{d.page_content=}")
print("...")

- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
...

Ingestion

import json
from pathlibimport Path
from tempfileimport mkdtemp

from langchain_huggingface.embeddingsimport HuggingFaceEmbeddings
from langchain_milvusimport Milvus

embedding= HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)

milvus_uri=str(Path(mkdtemp())/"docling.db")# or set as needed
vectorstore= Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name="docling_demo",
    connection_args={"uri": milvus_uri},
    index_params={"index_type":"FLAT"},
    drop_old=True,
)

API Reference:HuggingFaceEmbeddings

RAG

from langchain.chainsimport create_retrieval_chain
from langchain.chains.combine_documentsimport create_stuff_documents_chain
from langchain_huggingfaceimport HuggingFaceEndpoint

retriever= vectorstore.as_retriever(search_kwargs={"k": TOP_K})
llm= HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    huggingfacehub_api_token=HF_TOKEN,
    task="text-generation",
)

API Reference:create_retrieval_chain |create_stuff_documents_chain |HuggingFaceEndpoint

defclip_text(text, threshold=100):
returnf"{text[:threshold]}..."iflen(text)> thresholdelse text

question_answer_chain= create_stuff_documents_chain(llm, PROMPT)
rag_chain= create_retrieval_chain(retriever, question_answer_chain)
resp_dict= rag_chain.invoke({"input": QUESTION})

clipped_answer= clip_text(resp_dict["answer"], threshold=350)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, docinenumerate(resp_dict["context"]):
print()
print(f"Source{i+1}:")
print(f"  text:{json.dumps(clip_text(doc.page_content, threshold=350))}")
for keyin doc.metadata:
if key!="pk":
            val= doc.metadata.get(key)
            clipped_val= clip_text(val)ifisinstance(val,str)else val
print(f"{key}:{clipped_val}")

Question:
Which are the main AI models in Docling?

Answer:
The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model.

Source 1:
  text: "3.2 AI models\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 2:
  text: "3 Processing pipeline\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 3:
  text: "6 Future work and contributions\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Notice that the sources contain rich grounding information, including the passageheadings (i.e. section), page, and precise bounding box.

API reference

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Docling

Overview

Setup

Initialization

Load

Lazy Load

End-to-end Example

Ingestion

RAG

API reference

Related

Movatterモバイル変換

Overview​

Setup​

Initialization​

Load​

Lazy Load​

End-to-end Example​

Ingestion​

RAG​

API reference​

Related​

Overview

Setup

Initialization

Load

Lazy Load

End-to-end Example

Ingestion

RAG

API reference

Related