Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
OurBuilding Ambient Agents with LangGraph course is now available on LangChain Academy!
Open In ColabOpen on GitHub

Dedoc

This sample demonstrates the use ofDedoc in combination withLangChain as aDocumentLoader.

Overview

Dedoc is anopen-sourcelibrary/service that extracts texts, tables, attached files and document structure(e.g., titles, list items, etc.) from files of various formats.

Dedoc supportsDOCX,XLSX,PPTX,EML,HTML,PDF, images and more.Full list of supported formats can be foundhere.

Integration details

ClassPackageLocalSerializableJS support
DedocFileLoaderlangchain_communitybeta
DedocPDFLoaderlangchain_communitybeta
DedocAPIFileLoaderlangchain_communitybeta

Loader features

Methods for lazy loading and async loading are available, but in fact, document loading is executed synchronously.

SourceDocument Lazy LoadingAsync Support
DedocFileLoader
DedocPDFLoader
DedocAPIFileLoader

Setup

  • To accessDedocFileLoader andDedocPDFLoader document loaders, you'll need to install thededoc integration package.
  • To accessDedocAPIFileLoader, you'll need to run theDedoc service, e.g.Docker container (please seethe documentationfor more details):
docker pull dedocproject/dedoc
docker run -p 1231:1231

Dedoc installation instruction is givenhere.

# Install package
%pip install--quiet"dedoc[torch]"
Note: you may need to restart the kernel to use updated packages.

Instantiation

from langchain_community.document_loadersimport DedocFileLoader

loader= DedocFileLoader("./example_data/state_of_the_union.txt")
API Reference:DedocFileLoader

Load

docs= loader.load()
docs[0].page_content[:100]
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'

Lazy Load

docs= loader.lazy_load()

for docin docs:
print(doc.page_content[:100])
break

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t

API reference

For detailed information on configuring and callingDedoc loaders, please see the API references:

Loading any file

For automatic handling of any file in asupported format,DedocFileLoader can be useful.The file loader automatically detects the file type with a correct extension.

File parsing process can be configured throughdedoc_kwargs during theDedocFileLoader class initialization.Here the basic examples of some options usage are given,please see the documentation ofDedocFileLoader anddedoc documentationto get more details about configuration parameters.

Basic example

from langchain_community.document_loadersimport DedocFileLoader

loader= DedocFileLoader("./example_data/state_of_the_union.txt")

docs= loader.load()

docs[0].page_content[:400]
API Reference:DedocFileLoader
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

Modes of split

DedocFileLoader supports different types of document splitting into parts (each part is returned separately).For this purpose,split parameter is used with the following options:

  • document (default value): document text is returned as a single langchainDocument object (don't split);
  • page: split document text into pages (works forPDF,DJVU,PPTX,PPT,ODP);
  • node: split document text intoDedoc tree nodes (title nodes, list item nodes, raw text nodes);
  • line: split document text into textual lines.
loader= DedocFileLoader(
"./example_data/layout-parser-paper.pdf",
split="page",
pages=":2",
)

docs= loader.load()

len(docs)
2

Handling tables

DedocFileLoader supports tables handling whenwith_tables parameter isset toTrue during loader initialization (with_tables=True by default).

Tables are not split - each table corresponds to one langchainDocument object.For tables,Document object has additionalmetadata fieldstype="table"andtext_as_html with tableHTML representation.

loader= DedocFileLoader("./example_data/mlb_teams_2012.csv")

docs= loader.load()

docs[1].metadata["type"], docs[1].metadata["text_as_html"][:200]
('table',
'<table border="1" style="border-collapse: collapse; width: 100%;">\n<tbody>\n<tr>\n<td colspan="1" rowspan="1">Team</td>\n<td colspan="1" rowspan="1"> &quot;Payroll (millions)&quot;</td>\n<td colspan="1" r')

Handling attached files

DedocFileLoader supports attached files handling whenwith_attachments is settoTrue during loader initialization (with_attachments=False by default).

Attachments are split according to thesplit parameter.For attachments, langchainDocument object has an additional metadatafieldtype="attachment".

loader= DedocFileLoader(
"./example_data/fake-email-attachment.eml",
with_attachments=True,
)

docs= loader.load()

docs[1].metadata["type"], docs[1].page_content
('attachment',
'\nContent-Type\nmultipart/mixed; boundary="0000000000005d654405f082adb7"\nDate\nFri, 23 Dec 2022 12:08:48 -0600\nFrom\nMallori Harrell <mallori@unstructured.io>\nMIME-Version\n1.0\nMessage-ID\n<CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>\nSubject\nFake email with attachment\nTo\nMallori Harrell <mallori@unstructured.io>')

Loading PDF file

If you want to handle onlyPDF documents, you can useDedocPDFLoader with onlyPDF support.The loader supports the same parameters for document split, tables and attachments extraction.

Dedoc can extractPDF with or without a textual layer,as well as automatically detect its presence and correctness.SeveralPDF handlers are available, you can usepdf_with_text_layerparameter to choose one of them.Please seeparameters descriptionto get more details.

ForPDF without a textual layer,Tesseract OCR and its language packages should be installed.In this case,the instruction can be useful.

from langchain_community.document_loadersimport DedocPDFLoader

loader= DedocPDFLoader(
"./example_data/layout-parser-paper.pdf", pdf_with_text_layer="true", pages="2:2"
)

docs= loader.load()

docs[0].page_content[:400]
API Reference:DedocPDFLoader
'\n2\n\nZ. Shen et al.\n\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\n\nA generalized learning-based framework dramatically reduces the need for the\n\nmanual specification of complicated rules, which is the status quo with traditional\n\nmethods. DL has the potential to transform DIA pipelines and benefit a broad\n\nspectrum of large-scale document digitization projects.\n'

Dedoc API

If you want to get up and running with less set up, you can useDedoc as a service.DedocAPIFileLoader can be used without installation ofdedoc library.The loader supports the same parameters asDedocFileLoader andalso automatically detects input file types.

To useDedocAPIFileLoader, you should run theDedoc service, e.g.Docker container (please seethe documentationfor more details):

docker pull dedocproject/dedoc
docker run -p 1231:1231

Please do not use our demo URLhttps://dedoc-readme.hf.space in your code.

from langchain_community.document_loadersimport DedocAPIFileLoader

loader= DedocAPIFileLoader(
"./example_data/state_of_the_union.txt",
url="https://dedoc-readme.hf.space",
)

docs= loader.load()

docs[0].page_content[:400]
API Reference:DedocAPIFileLoader
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '

Related


[8]ページ先頭

©2009-2025 Movatter.jp