Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
OurBuilding Ambient Agents with LangGraph course is now available on LangChain Academy!
Open on GitHub

Unstructured

Theunstructured package fromUnstructured.IO extracts clean text from raw source documents likePDFs and Word documents.This page covers how to use theunstructuredecosystem within LangChain.

Installation and Setup

If you are using a loader that runs locally, use the following steps to getunstructured and itsdependencies running.

  • For the smallest installation footprint and to take advantage of features not available in theopen-sourceunstructured package, install the Python SDK withpip install unstructured-clientalong withpip install langchain-unstructured to use theUnstructuredLoader and partitionremotely against the Unstructured API. This loader livesin a LangChain partner repo instead of thelangchain-community repo and you will need anapi_key, which you can generate a free keyhere.

  • To run everything locally, install the open-source python package withpip install unstructuredalong withpip install langchain-community and use the sameUnstructuredLoader as mentioned above.

    • You can install document specific dependencies with extras, e.g.pip install "unstructured[docx]". Learn more about extrashere.
    • To install the dependencies for all document types, usepip install "unstructured[all-docs]".
  • Install the following system dependencies if they are not already available on your system with e.g.brew install for Mac.Depending on what document types you're parsing, you may not need all of these.

    • libmagic-dev (filetype detection)
    • poppler-utils (images and PDFs)
    • tesseract-ocr(images and PDFs)
    • qpdf (PDFs)
    • libreoffice (MS Office docs)
    • pandoc (EPUBs)
  • When running locally, Unstructured also recommends using Dockerby following thisguide to ensure allsystem dependencies are installed correctly.

The Unstructured API requires API keys to make requests.You can request an API keyhere and start using it today!Checkout the READMEhere here to get started making API calls.We'd love to hear your feedback, let us know how it goes in ourcommunity slack.And stay tuned for improvements to both quality and performance!Check out the instructionshere if you'd like to self-host the Unstructured API or run it locally.

Data Loaders

The primary usage ofUnstructured is in data loaders.

UnstructuredLoader

See ausage example to see how you can usethis loader for both partitioning locally and remotely with the serverless Unstructured API.

from langchain_unstructuredimport UnstructuredLoader
API Reference:UnstructuredLoader

UnstructuredCHMLoader

CHM meansMicrosoft Compiled HTML Help.

from langchain_community.document_loadersimport UnstructuredCHMLoader
API Reference:UnstructuredCHMLoader

UnstructuredCSVLoader

Acomma-separated values (CSV) file is a delimited text file that usesa comma to separate values. Each line of the file is a data record.Each record consists of one or more fields, separated by commas.

See ausage example.

from langchain_community.document_loadersimport UnstructuredCSVLoader
API Reference:UnstructuredCSVLoader

UnstructuredEmailLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredEmailLoader

UnstructuredEPubLoader

EPUB is ane-book file format that usesthe “.epub” file extension. The term is short for electronic publication andis sometimes styledePub.EPUB is supported by many e-readers, and compatiblesoftware is available for most smartphones, tablets, and computers.

See ausage example.

from langchain_community.document_loadersimport UnstructuredEPubLoader

UnstructuredExcelLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredExcelLoader

UnstructuredFileIOLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredFileIOLoader

UnstructuredHTMLLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredHTMLLoader

UnstructuredImageLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredImageLoader

UnstructuredMarkdownLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredMarkdownLoader

UnstructuredODTLoader

TheOpen Document Format for Office Applications (ODF), also known asOpenDocument,is an open file format for word processing documents, spreadsheets, presentationsand graphics and using ZIP-compressed XML files. It was developed with the aim ofproviding an open, XML-based file format specification for office applications.

See ausage example.

from langchain_community.document_loadersimport UnstructuredODTLoader
API Reference:UnstructuredODTLoader

UnstructuredOrgModeLoader

AnOrg Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.

See ausage example.

from langchain_community.document_loadersimport UnstructuredOrgModeLoader

UnstructuredPDFLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredPDFLoader
API Reference:UnstructuredPDFLoader

UnstructuredPowerPointLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredPowerPointLoader

UnstructuredRSTLoader

AreStructured Text (RST) file is a file format for textual dataused primarily in the Python programming language community for technical documentation.

See ausage example.

from langchain_community.document_loadersimport UnstructuredRSTLoader
API Reference:UnstructuredRSTLoader

UnstructuredRTFLoader

See a usage example in the API documentation.

from langchain_community.document_loadersimport UnstructuredRTFLoader
API Reference:UnstructuredRTFLoader

UnstructuredTSVLoader

Atab-separated values (TSV) file is a simple, text-based file format for storing tabular data.Records are separated by newlines, and values within a record are separated by tab characters.

See ausage example.

from langchain_community.document_loadersimport UnstructuredTSVLoader
API Reference:UnstructuredTSVLoader

UnstructuredURLLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredURLLoader
API Reference:UnstructuredURLLoader

UnstructuredWordDocumentLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredWordDocumentLoader

UnstructuredXMLLoader

See ausage example.

from langchain_community.document_loadersimport UnstructuredXMLLoader
API Reference:UnstructuredXMLLoader

[8]ページ先頭

©2009-2025 Movatter.jp