Unstructured
The
unstructured
package fromUnstructured.IO extracts clean text from raw source documents likePDFs and Word documents.This page covers how to use theunstructured
ecosystem within LangChain.
Installation and Setup
If you are using a loader that runs locally, use the following steps to getunstructured
and itsdependencies running.
For the smallest installation footprint and to take advantage of features not available in theopen-source
unstructured
package, install the Python SDK withpip install unstructured-client
along withpip install langchain-unstructured
to use theUnstructuredLoader
and partitionremotely against the Unstructured API. This loader livesin a LangChain partner repo instead of thelangchain-community
repo and you will need anapi_key
, which you can generate a free keyhere.- Unstructured's documentation for the sdk can be found here:https://docs.unstructured.io/api-reference/api-services/sdk
To run everything locally, install the open-source python package with
pip install unstructured
along withpip install langchain-community
and use the sameUnstructuredLoader
as mentioned above.- You can install document specific dependencies with extras, e.g.
pip install "unstructured[docx]"
. Learn more about extrashere. - To install the dependencies for all document types, use
pip install "unstructured[all-docs]"
.
- You can install document specific dependencies with extras, e.g.
Install the following system dependencies if they are not already available on your system with e.g.
brew install
for Mac.Depending on what document types you're parsing, you may not need all of these.libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)qpdf
(PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
When running locally, Unstructured also recommends using Dockerby following thisguide to ensure allsystem dependencies are installed correctly.
The Unstructured API requires API keys to make requests.You can request an API keyhere and start using it today!Checkout the READMEhere here to get started making API calls.We'd love to hear your feedback, let us know how it goes in ourcommunity slack.And stay tuned for improvements to both quality and performance!Check out the instructionshere if you'd like to self-host the Unstructured API or run it locally.
Data Loaders
The primary usage ofUnstructured
is in data loaders.
UnstructuredLoader
See ausage example to see how you can usethis loader for both partitioning locally and remotely with the serverless Unstructured API.
from langchain_unstructuredimport UnstructuredLoader
UnstructuredCHMLoader
CHM
meansMicrosoft Compiled HTML Help
.
from langchain_community.document_loadersimport UnstructuredCHMLoader
UnstructuredCSVLoader
Acomma-separated values
(CSV
) file is a delimited text file that usesa comma to separate values. Each line of the file is a data record.Each record consists of one or more fields, separated by commas.
See ausage example.
from langchain_community.document_loadersimport UnstructuredCSVLoader
UnstructuredEmailLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredEmailLoader
UnstructuredEPubLoader
EPUB is ane-book file format
that usesthe “.epub” file extension. The term is short for electronic publication andis sometimes styledePub
.EPUB
is supported by many e-readers, and compatiblesoftware is available for most smartphones, tablets, and computers.
See ausage example.
from langchain_community.document_loadersimport UnstructuredEPubLoader
UnstructuredExcelLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredExcelLoader
UnstructuredFileIOLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredFileIOLoader
UnstructuredHTMLLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredHTMLLoader
UnstructuredImageLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredImageLoader
UnstructuredMarkdownLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredMarkdownLoader
UnstructuredODTLoader
TheOpen Document Format for Office Applications (ODF)
, also known asOpenDocument
,is an open file format for word processing documents, spreadsheets, presentationsand graphics and using ZIP-compressed XML files. It was developed with the aim ofproviding an open, XML-based file format specification for office applications.
See ausage example.
from langchain_community.document_loadersimport UnstructuredODTLoader
UnstructuredOrgModeLoader
AnOrg Mode document is a document editing, formatting, and organizing mode, designed for notes, planning, and authoring within the free software text editor Emacs.
See ausage example.
from langchain_community.document_loadersimport UnstructuredOrgModeLoader
UnstructuredPDFLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredPDFLoader
UnstructuredPowerPointLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredPowerPointLoader
UnstructuredRSTLoader
AreStructured Text
(RST
) file is a file format for textual dataused primarily in the Python programming language community for technical documentation.
See ausage example.
from langchain_community.document_loadersimport UnstructuredRSTLoader
UnstructuredRTFLoader
See a usage example in the API documentation.
from langchain_community.document_loadersimport UnstructuredRTFLoader
UnstructuredTSVLoader
Atab-separated values
(TSV
) file is a simple, text-based file format for storing tabular data.Records are separated by newlines, and values within a record are separated by tab characters.
See ausage example.
from langchain_community.document_loadersimport UnstructuredTSVLoader
UnstructuredURLLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredURLLoader
UnstructuredWordDocumentLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredWordDocumentLoader
UnstructuredXMLLoader
See ausage example.
from langchain_community.document_loadersimport UnstructuredXMLLoader
- Installation and Setup
- Data Loaders
- UnstructuredLoader
- UnstructuredCHMLoader
- UnstructuredCSVLoader
- UnstructuredEmailLoader
- UnstructuredEPubLoader
- UnstructuredExcelLoader
- UnstructuredFileIOLoader
- UnstructuredHTMLLoader
- UnstructuredImageLoader
- UnstructuredMarkdownLoader
- UnstructuredODTLoader
- UnstructuredOrgModeLoader
- UnstructuredPDFLoader
- UnstructuredPowerPointLoader
- UnstructuredRSTLoader
- UnstructuredRTFLoader
- UnstructuredTSVLoader
- UnstructuredURLLoader
- UnstructuredWordDocumentLoader
- UnstructuredXMLLoader