Unstructured
This notebook covers how to useUnstructured
document loader to load files of many types.Unstructured
currently supports loading of text files, powerpoints, html, pdfs, images, and more.
Please seethis guide for more instructions on setting up Unstructured locally, including setting up required system dependencies.
Overview
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
UnstructuredLoader | langchain_unstructured | ✅ | ❌ | ✅ |
Loader features
Source | Document Lazy Loading | Native Async Support |
---|---|---|
UnstructuredLoader | ✅ | ❌ |
Setup
Credentials
By default,langchain-unstructured
installs a smaller footprint that requires offloading of the partitioning logic to the Unstructured API, which requires an API key. If you use the local installation, you do not need an API key. To get your API key, head over tothis site and get an API key, and then set it in the cell below:
import getpass
import os
if"UNSTRUCTURED_API_KEY"notin os.environ:
os.environ["UNSTRUCTURED_API_KEY"]= getpass.getpass(
"Enter your Unstructured API key: "
)
Installation
Normal Installation
The following packages are required to run the rest of this notebook.
# Install package, compatible with API partitioning
%pip install--upgrade--quiet langchain-unstructured unstructured-client unstructured"unstructured[pdf]" python-magic
Installation for Local
If you would like to run the partitioning logic locally, you will need to install a combination of system dependencies, as outlined in theUnstructured documentation here.
For example, on Macs you can install the required dependencies with:
# base dependencies
brew install libmagic poppler tesseract
# If parsing xml / html documents:
brew install libxml2 libxslt
You can install the requiredpip
dependencies needed for local with:
pip install "langchain-unstructured[local]"
Initialization
TheUnstructuredLoader
allows loading from a variety of different file types. To read all about theunstructured
package please refer to theirdocumentation/. In this example, we show loading from both a text file and a PDF file.
from langchain_unstructuredimport UnstructuredLoader
file_paths=[
"./example_data/layout-parser-paper.pdf",
"./example_data/state_of_the_union.txt",
]
loader= UnstructuredLoader(file_paths)
Load
docs= loader.load()
docs[0]
INFO: pikepdf C++ to Python logger bridge initialized
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
print(docs[0].metadata)
{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}
Lazy Load
pages=[]
for docin loader.lazy_load():
pages.append(doc)
pages[0]
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')
Post Processing
If you need to post process theunstructured
elements after extraction, you can pass in a list ofstr
->str
functions to thepost_processors
kwarg when you instantiate theUnstructuredLoader
. This applies to other Unstructured loaders as well. Below is an example.
from langchain_unstructuredimport UnstructuredLoader
from unstructured.cleaners.coreimport clean_extra_whitespace
loader= UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
post_processors=[clean_extra_whitespace],
)
docs= loader.load()
docs[5:10]
[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),
Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]
Unstructured API
If you want to get up and running with smaller packages and get the most up-to-date partitioning you canpip install unstructured-client
andpip install langchain-unstructured
. Formore information about theUnstructuredLoader
, refer to theUnstructured provider page.
The loader will process your document using the hosted Unstructured serverless API when you pass inyourapi_key
and setpartition_via_api=True
. You can generate a freeUnstructured API keyhere.
Check out the instructionshereif you’d like to self-host the Unstructured API or run it locally.
from langchain_unstructuredimport UnstructuredLoader
loader= UnstructuredLoader(
file_path="example_data/fake.docx",
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs= loader.load()
docs[0]
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')
You can also batch multiple files through the Unstructured API in a single API usingUnstructuredLoader
.
loader= UnstructuredLoader(
file_path=["example_data/fake.docx","example_data/fake-email.eml"],
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_via_api=True,
)
docs= loader.load()
print(docs[0].metadata["filename"],": ", docs[0].page_content[:100])
print(docs[-1].metadata["filename"],": ", docs[-1].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Preparing to split document for partition.
INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.
``````output
fake.docx : Lorem ipsum dolor sit amet.
fake-email.eml : Violets are blue
Unstructured SDK Client
Partitioning with the Unstructured API relies on theUnstructured SDKClient.
If you want to customize the client, you will have to pass anUnstructuredClient
instance to theUnstructuredLoader
. Below is an example showing how you can customize features of the client such as using your ownrequests.Session()
, passing an alternativeserver_url
, and customizing theRetryConfig
object. For more information about customizing the client or what additional parameters the sdk client accepts, refer to theUnstructured Python SDK docs and the client section of theAPI Parameters docs. Note that all API Parameters should be passed to theUnstructuredLoader
.
import requests
from langchain_unstructuredimport UnstructuredLoader
from unstructured_clientimport UnstructuredClient
from unstructured_client.utilsimport BackoffStrategy, RetryConfig
client= UnstructuredClient(
api_key_auth=os.getenv(
"UNSTRUCTURED_API_KEY"
),# Note: the client API param is "api_key_auth" instead of "api_key"
client=requests.Session(),# Define your own requests session
server_url="https://api.unstructuredapp.io/general/v0/general",# Define your own api url
retry_config=RetryConfig(
strategy="backoff",
retry_connection_errors=True,
backoff=BackoffStrategy(
initial_interval=500,
max_interval=60000,
exponent=1.5,
max_elapsed_time=900000,
),
),# Define your own retry config
)
loader= UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
partition_via_api=True,
client=client,
split_pdf_page=True,
split_pdf_page_range=[1,10],
)
docs= loader.load()
print(docs[0].metadata["filename"],": ", docs[0].page_content[:100])
INFO: Preparing to split document for partition.
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 10 (10 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 5 files with 2 page(s) each.
INFO: Partitioning set #1 (pages 1-2).
INFO: Partitioning set #2 (pages 3-4).
INFO: Partitioning set #3 (pages 5-6).
INFO: Partitioning set #4 (pages 7-8).
INFO: Partitioning set #5 (pages 9-10).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned the document.
``````output
layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
Chunking
TheUnstructuredLoader
does not supportmode
as parameter for grouping text like the olderloaderUnstructuredFileLoader
and others did. It instead supports "chunking". Chunking inunstructured differs from other chunking mechanisms you may be familiar with that form chunks basedon plain-text features--character sequences like "\n\n" or "\n" that might indicate a paragraphboundary or list-item boundary. Instead, all documents are split using specific knowledge about eachdocument format to partition the document into semantic units (document elements) and we only need toresort to text-splitting when a single element exceeds the desired maximum chunk size. In general,chunking combines consecutive elements to form chunks as large as possible without exceeding themaximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.Each “chunk” is an instance of one of these three types.
See thispage for moredetails about chunking options, but to reproduce the same behavior asmode="single"
, you can setchunking_strategy="basic"
,max_characters=<some-really-big-number>
, andinclude_orig_elements=False
.
from langchain_unstructuredimport UnstructuredLoader
loader= UnstructuredLoader(
"./example_data/layout-parser-paper.pdf",
chunking_strategy="basic",
max_characters=1000000,
include_orig_elements=False,
)
docs= loader.load()
print("Number of LangChain documents:",len(docs))
print("Length of text in the document:",len(docs[0].page_content))
Number of LangChain documents: 1
Length of text in the document: 42772
Loading web pages
UnstructuredLoader
accepts aweb_url
kwarg when run locally that populates theurl
parameter of the underlying Unstructuredpartition. This allows for the parsing of remotely hosted documents, such as HTML web pages.
Example usage:
from langchain_unstructuredimport UnstructuredLoader
loader= UnstructuredLoader(web_url="https://www.example.com")
docs= loader.load()
for docin docs:
print(f"{doc}\n")
page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}
page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}
page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}
API reference
For detailed documentation of allUnstructuredLoader
features and configurations head to the API reference:https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html
Related
- Document loaderconceptual guide
- Document loaderhow-to guides