- Notifications
You must be signed in to change notification settings - Fork1.1k
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
License
Unstructured-IO/unstructured
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Theunstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, andmany more. The use cases ofunstructured revolve around streamlining and optimizing the data processing workflow for LLMs.unstructured modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
Ready to move your data processing pipeline to production, and take advantage of advanced features? Check outUnstructured Platform. In addition to better processing performance, take advantage of chunking, embedding, and image and table enrichment generation, all from a low code UI or an API.Request a demo from our sales team to learn more about how to get started.
There are several ways to use theunstructured library:
- Run the library in a container or
- Install the library
- For installation with
condaon Windows system, please refer to thedocumentation
The following instructions are intended to help you get up and running using Docker to interact withunstructured.Seehere if you don't already have docker installed on your machine.
NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware.docker pull should download the corresponding image for your architecture, but you can specify with--platform (e.g.--platform linux/amd64) if needed.
We build Docker images for all pushes tomain. We tag each image with the corresponding short commit hash (e.g.fbc7a69) and the application version (e.g.0.5.5-dev1). We also tag the most recent image withlatest. To leverage this,docker pull from our image repository.
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
Once pulled, you can create a container from this image and shell to it.
# create the containerdocker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest# this will drop you into a bash shell where the Docker image is runningdockerexec -it unstructured bash
You can also build your own Docker image. Note that the base image iswolfi-base, which isupdated regularly. If you are building the image locally, it is possibledocker-build couldfail due to upstream changes inwolfi-base.
If you only plan on parsing one type of data you can speed up building the image by commenting out someof the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessaryfor your use case.
make docker-build# this will drop you into a bash shell where the Docker image is runningmake docker-start-bashOnce in the running container, you can try things directly in Python interpreter's interactive mode.
# this will drop you into a python console so you can run the below partition functionspython3>>> from unstructured.partition.pdf import partition_pdf>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")>>> from unstructured.partition.text import partition_text>>> elements = partition_text(filename="example-docs/fake-text.txt")
Use the following instructions to get up and running withunstructured and test yourinstallation.
Install the Python SDK to support all document types with
pip install "unstructured[all-docs]"- For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run
pip install unstructured - To process other doc types, you can install the extras required for those documents, such as
pip install "unstructured[docx,pptx]"
- For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run
Install the following system dependencies if they are not already available on your system.Depending on what document types you're parsing, you may not need all of these.
libmagic-dev(filetype detection)poppler-utils(images and PDFs)tesseract-ocr(images and PDFs, installtesseract-langfor additional language support)libreoffice(MS Office docs)pandoc(EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version2.14.2or newer. Running eithermake install-pandocor./scripts/install-pandoc.shwill install the correct version for you.
For suggestions on how to install on the Windows and to learn about dependencies for other features, see theinstallation documentationhere.
At this point, you should be able to run the following code:
fromunstructured.partition.autoimportpartitionelements=partition(filename="example-docs/eml/fake-email.eml")print("\n\n".join([str(el)forelinelements]))
The following instructions are intended to help you get up and running withunstructuredlocally if you are planning to contribute to the project.
Using
pyenvto manage virtualenv's is recommended but not necessaryCreate a virtualenv to work in and activate it, e.g. for one named
unstructured:pyenv virtualenv 3.10 unstructuredpyenv activate unstructuredRun
make installOptional:
- To install models and dependencies for processing images and PDFs locally, run
make install-local-inference. - For processing image files,
tesseractis required. Seehere for installation instructions. - For processing PDF files,
tesseractandpopplerare required. Thepdf2image docs have instructions on installingpoppleracross various platforms.
- To install models and dependencies for processing images and PDFs locally, run
Additionally, if you're planning to contribute tounstructured, we provide you an optionalpre-commit configurationfile to ensure your code matches the formatting and linting standards used inunstructured.If you'd prefer not to have code changes auto-tidied before every commit, you can usemake check to seewhether any linting or formatting changes should be applied, andmake tidy to apply them.
If using the optionalpre-commit, you'll just need to install the hooks withpre-commit install since thepre-commit package is installed as part ofmake install mentioned above. Finally, if you decided to usepre-commityou can also uninstall the hooks withpre-commit uninstall.
In addition to develop in your local OS we also provide a helper to use docker providing a development environment:
make docker-start-dev
This starts a docker container with your local repo mounted to/mnt/local_unstructured. This docker image allows you to develop without worrying about your OS's compatibility with the repo and its dependencies.
For more comprehensive documentation, visithttps://docs.unstructured.io . You can also learnmore about our other products on the documentation page, including our SaaS API.
Here are a few pages from theOpen Source documentation pagethat are helpful for new users to review:
The following examples show how to get started with theunstructured library. The easiest way to parse a document in unstructured is to use thepartition function. If you usepartition function,unstructured will detect the file type and route it to the appropriate file-specific partitioning function. If you are using thepartition function, you may need to install additional dependencies per doc type.For example, to install docx dependencies you need to runpip install "unstructured[docx]".See ourinstallation guide for more details.
fromunstructured.partition.autoimportpartitionelements=partition("example-docs/layout-parser-paper.pdf")
Runprint("\n\n".join([str(el) for el in elements])) to get a string representation of theoutput, which looks like:
LayoutParser : A Unified Toolkit for Deep Learning Based Document Image AnalysisZejiang Shen 1 ( (cid:0) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , andWeining Li 5Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neuralnetworks. Ideally, research outcomes could be easily deployed in production and extended for further investigation.However, various factors like loosely organized codebases and sophisticated model configurations complicate the easyreuse of important innovations by a wide audience. Though there have been ongoing efforts to improve reusability andsimplify deep learning (DL) model development in disciplines like natural language processing and computer vision, noneof them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIAis central to academic research across a wide range of disciplines in the social sciences and humanities. This paperintroduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications.The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL modelsfor layout detection, character recognition, and many other document processing tasks. To promote extensibility,LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitizationpipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines inreal-word use cases. The library is publicly available at https://layout-parser.github.ioKeywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library ·Toolkit.IntroductionDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasksincluding document image classification [11,See thepartitioningsection in our documentation for a full list of options and instructions on how to usefile-specific partitioning functions.
See oursecurity policy forinformation on how to report security vulnerabilities.
Encountered a bug? Please create a newGitHub issue and use our bug report template to describe the problem. To help us diagnose the issue, use thepython scripts/collect_env.py command to gather your system's environment information and include it in your report. Your assistance helps us continuously improve our software - thank you!
| Section | Description |
|---|---|
| Company Website | Unstructured.io product and company info |
| Documentation | Full API documentation |
| Batch Processing | Ingesting batches of documents through Unstructured |
This library includes a very lightweight analytics "ping" when the library is loaded, however you can opt out of this data collection by setting the environment variableDO_NOT_TRACK=true before executing anyunstructured code. To learn more about how we collect and use this data, please read ourPrivacy Policy.
About
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Topics
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.
