Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork1.1k
Star13.3k

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

www.unstructured.io/

License

Apache-2.0 license

13.3k stars 1.1k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,785 Commits
.github		.github
discord-test		discord-test
docker/rockylinux-9.2		docker/rockylinux-9.2
example-docs		example-docs
img		img
requirements		requirements
scripts		scripts
test_unstructured		test_unstructured
test_unstructured_ingest		test_unstructured_ingest
typings		typings
unstructured		unstructured
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.grype.yaml		.grype.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
liccheck.ini		liccheck.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Open-Source Pre-Processing Tools for Unstructured Data

Theunstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, andmany more. The use cases ofunstructured revolve around streamlining and optimizing the data processing workflow for LLMs.unstructured modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

Try the Unstructured Platform Product

Ready to move your data processing pipeline to production, and take advantage of advanced features? Check outUnstructured Platform. In addition to better processing performance, take advantage of chunking, embedding, and image and table enrichment generation, all from a low code UI or an API.Request a demo from our sales team to learn more about how to get started.

✴️ Quick Start

There are several ways to use theunstructured library:

Run the library in a container or
Install the library
1. Install from PyPI
2. Install for local development
For installation withconda on Windows system, please refer to thedocumentation

Run the library in a container

The following instructions are intended to help you get up and running using Docker to interact withunstructured.Seehere if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware.docker pull should download the corresponding image for your architecture, but you can specify with--platform (e.g.--platform linux/amd64) if needed.

We build Docker images for all pushes tomain. We tag each image with the corresponding short commit hash (e.g.fbc7a69) and the application version (e.g.0.5.5-dev1). We also tag the most recent image withlatest. To leverage this,docker pull from our image repository.

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest

Once pulled, you can create a container from this image and shell to it.

# create the containerdocker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest# this will drop you into a bash shell where the Docker image is runningdockerexec -it unstructured bash

You can also build your own Docker image. Note that the base image iswolfi-base, which isupdated regularly. If you are building the image locally, it is possibledocker-build couldfail due to upstream changes inwolfi-base.

If you only plan on parsing one type of data you can speed up building the image by commenting out someof the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessaryfor your use case.

make docker-build# this will drop you into a bash shell where the Docker image is runningmake docker-start-bash

Once in the running container, you can try things directly in Python interpreter's interactive mode.

# this will drop you into a python console so you can run the below partition functionspython3>>> from unstructured.partition.pdf import partition_pdf>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")>>> from unstructured.partition.text import partition_text>>> elements = partition_text(filename="example-docs/fake-text.txt")

Installing the library

Use the following instructions to get up and running withunstructured and test yourinstallation.

Install the Python SDK to support all document types withpip install "unstructured[all-docs]"
- For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can runpip install unstructured
- To process other doc types, you can install the extras required for those documents, such aspip install "unstructured[docx,pptx]"
Install the following system dependencies if they are not already available on your system.Depending on what document types you're parsing, you may not need all of these.
- libmagic-dev (filetype detection)
- poppler-utils (images and PDFs)
- tesseract-ocr (images and PDFs, installtesseract-lang for additional language support)
- libreoffice (MS Office docs)
- pandoc (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version2.14.2 or newer. Running eithermake install-pandoc or./scripts/install-pandoc.sh will install the correct version for you.
For suggestions on how to install on the Windows and to learn about dependencies for other features, see theinstallation documentationhere.

At this point, you should be able to run the following code:

fromunstructured.partition.autoimportpartitionelements=partition(filename="example-docs/eml/fake-email.eml")print("\n\n".join([str(el)forelinelements]))

Installation Instructions for Local Development

The following instructions are intended to help you get up and running withunstructuredlocally if you are planning to contribute to the project.

Usingpyenv to manage virtualenv's is recommended but not necessary
- Mac install instructions. Seehere for more detailed instructions.
  - brew install pyenv-virtualenv
  - pyenv install 3.10
- Linux instructions are availablehere.
Create a virtualenv to work in and activate it, e.g. for one namedunstructured:
pyenv virtualenv 3.10 unstructured
pyenv activate unstructured
Runmake install
Optional:
- To install models and dependencies for processing images and PDFs locally, runmake install-local-inference.
- For processing image files,tesseract is required. Seehere for installation instructions.
- For processing PDF files,tesseract andpoppler are required. Thepdf2image docs have instructions on installingpoppler across various platforms.

Additionally, if you're planning to contribute tounstructured, we provide you an optionalpre-commit configurationfile to ensure your code matches the formatting and linting standards used inunstructured.If you'd prefer not to have code changes auto-tidied before every commit, you can usemake check to seewhether any linting or formatting changes should be applied, andmake tidy to apply them.

If using the optionalpre-commit, you'll just need to install the hooks withpre-commit install since thepre-commit package is installed as part ofmake install mentioned above. Finally, if you decided to usepre-commityou can also uninstall the hooks withpre-commit uninstall.

In addition to develop in your local OS we also provide a helper to use docker providing a development environment:

make docker-start-dev

This starts a docker container with your local repo mounted to/mnt/local_unstructured. This docker image allows you to develop without worrying about your OS's compatibility with the repo and its dependencies.

👏 Quick Tour

Documentation

For more comprehensive documentation, visithttps://docs.unstructured.io . You can also learnmore about our other products on the documentation page, including our SaaS API.

Here are a few pages from theOpen Source documentation pagethat are helpful for new users to review:

PDF Document Parsing Example

The following examples show how to get started with theunstructured library. The easiest way to parse a document in unstructured is to use thepartition function. If you usepartition function,unstructured will detect the file type and route it to the appropriate file-specific partitioning function. If you are using thepartition function, you may need to install additional dependencies per doc type.For example, to install docx dependencies you need to runpip install "unstructured[docx]".See ourinstallation guide for more details.

fromunstructured.partition.autoimportpartitionelements=partition("example-docs/layout-parser-paper.pdf")

Runprint("\n\n".join([str(el) for el in elements])) to get a string representation of theoutput, which looks like:

LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image AnalysisZejiang Shen 1 ( (cid:0) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , andWeining Li 5Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neuralnetworks. Ideally, research outcomes could be easily deployed in production and extended for further investigation.However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easyreuse of important innovations by a wide audience. Though there have been ongoing eﬀorts to improve reusability andsimplify deep learning (DL) model development in disciplines like natural language processing and computer vision, noneof them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIAis central to academic research across a wide range of disciplines in the social sciences and humanities. This paperintroduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications.The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL modelsfor layout detection, character recognition, and many other document processing tasks. To promote extensibility,LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitizationpipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines inreal-word use cases. The library is publicly available at https://layout-parser.github.ioKeywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library ·Toolkit.IntroductionDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasksincluding document image classiﬁcation [11,

See thepartitioningsection in our documentation for a full list of options and instructions on how to usefile-specific partitioning functions.

💂‍♂️ Security Policy

See oursecurity policy forinformation on how to report security vulnerabilities.

🐛 Reporting Bugs

Encountered a bug? Please create a newGitHub issue and use our bug report template to describe the problem. To help us diagnose the issue, use thepython scripts/collect_env.py command to gather your system's environment information and include it in your report. Your assistance helps us continuously improve our software - thank you!

📚 Learn more

Section	Description
Company Website	Unstructured.io product and company info
Documentation	Full API documentation
Batch Processing	Ingesting batches of documents through Unstructured

📈 Analytics

This library includes a very lightweight analytics "ping" when the library is loaded, however you can opt out of this data collection by setting the environment variableDO_NOT_TRACK=true before executing anyunstructured code. To learn more about how we collect and use this data, please read ourPrivacy Policy.