API

Example code

Business

Open source

Support

FAQ

Unstructured open source

Getting started with open source

Quickstart

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

Overview

Getting started with open source

Quickstart

This quickstart uses the Unstructured open source library, which is designed as a starting point for quick prototyping and haslimits. For production scenarios, use theUnstructured user interface (UI) or theUnstructured API instead.

In this quickstart, you use theUnstructured open source library(GitHub,PyPI) along with Python on your local development machine to partition a PDF file into a standard set ofUnstructured document elements and metadata. You can use these elements andmetadata as input into your RAG applications, AI agents, model fine-tuning tasks, and more.

Prerequisites

To complete this quickstart, you need:

A Python virtual environment manager is recommended to manage your Python code dependencies.This quickstart usesuv for managing virtual environments andvenv as the virtual environment type. Installation anduse ofuv andvenv are described in the following steps.However,uv andvenv are not required to use the Unstructured open source library.
Python 3.9 or higher. You can useuv to install Python if needed, as described in the following steps.
A PDF file on your local machine. If you do not have a PDF file available, this quickstart provides a sample PDF file namedlayout-parser-paper.pdf that you can download in a later step. (The Unstructured open source library providessupport for additional file types as well.)

Install uv

macOS, Linux
Windows

To usecurl withsh:

curl -fsSL https://get.uv.dev | bash

To usewget withsh instead:

wget -qO- https://astral.sh/uv/install.sh | sh

To use PowerShell withirm to download the script and run it withiex:

powershell-ExecutionPolicy ByPass-c"irm https://astral.sh/uv/install.ps1 | iex"

To installuv by using other approaches such as PyPI, Homebrew, or WinGet,seeInstalling uv.

Install Python

uv will detect and use Python if you already have it installed.To view a list of installed Python versions, run the following command:

uv python list

If, however, you do not already have Python installed, you can install a version of Python for use withuvby running the following command. For example, this command installs Python 3.12 for use withuv:

uv python install 3.12

Create a uv project

Useuv to create a project by switching to the directory on your development machine where you want tocreate the project and then running the following command:

uv init

Create a venv virtual environment

To isolate and manage your project’s code dependencies, from your project directory, useuv to create a virtual environment withvenv by running the following command:

# Create the virtual environment by using the current Python version:uv venv# Or, if you want to use a specific Python version:uv venv --python 3.12

Activate the virtual environment

To activate thevenv virtual environment, run one of the following commands:

macOS, Linux
Windows

Forbash orzsh, runsource .venv/bin/activate
Forfish, runsource .venv/bin/activate.fish
Forcsh ortcsh, runsource .venv/bin/activate.csh
Forpwsh, run.venv/bin/Activate.ps1

Forcmd.exe, run.venv\Scripts\activate.bat
ForPowerShell, run.venv\Scripts\Activate.ps1

To deactivate the virtual environment at any time, rundeactivate.

Install the Unstructured open source library

With the virtual environment activated to enable code dependency isolation and management, useuv to install the Unstructured open source library by running the following command:

uv add unstructured

The preceding command supports plain text files (.txt), HTML files (.html), XML files (.xml), and emails (.eml,.msg, and.p7s) without any additional dependencies.To work with other file types, you must also install these dependencies, as follows, replacing<extra> with the appropriate extra for the target file type:

uv add "unstructured[<extra>]"

The following file type extras are available:

all-docs (for all supported file types in this list)
csv (for.csv files only)
docx (for.doc and.docx files only)
epub (for.epub files only)
image (for all supported image file types:.bmp,.heic,.jpeg,.png, and.tiff)
md (for.md files only)
odt (for.odt files only)
org (for.org files only)
pdf (for.pdf files only)
pptx (for.ppt and.pptx files only)
rst (for.rst files only)
rtf (for.rtf files only)
tsv (for.tsv files only)
xlsx (for.xls and.xlsx files only)

As this quickstart uses a sample PDF file, run the following command:

uv add "unstructured[pdf]"

Note that you can install multiple extras at the same time by separating them with commas, for example:

uv add "unstructured[pdf,docx]"

Install system dependencies

You maximum compatibility, you should also install the following system dependencies:

libmagic-dev (for filetype detection)
poppler-utils andtesseract-ocr (for images and PDFs), andtesseract-lang (for additional language support)
libreoffice (for Microsoft Office documents)
pandoc (for.epub,.odt, and.rtf files. For.rtf files, you must have version 2.14.2 or newer. Runningthis script will install the correct version for you.)

Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see youroperating system’s documentation.

Download the sample PDF file

Download the sample PDF file namedlayout-parser-paper.pdf from the following location to your local development machine:https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf(You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.)

Add the Python code

In the project’smain.py file, add the following Python code, replacing<path/to> with thepath to thelayout-parser-paper.pdf file that you downloaded to your local development machine.(If you want to use a different PDF file, replacelayout-parser-paper with the name of that PDF file instead.)

from unstructured.partition.pdfimport partition_pdffrom unstructured.staging.baseimport elements_to_jsonfile_path= "<path/to>"base_file_name= "layout-parser-paper"def main():    elements= partition_pdf(filename=f"{file_path}/{base_file_name}.pdf")    elements_to_json(elements=elements,filename=f"{file_path}/{base_file_name}-output.json")if __name__ == "__main__":    main()

Run the Python code

Useuv to run the preceding Python code by running the following command:

uv run main.py

It might take a few minutes for the command to finish.

View the output

After the command finishes running successfully, view the Unstructured elements and metadata that were generated by opening thelayout-parser-paper-output.json file in your editor. This file will be inthe location as the originallayout-parser-paper.pdf file.(If you used a different PDF file, the output file will be named<your-file-name>-output.json instead.)

Next steps

Learn more about theavailable partition functions in addition topartition_pdf for converting other types of files into standardUnstructured document elements and metadata.
By default, the preceding example uses theauto partitioning strategy. Learn about otheravailable partitioning strategies for fine-tuned approaches to converting different types of files into Unstructured document elements.
Learn aboutavailable chunking functions for splitting up the text in your document elements into manageable chunks as needed to fit into your models’ limited context windows.
Learn aboutavailable cleaning functions for cleaning up your document elements’ data as needed.
Learn aboutavailable extraction functions for getting precise information out of your document elements as needed.
Learn about how togenerate vector embeddings for the text in your document elements for use in RAG applications, AI agents, model fine-tuning tasks, and more.
For an additional code example, see theUnstructured Quick Tour Google Colab notebook.
The Unstructured open source library is also available as aDocker container.
TheUnstructured Ingest CLI and Unstructured Ingest Python library build upon the Unstructured open source library by providing additional functionality such as batch file processing,ingesting files from remote source locations and sending the processed files’ data to remote destination locations, creating programmatic ETL pipelines, optionally processing files on Unstructured-hosted compute resource instead of locally for improved performance and quality on a pay-as-you-go basis, and more.
TheUnstructured user interface (UI) andUnstructured API are superior to the Unstructured open source library, theUnstructured Ingest CLI, and the Unstructured Ingest Python library. The Unstructured UI and API are designed for production scenarios, with significantly increased performance and quality,the latest OCR and vision language models, advanced chunking strategies, security compliance, multi-user account management, job scheduling and monitoring, self-hosted deployment options, and more on a pay-as-you-go or subscription basis.

Need help?

Join theUnstructured Slack community and post yourquestions in the# ask-for-help-open-source-library channel.
Post your bug reports and feature requests in theUnstructured open source library GitHub repository. These bug reports and feature requests are evaluated and addressed
based on the interest and availability of the open source community.

Was this page helpful?

Suggest edits Raise issue

Supported file types Overview

Movatterモバイル変換

Unstructured open source

Getting started with open source

Using Unstructured open source

Ingestion

How to

Best practices

Concepts

Integrations

​Next steps

​Need help?

Next steps

Need help?