Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Unstructured open source
Getting started with open source
Ingestion
Best practices
Integrations
Getting started with open source

Quickstart

This quickstart uses the Unstructured open source library, which is designed as a starting point for quick prototyping and haslimits. For production scenarios, use theUnstructured user interface (UI) or theUnstructured API instead.
In this quickstart, you use theUnstructured open source library(GitHub,PyPI) along with Python on your local development machine to partition a PDF file into a standard set ofUnstructured document elements and metadata. You can use these elements andmetadata as input into your RAG applications, AI agents, model fine-tuning tasks, and more.
1

Prerequisites

To complete this quickstart, you need:
  • A Python virtual environment manager is recommended to manage your Python code dependencies.This quickstart usesuv for managing virtual environments andvenv as the virtual environment type. Installation anduse ofuv andvenv are described in the following steps.However,uv andvenv are not required to use the Unstructured open source library.
  • Python 3.9 or higher. You can useuv to install Python if needed, as described in the following steps.
  • A PDF file on your local machine. If you do not have a PDF file available, this quickstart provides a sample PDF file namedlayout-parser-paper.pdf that you can download in a later step. (The Unstructured open source library providessupport for additional file types as well.)
2

Install uv

  • macOS, Linux
  • Windows
To usecurl withsh:
curl -fsSL https://get.uv.dev | bash
To usewget withsh instead:
wget -qO- https://astral.sh/uv/install.sh | sh
To use PowerShell withirm to download the script and run it withiex:
powershell-ExecutionPolicy ByPass-c"irm https://astral.sh/uv/install.ps1 | iex"
To installuv by using other approaches such as PyPI, Homebrew, or WinGet,seeInstalling uv.
3

Install Python

uv will detect and use Python if you already have it installed.To view a list of installed Python versions, run the following command:
uv python list
If, however, you do not already have Python installed, you can install a version of Python for use withuvby running the following command. For example, this command installs Python 3.12 for use withuv:
uv python install 3.12
4

Create a uv project

Useuv to create a project by switching to the directory on your development machine where you want tocreate the project and then running the following command:
uv init
5

Create a venv virtual environment

To isolate and manage your project’s code dependencies, from your project directory, useuv to create a virtual environment withvenv by running the following command:
# Create the virtual environment by using the current Python version:uv venv# Or, if you want to use a specific Python version:uv venv --python 3.12
6

Activate the virtual environment

To activate thevenv virtual environment, run one of the following commands:
  • macOS, Linux
  • Windows
  • Forbash orzsh, runsource .venv/bin/activate
  • Forfish, runsource .venv/bin/activate.fish
  • Forcsh ortcsh, runsource .venv/bin/activate.csh
  • Forpwsh, run.venv/bin/Activate.ps1
  • Forcmd.exe, run.venv\Scripts\activate.bat
  • ForPowerShell, run.venv\Scripts\Activate.ps1
To deactivate the virtual environment at any time, rundeactivate.
7

Install the Unstructured open source library

With the virtual environment activated to enable code dependency isolation and management, useuv to install the Unstructured open source library by running the following command:
uv add unstructured
The preceding command supports plain text files (.txt), HTML files (.html), XML files (.xml), and emails (.eml,.msg, and.p7s) without any additional dependencies.To work with other file types, you must also install these dependencies, as follows, replacing<extra> with the appropriate extra for the target file type:
uv add "unstructured[<extra>]"
The following file type extras are available:
  • all-docs (for all supported file types in this list)
  • csv (for.csv files only)
  • docx (for.doc and.docx files only)
  • epub (for.epub files only)
  • image (for all supported image file types:.bmp,.heic,.jpeg,.png, and.tiff)
  • md (for.md files only)
  • odt (for.odt files only)
  • org (for.org files only)
  • pdf (for.pdf files only)
  • pptx (for.ppt and.pptx files only)
  • rst (for.rst files only)
  • rtf (for.rtf files only)
  • tsv (for.tsv files only)
  • xlsx (for.xls and.xlsx files only)
As this quickstart uses a sample PDF file, run the following command:
uv add "unstructured[pdf]"
Note that you can install multiple extras at the same time by separating them with commas, for example:
uv add "unstructured[pdf,docx]"
8

Install system dependencies

You maximum compatibility, you should also install the following system dependencies:
  • libmagic-dev (for filetype detection)
  • poppler-utils andtesseract-ocr (for images and PDFs), andtesseract-lang (for additional language support)
  • libreoffice (for Microsoft Office documents)
  • pandoc (for.epub,.odt, and.rtf files. For.rtf files, you must have version 2.14.2 or newer. Runningthis script will install the correct version for you.)
Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see youroperating system’s documentation.
9

Download the sample PDF file

Download the sample PDF file namedlayout-parser-paper.pdf from the following location to your local development machine:https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf(You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.)
10

Add the Python code

In the project’smain.py file, add the following Python code, replacing<path/to> with thepath to thelayout-parser-paper.pdf file that you downloaded to your local development machine.(If you want to use a different PDF file, replacelayout-parser-paper with the name of that PDF file instead.)
from unstructured.partition.pdfimport partition_pdffrom unstructured.staging.baseimport elements_to_jsonfile_path= "<path/to>"base_file_name= "layout-parser-paper"def main():    elements= partition_pdf(filename=f"{file_path}/{base_file_name}.pdf")    elements_to_json(elements=elements,filename=f"{file_path}/{base_file_name}-output.json")if __name__ == "__main__":    main()
11

Run the Python code

Useuv to run the preceding Python code by running the following command:
uv run main.py
It might take a few minutes for the command to finish.
12

View the output

After the command finishes running successfully, view the Unstructured elements and metadata that were generated by opening thelayout-parser-paper-output.json file in your editor. This file will be inthe location as the originallayout-parser-paper.pdf file.(If you used a different PDF file, the output file will be named<your-file-name>-output.json instead.)

Next steps

  • Learn more about theavailable partition functions in addition topartition_pdf for converting other types of files into standardUnstructured document elements and metadata.
  • By default, the preceding example uses theauto partitioning strategy. Learn about otheravailable partitioning strategies for fine-tuned approaches to converting different types of files into Unstructured document elements.
  • Learn aboutavailable chunking functions for splitting up the text in your document elements into manageable chunks as needed to fit into your models’ limited context windows.
  • Learn aboutavailable cleaning functions for cleaning up your document elements’ data as needed.
  • Learn aboutavailable extraction functions for getting precise information out of your document elements as needed.
  • Learn about how togenerate vector embeddings for the text in your document elements for use in RAG applications, AI agents, model fine-tuning tasks, and more.
  • For an additional code example, see theUnstructured Quick Tour Google Colab notebook.
  • The Unstructured open source library is also available as aDocker container.
  • TheUnstructured Ingest CLI and Unstructured Ingest Python library build upon the Unstructured open source library by providing additional functionality such as batch file processing,ingesting files from remote source locations and sending the processed files’ data to remote destination locations, creating programmatic ETL pipelines, optionally processing files on Unstructured-hosted compute resource instead of locally for improved performance and quality on a pay-as-you-go basis, and more.
  • TheUnstructured user interface (UI) andUnstructured API are superior to the Unstructured open source library, theUnstructured Ingest CLI, and the Unstructured Ingest Python library. The Unstructured UI and API are designed for production scenarios, with significantly increased performance and quality,the latest OCR and vision language models, advanced chunking strategies, security compliance, multi-user account management, job scheduling and monitoring, self-hosted deployment options, and more on a pay-as-you-go or subscription basis.

Need help?

Was this page helpful?


[8]ページ先頭

©2009-2026 Movatter.jp