Prerequisites
uv andvenv are described in the following steps.However,uv andvenv are not required to use the Unstructured open source library.uv to install Python if needed, as described in the following steps.layout-parser-paper.pdf that you can download in a later step. (The Unstructured open source library providessupport for additional file types as well.)Install uv
curl withsh:curl -fsSL https://get.uv.dev | bashwget withsh instead:wget -qO- https://astral.sh/uv/install.sh | shirm to download the script and run it withiex:powershell-ExecutionPolicy ByPass-c"irm https://astral.sh/uv/install.ps1 | iex"uv by using other approaches such as PyPI, Homebrew, or WinGet,seeInstalling uv.Install Python
uv will detect and use Python if you already have it installed.To view a list of installed Python versions, run the following command:uv python listuvby running the following command. For example, this command installs Python 3.12 for use withuv:uv python install 3.12Create a uv project
uv to create a project by switching to the directory on your development machine where you want tocreate the project and then running the following command:uv initCreate a venv virtual environment
uv to create a virtual environment withvenv by running the following command:# Create the virtual environment by using the current Python version:uv venv# Or, if you want to use a specific Python version:uv venv --python 3.12Activate the virtual environment
venv virtual environment, run one of the following commands:bash orzsh, runsource .venv/bin/activatefish, runsource .venv/bin/activate.fishcsh ortcsh, runsource .venv/bin/activate.cshpwsh, run.venv/bin/Activate.ps1cmd.exe, run.venv\Scripts\activate.batPowerShell, run.venv\Scripts\Activate.ps1deactivate.Install the Unstructured open source library
uv to install the Unstructured open source library by running the following command:uv add unstructured.txt), HTML files (.html), XML files (.xml), and emails (.eml,.msg, and.p7s) without any additional dependencies.To work with other file types, you must also install these dependencies, as follows, replacing<extra> with the appropriate extra for the target file type:uv add "unstructured[<extra>]"all-docs (for all supported file types in this list)csv (for.csv files only)docx (for.doc and.docx files only)epub (for.epub files only)image (for all supported image file types:.bmp,.heic,.jpeg,.png, and.tiff)md (for.md files only)odt (for.odt files only)org (for.org files only)pdf (for.pdf files only)pptx (for.ppt and.pptx files only)rst (for.rst files only)rtf (for.rtf files only)tsv (for.tsv files only)xlsx (for.xls and.xlsx files only)uv add "unstructured[pdf]"uv add "unstructured[pdf,docx]"Install system dependencies
tesseract-lang (for additional language support).epub,.odt, and.rtf files. For.rtf files, you must have version 2.14.2 or newer. Runningthis script will install the correct version for you.)Download the sample PDF file
layout-parser-paper.pdf from the following location to your local development machine:https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf(You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.)Add the Python code
main.py file, add the following Python code, replacing<path/to> with thepath to thelayout-parser-paper.pdf file that you downloaded to your local development machine.(If you want to use a different PDF file, replacelayout-parser-paper with the name of that PDF file instead.)from unstructured.partition.pdfimport partition_pdffrom unstructured.staging.baseimport elements_to_jsonfile_path= "<path/to>"base_file_name= "layout-parser-paper"def main(): elements= partition_pdf(filename=f"{file_path}/{base_file_name}.pdf") elements_to_json(elements=elements,filename=f"{file_path}/{base_file_name}-output.json")if __name__ == "__main__": main()Run the Python code
uv to run the preceding Python code by running the following command:uv run main.pyView the output
layout-parser-paper-output.json file in your editor. This file will be inthe location as the originallayout-parser-paper.pdf file.(If you used a different PDF file, the output file will be named<your-file-name>-output.json instead.)partition_pdf for converting other types of files into standardUnstructured document elements and metadata.auto partitioning strategy. Learn about otheravailable partitioning strategies for fine-tuned approaches to converting different types of files into Unstructured document elements.Was this page helpful?