Microsoft PowerPoint

Microsoft PowerPoint is a presentation program by Microsoft.

This covers how to loadMicrosoft PowerPoint documents into a document format that we can use downstream.

Please seethis guide for more instructions on setting up Unstructured locally, including setting up required system dependencies.

# Install packages
%pip install unstructured
%pip install python-magic
%pip install python-pptx

from langchain_community.document_loadersimport UnstructuredPowerPointLoader

loader= UnstructuredPowerPointLoader("./example_data/fake-power-point.pptx")

data= loader.load()

data

API Reference:UnstructuredPowerPointLoader

[Document(page_content='Adding a Bullet Slide\n\nFind the bullet slide layout\n\nUse _TextFrame.text for first bullet\n\nUse _TextFrame.add_paragraph() for subsequent bullets\n\nHere is a lot of text!\n\nHere is some text in a text box!', metadata={'source': './example_data/fake-power-point.pptx'})]

Retain Elements

Under the hood,Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifyingmode="elements".

loader= UnstructuredPowerPointLoader(
"./example_data/fake-power-point.pptx", mode="elements"
)

data= loader.load()

data[0]

Document(page_content='Adding a Bullet Slide', metadata={'source': './example_data/fake-power-point.pptx', 'category_depth': 0, 'file_directory': './example_data', 'filename': 'fake-power-point.pptx', 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title'})

Using Azure AI Document Intelligence

Azure AI Document Intelligence (formerly known asAzure Form Recognizer) is machine-learningbased service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs fromdigital or scanned PDFs, images, Office and HTML files.
Document Intelligence supportsPDF,JPEG/JPG,PNG,BMP,TIFF,HEIF,DOCX,XLSX,PPTX andHTML.

This current implementation of a loader usingDocument Intelligence can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained withMarkdownHeaderTextSplitter for semantic document chunking. You can also usemode="single" ormode="page" to return pure texts in a single page or document split by page.

Prerequisite

An Azure AI Document Intelligence resource in one of the 3 preview regions:East US,West US2,West Europe - followthis document to create one if you don't have. You will be passing<endpoint> and<key> as parameters to the loader.

%pip install--upgrade--quiet  langchain langchain-community azure-ai-documentintelligence

from langchain_community.document_loadersimport AzureAIDocumentIntelligenceLoader

file_path="<filepath>"
endpoint="<endpoint>"
key="<key>"
loader= AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)

documents= loader.load()

API Reference:AzureAIDocumentIntelligenceLoader

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Microsoft PowerPoint

Retain Elements

Using Azure AI Document Intelligence

Prerequisite

Related

Movatterモバイル変換

Retain Elements​

Using Azure AI Document Intelligence​

Prerequisite​

Related​

Retain Elements

Using Azure AI Document Intelligence

Prerequisite

Related