Microsoft SharePoint

Microsoft SharePoint is a website-based collaboration system that uses workflow applications, “list” databases, and other web parts and security features to empower business teams to work together developed by Microsoft.

This notebook covers how to load documents from theSharePoint Document Library. By default the document loader loadspdf,doc,docx andtxt files. You can load other file types by providing appropriate parsers (see more below).

Prerequisites

Register an application with theMicrosoft identity platform instructions.
When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called theclient ID, this value uniquely identifies your application in the Microsoft identity platform.
During the steps you will be following atitem 1, you can set the redirect URI ashttps://login.microsoftonline.com/common/oauth2/nativeclient
During the steps you will be following atitem 1, generate a new password (client_secret) under Application Secrets section.
Follow the instructions at thisdocument to add the followingSCOPES (offline_access andSites.Read.All) to your application.
To retrieve files from yourDocument Library, you will need its ID. To obtain it, you will need values ofTenant Name,Collection ID, andSubsite ID.
To find yourTenant Name follow the instructions at thisdocument. Once you got this, just remove.onmicrosoft.com from the value and hold the rest as yourTenant Name.
To obtain yourCollection ID andSubsite ID, you will need yourSharePointsite-name. YourSharePoint site URL has the following formathttps://<tenant-name>.sharepoint.com/sites/<site-name>. The last part of this URL is thesite-name.
To Get the SiteCollection ID, hit this URL in the browser:https://<tenant>.sharepoint.com/sites/<site-name>/_api/site/id and copy the value of theEdm.Guid property.
To get theSubsite ID (or web ID) use:https://<tenant>.sharepoint.com/sites/<site-name>/_api/web/id and copy the value of theEdm.Guid property.
TheSharePoint site ID has the following format:<tenant-name>.sharepoint.com,<Collection ID>,<subsite ID>. You can hold that value to use in the next step.
Visit theGraph Explorer Playground to obtain yourDocument Library ID. The first step is to ensure you are logged in with the account associated with yourSharePoint site. Then you need to make a request tohttps://graph.microsoft.com/v1.0/sites/<SharePoint site ID>/drive and the response will return a payload with a fieldid that holds the ID of yourDocument Library ID.

🧑 Instructions for ingesting your documents from SharePoint Document Library

🔑 Authentication

By default, theSharePointLoader expects that the values ofCLIENT_ID andCLIENT_SECRET must be stored as environment variables namedO365_CLIENT_ID andO365_CLIENT_SECRET respectively. You could pass those environment variables through a.env file at the root of your application or using the following command in your script.

os.environ['O365_CLIENT_ID']="YOUR CLIENT ID"
os.environ['O365_CLIENT_SECRET']="YOUR CLIENT SECRET"

This loader uses an authentication calledon behalf of a user. It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was succesful.

from langchain_community.document_loaders.sharepointimport SharePointLoader

loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID")

API Reference:SharePointLoader

Once the authentication has been done, the loader will store a token (o365_token.txt) at~/.credentials/ folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change theauth_with_token parameter to True in the instantiation of the loader.

from langchain_community.document_loaders.sharepointimport SharePointLoader

loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID", auth_with_token=True)

API Reference:SharePointLoader

🗂️ Documents loader

📑 Loading documents from a Document Library Directory

SharePointLoader can load documents from a specific folder within your Document Library. For instance, you want to load all documents that are stored atDocuments/marketing folder within your Document Library.

from langchain_community.document_loaders.sharepointimport SharePointLoader

loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID", folder_path="Documents/marketing", auth_with_token=True)
documents= loader.load()

API Reference:SharePointLoader

If you are receiving the errorResource not found for the segment, try using thefolder_id instead of the folder path, which can be obtained from theMicrosoft Graph API

loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID", auth_with_token=True
                          folder_id="<folder-id>")
documents= loader.load()

If you wish to load documents from the root directory, you can omitfolder_id,folder_path anddocuments_ids and loader will load root directory.

# loads documents from root directory
loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID", auth_with_token=True)
documents= loader.load()

Combined withrecursive=True you can simply load all documents from whole SharePoint:

# loads documents from root directory
loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID",
                          recursive=True,
                          auth_with_token=True)
documents= loader.load()

📑 Loading documents from a list of Documents IDs

Another possibility is to provide a list ofobject_id for each document you want to load. For that, you will need to query theMicrosoft Graph API to find all the documents ID that you are interested in. Thislink provides a list of endpoints that will be helpful to retrieve the documents ID.

For instance, to retrieve information about all objects that are stored atdata/finance/ folder, you need make a request to:https://graph.microsoft.com/v1.0/drives/<document-library-id>/root:/data/finance:/children. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.

from langchain_community.document_loaders.sharepointimport SharePointLoader

loader= SharePointLoader(document_library_id="YOUR DOCUMENT LIBRARY ID", object_ids=["ID_1","ID_2"], auth_with_token=True)
documents= loader.load()

API Reference:SharePointLoader

📑 Choosing supported file types and preffered parsers

By defaultSharePointLoader loads file types defined indocument_loaders/parsers/registry using the default parsers (see below).

def_get_default_parser()-> BaseBlobParser:
"""Get default mime-type based parser."""
return MimeTypeBasedParser(
        handlers={
"application/pdf": PyMuPDFParser(),
"text/plain": TextParser(),
"application/msword": MsWordParser(),
"application/vnd.openxmlformats-officedocument.wordprocessingml.document":(
                MsWordParser()
),
},
        fallback_parser=None,
)

You can override this behavior by passinghandlers argument toSharePointLoader.Pass a dictionary mapping either file extensions (like"doc","pdf", etc.)or MIME types (like"application/pdf","text/plain", etc.) to parsers.Note that you must use either file extensions or MIME types exclusively andcannot mix them.

Do not include the leading dot for file extensions.

# using file extensions:
handlers={
"doc": MsWordParser(),
"pdf": PDFMinerParser(),
"mp3": OpenAIWhisperParser()
}

# using MIME types:
handlers={
"application/msword": MsWordParser(),
"application/pdf": PDFMinerParser(),
"audio/mpeg": OpenAIWhisperParser()
}

loader= SharePointLoader(document_library_id="...",
                            handlers=handlers# pass handlers to SharePointLoader
)

In case multiple file extensions map to the same MIME type, the last dictionary item willapply.Example:

# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used
# to parse all jpg/jpeg files.
handlers={
"jpg": FirstParser(),
"jpeg": SecondParser()
}

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Microsoft SharePoint

Prerequisites

🧑 Instructions for ingesting your documents from SharePoint Document Library

🔑 Authentication

🗂️ Documents loader

📑 Loading documents from a Document Library Directory

📑 Loading documents from a list of Documents IDs

📑 Choosing supported file types and preffered parsers

Related

Movatterモバイル変換

Prerequisites​

🧑 Instructions for ingesting your documents from SharePoint Document Library​

🔑 Authentication​

🗂️ Documents loader​

📑 Loading documents from a Document Library Directory​

📑 Loading documents from a list of Documents IDs​

📑 Choosing supported file types and preffered parsers​

Related​

Prerequisites

🧑 Instructions for ingesting your documents from SharePoint Document Library

🔑 Authentication

🗂️ Documents loader

📑 Loading documents from a Document Library Directory

📑 Loading documents from a list of Documents IDs

📑 Choosing supported file types and preffered parsers

Related