Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
OurBuilding Ambient Agents with LangGraph course is now available on LangChain Academy!
Open In ColabOpen on GitHub

Microsoft OneDrive

Microsoft OneDrive (formerlySkyDrive) is a file hosting service operated by Microsoft.

This notebook covers how to load documents fromOneDrive. By default the document loader loadspdf,doc,docx andtxt files. You can load other file types by providing appropriate parsers (see more below).

Prerequisites

  1. Register an application with theMicrosoft identity platform instructions.
  2. When registration finishes, the Azure portal displays the app registration's Overview pane. You see the Application (client) ID. Also called theclient ID, this value uniquely identifies your application in the Microsoft identity platform.
  3. During the steps you will be following atitem 1, you can set the redirect URI ashttp://localhost:8000/callback
  4. During the steps you will be following atitem 1, generate a new password (client_secret) under Application Secrets section.
  5. Follow the instructions at thisdocument to add the followingSCOPES (offline_access andFiles.Read.All) to your application.
  6. Visit theGraph Explorer Playground to obtain yourOneDrive ID. The first step is to ensure you are logged in with the account associated your OneDrive account. Then you need to make a request tohttps://graph.microsoft.com/v1.0/me/drive and the response will return a payload with a fieldid that holds the ID of your OneDrive account.
  7. You need to install the o365 package using the commandpip install o365.
  8. At the end of the steps you must have the following values:
  • CLIENT_ID
  • CLIENT_SECRET
  • DRIVE_ID

🧑 Instructions for ingesting your documents from OneDrive

🔑 Authentication

By default, theOneDriveLoader expects that the values ofCLIENT_ID andCLIENT_SECRET must be stored as environment variables namedO365_CLIENT_ID andO365_CLIENT_SECRET respectively. You could pass those environment variables through a.env file at the root of your application or using the following command in your script.

os.environ['O365_CLIENT_ID']="YOUR CLIENT ID"
os.environ['O365_CLIENT_SECRET']="YOUR CLIENT SECRET"

This loader uses an authentication calledon behalf of a user. It is a 2 step authentication with user consent. When you instantiate the loader, it will call will print a url that the user must visit to give consent to the app on the required permissions. The user must then visit this url and give consent to the application. Then the user must copy the resulting page url and paste it back on the console. The method will then return True if the login attempt was successful.

from langchain_community.document_loaders.onedriveimport OneDriveLoader

loader= OneDriveLoader(drive_id="YOUR DRIVE ID")
API Reference:OneDriveLoader

Once the authentication has been done, the loader will store a token (o365_token.txt) at~/.credentials/ folder. This token could be used later to authenticate without the copy/paste steps explained earlier. To use this token for authentication, you need to change theauth_with_token parameter to True in the instantiation of the loader.

from langchain_community.document_loaders.onedriveimport OneDriveLoader

loader= OneDriveLoader(drive_id="YOUR DRIVE ID", auth_with_token=True)
API Reference:OneDriveLoader

🗂️ Documents loader

📑 Loading documents from a OneDrive Directory

OneDriveLoader can load documents from a specific folder within your OneDrive. For instance, you want to load all documents that are stored atDocuments/clients folder within your OneDrive.

from langchain_community.document_loaders.onedriveimport OneDriveLoader

loader= OneDriveLoader(drive_id="YOUR DRIVE ID", folder_path="Documents/clients", auth_with_token=True)
documents= loader.load()
API Reference:OneDriveLoader

📑 Loading documents from a list of Documents IDs

Another possibility is to provide a list ofobject_id for each document you want to load. For that, you will need to query theMicrosoft Graph API to find all the documents ID that you are interested in. Thislink provides a list of endpoints that will be helpful to retrieve the documents ID.

For instance, to retrieve information about all objects that are stored at the root of the Documents folder, you need make a request to:https://graph.microsoft.com/v1.0/drives/{YOUR DRIVE ID}/root/children. Once you have the list of IDs that you are interested in, then you can instantiate the loader with the following parameters.

from langchain_community.document_loaders.onedriveimport OneDriveLoader

loader= OneDriveLoader(drive_id="YOUR DRIVE ID", object_ids=["ID_1","ID_2"], auth_with_token=True)
documents= loader.load()
API Reference:OneDriveLoader

📑 Choosing supported file types and preffered parsers

By defaultOneDriveLoader loads file types defined indocument_loaders/parsers/registry using the default parsers (see below).

def_get_default_parser()-> BaseBlobParser:
"""Get default mime-type based parser."""
return MimeTypeBasedParser(
handlers={
"application/pdf": PyMuPDFParser(),
"text/plain": TextParser(),
"application/msword": MsWordParser(),
"application/vnd.openxmlformats-officedocument.wordprocessingml.document":(
MsWordParser()
),
},
fallback_parser=None,
)

You can override this behavior by passinghandlers argument toOneDriveLoader.Pass a dictionary mapping either file extensions (like"doc","pdf", etc.)or MIME types (like"application/pdf","text/plain", etc.) to parsers.Note that you must use either file extensions or MIME types exclusively andcannot mix them.

Do not include the leading dot for file extensions.

# using file extensions:
handlers={
"doc": MsWordParser(),
"pdf": PDFMinerParser(),
"mp3": OpenAIWhisperParser()
}

# using MIME types:
handlers={
"application/msword": MsWordParser(),
"application/pdf": PDFMinerParser(),
"audio/mpeg": OpenAIWhisperParser()
}

loader= OneDriveLoader(document_library_id="...",
handlers=handlers# pass handlers to OneDriveLoader
)

In case multiple file extensions map to the same MIME type, the last dictionary item willapply.Example:

# 'jpg' and 'jpeg' both map to 'image/jpeg' MIME type. SecondParser() will be used
# to parse all jpg/jpeg files.
handlers={
"jpg": FirstParser(),
"jpeg": SecondParser()
}

Related


[8]ページ先頭

©2009-2025 Movatter.jp