How to create a custom Document Loader

Overview

Applications based on LLMs frequently entail extracting data from databases or files, like PDFs, and converting it into a format that LLMs can utilize. In LangChain, this usually involves creating Document objects, which encapsulate the extracted text (page_content) along with metadata—a dictionary containing details about the document, such as the author's name or the date of publication.

Document objects are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in theDocument to generate a desired response (e.g., summarizing the document).Documents can be either used immediately or indexed into a vectorstore for future retrieval and use.

The main abstractions forDocument Loading are:

Component	Description
Document	Contains`text` and`metadata`
BaseLoader	Use to convert raw data into`Documents`
Blob	A representation of binary data that's located either in a file or in memory
BaseBlobParser	Logic to parse a`Blob` to yield`Document` objects

This guide will demonstrate how to write custom document loading and file parsing logic; specifically, we'll see how to:

Create a standard document Loader by sub-classing fromBaseLoader.
Create a parser usingBaseBlobParser and use it in conjunction withBlob andBlobLoaders. This is useful primarily when working with files.

Standard Document Loader

A document loader can be implemented by sub-classing from aBaseLoader which provides a standard interface for loading documents.

Interface

Method Name	Explanation
lazy_load	Used to load documents one by onelazily. Use for production code.
alazy_load	Async variant of`lazy_load`
load	Used to load all the documents into memoryeagerly. Use for prototyping or interactive work.
aload	Used to load all the documents into memoryeagerly. Use for prototyping or interactive work.Added in 2024-04 to LangChain.

Theload methods is a convenience method meant solely for prototyping work -- it just invokeslist(self.lazy_load()).
Thealazy_load has a default implementation that will delegate tolazy_load. If you're using async, we recommend overriding the default implementation and providing a native async implementation.

important

When implementing a document loader doNOT provide parameters via thelazy_load oralazy_load methods.

All configuration is expected to be passed through the initializer (init). This was a design choice made by LangChain to make sure that once a document loader has been instantiated it has all the information needed to load documents.

Installation

Installlangchain-core andlangchain_community.

%pip install-qU langchain_core langchain_community

Implementation

Let's create an example of a standard document loader that loads a file and creates a document from each line in the file.

from typingimport AsyncIterator, Iterator

from langchain_core.document_loadersimport BaseLoader
from langchain_core.documentsimport Document


classCustomDocumentLoader(BaseLoader):
"""An example document loader that reads a file line by line."""

def__init__(self, file_path:str)->None:
"""Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path= file_path

deflazy_load(self)-> Iterator[Document]:# <-- Does not take any arguments
"""A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """
withopen(self.file_path, encoding="utf-8")as f:
            line_number=0
for linein f:
yield Document(
                    page_content=line,
                    metadata={"line_number": line_number,"source": self.file_path},
)
                line_number+=1

# alazy_load is OPTIONAL.
# If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
asyncdefalazy_load(
        self,
)-> AsyncIterator[Document]:# <-- Does not take any arguments
"""An async lazy loader that reads a file line by line."""
# Requires aiofiles
# https://github.com/Tinche/aiofiles
import aiofiles

asyncwith aiofiles.open(self.file_path, encoding="utf-8")as f:
            line_number=0
asyncfor linein f:
yield Document(
                    page_content=line,
                    metadata={"line_number": line_number,"source": self.file_path},
)
                line_number+=1

API Reference:BaseLoader |Document

Test 🧪

To test out the document loader, we need a file with some quality content.

withopen("./meow.txt","w", encoding="utf-8")as f:
    quality_content="meow meow🐱 \n meow meow🐱 \n meow😻😻"
    f.write(quality_content)

loader= CustomDocumentLoader("./meow.txt")

%pip install-q aiofiles

## Test out the lazy load interface
for docin loader.lazy_load():
print()
print(type(doc))
print(doc)


<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱
' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱
' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

## Test out the async implementation
asyncfor docin loader.alazy_load():
print()
print(type(doc))
print(doc)


<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱
' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱
' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}

tip

load() can be helpful in an interactive environment such as a jupyter notebook.

Avoid using it for production code since eager loading assumes that all the contentcan fit into memory, which is not always the case, especially for enterprise data.

loader.load()

[Document(metadata={'line_number': 0, 'source': './meow.txt'}, page_content='meow meow🐱 \n'),
 Document(metadata={'line_number': 1, 'source': './meow.txt'}, page_content=' meow meow🐱 \n'),
 Document(metadata={'line_number': 2, 'source': './meow.txt'}, page_content=' meow😻😻')]

Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can useopen to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.

BaseBlobParser

ABaseBlobParser is an interface that accepts ablob and outputs a list ofDocument objects. Ablob is a representation of data that lives either in memory or in a file. LangChain python has aBlob primitive which is inspired by theBlob WebAPI spec.

from langchain_core.document_loadersimport BaseBlobParser, Blob


classMyParser(BaseBlobParser):
"""A simple parser that creates a document from each line."""

deflazy_parse(self, blob: Blob)-> Iterator[Document]:
"""Parse a blob into a document line by line."""
        line_number=0
with blob.as_bytes_io()as f:
for linein f:
                line_number+=1
yield Document(
                    page_content=line,
                    metadata={"line_number": line_number,"source": blob.source},
)

API Reference:BaseBlobParser |Blob

blob= Blob.from_path("./meow.txt")
parser= MyParser()

list(parser.lazy_parse(blob))

[Document(metadata={'line_number': 1, 'source': './meow.txt'}, page_content='meow meow🐱 \n'),
 Document(metadata={'line_number': 2, 'source': './meow.txt'}, page_content=' meow meow🐱 \n'),
 Document(metadata={'line_number': 3, 'source': './meow.txt'}, page_content=' meow😻😻')]

Using theblob API also allows one to load content directly from memory without having to read it from a file!

blob= Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))

[Document(metadata={'line_number': 1, 'source': None}, page_content='some data from memory\n'),
 Document(metadata={'line_number': 2, 'source': None}, page_content='meow')]

Blob

Let's take a quick look through some of the Blob API.

blob= Blob.from_path("./meow.txt", metadata={"foo":"bar"})

blob.encoding

'utf-8'

blob.as_bytes()

b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'

blob.as_string()

'meow meow🐱 \n meow meow🐱 \n meow😻😻'

blob.as_bytes_io()

<contextlib._GeneratorContextManager at 0x74b8d42e9940>

blob.metadata

{'foo': 'bar'}

blob.source

'./meow.txt'

Blob Loaders

While a parser encapsulates the logic needed to parse binary data into documents,blob loaders encapsulate the logic that's necessary to load blobs from a given storage location.

At the moment,LangChain supportsFileSystemBlobLoader andCloudBlobLoader.

You can use theFileSystemBlobLoader to load blobs and then use the parser to parse them.

from langchain_community.document_loaders.blob_loadersimport FileSystemBlobLoader

filesystem_blob_loader= FileSystemBlobLoader(
    path=".", glob="*.mdx", show_progress=True
)

API Reference:FileSystemBlobLoader

%pip install-q tqdm

parser= MyParser()
for blobin filesystem_blob_loader.yield_blobs():
for docin parser.lazy_parse(blob):
print(doc)
break

Or, you can useCloudBlobLoader to load blobs from a cloud storage location (Supports s3://, az://, gs://, file:// schemes).

%pip install-q'cloudpathlib[s3]'

from cloudpathlibimport S3Client, S3Path
from langchain_community.document_loaders.blob_loadersimport CloudBlobLoader

client= S3Client(no_sign_request=True)
client.set_as_default_client()

path= S3Path(
"s3://bucket-01", client=client
)# Supports s3://, az://, gs://, file:// schemes.

cloud_loader= CloudBlobLoader(path, glob="**/*.pdf", show_progress=True)

for blobin cloud_loader.yield_blobs():
print(blob)

API Reference:CloudBlobLoader

Generic Loader

LangChain has aGenericLoader abstraction which composes aBlobLoader with aBaseBlobParser.

GenericLoader is meant to provide standardized classmethods that make it easy to use existingBlobLoader implementations. At the moment, theFileSystemBlobLoader andCloudBlobLoader are supported. See example below:

from langchain_community.document_loaders.genericimport GenericLoader

generic_loader_filesystem= GenericLoader(
    blob_loader=filesystem_blob_loader, blob_parser=parser
)
for idx, docinenumerate(generic_loader_filesystem.lazy_load()):
if idx<5:
print(doc)

print("... output truncated for demo purposes")

API Reference:GenericLoader

100%|██████████| 7/7 [00:00<00:00, 1224.82it/s]
``````output
page_content='# Text embedding models
' metadata={'line_number': 1, 'source': 'embed_text.mdx'}
page_content='
' metadata={'line_number': 2, 'source': 'embed_text.mdx'}
page_content=':::info
' metadata={'line_number': 3, 'source': 'embed_text.mdx'}
page_content='Head to [Integrations](/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.
' metadata={'line_number': 4, 'source': 'embed_text.mdx'}
page_content=':::
' metadata={'line_number': 5, 'source': 'embed_text.mdx'}
... output truncated for demo purposes

Custom Generic Loader

If you really like creating classes, you can sub-class and create a class to encapsulate the logic together.

You can sub-class from this class to load content using an existing loader.

from typingimport Any


classMyCustomLoader(GenericLoader):
@staticmethod
defget_parser(**kwargs: Any)-> BaseBlobParser:
"""Override this method to associate a default parser with the class."""
return MyParser()

loader= MyCustomLoader.from_filesystem(path=".", glob="*.mdx", show_progress=True)

for idx, docinenumerate(loader.lazy_load()):
if idx<5:
print(doc)

print("... output truncated for demo purposes")

100%|██████████| 7/7 [00:00<00:00, 814.86it/s]
``````output
page_content='# Text embedding models
' metadata={'line_number': 1, 'source': 'embed_text.mdx'}
page_content='
' metadata={'line_number': 2, 'source': 'embed_text.mdx'}
page_content=':::info
' metadata={'line_number': 3, 'source': 'embed_text.mdx'}
page_content='Head to [Integrations](/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.
' metadata={'line_number': 4, 'source': 'embed_text.mdx'}
page_content=':::
' metadata={'line_number': 5, 'source': 'embed_text.mdx'}
... output truncated for demo purposes

Movatterモバイル変換

Overview​

Standard Document Loader​

Interface​

Installation​

Implementation​

Test 🧪​

Working with Files​

BaseBlobParser​

Blob​

Blob Loaders​

Generic Loader​

Custom Generic Loader​