JSONLoader
This notebook provides a quick overview for getting started with JSONdocument loader. For detailed documentation of all JSONLoader features and configurations head to theAPI reference.
- TODO: Add any other relevant links, like information about underlying API, etc.
Overview
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
JSONLoader | langchain_community | ✅ | ❌ | ✅ |
Loader features
Source | Document Lazy Loading | Native Async Support |
---|---|---|
JSONLoader | ✅ | ❌ |
Setup
To access JSON document loader you'll need to install thelangchain-community
integration package as well as thejq
python package.
Credentials
No credentials are required to use theJSONLoader
class.
To enable automated tracing of your model calls, set yourLangSmith API key:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"
Installation
Installlangchain_community andjq:
%pip install-qU langchain_community jq
Initialization
Now we can instantiate our model object and load documents:
- TODO: Update model instantiation with relevant params.
from langchain_community.document_loadersimport JSONLoader
loader= JSONLoader(
file_path="./example_data/facebook_chat.json",
jq_schema=".messages[].content",
text_content=False,
)
Load
docs= loader.load()
docs[0]
Document(metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}, page_content='Bye!')
print(docs[0].metadata)
{'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}
Lazy Load
pages=[]
for docin loader.lazy_load():
pages.append(doc)
iflen(pages)>=10:
# do some paged operation, e.g.
# index.upsert(pages)
pages=[]
Read from JSON Lines file
If you want to load documents from a JSON Lines file, you passjson_lines=True
and specifyjq_schema
to extractpage_content
from a single JSON object.
loader= JSONLoader(
file_path="./example_data/facebook_chat_messages.jsonl",
jq_schema=".content",
text_content=False,
json_lines=True,
)
docs= loader.load()
print(docs[0])
page_content='Bye!' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}
Read specific content keys
Another option is to setjq_schema='.'
and provide acontent_key
in order to only load specific content:
loader= JSONLoader(
file_path="./example_data/facebook_chat_messages.jsonl",
jq_schema=".",
content_key="sender_name",
json_lines=True,
)
docs= loader.load()
print(docs[0])
page_content='User 2' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}
JSON file with jq schemacontent_key
To load documents from a JSON file using thecontent_key
within the jq schema, setis_content_key_jq_parsable=True
. Ensure thatcontent_key
is compatible and can be parsed using the jq schema.
loader= JSONLoader(
file_path="./example_data/facebook_chat.json",
jq_schema=".messages[]",
content_key=".content",
is_content_key_jq_parsable=True,
)
docs= loader.load()
print(docs[0])
page_content='Bye!' metadata={'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1}
Extracting metadata
Generally, we want to include metadata available in the JSON file into the documents that we create from the content.
The following demonstrates how metadata can be extracted using theJSONLoader
.
There are some key changes to be noted. In the previous example where we didn't collect the metadata, we managed to directly specify in the schema where the value for thepage_content
can be extracted from.
In this example, we have to tell the loader to iterate over the records in themessages
field. The jq_schema then has to be.messages[]
This allows us to pass the records (dict) into themetadata_func
that has to be implemented. Themetadata_func
is responsible for identifying which pieces of information in the record should be included in the metadata stored in the finalDocument
object.
Additionally, we now have to explicitly specify in the loader, via thecontent_key
argument, the key from the record where the value for thepage_content
needs to be extracted from.
# Define the metadata extraction function.
defmetadata_func(record:dict, metadata:dict)->dict:
metadata["sender_name"]= record.get("sender_name")
metadata["timestamp_ms"]= record.get("timestamp_ms")
return metadata
loader= JSONLoader(
file_path="./example_data/facebook_chat.json",
jq_schema=".messages[]",
content_key="content",
metadata_func=metadata_func,
)
docs= loader.load()
print(docs[0].metadata)
{'source': '/Users/isaachershenson/Documents/langchain/docs/docs/integrations/document_loaders/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}
API reference
For detailed documentation of all JSONLoader features and configurations head to the API reference:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html
Related
- Document loaderconceptual guide
- Document loaderhow-to guides