How to load Markdown
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
Here we cover how to loadMarkdown
documents into LangChainDocument objects that we can use downstream.
We will cover:
- Basic usage;
- Parsing of Markdown into elements such as titles, list items, and text.
LangChain implements anUnstructuredMarkdownLoader object which requires theUnstructured package. First we install it:
%pip install"unstructured[md]" nltk
Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChain's readme:
from langchain_community.document_loadersimport UnstructuredMarkdownLoader
from langchain_core.documentsimport Document
markdown_path="../../../README.md"
loader= UnstructuredMarkdownLoader(markdown_path)
data= loader.load()
assertlen(data)==1
assertisinstance(data[0], Document)
readme_content= data[0].page_content
print(readme_content[:250])
API Reference:UnstructuredMarkdownLoader |Document
🦜️🔗 LangChain
⚡ Build context-aware reasoning applications ⚡
Looking for the JS/TS library? Check out LangChain.js.
To help you ship LangChain apps to production faster, check out LangSmith.
LangSmith is a unified developer platform for building,
Retain Elements
Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifyingmode="elements"
.
loader= UnstructuredMarkdownLoader(markdown_path, mode="elements")
data= loader.load()
print(f"Number of documents:{len(data)}\n")
for documentin data[:2]:
print(f"{document}\n")
Number of documents: 66
page_content='🦜️🔗 LangChain' metadata={'source': '../../../README.md', 'category_depth': 0, 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'Title'}
page_content='⚡ Build context-aware reasoning applications ⚡' metadata={'source': '../../../README.md', 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'NarrativeText'}
Note that in this case we recover three distinct element types:
print(set(document.metadata["category"]for documentin data))
{'ListItem', 'NarrativeText', 'Title'}