Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
OurBuilding Ambient Agents with LangGraph course is now available on LangChain Academy!
Open In ColabOpen on GitHub

How to load Markdown

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

Here we cover how to loadMarkdown documents into LangChainDocument objects that we can use downstream.

We will cover:

  • Basic usage;
  • Parsing of Markdown into elements such as titles, list items, and text.

LangChain implements anUnstructuredMarkdownLoader object which requires theUnstructured package. First we install it:

%pip install"unstructured[md]" nltk

Basic usage will ingest a Markdown file to a single document. Here we demonstrate on LangChain's readme:

from langchain_community.document_loadersimport UnstructuredMarkdownLoader
from langchain_core.documentsimport Document

markdown_path="../../../README.md"
loader= UnstructuredMarkdownLoader(markdown_path)

data= loader.load()
assertlen(data)==1
assertisinstance(data[0], Document)
readme_content= data[0].page_content
print(readme_content[:250])
🦜️🔗 LangChain

⚡ Build context-aware reasoning applications ⚡

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith.
LangSmith is a unified developer platform for building,

Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifyingmode="elements".

loader= UnstructuredMarkdownLoader(markdown_path, mode="elements")

data= loader.load()
print(f"Number of documents:{len(data)}\n")

for documentin data[:2]:
print(f"{document}\n")
Number of documents: 66

page_content='🦜️🔗 LangChain' metadata={'source': '../../../README.md', 'category_depth': 0, 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'Title'}

page_content='⚡ Build context-aware reasoning applications ⚡' metadata={'source': '../../../README.md', 'last_modified': '2024-06-28T15:20:01', 'languages': ['eng'], 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'filetype': 'text/markdown', 'file_directory': '../../..', 'filename': 'README.md', 'category': 'NarrativeText'}

Note that in this case we recover three distinct element types:

print(set(document.metadata["category"]for documentin data))
{'ListItem', 'NarrativeText', 'Title'}

[8]ページ先頭

©2009-2025 Movatter.jp