Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
OurBuilding Ambient Agents with LangGraph course is now available on LangChain Academy!
Open In ColabOpen on GitHub

How to load HTML

The HyperText Markup Language orHTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to loadHTML documents into a LangChainDocument objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing viaUnstructured andBeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such asAzure AI Document Intelligence orFireCrawl.

Loading HTML with Unstructured

%pip install unstructured
from langchain_community.document_loadersimport UnstructuredHTMLLoader

file_path="../../docs/integrations/document_loaders/example_data/fake-content.html"

loader= UnstructuredHTMLLoader(file_path)
data= loader.load()

print(data)
[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]

Loading HTML with BeautifulSoup4

We can also useBeautifulSoup4 to load HTML documents using theBSHTMLLoader. This will extract the text from the HTML intopage_content, and the page title astitle intometadata.

%pip install bs4
from langchain_community.document_loadersimport BSHTMLLoader

loader= BSHTMLLoader(file_path)
data= loader.load()

print(data)
API Reference:BSHTMLLoader
[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

[8]ページ先頭

©2009-2025 Movatter.jp