How to load HTML

The HyperText Markup Language orHTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to loadHTML documents into a LangChainDocument objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing viaUnstructured andBeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such asAzure AI Document Intelligence orFireCrawl.

Loading HTML with Unstructured

%pip install unstructured

from langchain_community.document_loadersimport UnstructuredHTMLLoader

file_path="../../docs/integrations/document_loaders/example_data/fake-content.html"

loader= UnstructuredHTMLLoader(file_path)
data= loader.load()

print(data)

API Reference:UnstructuredHTMLLoader

[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]

Loading HTML with BeautifulSoup4

We can also useBeautifulSoup4 to load HTML documents using theBSHTMLLoader. This will extract the text from the HTML intopage_content, and the page title astitle intometadata.

%pip install bs4

from langchain_community.document_loadersimport BSHTMLLoader

loader= BSHTMLLoader(file_path)
data= loader.load()

print(data)

API Reference:BSHTMLLoader

[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]

Movatterモバイル変換

Loading HTML with Unstructured​

Loading HTML with BeautifulSoup4​

Loading HTML with Unstructured

Loading HTML with BeautifulSoup4