How to load HTML
The HyperText Markup Language orHTML is the standard markup language for documents designed to be displayed in a web browser.
This covers how to loadHTML
documents into a LangChainDocument objects that we can use downstream.
Parsing HTML files often requires specialized tools. Here we demonstrate parsing viaUnstructured andBeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such asAzure AI Document Intelligence orFireCrawl.
Loading HTML with Unstructured
%pip install unstructured
from langchain_community.document_loadersimport UnstructuredHTMLLoader
file_path="../../docs/integrations/document_loaders/example_data/fake-content.html"
loader= UnstructuredHTMLLoader(file_path)
data= loader.load()
print(data)
API Reference:UnstructuredHTMLLoader
[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'})]
Loading HTML with BeautifulSoup4
We can also useBeautifulSoup4
to load HTML documents using theBSHTMLLoader
. This will extract the text from the HTML intopage_content
, and the page title astitle
intometadata
.
%pip install bs4
from langchain_community.document_loadersimport BSHTMLLoader
loader= BSHTMLLoader(file_path)
data= loader.load()
print(data)
API Reference:BSHTMLLoader
[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]