How to load web pages
This guide covers how toload web pages into the LangChainDocument format that we use downstream. Web pages contain text, images, and other multimedia elements, and are typically represented with HTML. They may include links to other pages or resources.
LangChain integrates with a host of parsers that are appropriate for web pages. The right parser will depend on your needs. Below we demonstrate two possibilities:
- Simple and fast parsing, in which we recover one
Document
per web page with its content represented as a "flattened" string; - Advanced parsing, in which we recover multiple
Document
objects per page, allowing one to identify and traverse sections, links, tables, and other structures.
Setup
For the "simple and fast" parsing, we will needlangchain-community
and thebeautifulsoup4
library:
%pip install-qU langchain-community beautifulsoup4
For advanced parsing, we will uselangchain-unstructured
:
%pip install-qU langchain-unstructured
Simple and fast text extraction
If you are looking for a simple string representation of text that is embedded in a web page, the method below is appropriate. It will return a list ofDocument
objects -- one per page -- containing a single string of the page's text. Under the hood it uses thebeautifulsoup4
Python library.
LangChain document loaders implementlazy_load
and its async variant,alazy_load
, which return iterators ofDocument objects
. We will use these below.
import bs4
from langchain_community.document_loadersimport WebBaseLoader
page_url="https://python.langchain.com/docs/how_to/chatbots_memory/"
loader= WebBaseLoader(web_paths=[page_url])
docs=[]
asyncfor docin loader.alazy_load():
docs.append(doc)
assertlen(docs)==1
doc= docs[0]
USER_AGENT environment variable not set, consider setting it to identify your requests.
print(f"{doc.metadata}\n")
print(doc.page_content[:500].strip())
{'source': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'title': 'How to add memory to chatbots | \uf8ffü¶úÔ∏è\uf8ffüîó LangChain', 'description': 'A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:', 'language': 'en'}
How to add memory to chatbots | ü¶úÔ∏èüîó LangChain
Skip to main contentShare your thoughts on AI agents. Take the 3-min survey.IntegrationsAPI ReferenceMoreContributingPeopleLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1üí¨SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a Simple LLM Application with LCELBuild a Query Analysis SystemBuild a ChatbotConversational RAGBuild an Extraction ChainBuild an AgentTaggingd
This is essentially a dump of the text from the page's HTML. It may contain extraneous information like headings and navigation bars. If you are familiar with the expected HTML, you can specify desired<div>
classes and other parameters via BeautifulSoup. Below we parse only the body text of the article:
loader= WebBaseLoader(
web_paths=[page_url],
bs_kwargs={
"parse_only": bs4.SoupStrainer(class_="theme-doc-markdown markdown"),
},
bs_get_text_kwargs={"separator":" | ","strip":True},
)
docs=[]
asyncfor docin loader.alazy_load():
docs.append(doc)
assertlen(docs)==1
doc= docs[0]
print(f"{doc.metadata}\n")
print(doc.page_content[:500])
{'source': 'https://python.langchain.com/docs/how_to/chatbots_memory/'}
How to add memory to chatbots | A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including: | Simply stuffing previous messages into a chat model prompt. | The above, but trimming old messages to reduce the amount of distracting information the model has to deal with. | More complex modifications like synthesizing summaries for long running conversations. | We'll go into more detail on a few techniq
print(doc.page_content[-500:])
a greeting. Nemo then asks the AI how it is doing, and the AI responds that it is fine.'), | HumanMessage(content='What did I say my name was?'), | AIMessage(content='You introduced yourself as Nemo. How can I assist you today, Nemo?')] | Note that invoking the chain again will generate another summary generated from the initial summary plus new messages and so on. You could also design a hybrid approach where a certain number of messages are retained in chat history while others are summarized.
Note that this required advance technical knowledge of how the body text is represented in the underlying HTML.
We can parameterizeWebBaseLoader
with a variety of settings, allowing for specification of request headers, rate limits, and parsers and other kwargs for BeautifulSoup. See itsAPI reference for detail.
Advanced parsing
This method is appropriate if we want more granular control or processing of the page content. Below, instead of generating oneDocument
per page and controlling its content via BeautifulSoup, we generate multipleDocument
objects representing distinct structures on a page. These structures can include section titles and their corresponding body texts, lists or enumerations, tables, and more.
Under the hood it uses thelangchain-unstructured
library. See theintegration docs for more information about usingUnstructured with LangChain.
from langchain_unstructuredimport UnstructuredLoader
page_url="https://python.langchain.com/docs/how_to/chatbots_memory/"
loader= UnstructuredLoader(web_url=page_url)
docs=[]
asyncfor docin loader.alazy_load():
docs.append(doc)
INFO: Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO: NumExpr defaulting to 8 threads.
Note that with no advance knowledge of the page HTML structure, we recover a natural organization of the body text:
for docin docs[:5]:
print(doc.page_content)
How to add memory to chatbots
A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.
ERROR! Session/line number was not unique in database. History logging moved to new session 2747
Extracting content from specific sections
EachDocument
object represents an element of the page. Its metadata contains useful information, such as its category:
for docin docs[:5]:
print(f"{doc.metadata['category']}:{doc.page_content}")
Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use content of previous conversation turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.
ListItem: The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
ListItem: More complex modifications like synthesizing summaries for long running conversations.
Elements may also have parent-child relationships -- for example, a paragraph might belong to a section with a title. If a section is of particular interest (e.g., for indexing) we can isolate the correspondingDocument
objects.
As an example, below we load the content of the "Setup" sections for two web pages:
from typingimport List
from langchain_core.documentsimport Document
asyncdef_get_setup_docs_from_url(url:str)-> List[Document]:
loader= UnstructuredLoader(web_url=url)
setup_docs=[]
parent_id=-1
asyncfor docin loader.alazy_load():
if doc.metadata["category"]=="Title"and doc.page_content.startswith("Setup"):
parent_id= doc.metadata["element_id"]
if doc.metadata.get("parent_id")== parent_id:
setup_docs.append(doc)
return setup_docs
page_urls=[
"https://python.langchain.com/docs/how_to/chatbots_memory/",
"https://python.langchain.com/docs/how_to/chatbots_tools/",
]
setup_docs=[]
for urlin page_urls:
page_setup_docs=await _get_setup_docs_from_url(url)
setup_docs.extend(page_setup_docs)
from collectionsimport defaultdict
setup_text= defaultdict(str)
for docin setup_docs:
url= doc.metadata["url"]
setup_text[url]+=f"{doc.page_content}\n"
dict(setup_text)
{'https://python.langchain.com/docs/how_to/chatbots_memory/': "You'll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\n[33mWARNING: You are using pip version 22.0.4; however, version 23.3.2 is available.\nYou should consider upgrading via the '/Users/jacoblee/.pyenv/versions/3.10.5/bin/python -m pip install --upgrade pip' command.[0m[33m\n[0mNote: you may need to restart the kernel to use updated packages.\n",
'https://python.langchain.com/docs/how_to/chatbots_tools/': "For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.\nYou'll need to sign up for an account on the Tavily website, and install the following packages:\n%pip install --upgrade --quiet langchain-community langchain-openai tavily-python\n\n# Set env var OPENAI_API_KEY or load from a .env file:\nimport dotenv\n\ndotenv.load_dotenv()\nYou will also need your OpenAI key set as OPENAI_API_KEY and your Tavily API key set as TAVILY_API_KEY.\n"}
Vector search over page content
Once we have loaded the page contents into LangChainDocument
objects, we can index them (e.g., for a RAG application) in the usual way. Below we use OpenAIembeddings, although any LangChain embeddings model will suffice.
%pip install-qU langchain-openai
import getpass
import os
if"OPENAI_API_KEY"notin os.environ:
os.environ["OPENAI_API_KEY"]= getpass.getpass("OpenAI API Key:")
from langchain_core.vectorstoresimport InMemoryVectorStore
from langchain_openaiimport OpenAIEmbeddings
vector_store= InMemoryVectorStore.from_documents(setup_docs, OpenAIEmbeddings())
retrieved_docs= vector_store.similarity_search("Install Tavily", k=2)
for docin retrieved_docs:
print(f"Page{doc.metadata['url']}:{doc.page_content[:300]}\n")
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
``````output
Page https://python.langchain.com/docs/how_to/chatbots_tools/: You'll need to sign up for an account on the Tavily website, and install the following packages:
Page https://python.langchain.com/docs/how_to/chatbots_tools/: For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.
Other web page loaders
For a list of available LangChain web page loaders, please seethis table.