Sitemap

Extends from theWebBaseLoader,SitemapLoader loads a sitemap from a given URL, and then scrapes and loads all pages in the sitemap, returning each page as a Document.

The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load you can increase this limit. Note, while this will speed up the scraping process, it may cause the server to block you. Be careful!

Overview

Integration details

Class	Package	Local	Serializable	JS support
SiteMapLoader	langchain_community	✅	❌	✅

Loader features

Source	Document Lazy Loading	Native Async Support
SiteMapLoader	✅	❌

Setup

To access SiteMap document loader you'll need to install thelangchain-community integration package.

Credentials

No credentials are needed to run this.

To enable automated tracing of your model calls, set yourLangSmith API key:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

Installation

Installlangchain_community.

%pip install-qU langchain-community

Fix notebook asyncio bug

import nest_asyncio

nest_asyncio.apply()

Initialization

Now we can instantiate our model object and load documents:

from langchain_community.document_loaders.sitemapimport SitemapLoader

API Reference:SitemapLoader

sitemap_loader= SitemapLoader(web_path="https://api.python.langchain.com/sitemap.xml")

Load

docs= sitemap_loader.load()
docs[0]

Fetching pages: 100%|##########| 28/28 [00:04<00:00,  6.83it/s]

Document(metadata={'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}, page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n')

print(docs[0].metadata)

{'source': 'https://api.python.langchain.com/en/stable/', 'loc': 'https://api.python.langchain.com/en/stable/', 'lastmod': '2024-05-15T00:29:42.163001+00:00', 'changefreq': 'weekly', 'priority': '1'}

You can change therequests_per_second parameter to increase the max concurrent requests. and userequests_kwargs to pass kwargs when send requests.

sitemap_loader.requests_per_second=2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs={"verify":False}

Lazy Load

You can also load the pages lazily in order to minimize the memory load.

page=[]
for docin sitemap_loader.lazy_load():
    page.append(doc)
iflen(page)>=10:
# do some paged operation, e.g.
# index.upsert(page)

        page=[]

Fetching pages: 100%|##########| 28/28 [00:01<00:00, 19.06it/s]

Filtering sitemap URLs

Sitemaps can be massive files, with thousands of URLs. Often you don't need every single one of them. You can filter the URLs by passing a list of strings or regex patterns to thefilter_urls parameter. Only URLs that match one of the patterns will be loaded.

loader= SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents= loader.load()

documents[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\nLangChain Python API Reference Documentation.\n\n\nYou will be automatically redirected to the new location of this page.\n\n', metadata={'source': 'https://api.python.langchain.com/en/latest/', 'loc': 'https://api.python.langchain.com/en/latest/', 'lastmod': '2024-02-12T05:26:10.971077+00:00', 'changefreq': 'daily', 'priority': '0.9'})

Add custom scraping rules

TheSitemapLoader usesbeautifulsoup4 for the scraping process, and it scrapes every element on the page by default. TheSitemapLoader constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.

The following example shows how to develop and use a custom function to avoid navigation and header elements.

Import thebeautifulsoup4 library and define the custom function.

pip install beautifulsoup4

from bs4import BeautifulSoup


defremove_nav_and_header_elements(content: BeautifulSoup)->str:
# Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements= content.find_all("nav")
    header_elements= content.find_all("header")

# Remove each 'nav' and 'header' element from the BeautifulSoup object
for elementin nav_elements+ header_elements:
        element.decompose()

returnstr(content.get_text())

Add your custom function to theSitemapLoader object.

loader= SitemapLoader(
"https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)

Local Sitemap

The sitemap loader can also be used to load local files.

sitemap_loader= SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs= sitemap_loader.load()

API reference

For detailed documentation of all SiteMapLoader features and configurations head to the API reference:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.sitemap.SitemapLoader.html#langchain_community.document_loaders.sitemap.SitemapLoader

Document loaderconceptual guide
Document loaderhow-to guides

Movatterモバイル変換

Sitemap

Overview

Integration details

Loader features

Setup

Credentials

Installation

Fix notebook asyncio bug

Initialization

Load

Lazy Load

Filtering sitemap URLs

Add custom scraping rules

Local Sitemap

API reference

Related

Movatterモバイル変換

Overview​

Integration details​

Loader features​

Setup​

Credentials​

Installation​

Fix notebook asyncio bug​

Initialization​

Load​

Lazy Load​

Filtering sitemap URLs​

Add custom scraping rules​

Local Sitemap​

API reference​

Related​

Overview

Integration details

Loader features

Setup

Credentials

Installation

Fix notebook asyncio bug

Initialization

Load

Lazy Load

Filtering sitemap URLs

Add custom scraping rules

Local Sitemap

API reference

Related