Async Chromium
Chromium is one of the browsers supported by Playwright, a library used to control browser automation.
By runningp.chromium.launch(headless=True)
, we are launching a headless instance of Chromium.
Headless mode means that the browser is running without a graphical user interface.
In the below example we'll use theAsyncChromiumLoader
to load the page, and then theHtml2TextTransformer
to strip out the HTML tags and other semantic information.
%pip install--upgrade--quiet playwright beautifulsoup4 html2text
!playwright install
Note: If you are using Jupyter notebooks, you might also need to install and applynest_asyncio
before loading the documents like this:
!pip install nest-asyncio
import nest_asyncio
nest_asyncio.apply()
from langchain_community.document_loadersimport AsyncChromiumLoader
urls=["https://docs.smith.langchain.com/"]
loader= AsyncChromiumLoader(urls, user_agent="MyAppUserAgent")
docs= loader.load()
docs[0].page_content[0:100]
API Reference:AsyncChromiumLoader
'<!DOCTYPE html><html lang="en" dir="ltr" class="docs-wrapper docs-doc-page docs-version-2.0 plugin-d'
Now let's transform the documents into a more readable format using the transformer:
from langchain_community.document_transformersimport Html2TextTransformer
html2text= Html2TextTransformer()
docs_transformed= html2text.transform_documents(docs)
docs_transformed[0].page_content[0:500]
API Reference:Html2TextTransformer
'Skip to main content\n\nGo to API Docs\n\nSearch`⌘``K`\n\nGo to App\n\n * Quick start\n * Tutorials\n\n * How-to guides\n\n * Concepts\n\n * Reference\n\n * Pricing\n * Self-hosting\n\n * LangGraph Cloud\n\n * * Quick start\n\nOn this page\n\n# Get started with LangSmith\n\n**LangSmith** is a platform for building production-grade LLM applications. It\nallows you to closely monitor and evaluate your application, so you can ship\nquickly and with confidence. Use of LangChain is not necessary - LangSmith\nworks on it'
Related
- Document loaderconceptual guide
- Document loaderhow-to guides