- Notifications
You must be signed in to change notification settings - Fork15
Scrapfly Python SDK for headless browsers and proxy rotation
License
scrapfly/python-scrapfly
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
pip install scrapfly-sdk
You can also install extra dependencies
pip install "scrapfly-sdk[seepdup]"for performance improvementpip install "scrapfly-sdk[concurrency]"for concurrency out of the box (asyncio / thread)pip install "scrapfly-sdk[scrapy]"for scrapy integrationpip install "scrapfly-sdk[webhook-server]"for have a native webhook server using flaskpip install "scrapfly-sdk[all]"Everything!
For use of built-in HTML parser (viaScrapeApiResponse.selector property) additional requirement of eitherparsel orscrapy is required.
For reference of usage or examples, please checkout the folder/examples in this repository.
This SDK cover the following Scrapfly API endpoints:
Scrapfly Python SDKs are integrated withLlamaIndex andLangChain. Both framework allows training Large Language Models (LLMs) using augmented context.
This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:
- Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
- Document Understanding and Extraction
- Autonomous Agents that can perform research and take actions
In the context of web scraping, web page data can be extracted as Text or Markdown usingScrapfly's format feature to train LLMs with the scraped data.
Installllama-index,llama-index-readers-web, andscrapfly-sdk using pip:
pip install llama-index llama-index-readers-web scrapfly-sdk
Scrapfly is available at LlamaIndex as adata connector, known as aReader. This reader is used to gather a web page data into aDocument representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See theLlamaIndex use cases for more.
importosfromllama_index.readers.webimportScrapflyReaderfromllama_index.coreimportVectorStoreIndex# Initiate ScrapflyReader with your Scrapfly API keyscrapfly_reader=ScrapflyReader(api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/ignore_scrape_failures=True,# Ignore unprocessable web pages and log their exceptions)# Load documents from URLs as markdowndocuments=scrapfly_reader.load_data(urls=["https://web-scraping.dev/products"])# After creating the documents, train them with an LLM# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry:# https://docs.llamaindex.ai/en/stable/examples/llm/openai/# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/os.environ['OPENAI_API_KEY']="Your OpenAI Key"index=VectorStoreIndex.from_documents(documents)query_engine=index.as_query_engine()response=query_engine.query("What is the flavor of the dark energy potion?")print(response)"The flavor of the dark energy potion is bold cherry cola."
Theload_data function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:
fromllama_index.readers.webimportScrapflyReader# Initiate ScrapflyReader with your ScrapFly API keyscrapfly_reader=ScrapflyReader(api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/ignore_scrape_failures=True,# Ignore unprocessable web pages and log their exceptions)scrapfly_scrape_config= {"asp":True,# Bypass scraping blocking and antibot solutions, like Cloudflare"render_js":True,# Enable JavaScript rendering with a cloud headless browser"proxy_pool":"public_residential_pool",# Select a proxy pool (datacenter or residnetial)"country":"us",# Select a proxy location"auto_scroll":True,# Auto scroll the page"js":"",# Execute custom JavaScript code by the headless browser}# Load documents from URLs as markdowndocuments=scrapfly_reader.load_data(urls=["https://web-scraping.dev/products"],scrape_config=scrapfly_scrape_config,# Pass the scrape configscrape_format="markdown",# The scrape result format, either `markdown`(default) or `text`)
Installlangchain,langchain-community, andscrapfly-sdk using pip:
pip install langchain langchain-community scrapfly-sdk
Scrapfly is available at LangChain as adocument loader, known as aLoader. This reader is used to gather a web page data intoDocument representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, seeLangChain tutorials for further use cases.
importosfromlangchainimporthub# pip install langchainhubfromlangchain_chromaimportChroma# pip install langchain_chromafromlangchain_core.runnablesimportRunnablePassthroughfromlangchain_core.output_parsersimportStrOutputParserfromlangchain_openaiimportOpenAIEmbeddings,ChatOpenAI# pip install langchain_openaifromlangchain_text_splittersimportRecursiveCharacterTextSplitter# pip install langchain_text_splittersfromlangchain_community.document_loadersimportScrapflyLoaderscrapfly_loader=ScrapflyLoader( ["https://web-scraping.dev/products"],api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/continue_on_failure=True,# Ignore unprocessable web pages and log their exceptions)# Load documents from URLs as markdowndocuments=scrapfly_loader.load()# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/os.environ["OPENAI_API_KEY"]="Your OpenAI key"# Create a retrievertext_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)splits=text_splitter.split_documents(documents)vectorstore=Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings())retriever=vectorstore.as_retriever()defformat_docs(docs):return"\n\n".join(doc.page_contentfordocindocs)model=ChatOpenAI()prompt=hub.pull("rlm/rag-prompt")rag_chain= ( {"context":retriever|format_docs,"question":RunnablePassthrough()}|prompt|model|StrOutputParser())response=rag_chain.invoke("What is the flavor of the dark energy potion?")print(response)"The flavor of the Dark Energy Potion is bold cherry cola."
To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to theScrapflyLoader:
fromlangchain_community.document_loadersimportScrapflyLoaderscrapfly_scrape_config= {"asp":True,# Bypass scraping blocking and antibot solutions, like Cloudflare"render_js":True,# Enable JavaScript rendering with a cloud headless browser"proxy_pool":"public_residential_pool",# Select a proxy pool (datacenter or residnetial)"country":"us",# Select a proxy location"auto_scroll":True,# Auto scroll the page"js":"",# Execute custom JavaScript code by the headless browser}scrapfly_loader=ScrapflyLoader( ["https://web-scraping.dev/products"],api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/continue_on_failure=True,# Ignore unprocessable web pages and log their exceptionsscrape_config=scrapfly_scrape_config,# Pass the scrape_config objectscrape_format="markdown",# The scrape result format, either `markdown`(default) or `text`)# Load documents from URLs as markdowndocuments=scrapfly_loader.load()print(documents)
You can create a free account onScrapfly to get your API Key.
asyncio-pool dependency has been dropped
scrapfly.concurrent_scrape is now an async generator. If the concurrency isNone or not defined, the max concurrency allowed byyour current subscription is used.
asyncforresultinscrapfly.concurrent_scrape(concurrency=10,scrape_configs=[ScrapConfig(...), ...]):print(result)
brotli args is deprecated and will be removed in the next minor. There is not benefit in most of caseversus gzip regarding and size and use more CPU.
- Better error log
- Async/Improvement for concurrent scrape with asyncio
- Scrapy media pipeline are now supported out of the box
About
Scrapfly Python SDK for headless browsers and proxy rotation
Topics
Resources
License
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors4
Uh oh!
There was an error while loading.Please reload this page.