scrapfly/python-scrapflyPublic

NotificationsYou must be signed in to change notification settings
Fork15
Star50

Scrapfly Python SDK for headless browsers and proxy rotation

License

View license

50 stars 15 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scrapfly		scrapfly
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

Scrapfly SDK

Installation

pip install scrapfly-sdk

You can also install extra dependencies

pip install "scrapfly-sdk[seepdup]" for performance improvement
pip install "scrapfly-sdk[concurrency]" for concurrency out of the box (asyncio / thread)
pip install "scrapfly-sdk[scrapy]" for scrapy integration
pip install "scrapfly-sdk[webhook-server]" for have a native webhook server using flask
pip install "scrapfly-sdk[all]" Everything!

For use of built-in HTML parser (viaScrapeApiResponse.selector property) additional requirement of eitherparsel orscrapy is required.

For reference of usage or examples, please checkout the folder/examples in this repository.

This SDK cover the following Scrapfly API endpoints:

Integrations

Scrapfly Python SDKs are integrated withLlamaIndex andLangChain. Both framework allows training Large Language Models (LLMs) using augmented context.

This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:

Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
Document Understanding and Extraction
Autonomous Agents that can perform research and take actions

In the context of web scraping, web page data can be extracted as Text or Markdown usingScrapfly's format feature to train LLMs with the scraped data.

LlamaIndex

Installation

Installllama-index,llama-index-readers-web, andscrapfly-sdk using pip:

pip install llama-index llama-index-readers-web scrapfly-sdk

Usage

Scrapfly is available at LlamaIndex as adata connector, known as aReader. This reader is used to gather a web page data into aDocument representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See theLlamaIndex use cases for more.

importosfromllama_index.readers.webimportScrapflyReaderfromllama_index.coreimportVectorStoreIndex# Initiate ScrapflyReader with your Scrapfly API keyscrapfly_reader=ScrapflyReader(api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/ignore_scrape_failures=True,# Ignore unprocessable web pages and log their exceptions)# Load documents from URLs as markdowndocuments=scrapfly_reader.load_data(urls=["https://web-scraping.dev/products"])# After creating the documents, train them with an LLM# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry:# https://docs.llamaindex.ai/en/stable/examples/llm/openai/# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/os.environ['OPENAI_API_KEY']="Your OpenAI Key"index=VectorStoreIndex.from_documents(documents)query_engine=index.as_query_engine()response=query_engine.query("What is the flavor of the dark energy potion?")print(response)"The flavor of the dark energy potion is bold cherry cola."

Theload_data function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:

fromllama_index.readers.webimportScrapflyReader# Initiate ScrapflyReader with your ScrapFly API keyscrapfly_reader=ScrapflyReader(api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/ignore_scrape_failures=True,# Ignore unprocessable web pages and log their exceptions)scrapfly_scrape_config= {"asp":True,# Bypass scraping blocking and antibot solutions, like Cloudflare"render_js":True,# Enable JavaScript rendering with a cloud headless browser"proxy_pool":"public_residential_pool",# Select a proxy pool (datacenter or residnetial)"country":"us",# Select a proxy location"auto_scroll":True,# Auto scroll the page"js":"",# Execute custom JavaScript code by the headless browser}# Load documents from URLs as markdowndocuments=scrapfly_reader.load_data(urls=["https://web-scraping.dev/products"],scrape_config=scrapfly_scrape_config,# Pass the scrape configscrape_format="markdown",# The scrape result format, either `markdown`(default) or `text`)

LangChain

Installation

Installlangchain,langchain-community, andscrapfly-sdk using pip:

pip install langchain langchain-community scrapfly-sdk

Usage

Scrapfly is available at LangChain as adocument loader, known as aLoader. This reader is used to gather a web page data intoDocument representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, seeLangChain tutorials for further use cases.

importosfromlangchainimporthub# pip install langchainhubfromlangchain_chromaimportChroma# pip install langchain_chromafromlangchain_core.runnablesimportRunnablePassthroughfromlangchain_core.output_parsersimportStrOutputParserfromlangchain_openaiimportOpenAIEmbeddings,ChatOpenAI# pip install langchain_openaifromlangchain_text_splittersimportRecursiveCharacterTextSplitter# pip install langchain_text_splittersfromlangchain_community.document_loadersimportScrapflyLoaderscrapfly_loader=ScrapflyLoader(    ["https://web-scraping.dev/products"],api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/continue_on_failure=True,# Ignore unprocessable web pages and log their exceptions)# Load documents from URLs as markdowndocuments=scrapfly_loader.load()# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/os.environ["OPENAI_API_KEY"]="Your OpenAI key"# Create a retrievertext_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)splits=text_splitter.split_documents(documents)vectorstore=Chroma.from_documents(documents=splits,embedding=OpenAIEmbeddings())retriever=vectorstore.as_retriever()defformat_docs(docs):return"\n\n".join(doc.page_contentfordocindocs)model=ChatOpenAI()prompt=hub.pull("rlm/rag-prompt")rag_chain= (    {"context":retriever|format_docs,"question":RunnablePassthrough()}|prompt|model|StrOutputParser())response=rag_chain.invoke("What is the flavor of the dark energy potion?")print(response)"The flavor of the Dark Energy Potion is bold cherry cola."

To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to theScrapflyLoader:

fromlangchain_community.document_loadersimportScrapflyLoaderscrapfly_scrape_config= {"asp":True,# Bypass scraping blocking and antibot solutions, like Cloudflare"render_js":True,# Enable JavaScript rendering with a cloud headless browser"proxy_pool":"public_residential_pool",# Select a proxy pool (datacenter or residnetial)"country":"us",# Select a proxy location"auto_scroll":True,# Auto scroll the page"js":"",# Execute custom JavaScript code by the headless browser}scrapfly_loader=ScrapflyLoader(    ["https://web-scraping.dev/products"],api_key="Your Scrapfly API key",# Get your API key from https://www.scrapfly.io/continue_on_failure=True,# Ignore unprocessable web pages and log their exceptionsscrape_config=scrapfly_scrape_config,# Pass the scrape_config objectscrape_format="markdown",# The scrape result format, either `markdown`(default) or `text`)# Load documents from URLs as markdowndocuments=scrapfly_loader.load()print(documents)

Get Your API Key

You can create a free account onScrapfly to get your API Key.

Migration

Migrate from 0.7.x to 0.8

asyncio-pool dependency has been dropped

scrapfly.concurrent_scrape is now an async generator. If the concurrency isNone or not defined, the max concurrency allowed byyour current subscription is used.

asyncforresultinscrapfly.concurrent_scrape(concurrency=10,scrape_configs=[ScrapConfig(...), ...]):print(result)

brotli args is deprecated and will be removed in the next minor. There is not benefit in most of caseversus gzip regarding and size and use more CPU.

What's new

0.8.x

Better error log
Async/Improvement for concurrent scrape with asyncio
Scrapy media pipeline are now supported out of the box

About

Scrapfly Python SDK for headless browsers and proxy rotation

scrapfly.io/docs/sdk/python

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Scrapfly SDK

Installation

Integrations

LlamaIndex

Installation

Usage

LangChain

Installation

Usage

Get Your API Key

Migration

Migrate from 0.7.x to 0.8

What's new

0.8.x

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

scrapfly/python-scrapfly

Folders and files

Latest commit

History

Repository files navigation

Scrapfly SDK

Installation

Integrations

LlamaIndex

Installation

Usage

LangChain

Installation

Usage

Get Your API Key

Migration

Migrate from 0.7.x to 0.8

What's new

0.8.x

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors4

Uh oh!

Languages

Packages