- Notifications
You must be signed in to change notification settings - Fork1
A guide to caching web scraping scripts in Python.
oxylabs/python-cache-tutorial
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This article will show you how to use caching in Python with your webscraping tasks. You can read thefullarticleon our blog, where we delve deeper into the different cachingstrategies.
There are different ways to implement caching in Python for differentcaching strategies. Here we’ll see two methods of Python caching for asimple web scraping example. If you’re new to web scraping, take a lookat ourstep-by-step Python web scrapingguide.
We’ll use therequestslibrary to make HTTP requeststo a website. Install it withpip by entering the followingcommand in your terminal:
python -m pip install requests
Other libraries we’ll use in this project, specificallytime
andfunctools
, come natively with Python 3.11.2, so you don’t have toinstall them.
Adecorator in Python is afunction that accepts another function as an argument and outputs a newfunction. We can alter the behavior of the original function using adecorator without changing its source code.
One common use case for decorators is to implement caching. Thisinvolves creating a dictionary to store the function's results and thensaving them in the cache for future use.
Let’s start by creating a simple function that takes a URL as a functionargument, requests that URL, and returns the response text:
defget_html_data(url):response=requests.get(url)returnresponse.text
Now, let's move toward creating a memoized version of this function:
defmemoize(func):cache= {}defwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text
Thewrapper
function determines whether the current input arguments havebeen previously cached and, if so, returns the previously cached result.If not, the code calls the original function and caches the resultbefore being returned. In this case, we define amemoize
decorator thatgenerates acache
dictionary to hold the results of previous functioncalls.
By adding@memoize
above the function definition, we can use the memoizedecorator to enhance theget_html_data
function. This generates a newmemoized function that we’ve calledget_html_data_cached
. It only makesa single network request for a URL and then stores the response in thecache for further requests.
Let’s use thetime
module to compare the execution speeds of theget_html_data
function and the memoizedget_html_data_cached
function:
importtimestart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)
Here’s what the complete code looks like:
# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoize function to cache the datadefmemoize(func):cache= {}# Inner wrapper function to store the data in the cachedefwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper# Memoized function to get the HTML Content@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (manual decorator)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)
And here’s the output:
Notice the time difference between the two functions. Both take almostthe same time, but the supremacy of caching lies behind the re-access.
Since we’re making only one request, the memoized function also has toaccess data from the main memory. Therefore, with our example, asignificant time difference in execution isn’t expected. However, if youincrease the number of calls to these functions, the time differencewill significantly increase (seePerformanceComparison).
Another method to implement caching in Python is to use the built-in@lru_cache
decorator fromfunctools
. This decorator implements cacheusing the least recently used (LRU) caching strategy. This LRU cache isa fixed-size cache, which means it’ll discard the data from the cachethat hasn’t been used recently.
To use the@lru_cache
decorator, we can create a new function forextracting HTML content and place the decorator name at the top. Makesure to import thefunctools
module before using the decorator:
fromfunctoolsimportlru_cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text
In the above example, theget_html_data_lru
method is memoized using the@lru_cache
decorator. The cache can grow indefinitely when themaxsize
option is set toNone
.
To use the@lru_cache
decorator, just add it above theget_html_data_lru
function. Here’s the complete code sample:
# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoized using LRU Cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (LRU cache)start_time=time.time()get_html_data_lru('https://books.toscrape.com/')print('Time taken (memoized function with LRU cache):',time.time()-start_time)
This produced the following output:
In the following table, we’ve determined the execution times of allthree functions for different numbers of requests to these functions:
No. of requests | Time taken by normal function | Time taken by memoized function (manual decorator) | Time taken by memoized function (lru_cache decorator) |
---|---|---|---|
1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds |
10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds |
20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds |
30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds |
As the number of requests to the functions increases, you can see asignificant reduction in execution times using the caching strategy. Thefollowing comparison chart depicts these results:
The comparison results clearly show that using a caching strategy inyour code can significantly improve overall performance and speed.
Feel free to visit ourblog for anarray of intriguing web scraping topics that will keep you hooked!
About
A guide to caching web scraping scripts in Python.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.