- Notifications
You must be signed in to change notification settings - Fork1
A guide to caching web scraping scripts in Python.
oxylabs/python-cache-tutorial
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This article will show you how to use caching in Python with your webscraping tasks. You can read thefullarticleon our blog, where we delve deeper into the different cachingstrategies.
There are different ways to implement caching in Python for differentcaching strategies. Here we’ll see two methods of Python caching for asimple web scraping example. If you’re new to web scraping, take a lookat ourstep-by-step Python web scrapingguide.
We’ll use therequestslibrary to make HTTP requeststo a website. Install it withpip by entering the followingcommand in your terminal:
python -m pip install requests
Other libraries we’ll use in this project, specifically time andfunctools, come natively with Python 3.11.2, so you don’t have toinstall them.
Adecorator in Python is afunction that accepts another function as an argument and outputs a newfunction. We can alter the behavior of the original function using adecorator without changing its source code.
One common use case for decorators is to implement caching. Thisinvolves creating a dictionary to store the function's results and thensaving them in the cache for future use.
Let’s start by creating a simple function that takes a URL as a functionargument, requests that URL, and returns the response text:
def get_html_data(url):
response = requests.get(url)
return response.text
Now, let's move toward creating a memoized version of this function:
def memoize(func):
cache = {}
def wrapper(*args):
if args in cache:
return cache[args]
else:
result = func(*args)
cache[args] = result
return result
return wrapper
@memoize
def get_html_data_cached(url):
response = requests.get(url)
return response.text
The wrapper function determines whether the current input arguments havebeen previously cached and, if so, returns the previously cached result.If not, the code calls the original function and caches the resultbefore being returned. In this case, we define a memoize decorator thatgenerates a cache dictionary to hold the results of previous functioncalls.
By adding @memoize above the function definition, we can use the memoizedecorator to enhance the get_html_data function. This generates a newmemoized function that we’ve called get_html_data_cached. It only makesa single network request for a URL and then stores the response in thecache for further requests.
Let’s use the time module to compare the execution speeds of theget_html_data function and the memoized get_html_data_cached function:
import time
start_time = time.time()
get_html_data('https://books.toscrape.com/')
print('Time taken (normal function):', time.time() - start_time)
start_time = time.time()
get_html_data_cached('https://books.toscrape.com/')
print('Time taken (memoized function using manual decorator):',time.time() - start_time)
Here’s what the complete code looks like:
# Import the required modules
from functools import lru_cache
import time
import requests
# Function to get the HTML Content
def get_html_data(url):
response = requests.get(url)
return response.text
# Memoize function to cache the data
def memoize(func):
cache = {}
# Inner wrapper function to store the data in the cache
def wrapper(*args):
if args in cache:
return cache[args]
else:
result = func(*args)
cache[args] = result
return result
return wrapper
# Memoized function to get the HTML Content
@memoize
def get_html_data_cached(url):
response = requests.get(url)
return response.text
# Get the time it took for a normal function
start_time = time.time()
get_html_data('https://books.toscrape.com/')
print('Time taken (normal function):', time.time() - start_time)
# Get the time it took for a memoized function (manual decorator)
start_time = time.time()
get_html_data_cached('https://books.toscrape.com/')
print('Time taken (memoized function using manual decorator):',time.time() - start_time)
Here’s the output:
Notice the time difference between the two functions. Both take almostthe same time, but the supremacy of caching lies behind the re-access.
Since we’re making only one request, the memoized function also has toaccess data from the main memory. Therefore, with our example, asignificant time difference in execution isn’t expected. However, if youincrease the number of calls to these functions, the time differencewill significantly increase (seePerformanceComparison).
Another method to implement caching in Python is to use the built-in@lru_cache decorator from functools. This decorator implements cacheusing the least recently used (LRU) caching strategy. This LRU cache isa fixed-size cache, which means it’ll discard the data from the cachethat hasn’t been used recently.
To use the @lru_cache decorator, we can create a new function forextracting HTML content and place the decorator name at the top. Makesure to import the functools module before using the decorator:
from functools import lru_cache
@lru_cache(maxsize=None)
def get_html_data_lru(url):
response = requests.get(url)
return response.text
In the above example, the get_html_data_lru method is memoized using the@lru_cache decorator. The cache can grow indefinitely when the maxsizeoption is set to None.
To use the @lru_cache decorator, just add it above the get_html_data_lrufunction. Here’s the complete code sample:
# Import the required modules
from functools import lru_cache
import time
import requests
# Function for getting HTML Content
def get_html_data(url):
response = requests.get(url)
return response.text
# Memoized using LRU Cache
@lru_cache(maxsize=None)
def get_html_data_lru(url):
response = requests.get(url)
return response.text
# Getting time for Normal function to extract HTML content
start_time = time.time()
get_html_data('https://books.toscrape.com/')
print('Time taken (normal function):', time.time() - start_time)
# Getting time for Memoized function (LRU cache) to extract HTMLcontent
start_time = time.time()
get_html_data_lru('https://books.toscrape.com/')
print('Time taken (memoized function with LRU cache):', time.time() -start_time)
This produced the following output:
In the following table, we’ve determined the execution times of allthree functions for different numbers of requests to these functions:
No. of requests | Time taken by normal function | Time taken by memoized function (manual decorator) | Time taken by memoized function (lru_cache decorator) |
---|---|---|---|
1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds |
10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds |
20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds |
30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds |
As the number of requests to the functions increases, you can see asignificant reduction in execution times using the caching strategy. Thefollowing comparison chart depicts these results:
The comparison results clearly show that using a caching strategy inyour code can significantly improve overall performance and speed.
Feel free to visit ourblog for anarray of intriguing web scraping topics that will keep you hooked!
About
A guide to caching web scraping scripts in Python.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.