Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A guide to caching web scraping scripts in Python.

NotificationsYou must be signed in to change notification settings

oxylabs/python-cache-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Oxylabs promo code

This article will show you how to use caching in Python with your webscraping tasks. You can read thefullarticleon our blog, where we delve deeper into the different cachingstrategies.

How to implement a cache in Python

There are different ways to implement caching in Python for differentcaching strategies. Here we’ll see two methods of Python caching for asimple web scraping example. If you’re new to web scraping, take a lookat ourstep-by-step Python web scrapingguide.

Install the required libraries

We’ll use therequestslibrary to make HTTP requeststo a website. Install it withpip by entering the followingcommand in your terminal:

python -m pip install requests

Other libraries we’ll use in this project, specificallytime andfunctools, come natively with Python 3.11.2, so you don’t have toinstall them.

Method 1: Python caching using a manual decorator

Adecorator in Python is afunction that accepts another function as an argument and outputs a newfunction. We can alter the behavior of the original function using adecorator without changing its source code.

One common use case for decorators is to implement caching. Thisinvolves creating a dictionary to store the function's results and thensaving them in the cache for future use.

Let’s start by creating a simple function that takes a URL as a functionargument, requests that URL, and returns the response text:

defget_html_data(url):response=requests.get(url)returnresponse.text

Now, let's move toward creating a memoized version of this function:

defmemoize(func):cache= {}defwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text

Thewrapper function determines whether the current input arguments havebeen previously cached and, if so, returns the previously cached result.If not, the code calls the original function and caches the resultbefore being returned. In this case, we define amemoize decorator thatgenerates acache dictionary to hold the results of previous functioncalls.

By adding@memoize above the function definition, we can use the memoizedecorator to enhance theget_html_data function. This generates a newmemoized function that we’ve calledget_html_data_cached. It only makesa single network request for a URL and then stores the response in thecache for further requests.

Let’s use thetime module to compare the execution speeds of theget_html_data function and the memoizedget_html_data_cached function:

importtimestart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)

Here’s what the complete code looks like:

# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoize function to cache the datadefmemoize(func):cache= {}# Inner wrapper function to store the data in the cachedefwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper# Memoized function to get the HTML Content@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (manual decorator)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)

And here’s the output:

Notice the time difference between the two functions. Both take almostthe same time, but the supremacy of caching lies behind the re-access.

Since we’re making only one request, the memoized function also has toaccess data from the main memory. Therefore, with our example, asignificant time difference in execution isn’t expected. However, if youincrease the number of calls to these functions, the time differencewill significantly increase (seePerformanceComparison). 

Method 2: Python caching using LRU cache decorator

Another method to implement caching in Python is to use the built-in@lru_cache decorator fromfunctools. This decorator implements cacheusing the least recently used (LRU) caching strategy. This LRU cache isa fixed-size cache, which means it’ll discard the data from the cachethat hasn’t been used recently.

To use the@lru_cache decorator, we can create a new function forextracting HTML content and place the decorator name at the top. Makesure to import thefunctools module before using the decorator: 

fromfunctoolsimportlru_cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text

In the above example, theget_html_data_lru method is memoized using the@lru_cache decorator. The cache can grow indefinitely when themaxsizeoption is set toNone.

To use the@lru_cache decorator, just add it above theget_html_data_lrufunction. Here’s the complete code sample:

# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoized using LRU Cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (LRU cache)start_time=time.time()get_html_data_lru('https://books.toscrape.com/')print('Time taken (memoized function with LRU cache):',time.time()-start_time)

This produced the following output:

Performance comparison

In the following table, we’ve determined the execution times of allthree functions for different numbers of requests to these functions:

No. of requestsTime taken by normal functionTime taken by memoized function (manual decorator)Time taken by memoized function (lru_cache decorator)
12.1 Seconds2.0 Seconds1.7 Seconds
1017.3 Seconds2.1 Seconds1.8 Seconds
2032.2 Seconds2.2 Seconds2.1 Seconds
3057.3 Seconds2.22 Seconds2.12 Seconds

As the number of requests to the functions increases, you can see asignificant reduction in execution times using the caching strategy. Thefollowing comparison chart depicts these results:

The comparison results clearly show that using a caching strategy inyour code can significantly improve overall performance and speed.

Feel free to visit ourblog for anarray of intriguing web scraping topics that will keep you hooked!


[8]ページ先頭

©2009-2025 Movatter.jp