oxylabs/python-cache-tutorialPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star0

A guide to caching web scraping scripts in Python.

oxylabs.io/blog/python-cache-how-to-use-effectively

0 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
src		src
readme.md		readme.md

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

How to implement a cache in Python

This article will show you how to use caching in Python with your webscraping tasks. You can read thefullarticleon our blog, where we delve deeper into the different cachingstrategies.

How to implement a cache in Python

There are different ways to implement caching in Python for differentcaching strategies. Here we’ll see two methods of Python caching for asimple web scraping example. If you’re new to web scraping, take a lookat ourstep-by-step Python web scrapingguide.

Install the required libraries

We’ll use therequestslibrary to make HTTP requeststo a website. Install it withpip by entering the followingcommand in your terminal:

python -m pip install requests

Other libraries we’ll use in this project, specificallytime andfunctools, come natively with Python 3.11.2, so you don’t have toinstall them.

Method 1: Python caching using a manual decorator

Adecorator in Python is afunction that accepts another function as an argument and outputs a newfunction. We can alter the behavior of the original function using adecorator without changing its source code.

One common use case for decorators is to implement caching. Thisinvolves creating a dictionary to store the function's results and thensaving them in the cache for future use.

Let’s start by creating a simple function that takes a URL as a functionargument, requests that URL, and returns the response text:

defget_html_data(url):response=requests.get(url)returnresponse.text

Now, let's move toward creating a memoized version of this function:

defmemoize(func):cache= {}defwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text

Thewrapper function determines whether the current input arguments havebeen previously cached and, if so, returns the previously cached result.If not, the code calls the original function and caches the resultbefore being returned. In this case, we define amemoize decorator thatgenerates acache dictionary to hold the results of previous functioncalls.

By adding@memoize above the function definition, we can use the memoizedecorator to enhance theget_html_data function. This generates a newmemoized function that we’ve calledget_html_data_cached. It only makesa single network request for a URL and then stores the response in thecache for further requests.

Let’s use thetime module to compare the execution speeds of theget_html_data function and the memoizedget_html_data_cached function:

importtimestart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)

Here’s what the complete code looks like:

# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoize function to cache the datadefmemoize(func):cache= {}# Inner wrapper function to store the data in the cachedefwrapper(*args):ifargsincache:returncache[args]else:result=func(*args)cache[args]=resultreturnresultreturnwrapper# Memoized function to get the HTML Content@memoizedefget_html_data_cached(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (manual decorator)start_time=time.time()get_html_data_cached('https://books.toscrape.com/')print('Time taken (memoized function using manual decorator):',time.time()-start_time)

And here’s the output:

Notice the time difference between the two functions. Both take almostthe same time, but the supremacy of caching lies behind the re-access.

Since we’re making only one request, the memoized function also has toaccess data from the main memory. Therefore, with our example, asignificant time difference in execution isn’t expected. However, if youincrease the number of calls to these functions, the time differencewill significantly increase (seePerformanceComparison).

Method 2: Python caching using LRU cache decorator

Another method to implement caching in Python is to use the built-in@lru_cache decorator fromfunctools. This decorator implements cacheusing the least recently used (LRU) caching strategy. This LRU cache isa fixed-size cache, which means it’ll discard the data from the cachethat hasn’t been used recently.

To use the@lru_cache decorator, we can create a new function forextracting HTML content and place the decorator name at the top. Makesure to import thefunctools module before using the decorator:

fromfunctoolsimportlru_cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text

In the above example, theget_html_data_lru method is memoized using the@lru_cache decorator. The cache can grow indefinitely when themaxsizeoption is set toNone.

To use the@lru_cache decorator, just add it above theget_html_data_lrufunction. Here’s the complete code sample:

# Import the required modulesfromfunctoolsimportlru_cacheimporttimeimportrequests# Function to get the HTML Contentdefget_html_data(url):response=requests.get(url)returnresponse.text# Memoized using LRU Cache@lru_cache(maxsize=None)defget_html_data_lru(url):response=requests.get(url)returnresponse.text# Get the time it took for a normal functionstart_time=time.time()get_html_data('https://books.toscrape.com/')print('Time taken (normal function):',time.time()-start_time)# Get the time it took for a memoized function (LRU cache)start_time=time.time()get_html_data_lru('https://books.toscrape.com/')print('Time taken (memoized function with LRU cache):',time.time()-start_time)

This produced the following output:

Performance comparison

In the following table, we’ve determined the execution times of allthree functions for different numbers of requests to these functions:

No. of requests	Time taken by normal function	Time taken by memoized function (manual decorator)	Time taken by memoized function (lru_cache decorator)
1	2.1 Seconds	2.0 Seconds	1.7 Seconds
10	17.3 Seconds	2.1 Seconds	1.8 Seconds
20	32.2 Seconds	2.2 Seconds	2.1 Seconds
30	57.3 Seconds	2.22 Seconds	2.12 Seconds

As the number of requests to the functions increases, you can see asignificant reduction in execution times using the caching strategy. Thefollowing comparison chart depicts these results:

The comparison results clearly show that using a caching strategy inyour code can significantly improve overall performance and speed.

Feel free to visit ourblog for anarray of intriguing web scraping topics that will keep you hooked!

About

A guide to caching web scraping scripts in Python.

oxylabs.io/blog/python-cache-how-to-use-effectively

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

How to implement a cache in Python

Install the required libraries

Method 1: Python caching using a manual decorator

Method 2: Python caching using LRU cache decorator

Performance comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

oxylabs/python-cache-tutorial

Folders and files

Latest commit

History

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

How to implement a cache in Python

Install the required libraries

Method 1: Python caching using a manual decorator

Method 2: Python caching using LRU cache decorator

Performance comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages