oxylabs/python-cache-tutorialPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star0

A guide to caching web scraping scripts in Python.

oxylabs.io/blog/python-cache-how-to-use-effectively

0 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
src		src
readme.md		readme.md

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

This article will show you how to use caching in Python with your webscraping tasks. You can read thefullarticleon our blog, where we delve deeper into the different cachingstrategies.

How to implement a cache in Python

There are different ways to implement caching in Python for differentcaching strategies. Here we’ll see two methods of Python caching for asimple web scraping example. If you’re new to web scraping, take a lookat ourstep-by-step Python web scrapingguide.

Install the required libraries

We’ll use therequestslibrary to make HTTP requeststo a website. Install it withpip by entering the followingcommand in your terminal:

python -m pip install requests

Other libraries we’ll use in this project, specifically time andfunctools, come natively with Python 3.11.2, so you don’t have toinstall them.

Method 1: Python caching using a manual decorator

Adecorator in Python is afunction that accepts another function as an argument and outputs a newfunction. We can alter the behavior of the original function using adecorator without changing its source code.

One common use case for decorators is to implement caching. Thisinvolves creating a dictionary to store the function's results and thensaving them in the cache for future use.

Let’s start by creating a simple function that takes a URL as a functionargument, requests that URL, and returns the response text:

def get_html_data(url):

response = requests.get(url)

return response.text

Now, let's move toward creating a memoized version of this function:

def memoize(func):

cache = {}

def wrapper(*args):

if args in cache:

return cache[args]

else:

result = func(*args)

cache[args] = result

return result

return wrapper

@memoize

def get_html_data_cached(url):

response = requests.get(url)

return response.text

The wrapper function determines whether the current input arguments havebeen previously cached and, if so, returns the previously cached result.If not, the code calls the original function and caches the resultbefore being returned. In this case, we define a memoize decorator thatgenerates a cache dictionary to hold the results of previous functioncalls.

By adding @memoize above the function definition, we can use the memoizedecorator to enhance the get_html_data function. This generates a newmemoized function that we’ve called get_html_data_cached. It only makesa single network request for a URL and then stores the response in thecache for further requests.

Let’s use the time module to compare the execution speeds of theget_html_data function and the memoized get_html_data_cached function:

import time

start_time = time.time()

get_html_data('https://books.toscrape.com/')

print('Time taken (normal function):', time.time() - start_time)

start_time = time.time()

get_html_data_cached('https://books.toscrape.com/')

print('Time taken (memoized function using manual decorator):',time.time() - start_time)

Here’s what the complete code looks like:

# Import the required modules

from functools import lru_cache

import time

import requests

# Function to get the HTML Content

def get_html_data(url):

response = requests.get(url)

return response.text

# Memoize function to cache the data

def memoize(func):

cache = {}

# Inner wrapper function to store the data in the cache

def wrapper(*args):

if args in cache:

return cache[args]

else:

result = func(*args)

cache[args] = result

return result

return wrapper

# Memoized function to get the HTML Content

@memoize

def get_html_data_cached(url):

response = requests.get(url)

return response.text

# Get the time it took for a normal function

start_time = time.time()

get_html_data('https://books.toscrape.com/')

print('Time taken (normal function):', time.time() - start_time)

# Get the time it took for a memoized function (manual decorator)

start_time = time.time()

get_html_data_cached('https://books.toscrape.com/')

print('Time taken (memoized function using manual decorator):',time.time() - start_time)

Here’s the output:

Notice the time difference between the two functions. Both take almostthe same time, but the supremacy of caching lies behind the re-access.

Since we’re making only one request, the memoized function also has toaccess data from the main memory. Therefore, with our example, asignificant time difference in execution isn’t expected. However, if youincrease the number of calls to these functions, the time differencewill significantly increase (seePerformanceComparison).

Method 2: Python caching using LRU cache decorator

Another method to implement caching in Python is to use the built-in@lru_cache decorator from functools. This decorator implements cacheusing the least recently used (LRU) caching strategy. This LRU cache isa fixed-size cache, which means it’ll discard the data from the cachethat hasn’t been used recently.

To use the @lru_cache decorator, we can create a new function forextracting HTML content and place the decorator name at the top. Makesure to import the functools module before using the decorator:

from functools import lru_cache

@lru_cache(maxsize=None)

def get_html_data_lru(url):

response = requests.get(url)

return response.text

In the above example, the get_html_data_lru method is memoized using the@lru_cache decorator. The cache can grow indefinitely when the maxsizeoption is set to None.

To use the @lru_cache decorator, just add it above the get_html_data_lrufunction. Here’s the complete code sample:

# Import the required modules

from functools import lru_cache

import time

import requests

# Function for getting HTML Content

def get_html_data(url):

response = requests.get(url)

return response.text

# Memoized using LRU Cache

@lru_cache(maxsize=None)

def get_html_data_lru(url):

response = requests.get(url)

return response.text

# Getting time for Normal function to extract HTML content

start_time = time.time()

get_html_data('https://books.toscrape.com/')

print('Time taken (normal function):', time.time() - start_time)

# Getting time for Memoized function (LRU cache) to extract HTMLcontent

start_time = time.time()

get_html_data_lru('https://books.toscrape.com/')

print('Time taken (memoized function with LRU cache):', time.time() -start_time)

This produced the following output:

Performance comparison

In the following table, we’ve determined the execution times of allthree functions for different numbers of requests to these functions:

No. of requests	Time taken by normal function	Time taken by memoized function (manual decorator)	Time taken by memoized function (lru_cache decorator)
1	2.1 Seconds	2.0 Seconds	1.7 Seconds
10	17.3 Seconds	2.1 Seconds	1.8 Seconds
20	32.2 Seconds	2.2 Seconds	2.1 Seconds
30	57.3 Seconds	2.22 Seconds	2.12 Seconds

As the number of requests to the functions increases, you can see asignificant reduction in execution times using the caching strategy. Thefollowing comparison chart depicts these results:

The comparison results clearly show that using a caching strategy inyour code can significantly improve overall performance and speed.

Feel free to visit ourblog for anarray of intriguing web scraping topics that will keep you hooked!

About

A guide to caching web scraping scripts in Python.

oxylabs.io/blog/python-cache-how-to-use-effectively

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

How to implement a cache in Python

Install the required libraries

Method 1: Python caching using a manual decorator

Method 2: Python caching using LRU cache decorator

Performance comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

oxylabs/python-cache-tutorial

Folders and files

Latest commit

History

Repository files navigation

Python Cache: How to Speed Up Your Code with Effective Caching

How to implement a cache in Python

Install the required libraries

Method 1: Python caching using a manual decorator

Method 2: Python caching using LRU cache decorator

Performance comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages