Scraping web data with Python 3

Question 1

Below is a portion of the code i have written to scrape the bikesales.com.au website for details of bikes for sales (The full code ishere). This finds all the 'href' attributes on each search page and tries to request the html for each href corresponding to each bike for sale. My code works correctly, however I had to add some retry attempts with exponential backoff to avoid the following error:

ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)

The code works correctly, however I would like to avoid the backoff approach if possible.

from requests import getfrom requests.exceptions import RequestExceptionfrom contextlib import closingfrom bs4 import BeautifulSoupdef get_html_content(url, multiplier=1):    """    Retrieve the contents of the url.    """    # Be a responisble scraper.    # The multiplier is used to exponentially increase the delay when there are several attempts at connecting to the url    time.sleep(2*multiplier)    # Get the html from the url    try:        with closing(get(url)) as resp:            content_type = resp.headers['Content-Type'].lower()            if is_good_response(resp):                return resp.content            else:                # Unable to get the url response                return None    except RequestException as e:        print("Error during requests to {0} : {1}".format(url, str(e)))if __name__ == '__main__':    baseUrl = 'https://www.bikesales.com.au/'    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'    content = get_html_content(url)    html = BeautifulSoup(content, 'html.parser')    BikeList = html.findAll("a", {"class": "item-link-container"})    # Cycle through the list of bikes on each search page.    for bike in BikeList:        # Get the URL for each bike.        individualBikeURL = bike.attrs['href']        BikeContent = get_html_content(baseUrl+individualBikeURL)        # Reset the miltipler for each new url        multiplier = 1        ## occasionally the connection is lost, so try again.        ## Im not sure why the connection is lost, i might be that the site is trying to guard against scraping software.        # If initial attempt to connect to the url was unsuccessful, try again with an increasing delay        while (BikeContent == None):            # Limit the exponential delay to 16x            if (multiplier < 16):                multiplier *= 2            BikeContent = get_html_content(baseUrl+individualBikeURL,multiplier)

My question is, Is there something that i am missing in the implementation of the request? or, is this just a result of the site denying scraping tools?

Question 2

I would be interested to know if this code will work on any website or just a specific website?

Question 3

I've edited the question to more clearly indicate that the code functions. Please confirm that my edit was correct.

Question 4

Could you include the code foris_good_response?

Question 5

@Peilonrayzgithub.com/beaubellamy/BikeSalesScraper/blob/master/…

Question 6

@Malachi: I've used the same code to scrape one other website 'skiresort.info' without any issues.

Question 7

I assumeis_good_response is just checking for a 200 response code.
Mergeis_good_response,get_html_content and the insides of your for-loop in your main together.

This makes the main code:

from requests import getfrom bs4 import BeautifulSoupif __name__ == '__main__':    baseUrl = 'https://www.bikesales.com.au/'    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'    content = get_html_content(url)    html = BeautifulSoup(content, 'html.parser')    BikeList = html.findAll("a", {"class": "item-link-container"})    for bike in bike_list:        individualBikeURL = bike.attrs['href']        bike_content = get_bike(baseUrl+individualBikeURL)

Where we will be focusing on:

def get_bike(url):    multiplier = 1    while (BikeContent == None):        time.sleep(2*multiplier)        try:            with closing(get(url)) as resp:                content_type = resp.headers['Content-Type'].lower()                if 200 <= resp.status_code < 300:                    return resp.content        except RequestException as e:            print("Error during requests to {0} : {1}".format(url, str(e)))        if (multiplier < 16):            multiplier *= 2    return None

Allow a retry argument. Retry should also act differently on different values:
- None - Don't retry.
- -1 - Retry infinatly.
- n - Retry until \$2^n\$.
- iterator - loop through for the delays
We can also add another function to work the same way your previous code did.
You shouldn't need to usecontextlib.closing, asResponse.close "should not normally need to be called explicitly."
You don't needcontent_type inget_bike.
You should use*args and**kwargs so you can userequests.gets arguments if you ever need to.
You can allow this to work withpost and other request methods if you take the method as a parameter.

import itertoolsimport collections.abcimport requests.exceptionsdef request(method, retry=None, *args, **kwargs):    if retry is None:        retry = iter()    elif retry == -1:        retry = (2**i for i in itertools.count())    elif isinstance(retry, int):        retry = (2**i for i in range(retry))    elif isinstance(retry, collections.abc.Iterable):        pass    else:        raise ValueError('Unknown retry {retry}'.format(retry=retry))    for sleep in itertools.chain([0], retry):        if sleep:            time.sleep(sleep)        try:            resp = method(*args, **kwargs)            if 200 <= resp.status_code < 300:                return resp.content        except requests.exceptions.RequestException as e:            print('Error during requests to {0} : {1}'.format(url, str(e)))    return Nonedef bike_retrys():    for i in range(5):        yield 2**i    while True:        yield 16

To improve the rest of the code:

Use snake case.
Constants should be in upper snake case.
Use the above code.
Useimport requests, rather thanfrom requests import get.
You can make a little helper function to callrequest, so usage is cleaner.

import requestsfrom bs4 import BeautifulSoupdef get_bike(*args, **kwargs):    return request(requests.get, bike_retrys(), *args, **kwargs)if __name__ == '__main__':    BASE_URL = 'https://www.bikesales.com.au/'    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'    content = get_bike(url)    html = BeautifulSoup(content, 'html.parser')    bike_list = html.findAll("a", {"class": "item-link-container"})    for bike in bike_list:        bike_content = get_bike(BASE_URL + bike.attrs['href'])

Question 8

if 200 <= resp.status_code < 300 =>if resp.ok?

Question 9

@MathiasEttinger I didn't knowresp.ok was a thing. However, from the documentation, it is the same as200 <= resp.status_code < 400.

Question 10

Right, but sinceallow_redirects=False is not used here, all 3xx are converted to the final element.

Question 11

@MathiasEttinger I'll admit I don't know much about 3xx. From what you put it'd be better to use it whether we useallow_redirects or not. I'll edit my answer in a bit, or you can if you want.

Peilonrayz♦ 44.6k7 gold badges80 silver badges158 bronze badges · Accepted Answer · 2018-05-29 15:02:24Z

I assumeis_good_response is just checking for a 200 response code.
Mergeis_good_response,get_html_content and the insides of your for-loop in your main together.

This makes the main code:

from requests import getfrom bs4 import BeautifulSoupif __name__ == '__main__':    baseUrl = 'https://www.bikesales.com.au/'    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'    content = get_html_content(url)    html = BeautifulSoup(content, 'html.parser')    BikeList = html.findAll("a", {"class": "item-link-container"})    for bike in bike_list:        individualBikeURL = bike.attrs['href']        bike_content = get_bike(baseUrl+individualBikeURL)

Where we will be focusing on:

def get_bike(url):    multiplier = 1    while (BikeContent == None):        time.sleep(2*multiplier)        try:            with closing(get(url)) as resp:                content_type = resp.headers['Content-Type'].lower()                if 200 <= resp.status_code < 300:                    return resp.content        except RequestException as e:            print("Error during requests to {0} : {1}".format(url, str(e)))        if (multiplier < 16):            multiplier *= 2    return None

Allow a retry argument. Retry should also act differently on different values:
- None - Don't retry.
- -1 - Retry infinatly.
- n - Retry until \$2^n\$.
- iterator - loop through for the delays
We can also add another function to work the same way your previous code did.
You shouldn't need to usecontextlib.closing, asResponse.close "should not normally need to be called explicitly."
You don't needcontent_type inget_bike.
You should use*args and**kwargs so you can userequests.gets arguments if you ever need to.
You can allow this to work withpost and other request methods if you take the method as a parameter.

import itertoolsimport collections.abcimport requests.exceptionsdef request(method, retry=None, *args, **kwargs):    if retry is None:        retry = iter()    elif retry == -1:        retry = (2**i for i in itertools.count())    elif isinstance(retry, int):        retry = (2**i for i in range(retry))    elif isinstance(retry, collections.abc.Iterable):        pass    else:        raise ValueError('Unknown retry {retry}'.format(retry=retry))    for sleep in itertools.chain([0], retry):        if sleep:            time.sleep(sleep)        try:            resp = method(*args, **kwargs)            if 200 <= resp.status_code < 300:                return resp.content        except requests.exceptions.RequestException as e:            print('Error during requests to {0} : {1}'.format(url, str(e)))    return Nonedef bike_retrys():    for i in range(5):        yield 2**i    while True:        yield 16

To improve the rest of the code:

Use snake case.
Constants should be in upper snake case.
Use the above code.
Useimport requests, rather thanfrom requests import get.
You can make a little helper function to callrequest, so usage is cleaner.

import requestsfrom bs4 import BeautifulSoupdef get_bike(*args, **kwargs):    return request(requests.get, bike_retrys(), *args, **kwargs)if __name__ == '__main__':    BASE_URL = 'https://www.bikesales.com.au/'    url = 'https://www.bikesales.com.au/bikes/?q=Service%3D%5BBikesales%5D'    content = get_bike(url)    html = BeautifulSoup(content, 'html.parser')    bike_list = html.findAll("a", {"class": "item-link-container"})    for bike in bike_list:        bike_content = get_bike(BASE_URL + bike.attrs['href'])

@MathiasEttinger I didn't knowresp.ok was a thing. However, from the documentation, it is the same as200 <= resp.status_code < 400.
Right, but sinceallow_redirects=False is not used here, all 3xx are converted to the final element.
@MathiasEttinger I'll admit I don't know much about 3xx. From what you put it'd be better to use it whether we useallow_redirects or not. I'll edit my answer in a bit, or you can if you want.

Movatterモバイル変換

Stack Exchange Network

Scraping web data with Python 3

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS