Web-scraping through a rotating proxy script

Question 1

I've created a script in python which is able to parse proxies (supposed to support "https") from a website. Then the script will use those proxies randomly to parse the title of different coffe shops from a website. With every new request, the script is supposed to use new proxies. I've tried my best to make it flawless. The scraper is doing fine at this moment.

I'll be happy to shake off any redundancy within my script (I meant DRY) or to bring about any change to make it better.

This is the complete approach:

import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentfrom random import choicelinks = ['https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={}'.format(page) for page in range(1,6)]def get_proxies():    link = 'https://www.sslproxies.org/'       response = requests.get(link)    soup = BeautifulSoup(response.text,"lxml")    proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]    return proxies  #producing list of proxies that supports "https"def check_proxy(session, proxy_list=get_proxies(), validated=False):    proxy = choice(proxy_list)    session.proxies = {'https': 'https://{}'.format(proxy)}    try:        print(session.get('https://httpbin.org/ip').json())        validated = True  #try to make sure it is a working proxy        return    except Exception: pass    while True:        proxy = choice(proxy_list)        session.proxies = {'https': 'https://{}'.format(proxy)}        if not validated:  #otherwise get back to ensure it does fetch a working proxy            print("-------go validate--------")            returndef parse_content(url):    ua = UserAgent()    session = requests.Session()    session.headers = {'User-Agent': ua.random}    check_proxy(session)  #collect a working proxy to be used to fetch a valid response    while True:        try:            response = session.get(url)            break     #as soon as it fetch a valid response, it will break out of the while loop to continue with the rest        except Exception as e:            session.headers = {'User-Agent': ua.random}            check_proxy(session)  #if exception is raised, start over again            parse_content(url)    soup = BeautifulSoup(response.text, 'lxml')    for items in soup.select(".info span[itemprop='name']"):        print(items.text)if __name__ == '__main__':    for link in links:        parse_content(link)

Question 2

except /pass is usually a bad idea. You'll want to at least know which exceptions to swallow and which ones to print.

Question 3

Why are you reassigning session.proxies?

Question 4

From your description you want your code to perform these tasks:

Get a list of proxies
That support https
That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).

I would use a couple of generators for that:

import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentfrom random import shuffledef get_proxies(link):      response = requests.get(link)    soup = BeautifulSoup(response.text,"lxml")    https_proxies = filter(lambda item: "yes" in item.text,                           soup.select("table.table tr"))    for item in https_proxies:        yield "{}:{}".format(item.select_one("td").text,                             item.select_one("td:nth-of-type(2)").text)def get_random_proxies_iter():    proxies = list(get_proxies('https://www.sslproxies.org/'))    shuffle(proxies)    return iter(proxies)  # iter so we can call next on it to get the next proxydef get_proxy(session, proxies, validated=False):    session.proxies = {'https': 'https://{}'.format(next(proxies))}    if validated:        while True:            try:                return session.get('https://httpbin.org/ip').json()            except Exception:                session.proxies = {'https': 'https://{}'.format(next(proxies))}def get_response(url):    session = requests.Session()    ua = UserAgent()    proxies = get_random_proxies_iter()    while True:        try:            session.headers = {'User-Agent': ua.random}            print(get_proxy(session, proxies, validated=True))  #collect a working proxy to be used to fetch a valid response            return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop        except StopIteration:            raise  # No more proxies left to try        except Exception:            pass  # Other errors: try againdef parse_content(url):    response = get_response(url)    soup = BeautifulSoup(response.text, 'lxml')    for items in soup.select(".info span[itemprop='name']"):        print(items.text)if __name__ == '__main__':    url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={}'    links = [url.format(page) for page in range(1, 6)]    for link in links:        parse_content(link)

This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just callget_random_proxies_iter outside ofparse_content and feed it all the way down toget_proxy.

Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2018-06-05 12:01:45Z

From your description you want your code to perform these tasks:

Get a list of proxies
That support https
That are actually working

And you want that list to be randomized (from your description without repetition, from your code repetitions are fine).

I would use a couple of generators for that:

import requestsfrom bs4 import BeautifulSoupfrom fake_useragent import UserAgentfrom random import shuffledef get_proxies(link):      response = requests.get(link)    soup = BeautifulSoup(response.text,"lxml")    https_proxies = filter(lambda item: "yes" in item.text,                           soup.select("table.table tr"))    for item in https_proxies:        yield "{}:{}".format(item.select_one("td").text,                             item.select_one("td:nth-of-type(2)").text)def get_random_proxies_iter():    proxies = list(get_proxies('https://www.sslproxies.org/'))    shuffle(proxies)    return iter(proxies)  # iter so we can call next on it to get the next proxydef get_proxy(session, proxies, validated=False):    session.proxies = {'https': 'https://{}'.format(next(proxies))}    if validated:        while True:            try:                return session.get('https://httpbin.org/ip').json()            except Exception:                session.proxies = {'https': 'https://{}'.format(next(proxies))}def get_response(url):    session = requests.Session()    ua = UserAgent()    proxies = get_random_proxies_iter()    while True:        try:            session.headers = {'User-Agent': ua.random}            print(get_proxy(session, proxies, validated=True))  #collect a working proxy to be used to fetch a valid response            return session.get(url) # as soon as it fetches a valid response, it will break out of the while loop        except StopIteration:            raise  # No more proxies left to try        except Exception:            pass  # Other errors: try againdef parse_content(url):    response = get_response(url)    soup = BeautifulSoup(response.text, 'lxml')    for items in soup.select(".info span[itemprop='name']"):        print(items.text)if __name__ == '__main__':    url = 'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={}'    links = [url.format(page) for page in range(1, 6)]    for link in links:        parse_content(link)

This actually makes sure that no proxy is reused for each site. The order in which the proxies are tried is different for each site. If you are not fine with trying the same proxies again for a new site, just callget_random_proxies_iter outside ofparse_content and feed it all the way down toget_proxy.

Movatterモバイル変換

Stack Exchange Network

Web-scraping through a rotating proxy script

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS