Web Scraping with Python + asyncio

Question 1

I've been working at speeding up my web scraping with theasyncio library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.

import aiohttpimport asyncioimport requestsfrom lxml import etree@asyncio.coroutinedef get(*args, **kwargs):    """    A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at    http://compiletoi.net/fast-scraping-in-python-with-asyncio.html    """    response = yield from aiohttp.request('GET', *args, **kwargs)    return (yield from response.read_and_close())@asyncio.coroutinedef extract_text(url):    """    Given the url for a chapter, extract the relevant text from it    :param url: the url for the chapter to scrape    :return: a string containing the chapter's text    """    sem = asyncio.Semaphore(5)    with (yield from sem):        page = yield from get(url)    tree = etree.HTML(page)    paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1]    return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)def generate_links():    """    Generate the links to each of the chapters    :return: A list of strings containing every url to visit    """    start_url = 'https://twigserial.wordpress.com/'    base_url = 'https://twigserial.wordpress.com/category/story/'    tree = etree.HTML(requests.get(start_url).text)    xpath = './/*/option[@class="level-2"]/text()'    return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]@asyncio.coroutinedef run():    links = generate_links()    chapters = []    for f in asyncio.as_completed([extract_text(link) for link in links]):        result = yield from f        chapters.append(result)    return chaptersdef main():    loop = asyncio.get_event_loop()    chapters = loop.run_until_complete(run())    print(len(chapters))if __name__ == '__main__':    main()

Question 2

For future readers attempting to run this from Jupyter notebook: note the current Jupyter's Tornado 5.0 update wil result in RuntimeError: This event loop is already running. Unclosed client session when running this. Resolution:stackoverflow.com/questions/47518874/…

Question 3

Looks ... great? Not a lot to complain about really.

The semaphore doesn't do anything though used like this, it should bepassed in from the top to protect theget/aiohttp.request. You cansee that if youprint something right before the HTTP request.

Also, the result ofasyncio.as_completed will be in random order, sobe sure to sort the resulting chapters somehow, e.g. by returning boththe URL and the collected text fromextract_text.

A couple of small things as well:

List comprehensions are okay, but with just a single argument it canbe shorter and equally performant just to usemap.
The URL constants should ideally be defined on the top level; at leastbase_url can also be defined by concatenating withstart_url.Alternatively they could be passed in togenerate_links. Thenagain, it's unlikely that another blog has the exact same layout?
The manualappend inrun seems unnecessary, I'd rewrite it into alist of generators and use a list comprehension instead.
At the momentgenerate_links is called fromrun; I think it makesmore sense to call it from themain function: it doesn't need to runconcurrently and you could think of a situation where you'd pass inthe result of a different function to be fetched and collected.

All in all, I'd maybe change things to the code below. Of course if youwere to add things to it, I'd recommend looking into command linearguments and configuration files, ...

import aiohttpimport asyncioimport requestsfrom lxml import etree@asyncio.coroutinedef get(*args, **kwargs):    """    A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at    http://compiletoi.net/fast-scraping-in-python-with-asyncio.html    """    response = yield from aiohttp.request('GET', *args, **kwargs)    return (yield from response.read_and_close())@asyncio.coroutinedef extract_text(url, sem):    """    Given the url for a chapter, extract the relevant text from it    :param url: the url for the chapter to scrape    :return: a string containing the chapter's text    """    with (yield from sem):        page = yield from get(url)    tree = etree.HTML(page)    paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]    return url, b'\n'.join(map(etree.tostring, paragraphs))def generate_links():    """    Generate the links to each of the chapters    :return: A list of strings containing every url to visit    """    start_url = 'https://twigserial.wordpress.com/'    base_url = start_url + 'category/story/'    tree = etree.HTML(requests.get(start_url).text)    xpath = './/*/option[@class="level-2"]/text()'    return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]@asyncio.coroutinedef run(links):    sem = asyncio.Semaphore(5)    fetchers = [extract_text(link, sem) for link in links]    return [(yield from f) for f in asyncio.as_completed(fetchers)]def main():    loop = asyncio.get_event_loop()    chapters = loop.run_until_complete(run(generate_links()))    print(len(chapters))if __name__ == '__main__':    main()

Question 4

shouldn't you runloop.close() at the end of main?

Question 5

No idea, I expected the original code to be correct. Does it matter inmain? Lastly, while I found some examples withclose, there are some without ... so I'm not sure about it. Do you have a good reference for it?

Question 6

i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. alsothis. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha!

Question 7

maybe only useful when you run loop.run_forever() instead of loop.run_until_complete()

ferada 11.4k26 silver badges66 bronze badges · Accepted Answer · 2015-12-01 00:03:55Z

Looks ... great? Not a lot to complain about really.

The semaphore doesn't do anything though used like this, it should bepassed in from the top to protect theget/aiohttp.request. You cansee that if youprint something right before the HTTP request.

Also, the result ofasyncio.as_completed will be in random order, sobe sure to sort the resulting chapters somehow, e.g. by returning boththe URL and the collected text fromextract_text.

A couple of small things as well:

List comprehensions are okay, but with just a single argument it canbe shorter and equally performant just to usemap.
The URL constants should ideally be defined on the top level; at leastbase_url can also be defined by concatenating withstart_url.Alternatively they could be passed in togenerate_links. Thenagain, it's unlikely that another blog has the exact same layout?
The manualappend inrun seems unnecessary, I'd rewrite it into alist of generators and use a list comprehension instead.
At the momentgenerate_links is called fromrun; I think it makesmore sense to call it from themain function: it doesn't need to runconcurrently and you could think of a situation where you'd pass inthe result of a different function to be fetched and collected.

All in all, I'd maybe change things to the code below. Of course if youwere to add things to it, I'd recommend looking into command linearguments and configuration files, ...

import aiohttpimport asyncioimport requestsfrom lxml import etree@asyncio.coroutinedef get(*args, **kwargs):    """    A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at    http://compiletoi.net/fast-scraping-in-python-with-asyncio.html    """    response = yield from aiohttp.request('GET', *args, **kwargs)    return (yield from response.read_and_close())@asyncio.coroutinedef extract_text(url, sem):    """    Given the url for a chapter, extract the relevant text from it    :param url: the url for the chapter to scrape    :return: a string containing the chapter's text    """    with (yield from sem):        page = yield from get(url)    tree = etree.HTML(page)    paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1]    return url, b'\n'.join(map(etree.tostring, paragraphs))def generate_links():    """    Generate the links to each of the chapters    :return: A list of strings containing every url to visit    """    start_url = 'https://twigserial.wordpress.com/'    base_url = start_url + 'category/story/'    tree = etree.HTML(requests.get(start_url).text)    xpath = './/*/option[@class="level-2"]/text()'    return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]@asyncio.coroutinedef run(links):    sem = asyncio.Semaphore(5)    fetchers = [extract_text(link, sem) for link in links]    return [(yield from f) for f in asyncio.as_completed(fetchers)]def main():    loop = asyncio.get_event_loop()    chapters = loop.run_until_complete(run(generate_links()))    print(len(chapters))if __name__ == '__main__':    main()

No idea, I expected the original code to be correct. Does it matter inmain? Lastly, while I found some examples withclose, there are some without ... so I'm not sure about it. Do you have a good reference for it?
i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. alsothis. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha!
maybe only useful when you run loop.run_forever() instead of loop.run_until_complete()

Movatterモバイル変換

Stack Exchange Network

Web Scraping with Python + asyncio

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS