I've been working at speeding up my web scraping with theasyncio library. I have a working solution, but am unsure as to how pythonic it is or if I am properly using the library. Any input would be appreciated.
import aiohttpimport asyncioimport requestsfrom lxml import etree@asyncio.coroutinedef get(*args, **kwargs): """ A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at http://compiletoi.net/fast-scraping-in-python-with-asyncio.html """ response = yield from aiohttp.request('GET', *args, **kwargs) return (yield from response.read_and_close())@asyncio.coroutinedef extract_text(url): """ Given the url for a chapter, extract the relevant text from it :param url: the url for the chapter to scrape :return: a string containing the chapter's text """ sem = asyncio.Semaphore(5) with (yield from sem): page = yield from get(url) tree = etree.HTML(page) paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1: -1] return b'\n'.join(etree.tostring(paragraph) for paragraph in paragraphs)def generate_links(): """ Generate the links to each of the chapters :return: A list of strings containing every url to visit """ start_url = 'https://twigserial.wordpress.com/' base_url = 'https://twigserial.wordpress.com/category/story/' tree = etree.HTML(requests.get(start_url).text) xpath = './/*/option[@class="level-2"]/text()' return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]@asyncio.coroutinedef run(): links = generate_links() chapters = [] for f in asyncio.as_completed([extract_text(link) for link in links]): result = yield from f chapters.append(result) return chaptersdef main(): loop = asyncio.get_event_loop() chapters = loop.run_until_complete(run()) print(len(chapters))if __name__ == '__main__': main()- \$\begingroup\$For future readers attempting to run this from Jupyter notebook: note the current Jupyter's Tornado 5.0 update wil result in RuntimeError: This event loop is already running. Unclosed client session when running this. Resolution:stackoverflow.com/questions/47518874/…\$\endgroup\$QHarr– QHarr2018-11-21 10:41:55 +00:00CommentedNov 21, 2018 at 10:41
1 Answer1
Looks ... great? Not a lot to complain about really.
The semaphore doesn't do anything though used like this, it should bepassed in from the top to protect theget/aiohttp.request. You cansee that if youprint something right before the HTTP request.
Also, the result ofasyncio.as_completed will be in random order, sobe sure to sort the resulting chapters somehow, e.g. by returning boththe URL and the collected text fromextract_text.
A couple of small things as well:
- List comprehensions are okay, but with just a single argument it canbe shorter and equally performant just to use
map. - The URL constants should ideally be defined on the top level; at least
base_urlcan also be defined by concatenating withstart_url.Alternatively they could be passed in togenerate_links. Thenagain, it's unlikely that another blog has the exact same layout? - The manual
appendinrunseems unnecessary, I'd rewrite it into alist of generators and use a list comprehension instead. - At the moment
generate_linksis called fromrun; I think it makesmore sense to call it from themainfunction: it doesn't need to runconcurrently and you could think of a situation where you'd pass inthe result of a different function to be fetched and collected.
All in all, I'd maybe change things to the code below. Of course if youwere to add things to it, I'd recommend looking into command linearguments and configuration files, ...
import aiohttpimport asyncioimport requestsfrom lxml import etree@asyncio.coroutinedef get(*args, **kwargs): """ A wrapper method for aiohttp's get method. Taken from Georges Dubus' article at http://compiletoi.net/fast-scraping-in-python-with-asyncio.html """ response = yield from aiohttp.request('GET', *args, **kwargs) return (yield from response.read_and_close())@asyncio.coroutinedef extract_text(url, sem): """ Given the url for a chapter, extract the relevant text from it :param url: the url for the chapter to scrape :return: a string containing the chapter's text """ with (yield from sem): page = yield from get(url) tree = etree.HTML(page) paragraphs = tree.findall('.//*/div[@class="entry-content"]/p')[1:-1] return url, b'\n'.join(map(etree.tostring, paragraphs))def generate_links(): """ Generate the links to each of the chapters :return: A list of strings containing every url to visit """ start_url = 'https://twigserial.wordpress.com/' base_url = start_url + 'category/story/' tree = etree.HTML(requests.get(start_url).text) xpath = './/*/option[@class="level-2"]/text()' return [base_url + suffix.strip() for suffix in tree.xpath(xpath)]@asyncio.coroutinedef run(links): sem = asyncio.Semaphore(5) fetchers = [extract_text(link, sem) for link in links] return [(yield from f) for f in asyncio.as_completed(fetchers)]def main(): loop = asyncio.get_event_loop() chapters = loop.run_until_complete(run(generate_links())) print(len(chapters))if __name__ == '__main__': main()- \$\begingroup\$shouldn't you runloop.close() at the end of main?\$\endgroup\$Derek Adair– Derek Adair2016-05-03 00:53:35 +00:00CommentedMay 3, 2016 at 0:53
- \$\begingroup\$No idea, I expected the original code to be correct. Does it matter in
main? Lastly, while I found some examples withclose, there are some without ... so I'm not sure about it. Do you have a good reference for it?\$\endgroup\$ferada– ferada2016-05-03 08:07:37 +00:00CommentedMay 3, 2016 at 8:07 - \$\begingroup\$i'm very new to Asyncio, and i'm building a crawler. The only reference i have is the asyncio docs which i linked. alsothis. It clears the queue and "shuts down the executor"... whateve rthat means, i was hoping you may know! ha!\$\endgroup\$Derek Adair– Derek Adair2016-05-03 14:54:23 +00:00CommentedMay 3, 2016 at 14:54
- \$\begingroup\$maybe only useful when you run loop.run_forever() instead of loop.run_until_complete()\$\endgroup\$Derek Adair– Derek Adair2016-05-03 15:01:48 +00:00CommentedMay 3, 2016 at 15:01
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.
