oxylabs/asynchronous-web-scraping-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star7

A comparison of asynchronous and synchronous web scraping methods with practical examples.

oxylabs.io/blog/asynchronous-web-scraping-python-aiohttp

7 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
src		src
README.md		README.md

Repository files navigation

Asynchronous Web Scraping With Python & AIOHTTP

In this tutorial, we will focus on scraping multiple URLs using the asynchronous method, and by comparing it to the synchronous one, we will demonstrate why it can be more beneficial. See thefull blogpostfor more information on asynchronous web scraping.

You can also check outone of our videos for a visual representation ofthe same web scraping tutorial.

Sending asynchronous HTTP requests

Let’s take a look at the asynchronous Python tutorial. For thisuse-case, we will use theaiohttp module.

1. Create an empty Python file with a main function

Note that the main function is marked as asynchronous. We use asyncioloop to prevent the script from exiting until the main functioncompletes.

importasyncioasyncdefmain():print('Saving the output of extracted information')loop=asyncio.get_event_loop()loop.run_until_complete(main())

Once again, it is a good idea to track the performance of your script.For that purpose, let's write a code that tracks script execution time.

2. Track script execution time

As with the first example, record the time at the start of the script.Then, type in any code that you need to measure (currently a singleprint statement). Finally, calculate how much time has passed by takingthe current time and subtracting the time at the start of the script.Once we have how much time has passed, we print it while rounding theresulting float to the last 2 decimals.

importasyncioimporttimeasyncdefmain():start_time=time.time()print('Saving the output of extracted information')time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)loop=asyncio.get_event_loop()loop.run_until_complete(main())

Time to read the csv file that contains URLs. The file will contain asingle column calledurl. There, you will see all the URLs that need tobe scraped for data.

3. Create a loop

Next, we open up urls.csv, then load it using csv module and loop overeach and every URL in the csv file. Additionally, we need to create anasync task for every URL we are going to scrape.

importasyncioimportcsvimporttimeasyncdefmain():start_time=time.time()withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:# the url from csv can be found in csv_row['url']print(csv_row['url'])print('Saving the output of extracted information')time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)loop=asyncio.get_event_loop()loop.run_until_complete(main())

Later in the function we wait for all the scraping tasks to completebefore moving on.

importasyncioimportcsvimporttimeasyncdefmain():start_time=time.time()tasks= []withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:task=asyncio.create_task(scrape(csv_row['url']))tasks.append(task)print('Saving the output of extracted information')awaitasyncio.gather(*tasks)time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)loop=asyncio.get_event_loop()loop.run_until_complete(main())

All that's left is scraping! But before doing that, remember to take alook at the data you're scraping.

Thetitle of thebookcan be extracted from an<h1> tag, that is wrapped by a<div> tagwith aproduct_main class.

Regarding the production information, it can be found in a table with atable-striped class.

4. Create a scrape functionality

The scrape function makes a request to the URL we loaded from the csvfile. Once the request is done, it loads the response HTML using theBeautifulSoup module. Then we use the knowledge about where the data isstored in HTML tags to extract the book name into thebook_name variableand collect all product information into aproduct_info dictionary.

importasyncioimportcsvimporttimeimportaiohttpasaiohttpfrombs4importBeautifulSoupasyncdefscrape(url):asyncwithaiohttp.ClientSession()assession:asyncwithsession.get(url)asresp:body=awaitresp.text()soup=BeautifulSoup(body,'html.parser')book_name=soup.select_one('.product_main').h1.textrows=soup.select('.table.table-striped tr')product_info= {row.th.text:row.td.textforrowinrows}asyncdefmain():start_time=time.time()tasks= []withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:task=asyncio.create_task(scrape(csv_row['url']))tasks.append(task)print('Saving the output of extracted information')awaitasyncio.gather(*tasks)time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)loop=asyncio.get_event_loop()loop.run_until_complete(main())

5. Add save_product function

The URL is scraped; however, no results can be seen. For that, you needto add another function –save_product.

importasyncioimportcsvimportjsonimporttimeimportaiohttpfrombs4importBeautifulSoupasyncdefsave_product(book_name,product_info):json_file_name=book_name.replace(' ','_')withopen(f'data/{json_file_name}.json','w')asbook_file:json.dump(product_info,book_file)asyncdefscrape(url):asyncwithaiohttp.ClientSession()assession:asyncwithsession.get(url)asresp:body=awaitresp.text()soup=BeautifulSoup(body,'html.parser')book_name=soup.select_one('.product_main').h1.textrows=soup.select('.table.table-striped tr')product_info= {row.th.text:row.td.textforrowinrows}awaitsave_product(book_name,product_info)asyncdefmain():start_time=time.time()tasks= []withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:task=asyncio.create_task(scrape(csv_row['url']))tasks.append(task)print('Saving the output of extracted information')awaitasyncio.gather(*tasks)time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)loop=asyncio.get_event_loop()loop.run_until_complete(main())

6. Run the script

Lastly, you can run the script and see the data.

Sending synchronous HTTP requests

In this tutorial we are going to scrape URLs defined in urls.csv using asynchronous approach. For this particular use case, the Pythonrequestsmodule is an ideal tool.

1. Create a Python file with a main function

defmain():print('Saving the output of extracted information')main()

Tracking the performance of your script is always a good idea.Therefore, the next step is to add a code that tracks script executiontime.

2. Track script execution time

First, record time at the very start of the script. Then, type in anycode that needs to be measured – in this case, we are using a singleprint statement. Finally, calculate how much time has passed. This canbe done by taking the current time and subtracting the time at the startof the script. Once we know how much time has passed, we can print itwhile rounding the resulting float to the last 2 decimals.

importtimedefmain():start_time=time.time()print('Saving the output of extracted information')time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)main()

Now that the preparations are done, it's time to read the csv file thatcontains URLs. There, you will see a single column calledurl, whichwill contain URLs that have to be scraped for data.

3. Create a loop

Next, we have to open up urls.csv. After that, load it using the csvmodule and loop over each and every URL from the csv file.

importcsvimporttimedefmain():start_time=time.time()print('Saving the output of extracted information')withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:# the url from csv can be found in csv_row['url']print(csv_row['url'])time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)main()

At this point, the job is almost done - all that’s left to do is toscrape it, although before you do that, look at the data you’rescraping.

The title of the book “A Light in the Attic” can be extracted from an<h1> tag, that is wrapped by a<div> tag with aproduct_main class.

As for the product information, it can be found in a table with atable-striped class, which you can see in the developer tools part.

4. Create a scrape function

Now, let's use what we've learned and create ascrape function.

The scrape function makes a request to the URL we loaded from the csvfile. Once the request is done, it loads the response HTML using theBeautifulSoup module. Then, we use the knowledge about where the data isstored in HTML tags to extract the book name into thebook_name variableand collect all product information into aproduct_info dictionary.

importcsvimporttimeimportrequestsasrequestsfrombs4importBeautifulSoupdefscrape(url):response=requests.get(url)soup=BeautifulSoup(response.content,'html.parser')book_name=soup.select_one('.product_main').h1.textrows=soup.select('.table.table-striped tr')product_info= {row.th.text:row.td.textforrowinrows}defmain():start_time=time.time()print('Saving the output of extracted information')withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:scrape(csv_row['url'])time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)main()

The URL is scraped; however, no results are seen yet. For that, it’stime to add yet another function -save_product.

5. Add save_product function

save_product takes two parameters: the book name and the product infodictionary. Since the book name contains spaces, we first replace themwith underscores. Finally, we create a JSON file and dump all the infowe have into it. Make sure you create a data directory in the folder ofyour script where all the JSON files are going to be saved.

importcsvimportjsonimporttimeimportrequestsfrombs4importBeautifulSoupdefsave_product(book_name,product_info):json_file_name=book_name.replace(' ','_')withopen(f'data/{json_file_name}.json','w')asbook_file:json.dump(product_info,book_file)defscrape(url):response=requests.get(url)soup=BeautifulSoup(response.content,'html.parser')book_name=soup.select_one('.product_main').h1.textrows=soup.select('.table.table-striped tr')product_info= {row.th.text:row.td.textforrowinrows}save_product(book_name,product_info)defmain():start_time=time.time()print('Saving the output of extracted information')withopen('urls.csv')asfile:csv_reader=csv.DictReader(file)forcsv_rowincsv_reader:scrape(csv_row['url'])time_difference=time.time()-start_timeprint(f'Scraping time: %.2f seconds.'%time_difference)main()

6. Run the script

Now, it's time to run the script and see the data. Here, we can also seehow much time the scraping took – in this case it’s 17.54 seconds.

Comparing the performance of sync and async

Now that we carefully went through the processes of making requests withboth synchronous and asynchronous methods, we can run the requests onceagain and compare the performance of two scripts.

The time difference is huge – while the async web scraping code was ableto execute all the tasks in around 3 seconds, it took almost 16 for thesynchronous one. This proves that scraping asynchronously is indeed morebeneficial due to its noticeable time efficiency.

About

A comparison of asynchronous and synchronous web scraping methods with practical examples.

oxylabs.io/blog/asynchronous-web-scraping-python-aiohttp

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Asynchronous Web Scraping With Python & AIOHTTP

Sending asynchronous HTTP requests

1. Create an empty Python file with a main function

2. Track script execution time

3. Create a loop

4. Create a scrape functionality

5. Add save_product function

6. Run the script

Sending synchronous HTTP requests

1. Create a Python file with a main function

2. Track script execution time

3. Create a loop

4. Create a scrape function

5. Add save_product function

6. Run the script

Comparing the performance of sync and async

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

oxylabs/asynchronous-web-scraping-python

Folders and files

Latest commit

History

Repository files navigation

Asynchronous Web Scraping With Python & AIOHTTP

Sending asynchronous HTTP requests

1. Create an empty Python file with a main function

2. Track script execution time

3. Create a loop

4. Create a scrape functionality

5. Add save_product function

6. Run the script

Sending synchronous HTTP requests

1. Create a Python file with a main function

2. Track script execution time

3. Create a loop

4. Create a scrape function

5. Add save_product function

6. Run the script

Comparing the performance of sync and async

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages