Posted onSep 19, 2020

Web Scrapping Using python BeautifulSoup

#python #webdev #beginners #webscrapping

What is Web Scrapping?

There are mainly two ways to extract data from a website:

Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
Access the HTML of the web-page and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.This blog discusses the steps involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup.

Getting Started:

We are going to use Python as our scraping language, together with a simple and powerful library called BeautifulSoup.

For Mac users, Python is pre-installed in OS X. Open up Terminal and type python --version. You should see your python version is 3.6(Shows for me).
For Windows users, please install Python through the official website.
Next we need to get the BeautifulSoup library using pip, a package management tool for Python.

In the command prompt, type:

pip install BeautifulSoup4

Note: If you fail to execute the above command line, try adding sudo in front of each line.

Scrapping Rules and Regulation:

Before starting scrapping, we need to know about some rules and regulation which is there to scrap any sites. Just read below points before starting scrapping random sites as scrapping data from any site is not legal. So, please follow the below points:

You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.

Steps Needed For WebScrapping:

Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.Install requests using cmd:
pip install requests
Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.Install html5lib using cmd:
pip install html5lib
Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files. installation of bs4 already done.

Just see this below image to understand the way scrapping works:

Scrapping Covid-19 Data:

We will be extract data in the form of table from the site worldometers. The code will be represent below with step by step also with description:

Code

# importing modulesimport requestsfrom bs4 import BeautifulSoup# URL for scrapping dataurl = 'https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/'# get URL htmlpage = requests.get(url)soup = BeautifulSoup(page.text, 'html.parser')data = []# soup.find_all('td') will scrape every element in the url's tabledata_iterator = iter(soup.find_all('td'))# data_iterator is the iterator of the table# This loop will keep repeating till there is data available in the iteratorwhile True:    try:        country = next(data_iterator).text        confirmed = next(data_iterator).text        deaths = next(data_iterator).text        continent = next(data_iterator).text        # For 'confirmed' and 'deaths', make sure to remove the commas and convert to int        data.append((            country,            int(confirmed.replace(', ', '')),            int(deaths.replace(', ', '')),            continent        ))    # StopIteration error is raised when there are no more elements left to iterate through    except StopIteration:        break# Sort the data by the number of confirmed casesdata.sort(key = lambda row: row[1], reverse = True)# create texttable objecttable = tt.Texttable()table.add_rows([(None, None, None, None)] + data) # Add an empty row at the beginning for the headerstable.set_cols_align(('c', 'c', 'c', 'c')) # 'l' denotes left, 'c' denotes center, and 'r' denotes righttable.header((' Country ', ' Number of cases ', ' Deaths ', ' Continent '))print(table.draw())

The output of the code will be something like this:

Conclusion

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the same time.
So, this was a simple example of how to create a web scraper in Python. From here, you can try to scrap any other website of your choice. In case of any queries, post them below in comments section.

My Other Blogs You Can View In Below links:

README Add to you Github Profile
Tic-Tac-Toe game Using Pygame

Happy Coding! Cheers.

Top comments(2)

Chege jr

Data Science innit

Joined
Feb 23, 2024

• Mar 6 '24

Copy link

Exemplary!

Ramakrushna Mohapatra

Software Developer. Data Science Enthusiastic Currently Pursing masters in Data science from IIIT-Banglore. 4 yrs of experience in Automation and in python. Hardworking guy who believes in his team.