
Python Web Scraping - Data Extraction
Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web scraping? In this chapter, let us understand this in detail.
Web page Analysis
Web page analysis is important because without analyzing we are not able to know in which form we are going to receive the data from (structured or unstructured) that web page after extraction. We can do web page analysis in the following ways −
Viewing Page Source
This is a way to understand how a web page is structured by examining its source code. To implement this, we need to right click the page and then must select theView page source option. Then, we will get the data of our interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting which is difficult for us to format.
Inspecting Page Source by Clicking Inspect Element Option
This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and whitespaces in the source code of web page. You can implement this by right clicking and then selecting theInspect orInspect element option from menu. It will provide the information about particular area or element of that web page.
Different Ways to Extract Data from Web Page
The following methods are mostly used for extracting data from a web page −
Regular Expression
They are highly specialized programming language embedded in Python. We can use it throughre module of Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some rules for the possible set of strings we want to match from the data.
If you want to learn more about regular expression in general, go to the linkhttps://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you want to know more about re module or regular expression in Python, you can follow thelink https://www.tutorialspoint.com/python/python_reg_expressions.htm.
Example
In the following example, we are going to scrape data about India fromhttp://example.webscraping.com after matching the contents of <td> with the help of regular expression.
import reimport urllib.requestresponse = urllib.request.urlopen('http://example.webscraping.com/places/default/view/India-102')html = response.read()text = html.decode()re.findall('<td>(.*?)</td>',text)
Output
The corresponding output will be as shown here −
[ '<img src="/places/static/images/flags/in.png" />', '3,287,590 square kilometres', '1,173,108,018', 'IN', 'India', 'New Delhi', '<a href="/places/default/continent/AS">AS</a>', '.in', 'INR', 'Rupee', '91', '######', '^(\\d{6})$', 'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc', '<div> <a href="/places/default/iso/CN">CN </a> <a href="/places/default/iso/NP">NP </a> <a href="/places/default/iso/MM">MM </a> <a href="/places/default/iso/BT">BT </a> <a href="/places/default/iso/PK">PK </a> <a href="/places/default/iso/BD">BD </a> </div>']
Observe that in the above output you can see the details about country India by using regular expression.
Beautiful Soup
Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup which can be known in more detail athttps://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests, because it needs an input (document or url) to create a soup object asit cannot fetch a web page by itself. You can use the following Python script to gather the title of web page and hyperlinks.
Installing Beautiful Soup
Using thepip command, we can installbeautifulsoup either in our virtual environment or in global installation.
(base) D:\ProgramData>pip install bs4Collecting bs4 Downloadinghttps://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gzRequirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages(from bs4) (4.6.0)Building wheels for collected packages: bs4 Running setup.py bdist_wheel for bs4 ... done Stored in directory:C:\Users\gaurav\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472Successfully built bs4Installing collected packages: bs4Successfully installed bs4-0.0.1
Example
Note that in this example, we are extending the above example implemented with requests python module. we are usingr.text for creating a soup object which will further be used to fetch details like title of the webpage.
First, we need to import necessary Python modules −
import requestsfrom bs4 import BeautifulSoup
In this following line of code we use requests to make a GET HTTP requests for the url:https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
Now we need to create a Soup object as follows −
soup = BeautifulSoup(r.text, 'lxml')print (soup.title)print (soup.title.text)
Output
The corresponding output will be as shown here −
<title>Learn and Grow with Aditi Agarwal</title>Learn and Grow with Aditi Agarwal
Lxml
Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML parsing library. It is comparatively fast and straightforward. You can read about it more onhttps://lxml.de/.
Installing lxml
Using the pip command, we can installlxml either in our virtual environment or in global installation.
(base) D:\ProgramData>pip install lxmlCollecting lxml Downloadinghttps://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e3e949aa06e4a2de55b3de53dc9fa9653fa/lxml-4.2.5-cp36-cp36m-win_amd64.whl(3.6MB) 100% || 3.6MB 64kB/sInstalling collected packages: lxmlSuccessfully installed lxml-4.2.5
Example: Data extraction using lxml and requests
In the following example, we are scraping a particular element of the web page fromauthoraditiagarwal.com by using lxml and requests −
First, we need to import the requests and html from lxml library as follows −
import requestsfrom lxml import html
Now we need to provide the url of web page to scrap
url =https://authoraditiagarwal.com/leadershipmanagement/
Now we need to provide the path(Xpath) to particular element of that web page −
path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'response = requests.get(url)byte_string = response.contentsource_code = html.fromstring(byte_string)tree = source_code.xpath(path)print(tree[0].text_content())
Output
The corresponding output will be as shown here −
The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicatedaily progress to the stakeholders. It tracks the completion of work for a given sprintor an iteration. The horizontal axis represents the days within a Sprint. The vertical axis represents the hours remaining to complete the committed work.