A web page is not only a document or container of content. The rapid development in computing and web-related technologies today has transformed the web, with more security features being implemented and the web becoming a dynamic, real-time source of information. Many scraping communities gather historic data; some analyze hourly data or the latestobtained data.
At our end, we (users) use web browsers (such as Google Chrome, Mozilla Firefox, and Safari) as an application to access information from the web. Web browsers provide various document-based functionalities to users and contain application-level features that are often useful toweb developers.
Web pages that users view or explore through their browsers are not just single documents. Various technologies exist that can be used to develop websites or web pages. A web page is a document that contains blocks of HTML tags. Most of the time, it is built with various sub-blocks linked as dependent or independent components from various interlinkedtechnologies, including JavaScript andCascading StyleSheets (CSS).
An understanding of the general concepts of web pages and the techniques of web development, along with the technologies found inside web pages, will provide more flexibility and control in the scraping process. A lot of the time, a developer can also employreverse-engineering techniques.
Reverse engineering is an activity that involves breaking down and examining the concepts that were required to build certain products. For more information on reverse engineering, please refer to theGlobalSpec articleHow Does Reverse Engineering Work?, availableathttps://insights.globalspec.com/article/7367/how-does-reverse-engineering-work.
Here, we will introduce and explore a few of the available web technologies that can help and guide us in the process ofdata extraction.
HTTP
Hypertext Transfer Protocol (HTTP) is anapplication protocolthat transfers resources (web-based), such as HTML documents, between a client and a web server. HTTP is a stateless protocol that follows the client-server model. Clients (web browsers) and web servers communicate or exchange informationusingHTTP requests andHTTP responses, as seen inFigure 1.2:
Figure 1.2: HTTP (client and server or request-response communication)
Requests and responses are cyclic in nature – they are like questions and answers from clients to the server, andvice versa.
Anotherencrypted and more secureversion of theHTTP protocol isHypertext Transfer Protocol Secure (HTTPS). It usesSecure Sockets Layer (SSL) (learn more about SSL athttps://developer.mozilla.org/en-US/docs/Glossary/SSL) andTransport Layer Security (TLS) (learn more aboutTLS athttps://developer.mozilla.org/en-US/docs/Glossary/TLS) to communicate encrypted content between a client and a server. This type of security allows clients to exchange sensitive data with a server in a safe manner. Activities such as banking, online shopping, and e-payment gateways use HTTPS to make sensitive data safe and prevent it frombeing exposed.
Important note
An HTTP request URL begins with http://, for example,http://www.packtpub.com, and an HTTPS request URL begins withhttps://, suchashttps://www.packpub.com.
You have now learned a bit about HTTP. In the next section, you will learn about HTTP requests (or HTTPrequest methods).
HTTP requests (or HTTP request methods)
Web browsers or clientssubmit their requests to the server. Requests are forwarded to the server using various methods (commonly known as HTTP request methods), such asGETandPOST:
- GET: This is the mostcommon method for requesting information. It isconsidered a safe method as the resource state is not altered here. Also, it is used to provide query strings, such ashttps://www.google.com/search?q=world%20cup%20football&source=hp, which is requesting information from Google based on the
q (world cup football) andsource (hp) parameters sent with the request. Information or queries (q andsource in this example) with values are displayed inthe URL. - POST: Used to makea secure request to the server. The requested resource state can be altered. Data posted or sent to the requested URL is not visible in the URL but rather transferred with the request body. It is used to submit information to the server in a secure way, such as for logins anduser registrations.
We will explore more about HTTP methods in theImplementing HTTP methods section ofChapter 2.
There are two main parts to HTTP communication, as seen inFigure 1.2. With a basic idea about HTTP requests, let’s explore HTTP responses in thenext section.
HTTP responses
The server processes the requests, and sometimes also the specified HTTP headers. When requests are received and processed, the server returns its response to the browser. Most of the time, responses are found in HTML format, or even, in JavaScriptand other document types, inJavaScript Object Notation (JSON) orother formats.
A response contains status codes, the meaning of which can be revealed usingDeveloper Tools (DevTools). The following list contains a few status codes along with some brief information about whatthey mean:
- 200: OK,request succeeded
- 404: Not found, requested resource cannotbe found
- 500: Internalserver error
- 204: No content tobe sent
- 401: Unauthorized request was made tothe server
There are also some groups of responses that can be identified from a range of HTTPresponse statuses:
- 100–199:Informational responses
- 200–299:Successful responses
- 300–399:Redirection responses
- 400–499:Client error
- 500–599:Server error
Important note
For more information on cookies, HTTP, HTTP responses, and status codes, please consult the official documentationathttps://www.w3.org/Protocols/andhttps://developer.mozilla.org/en-US/docs/Web/HTTP/Status.
Now that we have a basic idea about HTTP responses and requests, let us explore HTTP cookies (one of the most important factors inweb scraping).
HTTP cookies
HTTP cookies are datasent by the server to the browser. This data is generated and stored by websites on your system or computer. It helps to identify HTTP requests from the user to the website. Cookies contain information regarding session management, user preferences, anduser behavior.
The server identifies and communicates with the browser based on the information stored in the cookies. Data stored in cookies helps a website to access and transfer certain saved values, such as the session ID and expiration date and time, providing a quick interaction between the web requestand response.
Figure 1.3 displays the list of request cookies fromhttps://www.fifa.com/fifaplus/en, collected usingChrome DevTools:
Figure 1.3: Request cookies
We will explore and collect more information about and from browser-based DevTools in the upcoming sections andChapter 3.
Important note
Formore information about cookies, please visitAbout Cookies athttp://www.aboutcookies.org/ andAll About Cookiesathttp://www.allaboutcookies.org/.
Similar to the role of cookies, HTTP proxies are also quite important in scraping. We will explore more about proxies in the next section, and also in somelater chapters.
HTTP proxies
A proxy server acts as an intermediate server between a client and the main web server. The web browser sends requests to the server that are actually passed through the proxy, and the proxy returns the response from the server tothe client.
Proxies are often used for monitoring/filtering, performance improvement, translation, and security for internet-related resources. Proxies can also be bought as a service, which may also be used to deal with cross-domain resources. There are also various forms of proxy implementation, such as web proxies (which can be used to bypass IP blocking), CGI proxies, andDNS proxies.
You can buy or have a contract with a proxy seller or a similar organization. They will provide you with various types of proxies according to the country in which you are operating. Proxy switching during crawling is done frequently – a proxy allows us to bypass restricted content too. Normally, if a request is routed through a proxy, our IP is somewhat safe and not revealed as the receiver will just see the third-party proxy in their detail or server logs. You can even access sites that aren’t available in your location (that is, you see anaccess denied in your country message) by switching to adifferent proxy.
Cookie-based parameters that are passed in using HTTP GET requests, HTML form-related HTTP POST requests, and modifying or adapting headers will be crucial in managing code (that is, scripts) and accessing content during the webscraping process.
Important note
Details on HTTP, headers, cookies, and so on will be explored more in an upcoming section,Data-finding techniques used in web pages. Please visit the HTTP page in the MDN webdocs (https://developer.mozilla.org/en-US/docs/Web/HTTP) for more detailed information on HTTP and related concepts. Please visithttps://www.softwaretestinghelp.com/best-proxy-server/ for information on the bestproxy server.
You now understand general concepts regarding HTTP (including requests, responses, cookies, and proxies). Next, we will understand the technology that is used to create web content or make content available in somepredefined formats.
HTML
Websitesare made up of pages or documents containing text, images, style sheets, and scripts, among other things. They are often built with markuplanguages such asHypertext Markup Language (HTML) andExtensible Hypertext MarkupLanguage (XHTML).
HTML is oftenreferred to as the standard markup language used for building a web page. Since the early 1990s, HTML has been used independently as well as in conjunction with server-based scripting languages, such as PHP, ASP, and JSP. XHTML is an advanced and extended version of HTML, which is the primary markup language for web documents. XHTML isalso stricter than HTML, and from a coding perspective, is also known as an application builtwithExtensible MarkupLanguage (XML).
HTML defines and contains the content of a web page. Data that can be extracted, and any information-revealing data sources, can be found inside HTML pages within a predefined instruction set or markup elements called tags. HTML tags are normally a named placeholder carrying certain predefined attributes, for example,<a>,<b>, <table>,<img>,and<script>.
HTML is a container or type of markup language. Various factors are involved in building HTML; the next section defines these factors withsome examples.
HTML elements and attributes
HTMLelements (also referred to as document nodes) are the building blocks of web documents. HTML elements are built with a start tag,<..>, and an end tag,</..>, with certain content inside them. An HTML element can also contain attributes, usually defined asattribute-name = attribute-value, which provide additional information tothe element:
<p>normal paragraph tags</p><h1>heading tags there are also h2, h3, h4, h5, h6</h1><a href="https://www.google.com">Click here for Google.com</a><img src="myphoto1.jpg" width="300" height="300" alt="Picture" /><br />
The preceding code can be broken downas follows:
<p> and<h1> are HTML elements containing general text information (element content).<a> is defined with anhref attribute that contains the actual link that will be processed when the textClick here for Google.com is clicked. The link referstohttps://www.google.com/.- The
<img> image tag also contains a few attributes, such assrc andalt, along with their respective values.src holds the resource, which means the image address or image URL, as a value, whereasalt holds the value for alternative text (mostly displayed when there is a slow connection or the image is not able to load)for<img>. <br/> represents a line break in HTML and has no attributes or text content. It is used to insert a new line in the layout ofthe document.
HTML elements can also be nested in a tree-like structure with a parent-child hierarchy,as follows:
<div> <p> <b>Paragraph Content</b> <img src="mylogo.png" alt="Logo" class="logo"/> </p> <p> <h3> Paragraph Title: Web Scraping</h3> </p></div>
As seen in the preceding code, two<p> child elements are found inside an HTML<div> block. Both child elements carry certain attributes and various child elements as their content. Normally, HTML documents are built with theaforementioned structure.
As seen in the preceding code block in the last example, there are a few extra key-value pairs. The next sectionexplores this.
Global attributes
HTML elementscan contain some additional information, such as key-value pairs. These are also known as HTML element attributes. Attributes hold values and provide identification, or contain additional information that can be helpful in many aspects during scraping activities, such as identifying exact web elements and extracting values or text from them and traversing (movingalong) elements.
There are certain attributes that are common to HTML elements or can be applied to all HTML elements. The following list mentions some of the attributes that are identified as globalattributes (https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes):
id: This attribute’s values should be unique to the element they areapplied toclass: This attribute’s values are mostly used with CSS, providing equal state formatting options, and can be used withmultiple elementsstyle: This specifies inline CSS styles foran elementlang: This helps to identify the language ofthe text
Important note
Theid andclass attributes are mostly used to identify or format individual elements or groups of them. These attributes can also be managed by CSS and other scripting languages. These attributes can be identified by placing# and., respectively, in front of the attribute name when used with CSS, or while traversing and applyingparsing techniques.
HTML element attributes can also be overwritten or implemented dynamically using scripting languages. As displayed in the following example,itemprop attributes are used to add properties to an element, whereasdata-* is used to store data that is native to theelement itself:
<div itemscope itemtype="http://schema.org/Place"> <h1 itemprop="univeristy">University of Helsinki</h1> <span>Subject: <span itemprop="subject1">Artificial Intelligence</span> </span><span itemprop="subject2">Data Science</span></div><img src="logo.png" data-course-id="324" datatitle="Predictive Analysis" data-x="12345" data-y="54321" data-z="56743"/>
HTML tags and attributes are very helpful whenextracting data.
Important note
Please visithttps://www.w3.org orhttps://www.w3schools.com/html for more detailed informationonHTML.
InChapter 3, we will explore these attributes using different tools. We will also perform various logical operations and use them for extracting orscraping purposes.
We now have some idea about HTML and a few important attributes related to HTML. In the next section, we will learn the basics of XML, also known as the parent ofmarkup languages.
XML
XML is a markup languageused for distributing data over the internet, with a set of rules for encoding documents that are readable and easily exchangeable between machines and documents. XML files are recognized by the.xml extension.
XML emphasizes the usability of textual data across various formats and systems. XML is designed to carry portable data or data stored in tags that is not predefined with HTML tags. In XML documents, tags are created by the document developer or an automated program to describethe content.
The following code displays some exampleXML content:
<employees> <employee> <fullName>Shiba Chapagain</fullName> <gender>Female</gender> </employee> <employee> <fullName>Aasira Chapagain</fullName> <gender>Female</gender> </employee></employees>
In the preceding code, the<employees> parent node has two<employee> child nodes, which in turn contain the other child nodes of<fullName>and<gender>.
XML is an open standard, using the Unicode character set. XML is used to share data across various platforms and has been adopted by various web applications. Many websites use XML data, implementing its contents with the use of scripting languages and presenting it in HTML or other document formats for the end userto view.
Extraction tasks from XML documents can also be performed to obtain the contents in the desired format, or by filtering the requirement with respect to a specific need for data. Plus, behind-the-scenes data may also be obtained from certainwebsites only.
Important note
Please visithttps://www.w3.org/XML/ andhttps://www.w3schools.com/xml/ for more informationon XML.
So far, we have explored content placing and content holding related technologies based on markup languages such as HTML and XML. These technologies are somewhat static in nature. The next section is about JavaScript, which provides dynamism to the web with the helpof scripts.
JavaScript
JavaScript (also known asJS orJScript) is a programming language used to program HTML and web applicationsthat run in the browser. JavaScript is mostly preferred for adding dynamic features and providing user-based interaction inside web pages. JavaScript, HTML, and CSS are among the most-used web technologies, and now they are also used withheadless browsers (you can read more about headless browsers athttps://oxylabs.io/blog/what-is-headless-browser). The client-side availability of the JavaScript engine has also strengthened its usage in application testingand debugging.
<script> contains programming logic with JavaScript variables, operators, functions, arrays, loops, conditions, and events, targeting the HTMLDocument Object Model (DOM). JavaScriptcode can be added to HTML using<script>, as seen in the following code, or can also be embedded asa file:
<!DOCTYPE html><html><head> <script> function placeTitle() { document.getElementById("innerDiv").innerHTML = "Welcome to WebScraping"; } </script></head><body> <div>Press the button: <p></p></div> <button name="btnTitle" type="submit" onclick="placeTitle()"> Load Page Title! </button></body></html>As seen in the preceding code, the HTML<head> tag contains<script> with theplaceTitle() JavaScript function. The function defined fires up the event as soon as<button> is clicked and changes the content for<p> withid=innerDIV (this particular element is defined as empty) to display the textWelcometo WebScraping.
Important note
The HTML DOM is a standard for how to get, change, add, or delete HTML elements. Please visit the page on JavaScript HTML DOM on W3Schools (https://www.w3schools.com/js/js_htmldom.asp) for moredetailed information.
Thedynamic manipulationof HTML content, elements, attribute values, CSS, and HTML events with accessible internal functions and programming features makes JavaScript very popular in web development. There are many web-based technologies related to JavaScript, includingJSON,JavaScript Query (jQuery), AngularJS, andAsynchronous JavaScript and XML (AJAX), among many more. Some of these will be discussed in thefollowing subsections.
jQuery
jQuery, ormore specificallyJavaScript-based DOM-related query, is a JavaScript library that addresses incompatibilities across browsers, providing API features to handle the HTML DOM, events, and animations. jQuery has been acclaimed globally for providing interactivity to the web and the way JavaScript is used to code. jQuery is lightweight in comparison to the JavaScript framework. It is also easy to implement and takes a short and readablecoding approach.
jQuery is a huge topic and will require adequate knowledge of JavaScript before embarking on it. A jQuery-like Python-based library will be used by us inChapter 4.
Important note
For more information on jQuery, please visithttps://www.w3schools.com/jquery/andhttp://jquery.com/.
jQuery is mostly used for DOM-based activities, as discussed in this section, whereas AJAX is a collection of technologies, which we are going to learn about in thenext section.
AJAX
AJAX is a web development technique that uses a group of web technologies on the client side to create asynchronousweb applications.
JavaScriptXMLHttpRequest (XHR) objects are used to execute AJAX on web pages and load page content without refreshing or reloading the page. Please visit theAJAX page on W3Schools (https://www.w3schools.com/js/js_ajax_intro.asp) for more information on AJAX. From a scraping point of view, a basic overview of JavaScript functionality will be valuable to understand how a page is built or manipulated, as well as to identify the dynamiccomponents used.
Important note
Please visithttps://developer.mozilla.org/en-US/docs/Web/JavaScript,https://www.javascript.com/,https://www.w3schools.com/js/js_intro.asp, andhttps://www.w3schools.com/js/js_ajax_intro.asp for more information on JavaScriptand AJAX.
We have learned about a few JavaScript-based techniques and technologies that are commonly deployed in web development today. In the next section, we will learn aboutdata-storing objects.
JSON
JSON is a format used for storing and transporting data from a server to a web page. It is language-independent and preferred in web-based data interchange actions due to its size and readability. JSON files are files that have the.json extension.
JSON data is normally formatted as a name:value pair, which is evaluated as a JavaScript object and follows JavaScript operations. JSON and XML are often compared, as they both carry and exchange data between various web resources. JSON is usually ranked higher than XML for its structure, which is simple, readable, self-descriptive, understandable, and easyto process.
For web applications using JavaScript, AJAX, or RESTful services, JSON is preferred over XML due to its fast and easy operation. JSON and JavaScript objects are interchangeable. JSON is not a markup language, and it doesn’t contain any tags or attributes. Instead, it is a text-only format that can be accessed through a server, as well as being able to be managed by anyprogramming language.
JSON objects can also be expressed as arrays, dictionaries,and lists:
{"mymembers":[{ "firstName":"Aasira", "lastName":"Chapagain","cityName":"Kathmandu"},{ "firstName":"Rakshya", "lastName":"Dhungel","cityName":"New Delhi"},{ "firstName":"Shiba", "lastName":"Paudel","cityName":"Biratnagar"},]}You have learned about JSON, which is a content holder. In the following section, we will discuss HTML styling using CSS and providing HTML tags withextra identification.
Important note
JSON is also known for the mixture of dictionary and list objects it provides in Python. JSON is written as a string, and we can find plenty of websites that convert JSON strings into JSON objects, for example,https://jsonformatter.org/,https://jsonlint.com/,andhttps://www.freeformatter.com/json-formatter.html.
Please visithttp://www.json.org/,https://jsonlines.org/, andhttps://www.w3schools.com/js/js_json_intro.asp for more information regarding JSON andJSON Lines.
CSS
Theweb-based technologies we have introduced so far deal withcontent, including binding, development, and processing. CSS describes the display properties of HTML elements and the appearance of web pages. CSS is used for styling and providing the desired appearance and presentation ofHTML elements.
By using CSS, developers/designers can control the layout and presentation of a web document. CSS can be applied to a distinct element in a page, or it can be embedded through a separate document. Styling details can be described using the<style> tag.
The<style> tag can contain details targeting repeated and various elements in a block. As seen in the following code, multiple<a> elements exist, and it also possesses theclass andidglobal attributes:
<html><head><style>a{color:blue;}h1{color:black; text-decoration:underline;}#idOne{color:red;}.classOne{color:orange;}</style></head><body><h1> Welcome to Web Scraping </h1>Links:<a href="https://www.google.com"> Google </a> <a class='classOne' href="https://www.yahoo.com"> Yahoo </a><a id='idOne' href="https://www.wikipedia.org"> Wikipedia </a></body></html>Attributesthat are provided with CSS properties or have been styledinside<style> tags in the preceding code block will result in the output shown inFigure 1.4:
Figure 1.4: Output of the HTML code using CSS
Although CSS is used to manage the appearance of HTML elements, CSS selectors (patterns used to select elements or the position of elements) often play a major role in the scraping process. We will be exploring CSS selectors in detail inChapter 3.
Important note
Pleasevisithttps://www.w3.org/Style/CSS/ andhttps://www.w3schools.com/css/ formore detailed informationon CSS.
In this section, you were introduced to some of the technologies that can be used for web scraping. In the upcoming section, you will learn about data-finding techniques. Most of them are built with web technologies you have already beenintroduced to.