Commit3dc9856

authored

Update README.md

1 parentbddf005 commit3dc9856Copy full SHA for 3dc9856

File tree

1 file changed

+156

-3

lines changed

README.md

1 file changed

+156

-3

lines changed

`‎README.md`

Lines changed: 156 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -24,15 +24,18 @@ To run our example scraper, you are going to need these libraries:`
`24`	`24`	`-[Introduction](#introduction)`
`25`	`25`	`-[Be polite](#be-polite)`
`26`	`26`	`-[Let's get to it](#lets-get-to-it)`
`27`		`---[Inspecting the site](#inspecting-the-site)`
	`27`	`+-[Inspecting the site](#inspecting-the-site)`
	`28`	`+-[Requesting and parsing the data](#requesting-and-parsing-the-data)`
	`29`	`+-[Extracting the data](#extracting-the-data)`
	`30`	`+-[Conclusion](#conclusion)`
`28`	`31`
`29`	`32`	`##Introduction`
`30`	`33`
`31`	`34`	If you’re here that means you are interested in finding out more about how to scrape and enjoy all the data that you gather. However, before we dive into it, we first need to understand what web scraping is. In general terms, scraping is the process of acquiring a web page with all of its information and then extracting selected fields for further processing. Usually the purpose of gathering that information is so that a person could easily monitor it. Some examples could be reviews, prices, weather reports, billboard hits,and so on.
`32`	`35`
`33`	`36`	`##Be polite`
`34`	`37`
`35`		-Just as you are polite and caring in the real world, you should be such online as well. Before you start scraping, make sure that the website you’re targeting allows it. You can do that by checking its Robots.txt file. If the site doesn’t condone crawling or scraping of its content, be kind and respect the owner’s wishes. Failing to do so might get your IP blocked or even lead to legal action taken against you, so be wary. Moreover, check if the site you’re targeting has an API. If it does, just use that – it will be easier to get the needed data, and you won’t put unnecessary load on the sites infrastructures.
	`38`	+Just as you are polite and caring in the real world, you should be such online as well. Before you start scraping, make sure that the website you’re targeting allows it. You can do that by checking itsRobots.txt file. If the site doesn’t condone crawling or scraping of its content, be kind and respect the owner’s wishes. Failing to do so might get your IP blocked or even lead to legal action taken against you, so be wary. Moreover, check if the site you’re targeting has an API. If it does, just use that – it will be easier to get the needed data, and you won’t put unnecessary load on the sites infrastructures.
`36`	`39`
`37`	`40`	`##Let’s get to it`
`38`	`41`
`@@ -42,10 +45,160 @@ As mentioned, we will be using these libraries:`
`42`	`45`	`Requests`
`43`	`46`	`BeautifulSoup 4`
`44`	`47`	`The page we’re going to scrape ishttp://books.toscrape.com/. It doesn’t have robots.txt, but I think we can agree that the name of the site is asking you to scrape it. But before we carry on with the coding part, let's inspect the website first.`
	`48`	`+First off, let’s import the libraries we’ll be using:`
`45`	`49`
`46`	`50`	`###Inspecting the site`
`47`	`51`
`48`	`52`	`So, this is what the main page of the website looks like. We can see it contains books, their titles, prices, ratings, availability information, and a list of genres in the sidebar.`
	`53`	`+`
	`54`	`+<palign="center">`
	`55`	`+<img src="https://i.imgur.com/ovjMkS6.png" alt="books.toscrape.com Main window" width="600" height="500">`
	`56`	`+</p>`
	`57`	`+`
	`58`	`+When we select a specific book, we are greeted with even more information, such as its description, how many are in stock, the number of reviews, etc.`
	`59`	`+`
	`60`	`+<palign="center">`
	`61`	`+<img src="https://i.imgur.com/YRy5a1r.png" alt="books.toscrape.com Article window" width="600" height="500">`
	`62`	`+</p>`
	`63`	`+`
	`64`	+Great! Now we can think about what information we’d like to extract from this site. Generally, when scraping, we want to get valuable information which we could use later on. In this example, the most important points would be the price and the title of the book, so we could, for example, make a comparison with books on another website. We can also extract the direct link to a book, so it would be easier to reach later on. Finally, it would be great to know if the book is even available. As a finishing touch, we can scrape its description as well – perhaps it might catch your eye and you’ll read it.
	`65`	`+`
	`66`	`+So, now that we know exactly what we want to get from the site, we can go on and inspect those elements to see where they can be found later. Just a note: you don’t need to memorise everything now; when scraping, you’ll have to go back to the HTML code numerous times. Let’s have a look at the site’s code and inspect the elements we need. To do so, just right-click anywhere on the site with your mouse and select “Inspect”.`
	`67`	`+`
	`68`	`+Once you do that, a gazillion things will open – but don’t worry, we don't need to go through all of them. After a quick inspection, we can see that all the data we need is located in the article element with a class nameproduct_pod. The same is for all other books, as well.`
	`69`	`+`
	`70`	`+<palign="center">`
	`71`	`+<img src="https://i.imgur.com/QbdDzyW.png" alt="books.toscrape.com Inspecting the HTML" width="1000" height="500">`
	`72`	`+</p>`
	`73`	`+`
	`74`	`+This means that all the data we need will be nested in that article element. Now, let’s inspect the price. We can see that the price value is the text of theprice_color paragraph. And if we inspect In stock, we can see that it is a text value of theinstock availability paragraph. Now go on and get familiar with the rest of the elements we’ll be scraping. Once you're done, we need to get coding and turn our data extraction wishes into reality.`
	`75`	`+`
	`76`	`+###Requesting and parsing the data`
	`77`	`+`
	`78`	`+First off, let’s import the libraries we’ll be using:`
	`79`	`+<palign="center">`
	`80`	`+<img src="https://i.imgur.com/eIpTBBJ.png" alt="books.toscrape.com Libraries" width="600" height="50">`
	`81`	`+</p>`
	`82`	`+`
	`83`	`+We’ll need theRequests library to send HTTP requests andBeautifulSoup to parse the responses we receive from the website. Go ahead and import them.`
	`84`	`+Then, we’ll need to write a GET request to retrieve the contents of the site. Lets assign the response to the variabler.`
	`85`	`+`
	`86`	`+<palign="center">`
	`87`	`+<img src="https://i.imgur.com/7kAQtEY.png" alt="Writing the request" width="600" height="50">`
	`88`	`+</p>`
	`89`	`+`
	`90`	`+Therequests.get function has only one required argument, which is the URL of the site you are targeting. However, because we wish to use a proxy to reach the content, we need to pass in an additional proxies parameter. As you can see, both values are already assigned to variables, so let’s have a look at them.`
	`91`	`+`
	`92`	`+<palign="center">`
	`93`	`+<img src="https://i.imgur.com/KiS9NSP.png" alt="Request's variables" width="600" height="50">`
	`94`	`+</p>`
	`95`	`+`
	`96`	`+For the proxy, we first need to specify its kind, in this case, HTTP. Then, we have to enter the Smartproxy user’s username and password, separated by a colon, as well as the endpoint which we’ll be using to connect to the proxy server. And, well, theurl is the address of the site we wish to scrape.`
	`97`	`+`
	`98`	`+At the moment, the variabler holds the full response data from the website, including the status code, headers, URL itself, and, most importantly, the content we need. You can print it out withprint(r.content), and you’ll see that it’s the HTML code of the site you inspected previously. However, this time it’s on your device! (Except that it’s awkwardly formatted and unreadable, but we’ll fix that.)https://i.imgur.com/fzV4P8D.png`
	`99`	`+`
	`100`	`+To start working with the HTML code, we first need to parse it with BeautifulSoup – make a parse tree which we can use to extract the necessary information. Let’s create a variable calledhtml. We’ll use it to store the parsedr.content. To parse the HTML code, we just need to call the BeautifulSoup class and pass in the content and ‘html.parser’ (‘cause, you know, we are parsing HTML content here) as arguments. Try printing it out!`
	`101`	`+`
	`102`	`+<palign="center">`
	`103`	`+<img src="https://i.imgur.com/fzV4P8D.png" alt="Parsing the HTML" width="600" height="50">`
	`104`	`+</p>`
	`105`	`+`
	`106`	`+If you noticed in the image above, I used prettify(). It’s a method that comes with BeautifulSoup, and it makes the HTML even more understandable by adding indents and things like that.`
	`107`	`+`
	`108`	`+###Extracting the data`
	`109`	`+`
	`110`	`+As we found out earlier, all the data we need can be found in theproduct_pod articles. So, to make our lives easier, we can collect and work only with them. This way, we won’t need to parse all of the site’s HTML each time we want to get any data about a book. To do so, we can use one of BeautifulSoup’s methods calledfind_all(); it will find all instances of specified content.`
	`111`	`+`
	`112`	+So, in our case, we need to find and assign allarticles with theproduct_pod class to a variable. Let’s call itall_books. Now we need to parse through the html variable which we created earlier and which holds the entire HTML of the site. We’ll use thefind_all() method to do so. As arguments for thefind_all() method, we need to pass in two attributes: ‘article’, which is the tag of the content, and the classproduct_pod. Please note that because class is a Python keyword, it can’t be used as an argument name, so you need to add a trailing underscore. Here’s how it should look like:
	`113`	`+`
	`114`	`+<palign="center">`
	`115`	`+<img src="https://i.imgur.com/io8Kr21.png" alt="Assigning book data to a variable" width="600" height="50">`
	`116`	`+</p>`
	`117`	`+`
	`118`	`+Now, if you print outall_books, you’ll see that it contains a list of all the ‘product_pod’ articles found in the page.`
	`119`	`+`
	`120`	`+We’ve narrowed down the HTML to as much as we need. Now we can start gathering data about the books. Becauseall_books is a list containing all the necessary information about each book in the page, we’ll need to cycle through it using afor loop. Like this:`
	`121`	`+`
	`122`	`+#####for book in all_books:`
	`123`	`+`
	`124`	`+book is just a variable we created which we’ll be calling to get specific information in each loop. You can name it however you wish, but in our case,book is exactly what we are working with each iteration of the loop in theall_books list. Remember that we want to find the title, price, availability, description, and the link to each book. Let’s get started!`
	`125`	`+`
	`126`	`+When inspecting the site, we can see that the title is located in the h3 element, which is the only one in theproduct_pod article we’re working with.`
	`127`	`+`
	`128`	`+<palign="center">`
	`129`	`+<img src="https://i.imgur.com/odjbJLJ.png" alt="Inspecting the title" width="600" height="500">`
	`130`	`+</p>`
	`131`	`+`
	`132`	`+BeautifulSoup allows you to find a specific element very easily, by just specifying the HTML tags. To find the article, all we need to write is this:`
	`133`	`+`
	`134`	`+<palign="center">`
	`135`	`+<img src="https://i.imgur.com/OeRHGkb.png" alt="Assigning the title" width="600" height="50">`
	`136`	`+</p>`
	`137`	`+`
	`138`	+Once again,book is just the current iteration of theproduct_pod article, so we can just add.h3 and.a to specify which component’s data we want to assign to the title. We could also add.text to get the text value of thebook.h3.a. – which is indeed the title, but if you noticed, longer titles are not complete and have “...” at the end for styling purposes. That’s not really what we need. Instead, we need to get the value of the title element, which can be done by just adding‘title’ in the square brackets.
	`139`	`+`
	`140`	`+If we runprint(title), you’ll see that we have successfully extracted all the titles of the books in the page.`
	`141`	`+`
	`142`	`+<palign="center">`
	`143`	`+<img src="https://i.imgur.com/TpCaTPJ.png" alt="Title variable output" width="600" height="500">`
	`144`	`+</p>`
	`145`	`+`
	`146`	`+Some objects are not as easy to extract. They may be located in paragraphs, nested in other paragraphs further nested in other div containers. In such cases, it’s easier to use thefind() method. It’s very similar to thefind_all() method, however, it only returns the first found element – in our case, that’s exactly what we need. To find the price, we want to find the paragraph with theprice_color class and extract its text.`
	`147`	`+`
	`148`	`+<palign="center">`
	`149`	`+<img src="https://i.imgur.com/QSQ9RyR.png" alt="Assigning the price" width="600" height="50">`
	`150`	`+</p>`
	`151`	`+`
	`152`	+To find out if the element is in stock, we need to do the same thing we did with the price, simply specify a different paragraph. That would be the one containing theinstock availability class. If you were to print out the availability just like that, you’d see a lot of blank lines. It’s just the way the site’s HTML is styled. To combat that, we can use a simple Python method calledstrip(), which will remove any blank spaces or lines from the string. If you’ve done everything correctly, it should look like this:
	`153`	`+`
	`154`	`+<palign="center">`
	`155`	`+<img src="https://i.imgur.com/8cKQuyN.png" alt="Assigning the availability" width="600" height="50">`
	`156`	`+</p>`
	`157`	`+`
	`158`	+Furthermore, we need to get the description of the book. The problem is that it’s located on another page dedicated to the specific book. First, we need to get the link to the said book and make another HTTP request to retrieve the description. While inspecting, you’ll see that the link occupies the same place as the title. You can create a new variable, copy the command you used for the title, and just change the value in the square brackets to ‘href’, as that’s what we’re looking for there.
	`159`	`+`
	`160`	`+<palign="center">`
	`161`	`+<img src="https://i.imgur.com/iD0YDZO.png" alt="Assigning the link to a book" width="600" height="50">`
	`162`	`+</p>`
	`163`	`+`
	`164`	`+But, if you print outlink_to_book, you’ll see that it contains only a part of the link - the location of where the book can be found on the site, but not the domain. One easy way to solve this is to assign the website’s domain link to a variable and just add thelink_to_book, like this:`
	`165`	`+`
	`166`	`+<palign="center">`
	`167`	`+<img src="https://i.imgur.com/5XqNMqC.png" alt="Link variable" width="600" height="50">`
	`168`	`+</p>`
	`169`	`+Boom! Now you have the complete link, which we can use to extract the book’s description.`
	`170`	`+`
	`171`	`+To get the description, we need to make another request inside thefor loop, so we get one for each of the books. Basically, we need to do the same thing we did in the beginning: send a GET request to the link and parse the HTML response with BeautifulSoup.`
	`172`	`+`
	`173`	`+<palign="center">`
	`174`	`+<img src="https://i.imgur.com/TmXYIKR.png" alt="Second request" width="600" height="50">`
	`175`	`+</p>`
	`176`	`+`
	`177`	`+When inspecting the HTML of a book’s page, we can see that the description is just plain text stored in a paragraph. However, this paragraph is not the first in theproduct_page article and does not have a specified class. If we just try to usefind() without any additional parameters, it will return the price because it’s the value located in the very first paragraph.`
	`178`	`+`
`49`	`179`	`<palign="center">`
`50`		`-<img src="https://smartproxy.com/wp-content/themes/smartproxy/images/smartproxy-logo.svg" alt="Smartproxy logo" width="200" height="50">`
	`180`	`+<img src="https://i.imgur.com/b0SdKSD.png" alt="Inspecting the product description" width="600" height="500">`
`51`	`181`	`</p>`
	`182`	`+`
	`183`	`+In such a case, when using thefind() method, we need to state that the paragraph we’re looking for has no class (no sass intended). We can do so by specifying that theclass_ equals none.`
	`184`	`+`
	`185`	`+<palign="center">`
	`186`	`+<img src="https://i.imgur.com/WWFltGp.png" alt="Assigning the description" width="600" height="50">`
	`187`	`+</p>`
	`188`	`+And, of course, because we just want to get the text value, we add.text at the very end`
	`189`	`+`
	`190`	`+That’s it! We’ve gathered all the information that we needed. We can now print it all out and check what we’ve got. Just a quick note: because the description might be quite long, you can trim it by adding[:x], where x = number of characters you want to print. Some Python tricks for you!`
	`191`	`+`
	`192`	`+<palign="center">`
	`193`	`+<img src="https://i.imgur.com/rwnjf5X.png" alt="Printing the variables" width="600" height="200">`
	`194`	`+</p>`
	`195`	`+`
	`196`	`+And the response we get, which is just beautiful:`
	`197`	`+`
	`198`	`+<palign="center">`
	`199`	`+<img src="https://i.imgur.com/tsKfXni.png" alt="All variable output" width="800" height="650">`
	`200`	`+</p>`
	`201`	`+`
	`202`	`+##Conclusion`
	`203`	`+`
	`204`	`+To conclude, I would just like to note that there really are a thousand ways to get the data you need by using different functions, loops, and so on. But we sure hope that by the end of this article, you have a better idea of what, when, and how to scrape, and do it withproxies!`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit3dc9856

File tree

1 file changed

1 file changed

`‎README.md`

0 commit comments