Movatterモバイル変換

This vignette introduces you to the basics of web scraping withrvest. You’ll first learn the basics of HTML and how to use CSSselectors to refer to specific elements, then you’ll learn how to uservest functions to get data out of HTML and into R.

HTML basics

HTML stands for “HyperText Markup Language” and looks like this:

<html><head><title>Page title</title></head><body><h1 id='first'>A heading</h1><p>Some text&amp;<b>some bold text.</b></p><img src='myimg.png' width='100' height='100'></body>

HTML has a hierarchical structure formed byelementswhich consist of a start tag (e.g. <tag>), optionalattributes (id='first'), an end tag¹ (like</tag>), andcontents (everything inbetween the start and end tag).

Since< and> are used for start andend tags, you can’t write them directly. Instead you have to use theHTMLescapes> (greater than) and< (less than). And since those escapes use&, if you want a literal ampersand you have to escapeit as&. There are a wide range of possible HTMLescapes but you don’t need to worry about them too much because rvestautomatically handles them for you.

Elements

All up, there are over 100 HTML elements. Some of the most importantare:

Every HTML page must be in an<html> element,and it must have two children:<head>, which containsdocument metadata like the page title, and<body>,which contains the content you see in the browser.
Block tags like<h1> (heading 1),<p> (paragraph), and<ol> (orderedlist) form the overall structure of the page.
Inline tags like<b> (bold),<i> (italics), and<a> (links)formats text inside block tags.

If you encounter a tag that you’ve never seen before, you can findout what it does with a little googling. I recommend theMDN WebDocs which are produced by Mozilla, the company that makes theFirefox web browser.

Most elements can have content in between their start and end tags.This content can either be text or more elements. For example, thefollowing HTML contains paragraph of text, with one word in bold.

Hi! Myname is Hadley.

Thechildren of a node refers only to elements, sothe<p> element above has one child, the<b> element. The<b> element hasno children, but it does have contents (the text “name”).

Some elements, like<img> can’t have children.These elements depend solely on attributes for their behavior.

Attributes

Tags can have namedattributes which look likename1='value1' name2='value2'. Two of the most importantattributes areid andclass, which are used inconjunction with CSS (Cascading Style Sheets) to control the visualappearance of the page. These are often useful when scraping data off apage.

Reading HTML with rvest

You’ll usually start the scraping process withread_html(). This returns axml_document² objectwhich you’ll then manipulate using rvest functions:

html<-read_html("http://rvest.tidyverse.org/")class(html)#> [1] "xml_document" "xml_node"

For examples and experimentation, rvest also includes a function thatlets you create anxml_document from literal HTML:

html<-minimal_html("  <p>This is a paragraph<p>  <ul>    <li>This is a bulleted list</li>  </ul>")html#> {html_document}#> <html>#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...#> [2] <body>\n<p>This is a paragraph</p>\n<p>\n  </p>\n<ul>\n<li>This is a bull ...

Regardless of how you get the HTML, you’ll need some way to identifythe elements that contain the data you care about. rvest provides twooptions: CSS selectors and XPath expressions. Here I’ll focus on CSSselectors because they’re simpler but still sufficiently powerful formost scraping tasks.

CSS selectors

CSS is short for cascading style sheets, and is a tool for definingthe visual styling of HTML documents. CSS includes a miniature languagefor selecting elements on a page calledCSS selectors.CSS selectors define patterns for locating HTML elements, and are usefulfor scraping because they provide a concise way of describing whichelements you want to extract.

CSS selectors can be quite complex, but fortunately you only need thesimplest for rvest, because you can also write R code for morecomplicated situations. The four most important selectors are:

p: selects all<p>elements.
.title: selects all elements withclass“title”.
p.special: selects all<p>elements withclass “special”.
#title: selects the element with theidattribute that equals “title”. Id attributes must be unique within adocument, so this will only ever select a single element.

If you want to learn more CSS selectors I recommend starting with thefunCSS dinner tutorial andthen referring to theMDNweb docs.

Lets try out the most important selectors with a simple example:

html<-minimal_html("  <h1>This is a heading</h1>  <p id='first'>This is a paragraph</p>  <p class='important'>This is an important paragraph</p>")

In rvest you can extract a single element withhtml_element() or all matching elements withhtml_elements(). Both functions take a document³ and a cssselector:

html|>html_element("h1")#> {html_node}#> <h1>html|>html_elements("p")#> {xml_nodeset (2)}#> [1] <p id="first">This is a paragraph</p>#> [2] <p class="important">This is an important paragraph</p>html|>html_elements(".important")#> {xml_nodeset (1)}#> [1] <p class="important">This is an important paragraph</p>html|>html_elements("#first")#> {xml_nodeset (1)}#> [1] <p id="first">This is a paragraph</p>

Selectors can also be combined in various ways usingcombinators. For example,The most important combinatoris ” “, thedescendant combination, becausep a selects all<a> elements that are achild of a<p> element.

If you don’t know exactly what selector you need, I highly recommendusingSelectorGadget,which lets you automatically generate the selector you need by supplyingpositive and negative examples in the browser.

Extracting data

Now that you’ve got the elements you care about, you’ll need to getdata out of them. You’ll usually get the data from either the textcontents or an attribute. But, sometimes (if you’re lucky!), the datayou need will be in an HTML table.

Text

Usehtml_text2() to extract the plain text contents ofan HTML element:

html<-minimal_html("  <ol>    <li>apple &amp; pear</li>    <li>banana</li>    <li>pineapple</li>  </ol>")html|>html_elements("li")|>html_text2()#> [1] "apple & pear" "banana"       "pineapple"

Note that the escaped ampersand is automatically converted to&; you’ll only ever see HTML escapes in the sourceHTML, not in the data returned by rvest.

You might wonder why I usedhtml_text2(), since it seemsto give the same result ashtml_text():

html|>html_elements("li")|>html_text()#> [1] "apple & pear" "banana"       "pineapple"

The main difference is how the two functions handle white space. InHTML, white space is largely ignored, and it’s the structure of theelements that defines how text is laid out.html_text2()does its best to follow the same rules, giving you something similar towhat you’d see in the browser. Take this example which contains a bunchof white space that HTML ignores.

html<-minimal_html("<body>  <p>  This is  a  paragraph.</p><p>This is another paragraph.  It has two sentences.</p>")

html_text2() gives you what you expect: two paragraphsof text separated by a blank line.

html|>html_element("body")|>html_text2()|>cat()#> This is a paragraph.#>#> This is another paragraph. It has two sentences.

Whereashtml_text() returns the garbled raw underlyingtext:

html|>html_element("body")|>html_text()|>cat()#>#>#>   This is#>   a#>   paragraph.This is another paragraph.#>#>   It has two sentences.

Attributes

Attributes are used to record the destination of links (thehref attribute of<a> elements) and thesource of images (thesrc attribute of the<img> element):

html<-minimal_html("  <p><a href='https://en.wikipedia.org/wiki/Cat'>cats</a></p>  <img src='https://cataas.com/cat' width='100' height='200'>")

The value of an attribute can be retrieved withhtml_attr():

html|>html_elements("a")|>html_attr("href")#> [1] "https://en.wikipedia.org/wiki/Cat"html|>html_elements("img")|>html_attr("src")#> [1] "https://cataas.com/cat"

Note thathtml_attr() always returns a string, so youmay need to post-process withas.integer()/readr::parse_integer() orsimilar.

html|>html_elements("img")|>html_attr("width")#> [1] "100"html|>html_elements("img")|>html_attr("width")|>as.integer()#> [1] 100

Tables

HTML tables are composed four main elements:<table>,<tr> (table row),<th> (table heading), and<td>(table data). Here’s a simple HTML table with two columns and threerows:

html<-minimal_html("  <table>    <tr>      <th>x</th>      <th>y</th>    </tr>    <tr>      <td>1.5</td>      <td>2.7</td>    </tr>    <tr>      <td>4.9</td>      <td>1.3</td>    </tr>    <tr>      <td>7.2</td>      <td>8.1</td>    </tr>  </table>  ")

Because tables are a common way to store data, rvest includes thehandyhtml_table() which converts a table into a dataframe:

html|>html_node("table")|>html_table()#> # A tibble: 3 × 2#>       x     y#>   <dbl> <dbl>#> 1   1.5   2.7#> 2   4.9   1.3#> 3   7.2   8.1

Element vs elements

When using rvest, your eventual goal is usually to build up a dataframe, and you want each row to correspond some repeated unit on theHTML page. In this case, you should generally start by usinghtml_elements() to select the elements that contain eachobservation then usehtml_element() to extract thevariables from each observation. This guarantees that you’ll get thesame number of values for each variable becausehtml_element() always returns the same number of outputs asinputs.

To illustrate this problem take a look at this simple example Iconstructed using a few entries fromdplyr::starwars:

html<-minimal_html("  <ul>    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>    <li><b>R4-P17</b> is a <i>droid</i></li>  </ul>  ")

If you try to extract name, species, and weight directly, you end upwith one vector of length four and two vectors of length three, and noway to align them:

html|>html_elements("b")|>html_text2()#> [1] "C-3PO"  "R2-D2"  "Yoda"   "R4-P17"html|>html_elements("i")|>html_text2()#> [1] "droid" "droid" "droid"html|>html_elements(".weight")|>html_text2()#> [1] "167 kg" "96 kg"  "66 kg"

Instead, usehtml_elements() to find a element thatcorresponds to each character, then usehtml_element() toextract each variable for all observations:

characters<- html|>html_elements("li")characters|>html_element("b")|>html_text2()#> [1] "C-3PO"  "R2-D2"  "Yoda"   "R4-P17"characters|>html_element("i")|>html_text2()#> [1] "droid" "droid" NA      "droid"characters|>html_element(".weight")|>html_text2()#> [1] "167 kg" "96 kg"  "66 kg"  NA

html_element() automatically fills inNAwhen no elements match, keeping all of the variables aligned and makingit easy to create a data frame:

data.frame(name = characters|>html_element("b")|>html_text2(),species = characters|>html_element("i")|>html_text2(),weight = characters|>html_element(".weight")|>html_text2())#>     name species weight#> 1  C-3PO   droid 167 kg#> 2  R2-D2   droid  96 kg#> 3   Yoda    <NA>  66 kg#> 4 R4-P17   droid   <NA>

Movatterモバイル変換

Web scraping 101

HTML basics

Elements

Contents

Attributes

Reading HTML with rvest

CSS selectors

Extracting data

Text

Attributes

Tables

Element vs elements