Movatterモバイル変換


[0]ホーム

URL:


Title:Easily Harvest (Scrape) Web Pages
Version:1.0.5
Description:Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.
License:MIT + file LICENSE
URL:https://rvest.tidyverse.org/,https://github.com/tidyverse/rvest
BugReports:https://github.com/tidyverse/rvest/issues
Depends:R (≥ 4.1)
Imports:cli, glue, httr (≥ 0.5), lifecycle (≥ 1.0.3), magrittr,rlang (≥ 1.1.0), selectr, tibble, xml2 (≥ 1.4.0)
Suggests:chromote, covr, knitr, purrr, R6, readr, repurrrsive,rmarkdown, spelling, stringi (≥ 0.3.1), testthat (≥ 3.0.2),tidyr, webfakes
VignetteBuilder:knitr
Config/Needs/website:tidyverse/tidytemplate
Config/testthat/edition:3
Config/testthat/parallel:true
Encoding:UTF-8
Language:en-US
RoxygenNote:7.3.2
NeedsCompilation:no
Packaged:2025-08-29 12:57:41 UTC; hadleywickham
Author:Hadley Wickham [aut, cre], Posit Software, PBCROR ID [cph, fnd]
Maintainer:Hadley Wickham <hadley@posit.co>
Repository:CRAN
Date/Publication:2025-08-29 14:00:02 UTC

rvest: Easily Harvest (Scrape) Web Pages

Description

logo

Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.

Author(s)

Maintainer: Hadley Wickhamhadley@posit.co

Other contributors:

See Also

Useful links:


Interact with a live web page

Description

[Experimental]

You construct an LiveHTML object withread_html_live() and then interact,like you're a human, using the methods described below. When debugging ascraping script it is particularly useful to use⁠$view()⁠, which will opena live preview of the site, and you can actually see each of the operationsperformed on the real site.

rvest provides relatively simple methods for scrolling, typing, andclicking. For richer interaction, you probably want to use a packagethat exposes a more powerful user interface, likeselendir.

Public fields

session

Underlying chromote session object. For expert use only.

Methods

Public methods


Methodnew()

initialize the object

Usage
LiveHTML$new(url)
Arguments
url

URL to page.


Methodprint()

Called whenprint()ed

Usage
LiveHTML$print(...)
Arguments
...

Ignored


Methodview()

Display a live view of the site

Usage
LiveHTML$view()

Methodhtml_elements()

Extract HTML elements from the current page.

Usage
LiveHTML$html_elements(css, xpath)
Arguments
css, xpath

CSS selector or xpath expression.


Methodclick()

Simulate a click on an HTML element.

Usage
LiveHTML$click(css, n_clicks = 1)
Arguments
css

CSS selector.

n_clicks

Number of clicks


Methodget_scroll_position()

Get the current scroll position.

Usage
LiveHTML$get_scroll_position()

Methodscroll_into_view()

Scroll selected element into view.

Usage
LiveHTML$scroll_into_view(css)
Arguments
css

CSS selector.


Methodscroll_to()

Scroll to specified location

Usage
LiveHTML$scroll_to(top = 0, left = 0)
Arguments
top, left

Number of pixels from top/left respectively.


Methodscroll_by()

Scroll by the specified amount

Usage
LiveHTML$scroll_by(top = 0, left = 0)
Arguments
top, left

Number of pixels to scroll up/down and left/rightrespectively.


Methodtype()

Type text in the selected element

Usage
LiveHTML$type(css, text)
Arguments
css

CSS selector.

text

A single string containing the text to type.


Methodpress()

Simulate pressing a single key (including special keys).

Usage
LiveHTML$press(css, key_code, modifiers = character())
Arguments
css

CSS selector.

key_code

Name of key. You can see a complete list of knownkeys athttps://pptr.dev/api/puppeteer.keyinput.

modifiers

A character vector of modifiers. Must be one or moreof⁠"Shift⁠,"Control","Alt", or"Meta".


Methodclone()

The objects of this class are cloneable with this method.

Usage
LiveHTML$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

## Not run: # To retrieve data for this paginated site, we need to repeatedly push# the "Load More" buttonsess <- read_html_live("https://www.bodybuilding.com/exercises/finder")sess$view()sess |> html_elements(".ExResult-row") |> length()sess$click(".ExLoadMore-btn")sess |> html_elements(".ExResult-row") |> length()sess$click(".ExLoadMore-btn")sess |> html_elements(".ExResult-row") |> length()## End(Not run)

Make link to google form given id

Description

Make link to google form given id

Usage

google_form(x)

Arguments

x

Unique identifier for form


Get element attributes

Description

html_attr() gets a single attribute;html_attrs() gets all attributes.

Usage

html_attr(x, name, default = NA_character_)html_attrs(x)

Arguments

x

A document (fromread_html()), node set (fromhtml_elements()),node (fromhtml_element()), or session (fromsession()).

name

Name of attribute to retrieve.

default

A string used as a default value when the attribute doesnot exist in every element.

Value

A character vector (forhtml_attr()) or list (html_attrs())the same length asx.

Examples

html <- minimal_html('<ul>  <li><a href="https://a.com">a</a></li>  <li><a href="https://c.com">b</a></li>  <li><a href="https://c.com">b</a></li>  </ul>')html |> html_elements("a") |> html_attrs()html |> html_elements("a") |> html_attr("href")html |> html_elements("li") |> html_attr("class")html |> html_elements("li") |> html_attr("class", default = "inactive")

Get element children

Description

Get element children

Usage

html_children(x)

Arguments

x

A document (fromread_html()), node set (fromhtml_elements()),node (fromhtml_element()), or session (fromsession()).

Examples

html <- minimal_html("<ul><li>1<li>2<li>3</ul>")ul <- html_elements(html, "ul")html_children(ul)html <- minimal_html("<p>Hello <b>Hadley</b><i>!</i>")p <- html_elements(html, "p")html_children(p)

Select elements from an HTML document

Description

html_element() andhtml_elements() find HTML element using CSS selectorsor XPath expressions. CSS selectors are particularly useful in conjunctionwithhttps://selectorgadget.com/, which makes it very easy to discover theselector you need.

Usage

html_element(x, css, xpath)html_elements(x, css, xpath)

Arguments

x

Either a document, a node set or a single node.

css,xpath

Elements to select. Supply one ofcss orxpathdepending on whether you want to use a CSS selector or XPath 1.0expression.

Value

html_element() returns a nodeset the same length as the input.html_elements() flattens the output so there's no direct way to mapthe output to the input.

CSS selector support

CSS selectors are translated to XPath selectors by theselectrpackage, which is a port of the pythoncssselect library,https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described inhttps://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Theexceptions are listed below:

Examples

html <- minimal_html("  <h1>This is a heading</h1>  <p id='first'>This is a paragraph</p>  <p class='important'>This is an important paragraph</p>")html |> html_element("h1")html |> html_elements("p")html |> html_elements(".important")html |> html_elements("#first")# html_element() vs html_elements() --------------------------------------html <- minimal_html("  <ul>    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>    <li><b>R4-P17</b> is a <i>droid</i></li>  </ul>")li <- html |> html_elements("li")# When applied to a node set, html_elements() returns all matching elements# beneath any of the inputs, flattening results into a new node set.li |> html_elements("i")# When applied to a node set, html_element() always returns a vector the# same length as the input, using a "missing" element where needed.li |> html_element("i")# and html_text() and html_attr() will return NAli |> html_element("i") |> html_text2()li |> html_element("span") |> html_attr("class")

Guess faulty character encoding

Description

html_encoding_guess() helps you handle web pages that declare an incorrectencoding. Usehtml_encoding_guess() to generate a list of possibleencodings, then try each out by usingencoding argument ofread_html().html_encoding_guess() replaces the deprecatedguess_encoding().

Usage

html_encoding_guess(x)

Arguments

x

A character vector.

Examples

# A file with bad encoding included in the packagepath <- system.file("html-ex", "bad-encoding.html", package = "rvest")x <- read_html(path)x |> html_elements("p") |> html_text()html_encoding_guess(x)# Two valid encodings, only one of which is correctread_html(path, encoding = "ISO-8859-1") |> html_elements("p") |> html_text()read_html(path, encoding = "ISO-8859-2") |> html_elements("p") |> html_text()

Parse forms and set values

Description

Usehtml_form() to extract a form, set values withhtml_form_set(),and submit it withhtml_form_submit().

Usage

html_form(x, base_url = NULL)html_form_set(form, ...)html_form_submit(form, submit = NULL)

Arguments

x

A document (fromread_html()), node set (fromhtml_elements()),node (fromhtml_element()), or session (fromsession()).

base_url

Base url of underlying HTML document. The default,NULL,uses the url of the HTML document underlyingx.

form

A form

...

<dynamic-dots> Name-value pairs givingfields to modify.

Provide a character vector to set multiple checkboxes in a set orselect multiple values from a multi-select.

submit

Which button should be used to submit the form?

  • NULL, the default, uses the first button.

  • A string selects a button by its name.

  • A number selects a button using its relative position.

Value

See Also

HTML 4.01 form specification:https://www.w3.org/TR/html401/interact/forms.html

Examples

html <- read_html("http://www.google.com")search <- html_form(html)[[1]]search <- search |> html_form_set(q = "My little pony", hl = "fr")# Or if you have a list of values, use !!!vals <- list(q = "web scraping", hl = "en")search <- search |> html_form_set(!!!vals)# To submit and get result:## Not run: resp <- html_form_submit(search)read_html(resp)## End(Not run)

Get element name

Description

Get element name

Usage

html_name(x)

Arguments

x

A document (fromread_html()), node set (fromhtml_elements()),node (fromhtml_element()), or session (fromsession()).

Value

A character vector the same length asx

Examples

url <- "https://rvest.tidyverse.org/articles/starwars.html"html <- read_html(url)html |>  html_element("div") |>  html_children() |>  html_name()

Parse an html table into a data frame

Description

The algorithm mimics what a browser does, but repeats the values of mergedcells in every cell that cover.

Usage

html_table(  x,  header = NA,  trim = TRUE,  fill = deprecated(),  dec = ".",  na.strings = "NA",  convert = TRUE)

Arguments

x

A document (fromread_html()), node set (fromhtml_elements()),node (fromhtml_element()), or session (fromsession()).

header

Use first row as header? IfNA, will use first rowif it consists of⁠<th>⁠ tags.

IfTRUE, column names are left exactly as they are in the sourcedocument, which may require post-processing to generate a valid dataframe.

trim

Remove leading and trailing whitespace within each cell?

fill

Deprecated - missing cells in tables are now alwaysautomatically filled withNA.

dec

The character used as decimal place marker.

na.strings

Character vector of values that will be converted toNAifconvert isTRUE.

convert

IfTRUE, will runtype.convert() to interpret texts asinteger, double, orNA.

Value

When applied to a single element,html_table() returns a single tibble.When applied to multiple elements or a document,html_table() returnsa list of tibbles.

Examples

sample1 <- minimal_html("<table>  <tr><th>Col A</th><th>Col B</th></tr>  <tr><td>1</td><td>x</td></tr>  <tr><td>4</td><td>y</td></tr>  <tr><td>10</td><td>z</td></tr></table>")sample1 |>  html_element("table") |>  html_table()# Values in merged cells will be duplicatedsample2 <- minimal_html("<table>  <tr><th>A</th><th>B</th><th>C</th></tr>  <tr><td>1</td><td>2</td><td>3</td></tr>  <tr><td colspan='2'>4</td><td>5</td></tr>  <tr><td>6</td><td colspan='2'>7</td></tr></table>")sample2 |>  html_element("table") |>  html_table()# If a row is missing cells, they'll be filled with NAssample3 <- minimal_html("<table>  <tr><th>A</th><th>B</th><th>C</th></tr>  <tr><td colspan='2'>1</td><td>2</td></tr>  <tr><td colspan='2'>3</td></tr>  <tr><td>4</td></tr></table>")sample3 |>  html_element("table") |>  html_table()

Get element text

Description

There are two ways to retrieve text from a element:html_text() andhtml_text2().html_text() is a thin wrapper aroundxml2::xml_text()which returns just the raw underlying text.html_text2() simulates howtext looks in a browser, using an approach inspired by JavaScript'sinnerText().Roughly speaking, it converts⁠<br />⁠ to"\n", adds blank linesaround⁠<p>⁠ tags, and lightly formats tabular data.

html_text2() is usually what you want, but it is much slower thanhtml_text() so for simple applications where performance is importantyou may want to usehtml_text() instead.

Usage

html_text(x, trim = FALSE)html_text2(x, preserve_nbsp = FALSE)

Arguments

x

A document, node, or node set.

trim

IfTRUE will trim leading and trailing spaces.

preserve_nbsp

Should non-breaking spaces be preserved? By default,html_text2() converts to ordinary spaces to ease further computation.Whenpreserve_nbsp isTRUE,⁠&nbsp;⁠ will appear in strings as"\ua0". This often causes confusion because it prints the same way as" ".

Value

A character vector the same length asx

Examples

# To understand the difference between html_text() and html_text2()# take the following html:html <- minimal_html(  "<p>This is a paragraph.    This another sentence.<br>This should start on a new line")# html_text() returns the raw underlying text, which includes whitespace# that would be ignored by a browser, and ignores the <br>html |> html_element("p") |> html_text() |> writeLines()# html_text2() simulates what a browser would display. Non-significant# whitespace is collapsed, and <br> is turned into a line breakhtml |> html_element("p") |> html_text2() |> writeLines()# By default, html_text2() also converts non-breaking spaces to regular# spaces:html <- minimal_html("<p>x&nbsp;y</p>")x1 <- html |> html_element("p") |> html_text()x2 <- html |> html_element("p") |> html_text2()# When printed, non-breaking spaces look exactly like regular spacesx1x2# But aren't actually the same:x1 == x2# Which you can confirm by looking at their underlying binary# representaion:charToRaw(x1)charToRaw(x2)

Create an HTML document from inline HTML

Description

Create an HTML document from inline HTML

Usage

minimal_html(html, title = "")

Arguments

html

HTML contents of page.

title

Page title (required by HTML spec).

Examples

minimal_html("<p>test</p>")

Static web scraping (with xml2)

Description

read_html() works by performing a HTTP request then parsing the HTMLreceived using the xml2 package. This is "static" scraping because itoperates only on the raw HTML file. While this works for most sites,in some cases you will need to useread_html_live() if the parts ofthe page you want to scrape are dynamically generated with javascript.

Generally, we recommend usingread_html() if it works, as it will befaster and more robust, as it has fewer external dependencies (i.e. itdoesn't rely on the Chrome web browser installed on your computer.)

Usage

read_html(  x,  encoding = "",  ...,  options = c("RECOVER", "NOERROR", "NOBLANKS", "HUGE"))

Arguments

x

Usually a string representing a URL. Seexml2::read_html() forother options.

encoding

Specify a default encoding for the document. Unlessotherwise specified XML documents are assumed to be in UTF-8 orUTF-16. If the document is not UTF-8/16, and lacks an explicitencoding directive, this allows you to supply a default.

...

Additional arguments passed on to methods.

options

Set parsing options for the libxml2 parser. Zero or more of

RECOVER

recover on errors

NOENT

substitute entities

DTDLOAD

load the external subset

DTDATTR

default DTD attributes

DTDVALID

validate with the DTD

NOERROR

suppress error reports

NOWARNING

suppress warning reports

PEDANTIC

pedantic error reporting

NOBLANKS

remove blank nodes

SAX1

use the SAX1 interface internally

XINCLUDE

Implement XInclude substitution

NONET

Forbid network access

NODICT

Do not reuse the context dictionary

NSCLEAN

remove redundant namespaces declarations

NOCDATA

merge CDATA as text nodes

NOXINCNODE

do not generate XINCLUDE START/END nodes

COMPACT

compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree)

OLD10

parse using XML-1.0 before update 5

NOBASEFIX

do not fixup XINCLUDE xml:base uris

HUGE

relax any hardcoded limit from the parser

OLDSAX

parse using SAX2 interface before 2.7.0

IGNORE_ENC

ignore internal document encoding hint

BIG_LINES

Store big lines numbers in text PSVI field

Examples

# Start by reading a HTML page with read_html():starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")# Then find elements that match a css selector or XPath expression# using html_elements(). In this example, each <section> corresponds# to a different filmfilms <- starwars |> html_elements("section")films# Then use html_element() to extract one element per film. Here# we the title is given by the text inside <h2>title <- films |>  html_element("h2") |>  html_text2()title# Or use html_attr() to get data out of attributes. html_attr() always# returns a string so we convert it to an integer using a readr functionepisode <- films |>  html_element("h2") |>  html_attr("data-id") |>  readr::parse_integer()episode

Live web scraping (with chromote)

Description

[Experimental]

read_html() operates on the HTML source code downloaded from the server.This works for most websites but can fail if the site uses javascript togenerate the HTML.read_html_live() provides an alternative interfacethat runs a live web browser (Chrome) in the background. This allows you toaccess elements of the HTML page that are generated dynamically by javascriptand to interact with the live page by clicking on buttons or typing informs.

Behind the scenes, this function uses thechromote package, which requires thatyou have a copy ofGoogle Chrome installedon your machine.

Usage

read_html_live(url)

Arguments

url

Website url to read from.

Value

read_html_live() returns an R6LiveHTML object. You can interactwith this object using the usual rvest functions, or call its methods,like⁠$click()⁠,⁠$scroll_to()⁠, and⁠$type()⁠ to interact with the livepage like a human would.

Examples

## Not run: # When we retrieve the raw HTML for this site, it doesn't contain the# data we're interested in:static <- read_html("https://www.forbes.com/top-colleges/")static |> html_element("table")# Instead, we need to run the site in a real web browser, causing it to# download a JSON file and then dynamically generate the html:dynamic <- read_html_live("https://www.forbes.com/top-colleges/")# You may need to click the cookie consent banner if it appearsdynamic$view()# Now we can find the tabledynamic |> html_element("table")# And extract data from itdynamic |>   html_element("table") |>   html_table()## End(Not run)

Objects exported from other packages

Description

These objects are imported from other packages. Follow the linksbelow to see their documentation.

magrittr

%>%

xml2

url_absolute


Functions renamed in rvest 1.0.0

Description

[Deprecated]

rvest 1.0.0 renamed a number of functions to ensure that every functionhas a common prefix, matching tidyverse conventions that emerged sincervest was first created.

(html_node() andhtml_nodes() are only superseded because they'reso widely used.)

Additionally all session related functions gained a common prefix:

Usage

set_values(form, ...)submit_form(session, form, submit = NULL, ...)xml_tag(x)xml_node(...)xml_nodes(...)html_nodes(...)html_node(...)back(x)forward(x)jump_to(x, url, ...)follow_link(x, ...)html_session(url, ...)

Repair faulty encoding

Description

[Deprecated]This function has been deprecated because it doesn't work. Insteadre-read the HTML file with correctencoding argument.

Usage

repair_encoding(x, from = NULL)

Arguments

from

The encoding that the string is actually in. IfNULL,guess_encoding will be used.


Simulate a session in web browser

Description

This set of functions allows you to simulate a user interacting with awebsite, using forms and navigating from page to page.

Usage

session(url, ...)is.session(x)session_jump_to(x, url, ...)session_follow_link(x, i, css, xpath, ...)session_back(x)session_forward(x)session_history(x)session_submit(x, form, submit = NULL, ...)

Arguments

url

A URL, either relative or absolute, to navigate to.

...

Any additional httr config to use throughout the session.

x

A session.

i

A integer to select the ith link or a string to match thefirst link containing that text (case sensitive).

css,xpath

Elements to select. Supply one ofcss orxpathdepending on whether you want to use a CSS selector or XPath 1.0expression.

form

Anhtml_form to submit

submit

Which button should be used to submit the form?

  • NULL, the default, uses the first button.

  • A string selects a button by its name.

  • A number selects a button using its relative position.

Examples

s <- session("http://hadley.nz")s |>  session_jump_to("hadley.jpg") |>  session_jump_to("/") |>  session_history()s |>  session_jump_to("hadley.jpg") |>  session_back() |>  session_history()s |>  session_follow_link(css = "p a") |>  html_elements("p")

[8]ページ先頭

©2009-2025 Movatter.jp