| Title: | Parse XML |
|---|---|
| Description: | Bindings to 'libxml2' for working with XML data using a simple, consistent interface based on 'XPath' expressions. Also supports XML schema validation; for 'XSLT' transformations see the 'xslt' package. |
| Authors: | Hadley Wickham [aut], Jim Hester [aut], Jeroen Ooms [aut, cre], Posit Software, PBC [cph, fnd], R Foundation [ctb] (Copy of R-project homepage cached as example) |
| Maintainer: | Jeroen Ooms <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.5.1 |
| Built: | 2025-11-26 14:26:38 UTC |
| Source: | https://github.com/r-lib/xml2 |
This turns an XML document (or node or nodeset) into the equivalent Rlist. Note that this isas_list(), notas.list():lapply() automatically callsas.list() on its inputs, sowe can't override the default.
as_list(x, ns = character(), ...)as_list(x, ns= character(),...)
x | A document, node, or node set. |
ns | Optionally, a named vector giving prefix-url pairs, as producedby |
... | Needed for compatibility with generic. Unused. |
as_list currently only handles the four most common types ofchildren that an element might have:
Other elements, converted to lists.
Attributes, stored as R attributes. Attributes that have special meanings in R(class(),comment(),dim(),dimnames(),names(),row.names() andtsp()) are escaped with '.'
Text, stored as a character vector.
as_list(read_xml("<foo> a <b /><c><![CDATA[<d></d>]]></c></foo>"))as_list(read_xml("<foo> <bar><baz /></bar> </foo>"))as_list(read_xml("<foo id = 'a'></foo>"))as_list(read_xml("<foo><bar id='a'/><bar id='b'/></foo>"))as_list(read_xml("<foo> a <b /><c><![CDATA[<d></d>]]></c></foo>"))as_list(read_xml("<foo> <bar><baz /></bar> </foo>"))as_list(read_xml("<foo id = 'a'></foo>"))as_list(read_xml("<foo><bar id='a'/><bar id='b'/></foo>"))
This turns an R list into the equivalent XML document. Not all R lists willproduce valid XML, in particular there can only be one root node and allchild nodes need to be named (or empty) lists. R attributes become XMLattributes and R names become XML node names.
as_xml_document(x, ...)as_xml_document(x,...)
x | A document, node, or node set. |
... | Needed for compatibility with generic. Unused. |
as_xml_document(list(x = list()))# Nesting multiple nodesas_xml_document(list(foo = list(bar = list(baz = list()))))# attributes are stored as R attributesas_xml_document(list(foo = structure(list(), id = "a")))as_xml_document(list(foo = list( bar = structure(list(), id = "a"), bar = structure(list(), id = "b"))))as_xml_document(list(x= list()))# Nesting multiple nodesas_xml_document(list(foo= list(bar= list(baz= list()))))# attributes are stored as R attributesas_xml_document(list(foo= structure(list(), id="a")))as_xml_document(list(foo= list( bar= structure(list(), id="a"), bar= structure(list(), id="b"))))
Libcurl implementation ofC_download (the "internal" download method)with added support for https, ftps, gzip, etc. Default behavior is identicaltodownload.file(), but request can be fully configured by passinga customcurl::handle().
download_xml( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle())download_html( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle())download_xml( url, file= basename(url), quiet=TRUE, mode="wb", handle= curl::new_handle())download_html( url, file= basename(url), quiet=TRUE, mode="wb", handle= curl::new_handle())
url | A character string naming the URL of a resource to be downloaded. |
file | A character string with the name where the downloaded file issaved. |
quiet | If |
mode | A character string specifying the mode with which to write the file.Useful values are |
handle | a curl handle object |
The main difference betweencurl_download andcurl_fetch_diskis thatcurl_download checks the http status code before starting thedownload, and raises an error when status is non-successful. The behavior ofcurl_fetch_disk on the other hand is to proceed as normal and writethe error page to disk in case of a non success response.
Thecurl_download function does support resuming and removes the temporaryfile if the download did not complete successfully.For a more advanced download interface which supports concurrent requests andresuming large files, have a look at themulti_download function.
Path of downloaded file (invisibly).
## Not run: download_html("http://tidyverse.org/index.html")## End(Not run)## Not run:download_html("http://tidyverse.org/index.html")## End(Not run)
Read HTML or XML.
read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS")read_html( x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS", "HUGE"))## S3 method for class 'character'read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS")## S3 method for class 'raw'read_xml( x, encoding = "", base_url = "", ..., as_html = FALSE, options = "NOBLANKS")## S3 method for class 'connection'read_xml( x, encoding = "", n = 64 * 1024, verbose = FALSE, ..., base_url = "", as_html = FALSE, options = "NOBLANKS")read_xml(x, encoding="",..., as_html=FALSE, options="NOBLANKS")read_html( x, encoding="",..., options= c("RECOVER","NOERROR","NOBLANKS","HUGE"))## S3 method for class 'character'read_xml(x, encoding="",..., as_html=FALSE, options="NOBLANKS")## S3 method for class 'raw'read_xml( x, encoding="", base_url="",..., as_html=FALSE, options="NOBLANKS")## S3 method for class 'connection'read_xml( x, encoding="", n=64*1024, verbose=FALSE,..., base_url="", as_html=FALSE, options="NOBLANKS")
x | A string, a connection, or a raw vector. A string can be either a path, a url or literal xml. Urls willbe converted into connections either using If a connection, the complete connection is read into a raw vector beforebeing parsed. |
encoding | Specify a default encoding for the document. Unlessotherwise specified XML documents are assumed to be in UTF-8 orUTF-16. If the document is not UTF-8/16, and lacks an explicitencoding directive, this allows you to supply a default. |
... | Additional arguments passed on to methods. |
as_html | Optionally parse an xml file as if it's html. |
options | Set parsing options for the libxml2 parser. Zero or more of
|
base_url | When loading from a connection, raw vector or literalhtml/xml, this allows you to specify a base url for the document. Baseurls are used to turn relative urls into absolute urls. |
n | If |
verbose | When reading from a slow connection, this prints someoutput on every iteration so you know its working. |
An XML document. HTML is normalised to valid XML - this may notbe exactly the same transformation performed by the browser, but it'sa reasonable approximation.
When performing web scraping tasks it is both good practice — and often required —to set theuser agent request headerto a specific value. Sometimes this value is assigned to emulate a browser in orderto have content render in a certain way (e.g.Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0 to emulate more recent Windows browsers). Most often,this value should be set to provide the web resource owner information on who you areand the intent of your actions like this Google scraping bot user agent identifier:Googlebot/2.1 (+http://www.google.com/bot.html).
You can set the HTTP user agent for URL-based requests usinghttr::set_config() andhttr::user_agent():
httr::set_config(httr::user_agent("[email protected]; +https://example.com/info.html"))
httr::set_config() changes the configuration globally,httr::with_config() can be used to change configuration temporarily.
# Literal xml/html is useful for small examplesread_xml("<foo><bar /></foo>")read_html("<html><title>Hi<title></html>")read_html("<html><title>Hi")# From a local pathread_html(system.file("extdata", "r-project.html", package = "xml2"))## Not run: # From a urlcd <- read_xml(xml2_example("cd_catalog.xml"))me <- read_html("http://had.co.nz")## End(Not run)# Literal xml/html is useful for small examplesread_xml("<foo><bar /></foo>")read_html("<html><title>Hi<title></html>")read_html("<html><title>Hi")# From a local pathread_html(system.file("extdata","r-project.html", package="xml2"))## Not run:# From a urlcd<- read_xml(xml2_example("cd_catalog.xml"))me<- read_html("http://had.co.nz")## End(Not run)
Convert between relative and absolute urls.
url_absolute(x, base)url_relative(x, base)url_absolute(x, base)url_relative(x, base)
x | A character vector of urls relative to that base |
base | A string giving a base url. |
A character vector of urls
xml_url to retrieve the URL associated with a document
url_absolute(c(".", "..", "/", "/x"), "http://hadley.nz/a/b/c/d")url_relative("http://hadley.nz/a/c", "http://hadley.nz")url_relative("http://hadley.nz/a/c", "http://hadley.nz/")url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b")url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b/")url_absolute(c(".","..","/","/x"),"http://hadley.nz/a/b/c/d")url_relative("http://hadley.nz/a/c","http://hadley.nz")url_relative("http://hadley.nz/a/c","http://hadley.nz/")url_relative("http://hadley.nz/a/c","http://hadley.nz/a/b")url_relative("http://hadley.nz/a/c","http://hadley.nz/a/b/")
Escape and unescape urls.
url_escape(x, reserved = "")url_unescape(x)url_escape(x, reserved="")url_unescape(x)
x | A character vector of urls. |
reserved | A string containing additional characters to avoid escaping. |
url_escape("a b c")url_escape("a b c", "")url_unescape("a%20b%2fc")url_unescape("%C2%B5")url_escape("a b c")url_escape("a b c","")url_unescape("a%20b%2fc")url_unescape("%C2%B5")
Parse a url into its component pieces.
url_parse(x)url_parse(x)
x | A character vector of urls. |
A dataframe with one row for each element ofx andcolumns: scheme, server, port, user, path, query, fragment.
url_parse("http://had.co.nz/")url_parse("http://had.co.nz:1234/")url_parse("http://had.co.nz:1234/?a=1&b=2")url_parse("http://had.co.nz:1234/?a=1&b=2#def")url_parse("http://had.co.nz/")url_parse("http://had.co.nz:1234/")url_parse("http://had.co.nz:1234/?a=1&b=2")url_parse("http://had.co.nz:1234/?a=1&b=2#def")
This writes out both XML and normalised HTML. The default behavior willoutput the same format which was read. If you want to force output passoption = "as_xml" oroption = "as_html" respectively.
write_xml(x, file, ...)## S3 method for class 'xml_document'write_xml(x, file, ..., options = "format", encoding = "UTF-8")write_html(x, file, ...)## S3 method for class 'xml_document'write_html(x, file, ..., options = "format", encoding = "UTF-8")write_xml(x, file,...)## S3 method for class 'xml_document'write_xml(x, file,..., options="format", encoding="UTF-8")write_html(x, file,...)## S3 method for class 'xml_document'write_html(x, file,..., options="format", encoding="UTF-8")
x | A document or node to write to disk. It's not possible tosave nodesets containing more than one node. |
file | Path to file or connection to write to. |
... | additional arguments passed to methods. |
options | default: ‘format’. Zero or more of
|
encoding | The character encoding to use in the document. The defaultencoding is ‘UTF-8’. Available encodings are specified athttp://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding. |
h <- read_html("<p>Hi!</p>")tmp <- tempfile(fileext = ".xml")write_xml(h, tmp, options = "format")readLines(tmp)# write formatted HTML outputwrite_html(h, tmp, options = "format")readLines(tmp)h<- read_html("<p>Hi!</p>")tmp<- tempfile(fileext=".xml")write_xml(h, tmp, options="format")readLines(tmp)# write formatted HTML outputwrite_html(h, tmp, options="format")readLines(tmp)
xml_attrs() retrieves all attributes values as a named charactervector,xml_attrs() <- orxml_set_attrs() sets all attributevalues.xml_attr() retrieves the value of single attribute andxml_attr() <- orxml_set_attr() modifies its value. If theattribute doesn't exist, it will returndefault, which defaults toNA.xml_has_attr() tests if an attribute is present.
xml_attr(x, attr, ns = character(), default = NA_character_)xml_has_attr(x, attr, ns = character())xml_attrs(x, ns = character())xml_attr(x, attr, ns = character()) <- valuexml_set_attr(x, attr, value, ns = character())xml_attrs(x, ns = character()) <- valuexml_set_attrs(x, value, ns = character())xml_attr(x, attr, ns= character(), default=NA_character_)xml_has_attr(x, attr, ns= character())xml_attrs(x, ns= character())xml_attr(x, attr, ns= character())<- valuexml_set_attr(x, attr, value, ns= character())xml_attrs(x, ns= character())<- valuexml_set_attrs(x, value, ns= character())
x | A document, node, or node set. |
attr | Name of attribute to extract. |
ns | Optionally, a named vector giving prefix-url pairs, as producedby |
default | Default value to use when attribute is not present. |
value | character vector of new value. |
xml_attr() returns a character vector.NA is usedto represent of attributes that aren't defined.
xml_has_attr() returns a logical vector.
xml_attrs() returns a named character vector ifx x is singlenode, or a list of character vectors if given a nodeset
x <- read_xml("<root id='1'><child id ='a' /><child id='b' d='b'/></root>")xml_attr(x, "id")xml_attr(x, "apple")xml_attrs(x)kids <- xml_children(x)kidsxml_attr(kids, "id")xml_has_attr(kids, "id")xml_attrs(kids)# Missing attributes give missing valuesxml_attr(xml_children(x), "d")xml_has_attr(xml_children(x), "d")# If the document has a namespace, use the ns argument and# qualified attribute namesx <- read_xml(' <root xmlns:b="http://bar.com" xmlns:f="http://foo.com"> <doc b:id="b" f:id="f" /> </root>')doc <- xml_children(x)[[1]]ns <- xml_ns(x)xml_attrs(doc)xml_attrs(doc, ns)# If you don't supply a ns spec, you get the first matching attributexml_attr(doc, "id")xml_attr(doc, "b:id", ns)xml_attr(doc, "id", ns)# Can set a single attribute with `xml_attr() <-` or `xml_set_attr()`xml_attr(doc, "id") <- "one"xml_set_attr(doc, "id", "two")# Or set multiple attributes with `xml_attrs()` or `xml_set_attrs()`xml_attrs(doc) <- c("b:id" = "one", "f:id" = "two", "id" = "three")xml_set_attrs(doc, c("b:id" = "one", "f:id" = "two", "id" = "three"))x<- read_xml("<root id='1'><child id ='a' /><child id='b' d='b'/></root>")xml_attr(x,"id")xml_attr(x,"apple")xml_attrs(x)kids<- xml_children(x)kidsxml_attr(kids,"id")xml_has_attr(kids,"id")xml_attrs(kids)# Missing attributes give missing valuesxml_attr(xml_children(x),"d")xml_has_attr(xml_children(x),"d")# If the document has a namespace, use the ns argument and# qualified attribute namesx<- read_xml('<root xmlns:b="http://bar.com" xmlns:f="http://foo.com"><doc b:id="b" f:id="f" id=""/></root>')doc<- xml_children(x)[[1]]ns<- xml_ns(x)xml_attrs(doc)xml_attrs(doc, ns)# If you don't supply a ns spec, you get the first matching attributexml_attr(doc,"id")xml_attr(doc,"b:id", ns)xml_attr(doc,"id", ns)# Can set a single attribute with `xml_attr() <-` or `xml_set_attr()`xml_attr(doc,"id")<-"one"xml_set_attr(doc,"id","two")# Or set multiple attributes with `xml_attrs()` or `xml_set_attrs()`xml_attrs(doc)<- c("b:id"="one","f:id"="two","id"="three")xml_set_attrs(doc, c("b:id"="one","f:id"="two","id"="three"))
Construct a cdata node
xml_cdata(content)xml_cdata(content)
content | The CDATA content, does not include |
x <- xml_new_root("root")xml_add_child(x, xml_cdata("<d/>"))as.character(x)x<- xml_new_root("root")xml_add_child(x, xml_cdata("<d/>"))as.character(x)
xml_children returns only elements,xml_contents returnsall nodes.xml_length returns the number of children.xml_parent returns the parent node,xml_parentsreturns all parents up to the root.xml_siblings returns all nodesat the same level.xml_child makes it easy to specify a specificchild to return.
xml_children(x)xml_child(x, search = 1, ns = xml_ns(x))xml_contents(x)xml_parents(x)xml_siblings(x)xml_parent(x)xml_length(x, only_elements = TRUE)xml_root(x)xml_children(x)xml_child(x, search=1, ns= xml_ns(x))xml_contents(x)xml_parents(x)xml_siblings(x)xml_parent(x)xml_length(x, only_elements=TRUE)xml_root(x)
x | A document, node, or node set. |
search | For |
ns | Optionally, a named vector giving prefix-url pairs, as producedby |
only_elements | For |
A node or nodeset (possibly empty). Results are always de-duplicated.
x <- read_xml("<foo> <bar><boo /></bar> <baz/> </foo>")xml_children(x)xml_children(xml_children(x))xml_siblings(xml_children(x)[[1]])# Note the each unique node only appears once in the outputxml_parent(xml_children(x))# Mixed contentx <- read_xml("<foo> a <b/> c <d>e</d> f</foo>")# Childen gets the elements, contents gets all node typesxml_children(x)xml_contents(x)xml_length(x)xml_length(x, only_elements = FALSE)# xml_child makes it easier to select specific childrenxml_child(x)xml_child(x, 2)xml_child(x, "baz")x<- read_xml("<foo> <bar><boo /></bar> <baz/> </foo>")xml_children(x)xml_children(xml_children(x))xml_siblings(xml_children(x)[[1]])# Note the each unique node only appears once in the outputxml_parent(xml_children(x))# Mixed contentx<- read_xml("<foo> a <b/> c <d>e</d> f</foo>")# Childen gets the elements, contents gets all node typesxml_children(x)xml_contents(x)xml_length(x)xml_length(x, only_elements=FALSE)# xml_child makes it easier to select specific childrenxml_child(x)xml_child(x,2)xml_child(x,"baz")
Construct a comment node
xml_comment(content)xml_comment(content)
content | The comment content |
x <- xml_new_document()r <- xml_add_child(x, "root")xml_add_child(r, xml_comment("Hello!"))as.character(x)x<- xml_new_document()r<- xml_add_child(x,"root")xml_add_child(r, xml_comment("Hello!"))as.character(x)
This is used to create simple document type definitions. If you need tocreate a more complicated definition with internal subsets it is recommendedto parse a string directly withread_xml().
xml_dtd(name = "", external_id = "", system_id = "")xml_dtd(name="", external_id="", system_id="")
name | The name of the declaration |
external_id | The external ID of the declaration |
system_id | The system ID of the declaration |
r <- xml_new_root( xml_dtd( "html", "-//W3C//DTD XHTML 1.0 Transitional//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ))# Use read_xml directly for more complicated DTDd <- read_xml( '<!DOCTYPE doc [<!ELEMENT doc (#PCDATA)><!ENTITY foo " test ">]><doc>This is a valid document &foo; !</doc>')r<- xml_new_root( xml_dtd("html","-//W3C//DTD XHTML 1.0 Transitional//EN","http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"))# Use read_xml directly for more complicated DTDd<- read_xml( '<!DOCTYPE doc[<!ELEMENT doc(#PCDATA)><!ENTITY foo" test ">]><doc>This is a valid document&foo;!</doc>')
Xpath is like regular expressions for trees - it's worth learning ifyou're trying to extract nodes from arbitrary locations in a document.Usexml_find_all to find all matches - if there's no match you'llget an empty result. Usexml_find_first to find a specific match -if there's no match you'll get anxml_missing node.
xml_find_all(x, xpath, ns = xml_ns(x), ...)## S3 method for class 'xml_nodeset'xml_find_all(x, xpath, ns = xml_ns(x), flatten = TRUE, ...)xml_find_first(x, xpath, ns = xml_ns(x))xml_find_num(x, xpath, ns = xml_ns(x))xml_find_int(x, xpath, ns = xml_ns(x))xml_find_chr(x, xpath, ns = xml_ns(x))xml_find_lgl(x, xpath, ns = xml_ns(x))xml_find_all(x, xpath, ns= xml_ns(x),...)## S3 method for class 'xml_nodeset'xml_find_all(x, xpath, ns= xml_ns(x), flatten=TRUE,...)xml_find_first(x, xpath, ns= xml_ns(x))xml_find_num(x, xpath, ns= xml_ns(x))xml_find_int(x, xpath, ns= xml_ns(x))xml_find_chr(x, xpath, ns= xml_ns(x))xml_find_lgl(x, xpath, ns= xml_ns(x))
x | A document, node, or node set. |
xpath | A string containing an xpath (1.0) expression. |
ns | Optionally, a named vector giving prefix-url pairs, as producedby |
... | Further arguments passed to or from other methods. |
flatten | A logical indicating whether to return a single, flattenednodeset or a list of nodesets. |
xml_find_all returns a nodeset if applied to a node, and a nodesetor a list of nodesets if applied to a nodeset. If there are no matches,the nodeset(s) will be empty. Within each nodeset, the result will alwaysbe unique; repeated nodes are automatically de-duplicated.
xml_find_first returns a node if applied to a node, and a nodesetif applied to a nodeset. The output isalways the same size asthe input. If there are no matches,xml_find_first will return amissing node; if there are multiple matches, it will return the firstonly.
xml_find_num,xml_find_chr,xml_find_lgl returnnumeric, character and logical results respectively.
xml_find_one() has been deprecated. Instead usexml_find_first().
xml_ns_strip() to remove the default namespaces
x <- read_xml("<foo><bar><baz/></bar><baz/></foo>")xml_find_all(x, ".//baz")xml_path(xml_find_all(x, ".//baz"))# Note the difference between .// and //# // finds anywhere in the document (ignoring the current node)# .// finds anywhere beneath the current node(bar <- xml_find_all(x, ".//bar"))xml_find_all(bar, ".//baz")xml_find_all(bar, "//baz")# Find all vs find one -----------------------------------------------------x <- read_xml("<body> <p>Some <b>text</b>.</p> <p>Some <b>other</b> <b>text</b>.</p> <p>No bold here!</p></body>")para <- xml_find_all(x, ".//p")# By default, if you apply xml_find_all to a nodeset, it finds all matches,# de-duplicates them, and returns as a single nodeset. This means you# never know how many results you'll getxml_find_all(para, ".//b")# If you set flatten to FALSE, though, xml_find_all will return a list of# nodesets, where each nodeset contains the matches for the corresponding# node in the original nodeset.xml_find_all(para, ".//b", flatten = FALSE)# xml_find_first only returns the first match per input node. If there are 0# matches it will return a missing nodexml_find_first(para, ".//b")xml_text(xml_find_first(para, ".//b"))# Namespaces ---------------------------------------------------------------# If the document uses namespaces, you'll need use xml_ns to form# a unique mapping between full namespace url and a short prefixx <- read_xml(' <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com"> <f:doc><g:baz /></f:doc> <f:doc><g:baz /></f:doc> </root>')xml_find_all(x, ".//f:doc")xml_find_all(x, ".//f:doc", xml_ns(x))x<- read_xml("<foo><bar><baz/></bar><baz/></foo>")xml_find_all(x,".//baz")xml_path(xml_find_all(x,".//baz"))# Note the difference between .// and //# // finds anywhere in the document (ignoring the current node)# .// finds anywhere beneath the current node(bar<- xml_find_all(x,".//bar"))xml_find_all(bar,".//baz")xml_find_all(bar,"//baz")# Find all vs find one -----------------------------------------------------x<- read_xml("<body><p>Some<b>text</b>.</p><p>Some<b>other</b><b>text</b>.</p><p>No bold here!</p></body>")para<- xml_find_all(x,".//p")# By default, if you apply xml_find_all to a nodeset, it finds all matches,# de-duplicates them, and returns as a single nodeset. This means you# never know how many results you'll getxml_find_all(para,".//b")# If you set flatten to FALSE, though, xml_find_all will return a list of# nodesets, where each nodeset contains the matches for the corresponding# node in the original nodeset.xml_find_all(para,".//b", flatten=FALSE)# xml_find_first only returns the first match per input node. If there are 0# matches it will return a missing nodexml_find_first(para,".//b")xml_text(xml_find_first(para,".//b"))# Namespaces ---------------------------------------------------------------# If the document uses namespaces, you'll need use xml_ns to form# a unique mapping between full namespace url and a short prefixx<- read_xml('<root xmlns:f="http://foo.com" xmlns:g="http://bar.com"><f:doc><g:baz/></f:doc><f:doc><g:baz/></f:doc></root>')xml_find_all(x,".//f:doc")xml_find_all(x,".//f:doc", xml_ns(x))
The (tag) name of an xml element.
Modify the (tag) name of an element
xml_name(x, ns = character())xml_name(x, ns = character()) <- valuexml_set_name(x, value, ns = character())xml_name(x, ns= character())xml_name(x, ns= character())<- valuexml_set_name(x, value, ns= character())
x | A document, node, or node set. |
ns | Optionally, a named vector giving prefix-url pairs, as producedby |
value | a character vector with replacement name. |
A character vector.
x <- read_xml("<bar>123</bar>")xml_name(x)y <- read_xml("<bar><baz>1</baz>abc<foo /></bar>")z <- xml_children(y)xml_name(xml_children(y))x<- read_xml("<bar>123</bar>")xml_name(x)y<- read_xml("<bar><baz>1</baz>abc<foo /></bar>")z<- xml_children(y)xml_name(xml_children(y))
xml_new_document creates only a new document without a root node. Inmost cases you should instead usexml_new_root, which creates a newdocument and assigns the root node in one step.
xml_new_document(version = "1.0", encoding = "UTF-8")xml_new_root( .value, ..., .copy = inherits(.value, "xml_node"), .version = "1.0", .encoding = "UTF-8")xml_new_document(version="1.0", encoding="UTF-8")xml_new_root( .value,..., .copy= inherits(.value,"xml_node"), .version="1.0", .encoding="UTF-8")
version | The version number of the document. |
encoding | The character encoding to use in the document. The defaultencoding is ‘UTF-8’. Available encodings are specified athttp://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding. |
.value | node to insert. |
... | If named attributes or namespaces to set on the node, if unnamedtext to assign to the node. |
.copy | whether to copy the |
.version | The version number of the document, passed to |
.encoding | The encoding of the document, passed to |
Axml_document object.
xml_ns extracts all namespaces from a document, matching eachunique namespace url with the prefix it was first associated with. Defaultnamespaces are namedd1,d2 etc. Usexml_ns_renameto change the prefixes. Once you have a namespace object, you can pass it toother functions to work with fully qualified names instead of local names.
xml_ns(x)xml_ns_rename(old, ...)xml_ns(x)xml_ns_rename(old,...)
x | A document, node, or node set. |
old,... | An existing xml_namespace object followed by name-value(old prefix-new prefix) pairs to replace. |
A character vector with classxml_namespace so thedefault display is a little nicer.
x <- read_xml(' <root> <doc1 xmlns = "http://foo.com"><baz /></doc1> <doc2 xmlns = "http://bar.com"><baz /></doc2> </root>')xml_ns(x)# When there are default namespaces, it's a good idea to rename# them to give informative names:ns <- xml_ns_rename(xml_ns(x), d1 = "foo", d2 = "bar")ns# Now we can pass ns to other xml function to use fully qualified namesbaz <- xml_children(xml_children(x))xml_name(baz)xml_name(baz, ns)xml_find_all(x, "//baz")xml_find_all(x, "//foo:baz", ns)str(as_list(x))str(as_list(x, ns))x<- read_xml('<root><doc1 xmlns="http://foo.com"><baz/></doc1><doc2 xmlns="http://bar.com"><baz/></doc2></root>')xml_ns(x)# When there are default namespaces, it's a good idea to rename# them to give informative names:ns<- xml_ns_rename(xml_ns(x), d1="foo", d2="bar")ns# Now we can pass ns to other xml function to use fully qualified namesbaz<- xml_children(xml_children(x))xml_name(baz)xml_name(baz, ns)xml_find_all(x,"//baz")xml_find_all(x,"//foo:baz", ns)str(as_list(x))str(as_list(x, ns))
Strip the default namespaces from a document
xml_ns_strip(x)xml_ns_strip(x)
x | A document, node, or node set. |
x <- read_xml( "<foo xmlns = 'http://foo.com'> <baz/> <bar xmlns = 'http://bar.com'> <baz/> </bar> </foo>")# Need to specify the default namespaces to find the baz nodesxml_find_all(x, "//d1:baz")xml_find_all(x, "//d2:baz")# After stripping the default namespaces you can find both baz nodes directlyxml_ns_strip(x)xml_find_all(x, "//baz")x<- read_xml( "<foo xmlns='http://foo.com'><baz/><bar xmlns='http://bar.com'><baz/></bar></foo>")# Need to specify the default namespaces to find the baz nodesxml_find_all(x,"//d1:baz")xml_find_all(x,"//d2:baz")# After stripping the default namespaces you can find both baz nodes directlyxml_ns_strip(x)xml_find_all(x,"//baz")
This is useful when you want to figure out where nodes matching anxpath expression live in a document.
xml_path(x)xml_path(x)
x | A document, node, or node set. |
A character vector.
x <- read_xml("<foo><bar><baz /></bar><baz /></foo>")xml_path(xml_find_all(x, ".//baz"))x<- read_xml("<foo><bar><baz /></bar><baz /></foo>")xml_path(xml_find_all(x,".//baz"))
xml_add_sibling() andxml_add_child() are used to insert a nodeas a sibling or a child.xml_add_parent() adds a new parent inbetween the input node and the current parent.xml_replace()replaces an existing node with a new node.xml_remove() removes anode from the tree.
xml_replace(.x, .value, ..., .copy = TRUE)xml_add_sibling(.x, .value, ..., .where = c("after", "before"), .copy = TRUE)xml_add_child(.x, .value, ..., .where = length(xml_children(.x)), .copy = TRUE)xml_add_parent(.x, .value, ...)xml_remove(.x, free = FALSE)xml_replace(.x, .value,..., .copy=TRUE)xml_add_sibling(.x, .value,..., .where= c("after","before"), .copy=TRUE)xml_add_child(.x, .value,..., .where= length(xml_children(.x)), .copy=TRUE)xml_add_parent(.x, .value,...)xml_remove(.x, free=FALSE)
.x | a document, node or nodeset. |
.value | node to insert. |
... | If named attributes or namespaces to set on the node, if unnamedtext to assign to the node. |
.copy | whether to copy the |
.where | to add the new node, for |
free | When removing the node also free the memory used for that node.Note if you use this option you cannot use any existing objects pointing tothe node or its children, it is likely to crash R or return garbage. |
Care needs to be taken when usingxml_remove(),
Serializing XML objects to connections.
xml_serialize(object, connection, ...)xml_unserialize(connection, ...)xml_serialize(object, connection,...)xml_unserialize(connection,...)
object | R object to serialize. |
connection | an openconnection or (for |
... | Additional arguments passed to |
Forserialize,NULL unlessconnection = NULL, whenthe result is returned in a raw vector.
Forunserialize anR object.
library(xml2)x <- read_xml("<a> <b><c>123</c></b> <b><c>456</c></b></a>")b <- xml_find_all(x, "//b")out <- xml_serialize(b, NULL)xml_unserialize(out)library(xml2)x<- read_xml("<a><b><c>123</c></b><b><c>456</c></b></a>")b<- xml_find_all(x,"//b")out<- xml_serialize(b,NULL)xml_unserialize(out)
The namespace to be set must be already defined in one of the node'sancestors.
xml_set_namespace(.x, prefix = "", uri = "")xml_set_namespace(.x, prefix="", uri="")
.x | a node |
prefix | The namespace prefix to use |
uri | The namespace URI to use |
the node (invisibly)
Show the structure of an html/xml document without displaying any ofthe values. This is useful if you want to get a high level view of theway a document is organised. Compared toxml_structure,html_structure prints the id and class attributes.
xml_structure(x, indent = 2, file = "")html_structure(x, indent = 2, file = "")xml_structure(x, indent=2, file="")html_structure(x, indent=2, file="")
x | HTML/XML document (or part there of) |
indent | Number of spaces to ident |
file | aconnection, or a character string naming the fileto print to. If |
xml_structure(read_xml("<a><b><c/><c/></b><d/></a>"))rproj <- read_html(system.file("extdata", "r-project.html", package = "xml2"))xml_structure(rproj)xml_structure(xml_find_all(rproj, ".//p"))h <- read_html("<body><p id = 'a'></p><p class = 'c d'></p></body>")html_structure(h)xml_structure(read_xml("<a><b><c/><c/></b><d/></a>"))rproj<- read_html(system.file("extdata","r-project.html", package="xml2"))xml_structure(rproj)xml_structure(xml_find_all(rproj,".//p"))h<- read_html("<body><p id = 'a'></p><p class = 'c d'></p></body>")html_structure(h)
xml_text returns a character vector,xml_double returns anumeric vector,xml_integer returns an integer vector.
xml_text(x, trim = FALSE)xml_text(x) <- valuexml_set_text(x, value)xml_double(x)xml_integer(x)xml_text(x, trim=FALSE)xml_text(x)<- valuexml_set_text(x, value)xml_double(x)xml_integer(x)
x | A document, node, or node set. |
trim | If |
value | character vector with replacement text. |
A character vector, the same length as x.
x <- read_xml("<p>This is some text. This is <b>bold!</b></p>")xml_text(x)xml_text(xml_children(x))x <- read_xml("<x>This is some text. <x>This is some nested text.</x></x>")xml_text(x)xml_text(xml_find_all(x, "//x"))x <- read_xml("<p> Some text </p>")xml_text(x, trim = TRUE)# xml_double() and xml_integer() are useful for extracting numeric attributesx <- read_xml("<plot><point x='1' y='2' /><point x='2' y='1' /></plot>")xml_integer(xml_find_all(x, "//@x"))x<- read_xml("<p>This is some text. This is <b>bold!</b></p>")xml_text(x)xml_text(xml_children(x))x<- read_xml("<x>This is some text. <x>This is some nested text.</x></x>")xml_text(x)xml_text(xml_find_all(x,"//x"))x<- read_xml("<p> Some text </p>")xml_text(x, trim=TRUE)# xml_double() and xml_integer() are useful for extracting numeric attributesx<- read_xml("<plot><point x='1' y='2' /><point x='2' y='1' /></plot>")xml_integer(xml_find_all(x,"//@x"))
Determine the type of a node.
xml_type(x)xml_type(x)
x | A document, node, or node set. |
x <- read_xml("<foo> a <b /> <![CDATA[ blah]]></foo>")xml_type(x)xml_type(xml_contents(x))x<- read_xml("<foo> a <b /> <![CDATA[ blah]]></foo>")xml_type(x)xml_type(xml_contents(x))
This is useful for interpreting relative urls withurl_relative().
xml_url(x)xml_url(x)
x | A node or document. |
A character vector of length 1. ReturnsNA if the name isnot set.
catalog <- read_xml(xml2_example("cd_catalog.xml"))xml_url(catalog)x <- read_xml("<foo/>")xml_url(x)catalog<- read_xml(xml2_example("cd_catalog.xml"))xml_url(catalog)x<- read_xml("<foo/>")xml_url(x)
Validate an XML document against an XML 1.0 schema.
xml_validate(x, schema)xml_validate(x, schema)
x | A document, node, or node set. |
schema | an XML document containing the schema |
TRUE or FALSE
# Example from https://msdn.microsoft.com/en-us/library/ms256129(v=vs.110).aspxdoc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2"))schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2"))xml_validate(doc, schema)# Example from https://msdn.microsoft.com/en-us/library/ms256129(v=vs.110).aspxdoc<- read_xml(system.file("extdata/order-doc.xml", package="xml2"))schema<- read_xml(system.file("extdata/order-schema.xml", package="xml2"))xml_validate(doc, schema)
xml2 comes bundled with a number of sample files in its ‘inst/extdata’directory. This function makes them easy to access.
xml2_example(path = NULL)xml2_example(path=NULL)
path | Name of file. If |