Movatterモバイル変換

The key problem that readr solves isparsing a flatfile into a tibble. Parsing is the process of taking a text file andturning it into a rectangular tibble where each column is theappropriate part. Parsing takes place in three basic stages:

It’s easiest to learn how this works in the opposite order Below,you’ll learn how the:

Eachparse_*() is coupled with acol_*()function, which will be used in the process of parsing a completetibble.

Vector parsers

It’s easiest to learn the vector parses usingparse_functions. These all take a character vector and some options. Theyreturn a new vector the same length as the old, along with an attributedescribing any problems.

Atomic vectors

parse_logical(),parse_integer(),parse_double(), andparse_character() arestraightforward parsers that produce the corresponding atomicvector.

parse_integer(c("1","2","3"))#> [1] 1 2 3parse_double(c("1.56","2.34","3.56"))#> [1] 1.56 2.34 3.56parse_logical(c("true","false"))#> [1]  TRUE FALSE

By default, readr expects. as the decimal mark and, as the grouping mark. You can override this default usinglocale(), as described invignette("locales").

Flexible numeric parser

parse_integer() andparse_double() arestrict: the input string must be a single number with no leading ortrailing characters.parse_number() is more flexible: itignores non-numeric prefixes and suffixes, and knows how to deal withgrouping marks. This makes it suitable for reading currencies andpercentages:

parse_number(c("0%","10%","150%"))#> [1]   0  10 150parse_number(c("$1,234.5","$12.45"))#> [1] 1234.50   12.45

Date/times

readr supports three types of date/time data:

dates: number of days since 1970-01-01.
times: number of seconds since midnight.
datetimes: number of seconds since midnight 1970-01-01.

parse_datetime("2010-10-01 21:45")#> [1] "2010-10-01 21:45:00 UTC"parse_date("2010-10-01")#> [1] "2010-10-01"parse_time("1:00pm")#> 13:00:00

Each function takes aformat argument which describesthe format of the string. If not specified, it uses a default value:

parse_datetime() recognisesISO8601datetimes.
parse_date() uses thedate_formatspecified by thelocale(). The default value is%AD which uses an automatic date parser that recognisesdates of the formatY-m-d orY/m/d.
parse_time() uses thetime_formatspecified by thelocale(). The default value is%At which uses an automatic time parser that recognisestimes of the formH:M optionally followed by seconds andam/pm.

In most cases, you will need to supply aformat, asdocumented inparse_datetime():

parse_datetime("1 January, 2010","%d %B, %Y")#> [1] "2010-01-01 UTC"parse_datetime("02/02/15","%m/%d/%y")#> [1] "2015-02-02 UTC"

Factors

When reading a column that has a known set of values, you can readdirectly into a factor.parse_factor() will generate awarning if a value is not in the supplied levels.

parse_factor(c("a","b","a"),levels =c("a","b","c"))#> [1] a b a#> Levels: a b cparse_factor(c("a","b","d"),levels =c("a","b","c"))#> Warning: 1 parsing failure.#> row col           expected actual#>   3  -- value in level set      d#> [1] a    b    <NA>#> attr(,"problems")#> # A tibble: 1 × 4#>     row   col expected           actual#>   <int> <int> <chr>              <chr>#> 1     3    NA value in level set d#> Levels: a b c

Column specification

It would be tedious if you had to specify the type of every columnwhen reading a file. Instead readr, uses some heuristics to guess thetype of each column. You can access these results yourself usingguess_parser():

guess_parser(c("a","b","c"))#> [1] "character"guess_parser(c("1","2","3"))#> [1] "double"guess_parser(c("1,000","2,000","3,000"))#> [1] "number"guess_parser(c("2001/10/10"))#> [1] "date"

The guessing policies are described in the documentation for theindividual functions. Guesses are fairly strict. For example, we don’tguess that currencies are numbers, even though we can parse them:

guess_parser("$1,234")#> [1] "character"parse_number("$1,234")#> [1] 1234

There are two parsers that will never be guessed:col_skip() andcol_factor(). You will alwaysneed to supply these explicitly.

You can see the specification that readr would generate for a columnfile by usingspec_csv(),spec_tsv() and soon:

x<-spec_csv(readr_example("challenge.csv"))

For bigger files, you can often make the specification simpler bychanging the default column type usingcols_condense()

mtcars_spec<-spec_csv(readr_example("mtcars.csv"))mtcars_spec#> cols(#>   mpg = col_double(),#>   cyl = col_double(),#>   disp = col_double(),#>   hp = col_double(),#>   drat = col_double(),#>   wt = col_double(),#>   qsec = col_double(),#>   vs = col_double(),#>   am = col_double(),#>   gear = col_double(),#>   carb = col_double()#> )cols_condense(mtcars_spec)#> cols(#>   .default = col_double()#> )

By default readr only looks at the first 1000 rows. This keeps fileparsing speedy, but can generate incorrect guesses. For example, inchallenge.csv the column types change in row 1001, so readrguesses the wrong types. One way to resolve the problem is to increasethe number of rows:

x<-spec_csv(readr_example("challenge.csv"),guess_max =1001)

Another way is to manually specify thecol_type, asdescribed below.

Rectangular parsers

readr comes with five parsers for rectangular file formats:

read_csv() andread_csv2() for csvfiles
read_tsv() for tabs separated files
read_fwf() for fixed-width files
read_log() for web log files

Each of these functions firsts callsspec_xxx() (asdescribed above), and then parses the file according to that columnspecification:

df1<-read_csv(readr_example("challenge.csv"))#> Rows: 2000 Columns: 2#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: ","#> dbl  (1): x#> date (1): y#>#> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The rectangular parsing functions almost always succeed; they’ll onlyfail if the format is severely messed up. Instead, readr will generate adata frame of problems. The first few will be printed out, and you canaccess them all withproblems():

problems(df1)#> # A tibble: 0 × 5#> # ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

You’ve already seen one way of handling bad guesses: increasing thenumber of rows used to guess the type of each column.

df2<-read_csv(readr_example("challenge.csv"),guess_max =1001)#> Rows: 2000 Columns: 2#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: ","#> dbl  (1): x#> date (1): y#>#> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Another approach is to manually supply the column specification.

Overriding the defaults

In the previous examples, you may have noticed that readr printed thecolumn specification that it used to parse the file:

#> Parsed with column specification:#> cols(#>   x = col_integer(),#>   y = col_character()#> )

You can also access it after the fact usingspec():

spec(df1)#> cols(#>   x = col_double(),#>   y = col_date(format = "")#> )spec(df2)#> cols(#>   x = col_double(),#>   y = col_date(format = "")#> )

(This also allows you to access the full column specification ifyou’re reading a very wide file. By default, readr will only print thespecification of the first 20 columns.)

If you want to manually specify the column types, you can start bycopying and pasting this code, and then tweaking it fix the parsingproblems.

df3<-read_csv(readr_example("challenge.csv"),col_types =list(x =col_double(),y =col_date(format ="")  ))

In general, it’s good practice to supply an explicit columnspecification. It is more work, but it ensures that you get warnings ifthe data changes in unexpected ways. To be really strict, you can usestop_for_problems(df3). This will throw an error if thereare any parsing problems, forcing you to fix those problems beforeproceeding with the analysis.

Available column specifications

The available specifications are: (with string abbreviations inbrackets)

col_logical() [l], containing onlyT,F,TRUE orFALSE.
col_integer() [i], integers.
col_double() [d], doubles.
col_character() [c], everything else.
col_factor(levels, ordered) [f], a fixed set ofvalues.
col_date(format = "") [D]: with the locale’sdate_format.
col_time(format = "") [t]: with the locale’stime_format.
col_datetime(format = "") [T]: ISO8601 date times
col_number() [n], numbers containing thegrouping_mark
col_skip() [_, -], don’t import this column.
col_guess() [?], parse using the “best” type based onthe input.

Use thecol_types argument to override the defaultchoices. There are two ways to use it:

With a string:"dc__d": read first column as double,second as character, skip the next two and read the last column as adouble. (There’s no way to use this form with types that take additionalparameters.)

With a (named) list of col objects:

read_csv("iris.csv",col_types =list(Sepal.Length =col_double(),Sepal.Width =col_double(),Petal.Length =col_double(),Petal.Width =col_double(),Species =col_factor(c("setosa","versicolor","virginica"))))

Or, with their abbreviations:

read_csv("iris.csv",col_types =list(Sepal.Length ="d",Sepal.Width ="d",Petal.Length ="d",Petal.Width ="d",Species =col_factor(c("setosa","versicolor","virginica"))))

Any omitted columns will be parsed automatically, so the previouscall will lead to the same result as:

read_csv("iris.csv",col_types =list(Species =col_factor(c("setosa","versicolor","virginica"))))

You can also set a default type that will be used instead of relyingon the automatic detection for columns you don’t specify:

read_csv("iris.csv",col_types =list(Species =col_factor(c("setosa","versicolor","virginica")),.default =col_double()))

If you only want to read specified columns, usecols_only():

read_csv("iris.csv",col_types =cols_only(Species =col_factor(c("setosa","versicolor","virginica"))))

Output

The output of all these functions is a tibble. Note that charactersare never automatically converted to factors (i.e. no morestringsAsFactors = FALSE) and column names are left as is,not munged into valid R identifiers (i.e. there is nocheck.names = TRUE). Row names are never set.

Attributes store the column specification (spec()) andany parsing problems (problems()).