Arrow has a rich data type system that includes direct analogs ofmany R data types, and many data types that do not have a counterpart inR. This article describes the Arrow type system, compares it to R datatypes, and outlines the default mappings used when data are transferredfrom Arrow to R. At the end of the article there are two lookup tables:one describing the default “R to Arrow” type mappings and the otherdescribing the “Arrow to R” mappings.
Motivating example
To illustrate the conversion that needs to take place, consider thedifferences between the output when obtain we usedplyr::glimpse() to inspect thestarwars datain its original format – as a data frame in R – and the output we obtainwhen we convert it to an Arrow Table first by callingarrow_table():
## Rows: 87## Columns: 14## $ name<chr> "Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Or~## $ height<int> 172,167,96,202,150,178,165,97,183,182,188,180,2~## $ mass<dbl> 77.0,75.0,32.0,136.0,49.0,120.0,75.0,32.0,84.0,77.~## $ hair_color<chr> "blond",NA,NA,"none","brown","brown, grey","brown",N~## $ skin_color<chr> "fair","gold","white, blue","white","light","light","~## $ eye_color<chr> "blue","yellow","red","yellow","brown","blue","blue",~## $ birth_year<dbl> 19.0,112.0,33.0,41.9,19.0,52.0,47.0,NA,24.0,57.0,~## $ sex<chr> "male","none","none","male","female","male","female",~## $ gender<chr> "masculine","masculine","masculine","masculine","femini~## $ homeworld<chr> "Tatooine","Tatooine","Naboo","Tatooine","Alderaan","T~## $ species<chr> "Human","Droid","Droid","Human","Human","Human","Huma~## $ films<list> <"A New Hope", "The Empire Strikes Back", "Return of the J~## $ vehicles<list> <"Snowspeeder", "Imperial Speeder Bike">,<>,<>,<>,"Imp~## $ starships<list> <"X-wing", "Imperial shuttle">,<>,<>,"TIE Advanced x1",~glimpse(arrow_table(starwars))## Table## 87 rows x 14 columns## $ name<string> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia~## $ height<int32> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180~## $ mass<double> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ~## $ hair_color<string> "blond",NA,NA, "none", "brown", "brown, grey", "brown"~## $ skin_color<string> "fair", "gold", "white, blue", "white", "light", "light"~## $ eye_color<string> "blue", "yellow", "red", "yellow", "brown", "blue", "blu~## $ birth_year<double> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0,NA, 24.0, 57.~## $ sex<string> "male", "none", "none", "male", "female", "male", "femal~## $ gender<string> "masculine", "masculine", "masculine", "masculine", "fem~## $ homeworld<string> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan",~## $ species<string> "Human", "Droid", "Droid", "Human", "Human", "Human", "H~## $ films<list<...>> <"A New Hope", "The Empire Strikes Back", "Return of the~## $ vehicles<list<...>> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I~## $ starships<list<...>> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1~## Call `print()` for full schema detailsThe data represented are essentially the same, but the descriptionsof the data types for the columns have changed. For example:
nameis labelled<chr>(charactervector) in the data frame; it is labelled<string>(astring type, also referred to as utf8 type) in the Arrow Tableheightis labelled<int>(integervector) in the data frame; it is labelled<int32>(32bit signed integer) in the Arrow Tablemassis labelled<dbl>(numericvector) in the data frame; it is labelled<double>(64 bit floating point number) in the Arrow Table
Some of these differences are purely cosmetic: integers in R are infact 32 bit signed integers, so the underlying data types in Arrow and Rare direct analogs of one another. In other cases the differences arepurely about the implementation: Arrow and R have different ways tostore a vector of strings, but at a high level of abstraction the Rcharacter type and the Arrow string type can be viewed as directanalogs. In some cases, however, there are no clear analogs: while Arrowhas an analog of POSIXct (the timestamp type) it does not have an analogof POSIXlt; conversely, while R can represent 32 bit signed integers, itdoes not have an equivalent of a 64 bit unsigned integer.
When the arrow package converts between R data and Arrow data, itwill first check to see if a Schema has been provided – seeschema() for more information – and if none is available itwill attempt to guess the appropriate type by following the defaultmappings. A complete listing of these mappings is provided at the end ofthe article, but the most common cases are depicted in the illustrationbelow:

In this image, black boxes refer to R data types and light blue boxesrefer to Arrow data types. Directional arrows specify conversions (e.g.,the bidirectional arrow between the logical R type and the boolean Arrowtype means that the logical R converts to an Arrow boolean and viceversa). Solid lines indicate that this conversion rule is always thedefault; dashed lines mean that it only sometimes applies (the rules andspecial cases are described below).
Logical/boolean types
Arrow and R both use three-valued logic. In R, logical values can beTRUE orFALSE, withNA used torepresent missing data. In Arrow, the corresponding boolean type cantake valuestrue,false, ornull,as shown below:
chunked_array(c(TRUE,FALSE,NA), type=boolean())# default## ChunkedArray## <bool>## [## [## true,## false,## null## ]## ]It is not strictly necessary to settype = boolean() inthis example because the default behavior in arrow is to translate Rlogical vectors to Arrow booleans and vice versa. However, for the sakeof clarity we will specify the data types explicitly throughout thisarticle. We will likewise usechunked_array() to createArrow data from R objects andas.vector() to create R datafrom Arrow objects, but similar results are obtained if we use othermethods.
Integer types
Base R natively supports only one type of integer, using 32 bits torepresent signed numbers between -2147483648 and 2147483647, though Rcan also support 64 bit integers via thebit64package. Arrow inherits signed and unsigned integer types from C++ in 8bit, 16 bit, 32 bit, and 64 bit versions:
| Description | Data Type Function | Smallest Value | Largest Value |
|---|---|---|---|
| 8 bit unsigned | uint8() | 0 | 255 |
| 16 bit unsigned | uint16() | 0 | 65535 |
| 32 bit unsigned | uint32() | 0 | 4294967295 |
| 64 bit unsigned | uint64() | 0 | 18446744073709551615 |
| 8 bit signed | int8() | -128 | 127 |
| 16 bit signed | int16() | -32768 | 32767 |
| 32 bit signed | int32() | -2147483648 | 2147483647 |
| 64 bit signed | int64() | -9223372036854775808 | 9223372036854775807 |
By default, arrow translates R integers to the int32 type in Arrow,but you can override this by explicitly specifying another integertype:
chunked_array(c(10L,3L,200L), type=int32())# default## ChunkedArray## <int32>## [## [## 10,## 3,## 200## ]## ]chunked_array(c(10L,3L,200L), type=int64())## ChunkedArray## <int64>## [## [## 10,## 3,## 200## ]## ]If the value in R does not fall within the permissible range for thecorresponding Arrow type, arrow throws an error:
chunked_array(c(10L,3L,200L), type=int8())## Error: Invalid: value outside of rangeWhen translating from Arrow to R, integer types alway translate to Rintegers unless one of the following exceptions applies:
- If the value of an Arrow uint32 or uint64 falls outside the rangeallowed for R integers, the result will be a numeric vector in R
- If the value of an Arrow int64 variable falls outside the rangeallowed for R integers, the result will be a
bit64::integer64vector in R - If the user sets
options(arrow.int64_downcast = FALSE),the Arrow int64 type always yields abit64::integer64vector in R regardless of the value
Floating point numeric types
R has one double-precision (64 bit) numeric type, which translates tothe Arrow 64 bit floating point type by default. Arrow supports bothsingle-precision (32 bit) and double-precision (64 bit) floating pointnumbers, specified using thefloat32() andfloat64() data type functions. Both of these are translatedto doubles in R. Examples are shown below:
chunked_array(c(0.1,0.2,0.3), type=float64())# default## ChunkedArray## <double>## [## [## 0.1,## 0.2,## 0.3## ]## ]chunked_array(c(0.1,0.2,0.3), type=float32())## ChunkedArray## <float>## [## [## 0.1,## 0.2,## 0.3## ]## ]arrow_double<-chunked_array(c(0.1,0.2,0.3), type=float64())as.vector(arrow_double)## [1] 0.1 0.2 0.3Note that the Arrow specification also permits half-precision (16bit) floating point numbers, but these have not yet beenimplemented.
Fixed point decimal types
Arrow also containsdecimal() data types, in whichnumeric values are specified in decimal format rather than binary.Decimals in Arrow come in two varieties, a 128 bit version and a 256 bitversion, but in most cases users should be able to use the more generaldecimal() data type function rather than the specificdecimal32(),decimal64(),decimal128(), anddecimal256() functions.
The decimal types in Arrow are fixed-precision numbers (rather thanfloating-point), which means it is necessary to explicitly specify theprecision andscale arguments:
precisionspecifies the number of significant digits tostore.scalespecifies the number of digits that should bestored after the decimal point. If you setscale = 2,exactly two digits will be stored after the decimal point. If you setscale = 0, values will be rounded to the nearest wholenumber. Negative scales are also permitted (handy when dealing withextremely large numbers), soscale = -2stores the value tothe nearest 100.
Because R does not have any way to create decimal types natively, theexample below is a little circuitous. First we create some floatingpoint numbers as Chunked Arrays, and then explicitly cast these todecimal types within Arrow. This is possible because Chunked Arrayobjects possess acast() method:
arrow_floating<-chunked_array(c(.01,.1,1,10,100))arrow_decimals<-arrow_floating$cast(decimal(precision=5, scale=2))arrow_decimals## ChunkedArray## <decimal32(5, 2)>## [## [## 0.01,## 0.10,## 1.00,## 10.00,## 100.00## ]## ]Though not natively used in R, decimal types can be useful insituations where it is especially important to avoid problems that arisein floating point arithmetic.
String/character types
R uses a single character type to represent strings whereas Arrow hastwo types. In the Arrow C++ library these types are referred to asstrings and large_strings, but to avoid ambiguity in the arrow R packagethey are defined using theutf8() andlarge_utf8() data type functions. The distinction betweenthese two Arrow types is unlikely to be important for R users, thoughthe difference is discussed in the article ondata object layout.
The default behavior is to translate R character vectors to theutf8/string type, and to translate both Arrow types to R charactervectors:
strings<-chunked_array(c("oh","well","whatever"))strings## ChunkedArray## <string>## [## [## "oh",## "well",## "whatever"## ]## ]as.vector(strings)## [1] "oh" "well" "whatever"Factor/dictionary types
The analog of R factors in Arrow is the dictionary type. Factorstranslate to dictionaries and vice versa. To illustrate this, let’screate a small factor object in R:
## [1] cat dog pig dog## Levels: cat dog pigWhen translated to Arrow, this is the dictionary that results:
dict<-chunked_array(fct, type=dictionary())dict## ChunkedArray## <dictionary<values=string, indices=int32>>## [#### -- dictionary:## [## "cat",## "dog",## "pig"## ]## -- indices:## [## 0,## 1,## 2,## 1## ]## ]When translated back to R, we recover the original factor:
as.vector(dict)## [1] cat dog pig dog## Levels: cat dog pigArrow dictionaries are slightly more flexible than R factors: valuesin a dictionary do not necessarily have to be strings, but labels in afactor do. As a consequence, non-string values in an Arrow dictionaryare coerced to strings when translated to R.
Date types
In R, dates are typically represented using the Date class.Internally a Date object is a numeric type whose value counts the numberof days since the beginning of the Unix epoch (1 January 1970). Arrowsupplies two data types that can be used to represent dates: the date32type and the date64 type. The date32 type is similar to the Date classin R: internally it stores a 32 bit integer that counts the number ofdays since 1 January 1970. The default in arrow is to translate R Dateobjects to Arrow date32 types:
## [1] "1989-06-15" "1991-09-24" "1993-09-13"nirvana_32<-chunked_array(nirvana_album_dates, type=date32())# defaultnirvana_32## ChunkedArray## <date32[day]>## [## [## 1989-06-15,## 1991-09-24,## 1993-09-13## ]## ]Arrow also supplies a higher-precision date64 type, in which the dateis represented as a 64 bit integer that encodes the number ofmilliseconds since 1970-01-01 00:00 UTC:
nirvana_64<-chunked_array(nirvana_album_dates, type=date64())nirvana_64## ChunkedArray## <date64[ms]>## [## [## 1989-06-15,## 1991-09-24,## 1993-09-13## ]## ]The translation from Arrow to R differs. Internally the date32 typeis very similar to an R Date, so these objects are translated to R asDates:
## [1] "Date"However, because date64 types are specified to millisecond-levelprecision, they are translated to R as POSIXct times to avoid thepossibility of losing relevant information:
## [1] "POSIXct" "POSIXt"Temporal/timestamp types
In R there are two classes used to represent date and timeinformation, POSIXct and POSIXlt. Arrow only has one: the timestamptype. Arrow timestamps are loosely analogous to the POSIXct class.Internally, a POSIXct object represents the date with as a numericvariable that stores the number of seconds since 1970-01-01 00:00 UTC.Internally, an Arrow timestamp is a 64 bit integer counting the numberof milliseconds since 1970-01-01 00:00 UTC.
Arrow and R both support timezone information, but display itdifferently in the printed object. In R, local time is printed with thetimezone name adjacent to it:
sydney_newyear<-as.POSIXct("2000-01-01 00:01", tz="Australia/Sydney")sydney_newyear## [1] "2000-01-01 00:01:00 AEDT"When translated to Arrow, this POSIXct object becomes an Arrowtimestamp object. When printed, however, the temporal instant is alwaysdisplayed in UTC rather than local time:
sydney_newyear_arrow<-chunked_array(sydney_newyear, type=timestamp())sydney_newyear_arrow## ChunkedArray## <timestamp[s]>## [## [## 1999-12-31 13:01:00## ]## ]The timezone information is not lost, however, which we can easilysee by translating thesydney_newyear_arrow object back toan R POSIXct object:
as.vector(sydney_newyear_arrow)## [1] "1999-12-31 13:01:00 UTC"For POSIXlt objects the behaviour is different. Internally a POSIXltobject is a list specifying the “local time” in terms of a variety ofhuman-relevant fields. There is no analogous class to this in Arrow, sothe default behaviour is to translate it to an Arrow list.
Time of day types
Base R does not have a class to represent the time of day independentof the date (i.e., it is not possible to specify “3pm” without referringto a specific day), but it can be done with the help of thehms package.Internally, hms objects are always stored as the number of seconds since00:00:00.
Arrow has two data types for this purposes. For time32 types, dataare stored as a 32 bit integer that is interpreted either as the numberof seconds or the number of milliseconds since 00:00:00. Note thedifference between the following:
time_of_day<-hms::hms(56,34,12)chunked_array(time_of_day, type=time32(unit="s"))## ChunkedArray## <time32[s]>## [## [## 12:34:56## ]## ]chunked_array(time_of_day, type=time32(unit="ms"))## ChunkedArray## <time32[ms]>## [## [## 12:34:56.000## ]## ]A time64 object is similar, but stores the time of day using a 64 bitinteger and can represent the time at higher precision. It is possibleto choose microseconds (unit = "us") or nanoseconds(unit = "ns"), as shown below:
chunked_array(time_of_day, type=time64(unit="us"))## ChunkedArray## <time64[us]>## [## [## 12:34:56.000000## ]## ]chunked_array(time_of_day, type=time64(unit="ns"))## ChunkedArray## <time64[ns]>## [## [## 12:34:56.000000000## ]## ]All versions of time32 and time64 objects in Arrow translate to hmstimes in R.
Duration types
Lengths of time are represented as difftime objects in R. Theanalogous data type in Arrow is the duration type. A duration type isstored as a 64 bit integer, which can represent the number of seconds(the default,unit = "s"), milliseconds(unit = "ms"), microseconds (unit = "us"), ornanoseconds (unit = "ns"). To illustrate this we’ll createa difftime in R corresponding to 278 seconds:
len<-as.difftime(278, unit="secs")len## Time difference of 278 secsThe translation to Arrow looks like this:
chunked_array(len, type=duration(unit="s"))# default## ChunkedArray## <duration[s]>## [## [## 278## ]## ]chunked_array(len, type=duration(unit="ns"))## ChunkedArray## <duration[ns]>## [## [## 278000000000## ]## ]Regardless of the underlying unit, duration objects in Arrowtranslate to difftime objects in R.
List of default translations
The discussion above covers the most common cases. The two tables inthis section provide a more complete list of how arrow translatesbetween R data types and Arrow data types. In these table, entries witha- are not currently implemented.
Translations from R to Arrow
| Original R type | Arrow type after translation |
|---|---|
| logical | boolean |
| integer | int32 |
| double (“numeric”) | float641 |
| character | utf82 |
| factor | dictionary |
| raw | uint8 |
| Date | date32 |
| POSIXct | timestamp |
| POSIXlt | struct |
| data.frame | struct |
| list3 | list |
| bit64::integer64 | int64 |
| hms::hms | time32 |
| difftime | duration |
| vctrs::vctrs_unspecified | null |
1:float64 anddouble are thesame concept and data type in Arrow C++; however, onlyfloat64() is used in arrow as the functiondouble() already exists in base R
2: If the character vector exceeds 2GB of strings, it willbe converted to alarge_utf8 Arrow type
3: Only lists where all elements are the same type areable to be translated to Arrow list type (which is a “list of” sometype).
Translations from Arrow to R
| Original Arrow type | R type after translation |
|---|---|
| boolean | logical |
| int8 | integer |
| int16 | integer |
| int32 | integer |
| int64 | integer1 |
| uint8 | integer |
| uint16 | integer |
| uint32 | integer1 |
| uint64 | integer1 |
| float16 | -2 |
| float32 | double |
| float64 | double |
| utf8 | character |
| large_utf8 | character |
| binary | arrow_binary3 |
| large_binary | arrow_large_binary3 |
| fixed_size_binary | arrow_fixed_size_binary3 |
| date32 | Date |
| date64 | POSIXct |
| time32 | hms::hms |
| time64 | hms::hms |
| timestamp | POSIXct |
| duration | difftime |
| decimal | double |
| dictionary | factor4 |
| list | arrow_list5 |
| large_list | arrow_large_list5 |
| fixed_size_list | arrow_fixed_size_list5 |
| struct | data.frame |
| null | vctrs::vctrs_unspecified |
| map | arrow_list5 |
| union | -2 |
1: These integer types may contain values that exceed therange of R’sinteger type (32 bit signed integer). Whenthey do,uint32 anduint64 are converted todouble (“numeric”) andint64 is converted tobit64::integer64. This conversion can be disabled (so thatint64 always yields abit64::integer64 vector)by settingoptions(arrow.int64_downcast = FALSE).
2: Some Arrow data types do not currently have an Requivalent and will raise an error if cast to or mapped to via aschema.
3:arrow*_binary classes are implemented aslists of raw vectors.
4: Due to the limitation of R factors, Arrowdictionary values are coerced to string when translated toR if they are not already strings.
5:arrow*_list classes are implemented assubclasses ofvctrs_list_of with aptypeattribute set to what an empty Array of the value type converts to.
Further reading
- To learn more how data types are specified through
schema()metadata, see themetadata article. - For additional details on data types, see thedata types article.