stringr::str_replace_na() binding implemented(#47521).hms::hms() bindings (#47278)hms::hms() andhms::as_hms() bindingsimplemented to create and manipulate time of day variables(#46206).atan(),sinh(),cosh(),tanh(),asinh(),acosh(), andtanh(), andexpm1() bindings added(#44953).check_directory_existence_before_creation inS3FileSystem to reduce I/O calls on cloud storage (case_when() now correctly detects objects that are notin the global environment (blob::blob in additiontoarrow_binary whenconvertedto R objects. This change is the first step in eventuallydeprecating thearrow_binary class in favor of theblob class in theblobpackage (SeeGH-45709).This release primarily updates the underlying Arrow C++ version usedby the package to version 19.0.1 and includes all changes from the19.0.0 and 19.0.1 releases. For what’s changed in Arrow C++ 19.0.0,please see theblogpost andchangelog.For what’s changed in Arrow C++ 19.0.1, please see theblogpost andchangelog.
%in% (#43446)str_sub binding to properly handle negativeend values (time_hours <- function(mins) mins / 60 worked, buttime_hours_rounded <- function(mins) round(mins / 60)did not; now both work. These are automatic translations rather thantrue user-defined functions (UDFs); for UDFs, seeregister_scalar_function(). (#41223)mutate() expressions can now include aggregations, suchasx - mean(x). (#41350)summarize() supports more complex expressions, andcorrectly handles cases where column names are reused in expressions.(#41223)na_matches argument to thedplyr::*_join() functions is now supported. This argumentcontrols whetherNA values are considered equal whenjoining. (#41358)pull on groupeddatasets, it now returns the expected column. (#43172)base::prod have been added so you can nowuse it in your dplyr pipelines (i.e.,tbl |> summarize(prod(col))) without having to pull thedata into R (dimnames orcolnames onDataset objects now returns a useful result rather thanjustNULL (#38377).code() method on Schema objects now takes anoptionalnamespace argument which, whenTRUE,prefixes names witharrow:: which makes the output moreportable (@orgadish,#38144).SystemRequirements (#39602).sub,gsub,stringr::str_replace,stringr::str_replace_allare passed a length > 1 vector of values inpattern(@abfleishman,#39219).?open_datasetdocumenting how to use the ND-JSON support added in arrow 13.0.0 (s3_bucket,S3FileSystem), the debug loglevel for S3 can be set with theAWS_S3_LOG_LEVELenvironment variable. See?S3FileSystem for moreinformation. (#38267)to_duckdb()) no longerresults in warnings when quitting your R session. (#38495)LIBARROW_BINARY=true for old behavior (#39861).ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true)and requires atleast Arrow C++ 13.0.0 (#39739).open_dataset(), the partition variables are now included inthe resulting dataset (#37658).write_csv_dataset() now wrapswrite_dataset() and mirrors the syntax ofwrite_csv_arrow() (open_delim_dataset() now acceptsquoted_naargument to empty strings to be parsed as NA values (#37828).schema() can now be called ondata.frameobjects to retrieve their inferred Arrow schema (#37843).read_csv2_arrow() (#38002).CsvParseOptions object creation nowcontains more information about default values (fixed(),regex() etc.) now allow variables to be reliably used intheir arguments (#36784).ParquetReaderProperties, allowing users towork with Parquet files with unusually large metadata (#36992).add_filename() areimproved (@amoeba,#37372).create_package_with_all_dependencies() now properlyescapes paths on Windows (#37226).data.frame and noother classes now have theclass attribute dropped,resulting in now always returning tibbles from file reading functionsandarrow_table(), which results in consistency in the typeof returned objects. Callingas.data.frame() on ArrowTabular objects now always returns adata.frame object(#34775)open_dataset() now works with ND-JSON files(#35055)schema() on multiple Arrow objects now returnsthe object’s schema (#35543).by/by argument now supported inarrow implementation of dplyr verbs (dplyr::case_when() now accepts.default parameter to match the update in dplyr 1.1.0(#35502)arrow_array() can be used tocreate Arrow Arrays (#36381)scalar() can be used to createArrow Scalars (#36265)RecordBatchReader::ReadNext() from DuckDB from themain R thread (#36307)set_io_thread_count() withnum_threads < 2 (#36304)strptime() in arrow will return a timezone-awaretimestamp if%z is part of the format string (#35671)group_by() andacross() now matches dplyr (read_parquet() andread_feather()functions can now accept URL arguments (#33287, #34708).json_credentials argument inGcsFileSystem$create() now accepts a file path containingthe appropriate authentication token ($options member ofGcsFileSystemobjects can now be inspected (read_csv_arrow() andread_json_arrow()functions now accept literal text input wrapped inI() toimprove compatability withreadr::read_csv() ($ and[[ in dplyr expressions (#18818, #19706).FetchNode andOrderByNode to improve performance and simplify buildingquery plans from dplyr expressions (#34437, #34685).arrow_table() (#35038,#35039).data.frame withNULL column names to aTable (#15247, #34798).open_csv_dataset() family of functions (#33998,#34710).dplyr::n() function is now mapped to thecount_all kernel to improve performance and simplify the Rimplementation (#33892, #33917).s3_bucket()filesystem helper withendpoint_override and fixedsurprising behaviour that occurred when passing some combinations ofarguments (schema is supplied andcol_names = TRUE inopen_csv_dataset()(#34217, #34092).open_csv_dataset() allows a schema to be specified.(#34217)dplyr:::check_names() (#34369)map_batches() is lazy by default; it now returns aRecordBatchReader instead of a list ofRecordBatch objects unlesslazy = FALSE.(#14521)open_csv_dataset(),open_tsv_dataset(), andopen_delim_dataset()all wrapopen_dataset()- they don’t provide newfunctionality, but allow for readr-style options to be supplied, makingit simpler to switch between individual file-reading and datasetfunctionality. (#33614)col_names parameter allows specification ofcolumn names when opening a CSV dataset. (parse_options,read_options, andconvert_options parameters for reading individual files(read_*_arrow() functions) and datasets(open_dataset() and the newopen_*_dataset()functions) can be passed in as lists. (#15270)read_csv_arrow(). (#14930)join_by() has beenimplemented for dplyr joins on Arrow objects (equality conditions only).(#33664)dplyr::group_by()/dplyr::summarise() calls areused. (#14905)dplyr::summarize() works with division when divisor isa variable. (#14933)dplyr::right_join() correctly coalesces keys.(#15077)lubridate::with_tz() andlubridate::force_tz() (stringr::str_remove() andstringr::str_remove_all() (#14644)POSIXlt objects.(#15277)Array$create() can create Decimal arrays. (#15211)StructArray$create() can be used to create StructArrayobjects. (#14922)lubridate::as_datetime() on Arrow objects canhandle time in sub-seconds. (head() can be called afteras_record_batch_reader(). (#14518)as.Date() can go fromtimestamp[us] totimestamp[s]. (#14935)check_dots_empty(). (Minor improvements and fixes:
.data pronoun indplyr::group_by() (#14484)Several new functions can be used in queries:
dplyr::across() can be used to apply the samecomputation across multiple columns, and thewhere()selection helper is supported inacross();add_filename() can be used to get the filename a rowcame from (only available when querying?Dataset);slice_* family:dplyr::slice_min(),dplyr::slice_max(),dplyr::slice_head(),dplyr::slice_tail(), anddplyr::slice_sample().The package now has documentation that lists alldplyrmethods and R function mappings that are supported on Arrow data, alongwith notes about any differences in functionality between queriesevaluated in R versus in Acero, the Arrow query engine. See?acero.
A few new features and bugfixes were implemented for joins:
keep argument is now supported, allowing separatecolumns for the left and right hand side join keys in join output. Fulljoins now coalesce the join keys (whenkeep = FALSE),avoiding the issue where the join keys would be allNA forrows in the right hand side without any matches on the left.Some changes to improve the consistency of the API:
dplyr::pull() will returna?ChunkedArray instead of an R vector by default. Thecurrent default behavior is deprecated. To update to the new behaviornow, specifypull(as_vector = FALSE) or setoptions(arrow.pull_as_vector = FALSE) globally.dplyr::compute() on a query that is groupedreturns a?Table instead of a query object.Finally, long-running queries can now be cancelled and will aborttheir computation immediately.
as_arrow_array() can now takeblob::bloband?vctrs::list_of, which convert to binary and listarrays, respectively. Also fixed an issue whereas_arrow_array() ignored type argument when passed aStructArray.
Theunique() function works on?Table,?RecordBatch,?Dataset, and?RecordBatchReader.
write_feather() can takecompression = FALSE to choose writing uncompressedfiles.
Also, a breaking change for IPC files inwrite_dataset(): passing"ipc" or"feather" toformat will now write files with.arrow extension instead of.ipc or.feather.
As of version 10.0.0,arrow requires C++17 to build.This means that:
R >= 4.0. Version 9.0.0 was thelast version to support R 3.6.arrow,but you first need to install a newer compiler than the default systemcompiler, gcc 4.8. Seevignette("install", package = "arrow") for guidance. Notethat you only need the newer compiler to buildarrow:installing a binary package, as from RStudio Package Manager, or loadinga package you’ve already installed works fine with the systemdefaults.dplyr::union anddplyr::union_all(#13090)dplyr::glimpse (#13563)show_exec_plan() can be added to the end of a dplyrpipeline to show the underlying plan, similar todplyr::show_query().dplyr::show_query() anddplyr::explain() also work and show the same output, butmay change in the future. (#13541)register_scalar_function() to create them. (#13397)map_batches() returns aRecordBatchReaderand requires that the function it maps returns something coercible to aRecordBatch through theas_record_batch() S3function. It can also run in streaming fashion if passed.lazy = TRUE. (#13170, #13650)stringr::,lubridate::) within queries.For example,stringr::str_length will now dispatch to thesame kernel asstr_length. (#13160)lubridate::parse_date_time() datetime parser: (#12589,#13196, #13506)orders with year, month, day, hours, minutes, andseconds components are supported.orders argument in the Arrow binding works asfollows:orders are transformed intoformatswhich subsequently get applied in turn. There is noselect_formats parameter and no inference takes place (likeis the case inlubridate::parse_date_time()).lubridate date and datetime parsers such aslubridate::ymd(),lubridate::yq(), andlubridate::ymd_hms() (#13118, #13163, #13627)lubridate::fast_strptime() (#13174)lubridate::floor_date(),lubridate::ceiling_date(), andlubridate::round_date() (#12154)strptime() supports thetz argument topass timezones. (#13190)lubridate::qday() (day of quarter)exp() andsqrt(). (#13517)read_ipc_file() andwrite_ipc_file() are added. These functions are almost thesame asread_feather() andwrite_feather(),but differ in that they only target IPC files (Feather V2 files), notFeather V1 files.read_arrow() andwrite_arrow(), deprecatedsince 1.0.0 (July 2020), have been removed. Instead of these, use theread_ipc_file() andwrite_ipc_file() for IPCfiles, or,read_ipc_stream() andwrite_ipc_stream() for IPC streams. (#13550)write_parquet() now defaults to writing Parquet formatversion 2.4 (was 1.0). Previously deprecated argumentsproperties andarrow_properties have beenremoved; if you need to deal with these lower-level properties objectsdirectly, useParquetFileWriter, whichwrite_parquet() wraps. (#13555)write_dataset() preserves all schema metadata again. In8.0.0, it would drop most metadata, breaking packages such as sfarrow.(#13105)write_csv_arrow()) will automatically (de-)compress data ifthe file path contains a compression extension(e.g. "data.csv.gz"). This works locally as well as onremote filesystems like S3 and GCS. (#13183)FileSystemFactoryOptions can be provided toopen_dataset(), allowing you to pass options such as whichfile prefixes to ignore. (#13171)S3FileSystem will not create or deletebuckets. To enable that, pass the configuration optionallow_bucket_creation orallow_bucket_deletion. (#13206)GcsFileSystem andgs_bucket() allowconnecting to Google Cloud Storage. (#10999, #13601)$num_rows() method returns adouble (previously integer), avoiding integer overflow on larger tables.(#13482, #13514)arrow.dev_repo for nightly builds of the R packageand prebuilt libarrow binaries is nowhttps://nightlies.apache.org/arrow/r/.open_dataset():skip argument for skippingheader rows in CSV datasets.UnionDataset.{dplyr} queries:RecordBatchReader. This allows, forexample, results from DuckDB to be streamed back into Arrow rather thanmaterialized before continuing the pipeline.dplyr::rename_with().dplyr::count() returns an ungrouped dataframe.write_dataset() has more options for controlling rowgroup and file sizes when writing partitioned datasets, such asmax_open_files,max_rows_per_file,min_rows_per_group, andmax_rows_per_group.write_csv_arrow() accepts aDataset or anArrow dplyr query.option(use_threads = FALSE) no longer crashes R. Thatoption is set by default on Windows.dplyr joins support thesuffix argument tohandle overlap in column names.is.na() no longermisses any rows.map_batches() correctly acceptsDatasetobjects.read_csv_arrow()’s readr-style typeT ismapped totimestamp(unit = "ns") instead oftimestamp(unit = "s").{lubridate}features and fixes:lubridate::tz() (timezone),lubridate::semester(),lubridate::dst() (daylight savings time boolean),lubridate::date(),lubridate::epiyear() (year according to epidemiologicalweek calendar),lubridate::month() works with integer inputs.lubridate::make_date() &lubridate::make_datetime() +base::ISOdatetime() &base::ISOdate() tocreate date-times from numeric representations.lubridate::decimal_date() andlubridate::date_decimal()lubridate::make_difftime() (duration constructor)?lubridate::duration helper functions, such aslubridate::dyears(),lubridate::dhours(),lubridate::dseconds().lubridate::leap_year()lubridate::as_date() andlubridate::as_datetime()base::difftime andbase::as.difftime()base::as.Date() to convert to datebase::format()strptime() returnsNA instead of erroringin case of format mismatch, just likebase::strptime().as_arrow_array() andas_arrow_table() for mainArrow objects. This includes, Arrow tables, record batches, arrays,chunked arrays, record batch readers, schemas, and data types. Thisallows other packages to define custom conversions from their types toArrow objects, including extension arrays.?new_extension_type.vctrs::vec_is() returns TRUE (i.e.,any object that can be used as a column in atibble::tibble()), provided that the underlyingvctrs::vec_data() can be converted to an Arrow Array.Arrow arrays and tables can be easily concatenated:
concat_arrays() or, ifzero-copy is desired and chunking is acceptable, usingChunkedArray$create().c().cbind().rbind().concat_tables() isalso provided to concatenate tables while unifying schemas.sqrt(),log(), andexp() with Arrow arrays and scalars.read_* andwrite_* functions support RConnection objects for reading and writing files.median() andquantile() will warn onlyonce about approximate calculations regardless of interactivity.Array$cast() can cast StructArrays into another structtype with the same field names and structure (or a subset of fields) butdifferent field types.set_io_thread_count() would setthe CPU count instead of the IO thread count.RandomAccessFile has a$ReadMetadata()method that provides useful metadata provided by the filesystem.grepl binding returnsFALSE forNA inputs (previously it returnedNA), tomatch the behavior ofbase::grepl().create_package_with_all_dependencies() works on Windowsand Mac OS, instead of only Linux.{lubridate} features:week(),more of theis.*() functions, and the label argument tomonth() have been implemented.summarize(), such asifelse(n() > 1, mean(y), mean(z)), are supported.tibble anddata.frame to create columns oftibbles or data.frames respectively(e.g. ... %>% mutate(df_col = tibble(a, b)) %>% ...).factor type) are supported insideofcoalesce().open_dataset() accepts thepartitioningargument when reading Hive-style partitioned files, even though it isnot required.map_batches() function for customoperations on dataset has been restored.encoding argument whenreading).open_dataset() correctly ignores byte-order marks(BOMs) in CSVs, as already was true for reading singlefileshead() no longer hangs on large CSV datasets.write_csv_arrow() now follows the signature ofreadr::write_csv().$code() method on aschema ortype. This allows you to easily getthe code needed to create a schema from an object that already hasone.Duration type has been mapped to R’sdifftime class.decimal256() type is supported. Thedecimal() function has been revised to call eitherdecimal256() ordecimal128() based on thevalue of theprecision argument.write_parquet() uses a reasonable guess atchunk_size instead of always writing a single chunk. Thisimproves the speed of reading and writing large Parquet files.write_parquet() no longer drops attributes for groupeddata.frames.proxy_options.pkg-config to searchfor system dependencies (such aslibz) and link to them ifpresent. This new default will make building Arrow from source quickeron systems that have these dependencies installed already. To retain theprevious behavior of downloading and building all dependencies, setARROW_DEPENDENCY_SOURCE=BUNDLED.glue, whicharrow depends on transitively, has dropped support forit.str_count() in dplyr queriesThere are now two ways to query Arrow data:
dplyr::summarize(), both grouped and ungrouped, is nowimplemented for Arrow Datasets, Tables, and RecordBatches. Because datais scanned in chunks, you can aggregate over larger-than-memory datasetsbacked by many files. Supported aggregation functions includen(),n_distinct(),min(),max(),sum(),mean(),var(),sd(),any(), andall().median() andquantile()with one probability are also supported and currently return approximateresults using the t-digest algorithm.
Along withsummarize(), you can also callcount(),tally(), anddistinct(),which effectively wrapsummarize().
This enhancement does change the behavior ofsummarize()andcollect() in some cases: see “Breaking changes” belowfor details.
In addition tosummarize(), mutating and filteringequality joins (inner_join(),left_join(),right_join(),full_join(),semi_join(), andanti_join()) with are alsosupported natively in Arrow.
Grouped aggregation and (especially) joins should be consideredsomewhat experimental in this release. We expect them to work, but theymay not be well optimized for all workloads. To help us focus ourefforts on improving them in the next release, please let us know if youencounter unexpected behavior or poor performance.
New non-aggregating compute functions include string functions likestr_to_title() andstrftime() as well ascompute functions for extracting date parts (e.g. year(),month()) from dates. This is not a complete list ofadditional compute functions; for an exhaustive list of availablecompute functions seelist_compute_functions().
We’ve also worked to fill in support for all data types, such asDecimal, for functions added in previous releases. All typelimitations mentioned in previous release notes should be no longervalid, and if you find a function that is not implemented for a certaindata type, pleasereport anissue.
If you have theduckdb packageinstalled, you can hand off an Arrow Dataset or query object toDuckDB for further querying using theto_duckdb() function. This allows you to use duckdb’sdbplyr methods, as well as its SQL interface, to aggregatedata. Filtering and column projection done beforeto_duckdb() is evaluated in Arrow, and duckdb can push downsome predicates to Arrow as well. This handoffdoes not copythe data, instead it uses Arrow’s C-interface (just like passing arrowdata between R and Python). This means there is no serialization or datacopying costs are incurred.
You can also take a duckdbtbl and callto_arrow() to stream data to Arrow’s query engine. Thismeans that in a single dplyr pipeline, you could start with an ArrowDataset, evaluate some steps in DuckDB, then evaluate the rest inArrow.
arrange() the query result. For calls tosummarize(), you can setoptions(arrow.summarise.sort = TRUE) to match the currentdplyr behavior of sorting on the grouping columns.dplyr::summarize() on an in-memory Arrow Table orRecordBatch no longer eagerly evaluates. Callcompute() orcollect() to evaluate the query.head() andtail() also no longer eagerlyevaluate, both for in-memory data and for Datasets. Also, because roworder is no longer deterministic, they will effectively give you arandom slice of data from somewhere in the dataset unless youarrange() to specify sorting.sf::st_as_binary(col)) or using thesfarrow packagewhich handles some of the intricacies of this conversion process. Wehave plans to improve this and re-enable custom metadata like this inthe future when we can implement the saving in a safe and efficient way.If you need to preserve the pre-6.0.0 behavior of saving this metadata,you can setoptions(arrow.preserve_row_level_metadata = TRUE). We willbe removing this option in a coming release. We strongly recommendavoiding using this workaround if possible since the results will not besupported in the future and can lead to surprising and inaccurateresults. If you run into a custom class besides sf columns that areimpacted by this pleasereport anissue.LIBARROW_MINIMAL=true. This will have the coreArrow/Feather components but excludes Parquet, Datasets, compressionlibraries, and other optional features.create_package_with_all_dependencies() function(also available on GitHub without installing the arrow package) willdownload all third-party C++ dependencies and bundle them inside the Rsource package. Run this function on a system connected to the networkto produce the “fat” source package, then copy that .tar.gz package toyour offline machine and install. Special thanks tolibz) by settingARROW_DEPENDENCY_SOURCE=AUTO.This is not the default in this release (BUNDLED,i.e. download and build all dependencies) but may become the default inthe future.read_json_arrow()) are nowoptional and still on by default; setARROW_JSON=OFF beforebuilding to disable them.options(arrow.use_altrep = FALSE)Field objects can now be created as non-nullable, andschema() now optionally accepts a list ofFieldswrite_parquet() no longer errors when used with agrouped data.framecase_when() now errors cleanly if an expression is notsupported in Arrowopen_dataset() now works on CSVs without headerrowsTandt were reversed inread_csv_arrow()log(..., base = b) where b is somethingother than 2, e, or 10Table$create() now has aliasarrow_table()This patch version contains fixes for some sanitizer and compilerwarnings.
There are now more than 250 compute functions available for useindplyr::filter(),mutate(), etc. Additionsin this release include:
strsplit() andstr_split();strptime();paste(),paste0(), andstr_c();substr()andstr_sub();str_like();str_pad();stri_reverse()lubridate methods such asyear(),month(),wday(), and soonlog() et al.); trigonometry(sin(),cos(), et al.);abs();sign();pmin() andpmax();ceiling(),floor(), andtrunc()ifelse() andif_else() for all butDecimal types;case_when() for logical,numeric, and temporal types only;coalesce() for all butlists/structs. Note also that in this release, factors/dictionaries areconverted to strings in these functions.is.* functions are supported and can be used insiderelocate()The print method forarrow_dplyr_query now includesthe expression and the resulting type of columns derived bymutate().
transmute() now errors if passed arguments.keep,.before, or.after, forconsistency with the behavior ofdplyr ondata.frames.
write_csv_arrow() to use Arrow to write a data.frame toa single CSV filewrite_dataset(format = "csv", ...) to write a Datasetto CSVs, including with partitioningreticulate::py_to_r() andr_to_py()methods. Along with the addition of theScanner$ToRecordBatchReader() method, you can now build upa Dataset query in R and pass the resulting stream of batches to anothertool in process.Array$export_to_c(),RecordBatch$import_from_c()), similar to how they are inpyarrow. This facilitates their use in other packages. Seethepy_to_r() andr_to_py() methods for usageexamples.data.frame to an ArrowTable uses multithreading across columnsoptions(arrow.use_altrep = FALSE)is.na() now evaluates toTRUE onNaN values in floating point number fields, for consistencywith base R.is.nan() now evaluates toFALSE onNA values in floating point number fields andFALSE on all values in non-floating point fields, forconsistency with base R.Array,ChunkedArray,RecordBatch, andTable:na.omit() and friends,any()/all()RecordBatch$create() andTable$create() are recycledarrow_info() includes details on the C++ build, such ascompiler versionmatch_arrow() now convertsx into anArray if it is not aScalar,Array orChunkedArray and no longer dispatchesbase::match().LIBARROW_MINIMAL=false) includes bothjemalloc and mimalloc, and it has still has jemalloc as default, thoughthis is configurable at runtime with theARROW_DEFAULT_MEMORY_POOL environment variable.LIBARROW_MINIMAL,LIBARROW_DOWNLOAD, andNOT_CRAN are nowcase-insensitive in the Linux build script.Many moredplyr verbs are supported on Arrowobjects:
dplyr::mutate() is now supported in Arrow for manyapplications. For queries onTable andRecordBatch that are not yet supported in Arrow, theimplementation falls back to pulling data into an in-memory Rdata.frame first, as in the previous release. For queriesonDataset (which can be larger than memory), it raises anerror if the function is not implemented. The mainmutate()features that cannot yet be called on Arrow objects are (1)mutate() aftergroup_by() (which is typicallyused in combination with aggregation) and (2) queries that usedplyr::across().dplyr::transmute() (which callsmutate())dplyr::group_by() now preserves the.dropargument and supports on-the-fly definition of columnsdplyr::relocate() to reorder columnsdplyr::arrange() to sort rowsdplyr::compute() to evaluate the lazy expressions andreturn an Arrow Table. This is equivalent todplyr::collect(as_data_frame = FALSE), which was added in2.0.0.Over 100 functions can now be called on Arrow objects inside adplyr verb:
nchar(),tolower(), andtoupper(), along with theirstringr spellingsstr_length(),str_to_lower(), andstr_to_upper(), are supported in Arrowdplyrcalls.str_trim() is also supported.sub(),gsub(), andgrepl(), along withstr_replace(),str_replace_all(), andstr_detect(), are supported.cast(x, type) anddictionary_encode()allow changing the type of columns in Arrow objects;as.numeric(),as.character(), etc. are exposedas similar type-altering conveniencesdplyr::between(); the Arrow version also allows theleft andright arguments to be columns in thedata and not just scalarsdplyr verb. This enables you to access Arrow functions thatdon’t have a direct R mapping. Seelist_compute_functions()for all available functions, which are available indplyrprefixed byarrow_.dplyr::filter(arrow_dataset, string_column == 3) will errorwith a message about the type mismatch between the numeric3 and the string type ofstring_column.open_dataset() now accepts a vector of file paths (oreven a single file path). Among other things, this enables you to open asingle very large file and usewrite_dataset() to partitionit without having to read the whole file into memory.write_dataset() now defaults toformat = "parquet" and better validates theformat argumentschema inopen_dataset()is now correctly handledScanner$Scan() method has been removed; useScanner$ScanBatches()value_counts() to tabulate values in anArray orChunkedArray, similar tobase::table().StructArray objects gain data.frame-like methods,includingnames(),$,[[, anddim().<-) with either$ or[[Schema can now be edited by assigning in newtypes. This enables using the CSV reader to detect the schema of a file,modify theSchema object for any columns that you want toread in as a different type, and then use thatSchema toread the data.Table with a schema,with columns of different lengths, and with scalar value recycling\0) characters, the error message now informs you that youcan setoptions(arrow.skip_nul = TRUE) to strip them out.It is not recommended to set this option by default since this code pathis significantly slower, and most string data does not containnuls.read_json_arrow() now accepts a schema:read_json_arrow("file.json", schema = schema(col_a = float64(), col_b = string()))vignette("install", package = "arrow") for details. Thisallows a faster, smaller package build in cases where that is useful,and it enables a minimal, functioning R package build on Solaris.FORCE_BUNDLED_BUILD=true.arrow now uses themimalloc memoryallocator by default on macOS, if available (as it is in CRAN binaries),instead ofjemalloc. There areconfigurationissues withjemalloc on macOS, andbenchmarkanalysis shows that this has negative effects on performance,especially on memory-intensive workflows.jemalloc remainsthe default on Linux;mimalloc is default on Windows.ARROW_DEFAULT_MEMORY_POOL environmentvariable to switch memory allocators now works correctly when the ArrowC++ library has been statically linked (as is usually the case wheninstalling from CRAN).arrow_info() function now reports on the additionaloptional features, as well as the detected SIMD level. If key featuresor compression libraries are not enabled in the build,arrow_info() will refer to the installation vignette forguidance on how to install a more complete build, if desired.vignette("developing", package = "arrow").ARROW_HOME to point to a specific directory where the Arrowlibraries are. This is similar to passingINCLUDE_DIR andLIB_DIR.flight_get() andflight_put() (renamed frompush_data() in thisrelease) can handle both Tables and RecordBatchesflight_put() gains anoverwrite argumentto optionally check for the existence of a resource with the samenamelist_flights() andflight_path_exists()enable you to see available resources on a Flight serverSchema objects now haver_to_py andpy_to_r methods+,*, etc.) aresupported on Arrays and ChunkedArrays and can be used in filterexpressions in Arrowdplyr pipelines<-) with either$ or[[names()rlang pronouns.data and.env are now fully supported in Arrowdplyrpipelines.arrow.skip_nul (defaultFALSE, asinbase::scan()) allows conversion of Arrow string(utf8()) type data containing embedded nul\0characters to R. If set toTRUE, nuls will be stripped anda warning is emitted if any are found.arrow_info() for an overview of various run-time andbuild-time Arrow configurations, useful for debuggingARROW_DEFAULT_MEMORY_POOLbefore loading the Arrow package to change memory allocators. Windowspackages are built withmimalloc; most others are builtwith bothjemalloc (used by default) andmimalloc. These alternative memory allocators are generallymuch faster than the system memory allocator, so they are used bydefault when available, but sometimes it is useful to turn them off fordebugging purposes. To disable them, setARROW_DEFAULT_MEMORY_POOL=system.sf tibbles to faithfully preserved androundtripped (#8549).schema() for more details.write_parquet() can now write RecordBatchesreadr’sproblems attribute is removed whenconverting to Arrow RecordBatch and table to prevent large amounts ofmetadata from accumulating inadvertently (#9092)SubTreeFileSystem gains a useful print method and nolonger errors when printingr-arrowpackage are available withconda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrowcmakeversionsvignette("install", package = "arrow"), especially forknown CentOS issuesdistro package. Ifyour OS isn’t correctly identified, please report an issue there.write_dataset() to Feather or Parquet files withpartitioning. See the end ofvignette("dataset", package = "arrow") for discussion andexamples.head(),tail(), and take([) methods.head() is optimized but theothers may not be performant.collect() gains anas_data_frame argument,defaultTRUE but whenFALSE allows you toevaluate the accumulatedselect andfilterquery but keep the result in Arrow, not an Rdata.frameread_csv_arrow() supports specifying column types, bothwith aSchema and with the compact string representationfor types used in thereadr package. It also has gained atimestamp_parsers argument that lets you express a set ofstrptime parse strings that will be tried to convertcolumns designated asTimestamp type.libcurl andopenssl, as well as a sufficiently modern compiler. Seevignette("install", package = "arrow") for details.read_parquet(),write_feather(), et al.), as well asopen_dataset() andwrite_dataset(), allow youto access resources on S3 (or on file systems that emulate S3) either byproviding ans3:// URI or by providing aFileSystem$path(). Seevignette("fs", package = "arrow") for examples.copy_files() allows you to recursively copy directoriesof files from one file system to another, such as from S3 to your localmachine.Flightis a general-purpose client-server framework for high performancetransport of large datasets over network interfaces. Thearrow R package now provides methods for connecting toFlight RPC servers to send and receive data. Seevignette("flight", package = "arrow") for an overview.
==,>, etc.) and boolean(&,|,!) operations, alongwithis.na,%in% andmatch(calledmatch_arrow()), on Arrow Arrays and ChunkedArraysare now implemented in the C++ library.min(),max(), andunique() are implemented for Arrays and ChunkedArrays.dplyr filter expressions on Arrow Tables andRecordBatches are now evaluated in the C++ library, rather than bypulling data into R and evaluating. This yields significant performanceimprovements.dim() (nrow) for dplyr queries onTable/RecordBatch is now supportedarrow now depends oncpp11, which bringsmore robust UTF-8 handling and faster compilationInt64 type when allvalues fit with an R 32-bit integer now correctly inspects all chunks ina ChunkedArray, and this conversion can be disabled (so thatInt64 always yields abit64::integer64 vector)by settingoptions(arrow.int64_downcast = FALSE).ParquetFileReader has additional methods for accessingindividual columns or row groups from the fileParquetFileWriter; invalidArrowObject pointerfrom a saved R object; converting deeply nested structs from Arrow toRproperties andarrow_propertiesarguments towrite_parquet() are deprecated%in% expression now faithfully returns all relevantrows. or_; files and subdirectories startingwith those prefixes are still ignoredopen_dataset("~/path") now correctly expands thepathversion option towrite_parquet() isnow correctly implementedparquet-cpp library has beenfixedcmakeis more robust, and you can now specify a/path/to/cmake bysetting theCMAKE environment variablevignette("arrow", package = "arrow") includes tablesthat explain how R types are converted to Arrow types and viceversa.uint64,binary,fixed_size_binary,large_binary,large_utf8,large_list,list ofstructs.character vectors that exceed 2GB are converted toArrowlarge_utf8 typePOSIXlt objects can now be converted to Arrow(struct)attributes() are preserved in Arrow metadata whenconverting to Arrow RecordBatch and table and are restored whenconverting from Arrow. This means that custom subclasses, such ashaven::labelled, are preserved in round trip throughArrow.batch$metadata$new_key <- "new value"int64,uint32, anduint64 now are converted to Rinteger if allvalues fit in boundsdate32 is now converted to RDatewithdouble underlying storage. Even though the data valuesthemselves are integers, this provides more strict round-tripfidelityfactor,dictionaryChunkedArrays that do not have identical dictionaries are properlyunifiedRecordBatch{File,Stream}Writer willwrite V5, but you can specify an alternatemetadata_version. For convenience, if you know the consumeryou’re writing to cannot read V5, you can set the environment variableARROW_PRE_1_0_METADATA_VERSION=1 to write V4 withoutchanging any other code.ds <- open_dataset("s3://...").Note that this currently requires a special C++ library build withadditional dependencies–this is not yet available in CRAN releases or innightly packages.sum() andmean() are implemented forArray andChunkedArraydimnames() andas.list()reticulatecoerce_timestamps option towrite_parquet() is now correctly implemented.typedefinition if provided by the userread_arrow andwrite_arrow are nowdeprecated; use theread/write_feather() andread/write_ipc_stream() functions depending on whetheryou’re working with the Arrow IPC file or stream format,respectively.FileStats,read_record_batch, andread_table have beenremoved.jemalloc included, and Windows packagesusemimallocCC andCXX values that R usesdplyr 1.0reticulate::r_to_py() conversion now correctly worksautomatically, without having to call the method yourselfThis release includes support for version 2 of the Feather fileformat. Feather v2 features full support for all Arrow data types, fixesthe 2GB per-column limitation for large amounts of string data, and itallows files to be compressed using eitherlz4 orzstd.write_feather() can write either version2 orversion 1 Featherfiles, andread_feather() automatically detects which fileversion it is reading.
Related to this change, several functions around reading and writingdata have been reworked.read_ipc_stream() andwrite_ipc_stream() have been added to facilitate writingdata to the Arrow IPC stream format, which is slightly different fromthe IPC file format (Feather v2is the IPC file format).
Behavior has been standardized: allread_<format>() return an Rdata.frame(default) or aTable if the argumentas_data_frame = FALSE; allwrite_<format>() functions return the data object,invisibly. To facilitate some workflows, a specialwrite_to_raw() function is added to wrapwrite_ipc_stream() and return theraw vectorcontaining the buffer that was written.
To achieve this standardization,read_table(),read_record_batch(),read_arrow(), andwrite_arrow() have been deprecated.
The 0.17 Apache Arrow release includes a C data interface that allowsexchanging Arrow data in-process at the C level without copying andwithout libraries having a build or runtime dependency on each other.This enables us to usereticulate to share data between Rand Python (pyarrow) efficiently.
Seevignette("python", package = "arrow") fordetails.
dim() method, which sums rows acrossall files (#6635,UnionDataset with thec() methodNA asFALSE,consistent withdplyr::filter()vignette("dataset", package = "arrow") now has correct,executable codeNOT_CRAN=true. Seevignette("install", package = "arrow") for details and moreoptions.unify_schemas() to create aSchemacontaining the union of fields in multiple schemasread_feather() and other reader functions close anyfile connections they openR.oo package is also loadedFileStats is renamed toFileInfo, and theoriginal spelling has been deprecatedinstall_arrow() now installs the latest release ofarrow, including Linux dependencies, either for CRANreleases or for development builds (ifnightly = TRUE)LIBARROW_DOWNLOAD orNOT_CRANenvironment variable is setwrite_feather(),write_arrow() andwrite_parquet() now return their input, similar to thewrite_* functions in thereadr package (#6387,@boshek)list and create aListArray when all list elements are the same type (#6275,This release includes adplyr interface to ArrowDatasets, which let you work efficiently with large, multi-file datasetsas a single entity. Explore a directory of data files withopen_dataset() and then usedplyr methods toselect(),filter(), etc. Work will be donewhere possible in Arrow memory. When necessary, data is pulled into Rfor further computation.dplyr methods are conditionallyloaded if you havedplyr available; it is not a harddependency.
Seevignette("dataset", package = "arrow") fordetails.
A source package installation (as from CRAN) will now handle its C++dependencies automatically. For common Linux distributions and versions,installation will retrieve a prebuilt static C++ library for inclusionin the package; where this binary is not available, the package executesa bundled script that should build the Arrow C++ library with no systemdependencies beyond what R requires.
Seevignette("install", package = "arrow") fordetails.
Tables andRecordBatches also havedplyr methods.dplyr,[ methodsfor Tables, RecordBatches, Arrays, and ChunkedArrays now support naturalrow extraction operations. These use the C++Filter,Slice, andTake methods for efficient access,depending on the type of selection vector.array_expressionclass has also been added, enabling among other things the ability tofilter a Table with some function of Arrays, such asarrow_table[arrow_table$var1 > 5, ] without having topull everything into R first.write_parquet() now supports compressioncodec_is_available() returnsTRUE orFALSE whether the Arrow C++ library was built with supportfor a given compression library (e.g. gzip, lz4, snappy)character (as Rfactor levels are required tobe) instead of raising an errorClass$create() methods. Notably,arrow::array() andarrow::table() have beenremoved in favor ofArray$create() andTable$create(), eliminating the package startup messageabout maskingbase functions. For more information, see thenewvignette("arrow").ARROW_PRE_0_15_IPC_FORMAT=1.as_tibble argument in theread_*()functions has been renamed toas_data_frame (#5399,arrow::Column class has been removed, as it wasremoved from the C++ libraryTable andRecordBatch objects have S3methods that enable you to work with them more likedata.frames. Extract columns, subset, and so on. See?Table and?RecordBatch for examples.read_csv_arrow() supports more parsing options,includingcol_names,na,quoted_na, andskipread_parquet() andread_feather() caningest data from araw vector (#5141)~/file.parquet (#5169)double()), and time types can be createdwith human-friendly resolution strings (“ms”, “s”, etc.). (#5198,#5201)Initial CRAN release of thearrow package. Key featuresinclude: