Movatterモバイル変換

[0]ホーム

arrow 22.0.0

New features

stringr::str_replace_na() binding implemented(#47521).

Minor improvements and fixes

Subsecond time variables no longer truncated inhms::hms() bindings (#47278)

arrow 21.0.0.1

Minor improvements andfixes

Patch bundled version of Thrift to prevent CRAN check failures(@kou, #47286)

arrow 21.0.0

New features

Support for Arrow’s 32 and 64 bit Decimal types (#46720).
hms::hms() andhms::as_hms() bindingsimplemented to create and manipulate time of day variables(#46206).
atan(),sinh(),cosh(),tanh(),asinh(),acosh(), andtanh(), andexpm1() bindings added(#44953).

Minor improvements andfixes

Expose an optioncheck_directory_existence_before_creation inS3FileSystem to reduce I/O calls on cloud storage (@HaochengLIU,#41998).
case_when() now correctly detects objects that are notin the global environment (@etiennebacher, #46667).
Negative fractional dates now correctly converted to integers byflooring values (#46873).
Backwards compatibility checks for legacy Arrow C++ versions havebeen removed from the R package (#46491). This shouldn’t affect mostusers this package and would only impact you if you were building the Rpackage from source with different R package and Arrow C++versions.
Require CMake 3.25 or greater in bundled build script forfull-source builds (#46834). This shouldn’t affect most users.

arrow 20.0.0.2

Minor improvements andfixes

Updated internal C++ code to comply with CRAN’s gcc-UBSAN checks (#46394)

arrow 20.0.0

Minor improvements andfixes

Binary Arrays now inherit fromblob::blob in additiontoarrow_binary whenconvertedto R objects. This change is the first step in eventuallydeprecating thearrow_binary class in favor of theblob class in theblobpackage (SeeGH-45709).

arrow 19.0.1.1

Minor improvements andfixes

Updated internal code to comply with new CRAN requirements onnon-API calls (#45949)
Enable building the bundled third-party libraries under CMake 4.0(#45987)

arrow 19.0.1

This release primarily updates the underlying Arrow C++ version usedby the package to version 19.0.1 and includes all changes from the19.0.0 and 19.0.1 releases. For what’s changed in Arrow C++ 19.0.0,please see theblogpost andchangelog.For what’s changed in Arrow C++ 19.0.1, please see theblogpost andchangelog.

arrow 18.1.0

Minor improvements andfixes

Fix bindings to allow filtering a factor column in a Dataset using%in% (#43446)
Updatestr_sub binding to properly handle negativeend values (@coussens, #44141)
Fix altrep string columns from readr (#43351)
Fix crash in ParquetFileWriter$WriteTable and add WriteBatch(#42241)
Fix bindings in Math group generics (@aboyoun, #43162)
Fix pull on a grouped query returns the wrong column (#43172)

arrow 17.0.0

New features

R functions that users write that use functions that Arrow supportsin dataset queries now can be used in queries too. Previously, onlyfunctions that used arithmetic operators worked. For example,time_hours <- function(mins) mins / 60 worked, buttime_hours_rounded <- function(mins) round(mins / 60)did not; now both work. These are automatic translations rather thantrue user-defined functions (UDFs); for UDFs, seeregister_scalar_function(). (#41223)
mutate() expressions can now include aggregations, suchasx - mean(x). (#41350)
summarize() supports more complex expressions, andcorrectly handles cases where column names are reused in expressions.(#41223)
Thena_matches argument to thedplyr::*_join() functions is now supported. This argumentcontrols whetherNA values are considered equal whenjoining. (#41358)
R metadata, stored in the Arrow schema to support round-trippingdata between R and Arrow/Parquet, is now serialized and deserializedmore strictly. This makes it safer to load data from files from unknownsources into R data.frames. (#41969)

Minor improvements andfixes

Turn on the S3 and ZSTD features by default for macOS. (#42210)
Fix bindings in Math group generics. (#43162)
Fix a bug in our implementation ofpull on groupeddatasets, it now returns the expected column. (#43172)
The minimum version of the Arrow C++ library the Arrow R package canbe built with has been bumped to 15.0.0 (#42241)

arrow 16.1.0

New features

Streams can now be written to socket connections (#38897)
The Arrow R package now can be built with older versions of theArrow C++ library (back to 13.0.0) (#39738)

Minor improvements andfixes

Dataset and table output printing now truncates schemas longer than20 items long (#38916)
Fixed pointer conversion to Python for latest reticulate to ensuredata can be passed between Arrow and PyArrow (#39969)
Check on macOS if we are using GNU libtool is and ensure we usemacOS libtool instead (#40259)
Fix an error where creating a bundled tarball with all dependencieswas failing on Windows (@hutch3232, #40232)

arrow 15.0.1

New features

Bindings forbase::prod have been added so you can nowuse it in your dplyr pipelines (i.e.,tbl |> summarize(prod(col))) without having to pull thedata into R (@m-muecke, #38601).
Callingdimnames orcolnames onDataset objects now returns a useful result rather thanjustNULL (#38377).
Thecode() method on Schema objects now takes anoptionalnamespace argument which, whenTRUE,prefixes names witharrow:: which makes the output moreportable (@orgadish,#38144).

Minor improvements andfixes

Don’t download cmake when ARROW_OFFLINE_BUILD=true and updateSystemRequirements (#39602).
Fallback to source build gracefully if binary download fails(#39587).
An error is now thrown instead of warning and pulling the data intoR when any ofsub,gsub,stringr::str_replace,stringr::str_replace_allare passed a length > 1 vector of values inpattern(@abfleishman,#39219).
Missing documentation was added to?open_datasetdocumenting how to use the ND-JSON support added in arrow 13.0.0 (@Divyansh200102,#38258).
To make debugging problems easier when using arrow with AWS S3(e.g.,s3_bucket,S3FileSystem), the debug loglevel for S3 can be set with theAWS_S3_LOG_LEVELenvironment variable. See?S3FileSystem for moreinformation. (#38267)
Using arrow with duckdb (i.e.,to_duckdb()) no longerresults in warnings when quitting your R session. (#38495)
A large number of minor spelling mistakes were fixed (@jsoref, #38929,#38257)
The developer documentation has been updated to match changes madein recent releases (#38220)

arrow 14.0.2.1

Minor improvements andfixes

Check for internet access when building from source and fallback toa minimally scoped Arrow C++ build (#39699).
Build from source by default on macOS, useLIBARROW_BINARY=true for old behavior (#39861).
Support building against older versions of Arrow C++. This iscurrently opt-in (ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true)and requires atleast Arrow C++ 13.0.0 (#39739).
Make it possible to use Arrow C++ from Rtools on windows (in futureRtools versions). (#39986).

arrow 14.0.2

Minor improvements andfixes

Fixed C++ compiler warnings caused by implicit conversions (#39138,#39186).
Fixed confusing dplyr warnings during tests (#39076).
Added missing “-framework Security” pkg-config flag to preventissues when compiling with strict linker settings (#38861).

arrow 14.0.0.2

Minor improvements andfixes

Fixed the printf syntax to align with format checking (#38894)
Removed bashism in configure script (#38716).
Fixed a broken link in the README (#38657)
Properly escape the license header in the lintr config(#38639).
Removed spurious warnings from installation-script test suite(#38571).
Polished installation-script after refactor (#38534)

Installation

If pkg-config fails to detect the required libraries an additionalsearch without pkg-config is run (#38970).
Fetch the latest nightly Arrow C++ binary when installing adevelopment Version (#38236).

arrow 14.0.0.1

Minor improvements andfixes

Add more debug output for build failures (#38819)
Increase timeout during static library download (#38767)
Fix bug where rosetta detection was causing installation failure(#38754)

arrow 14.0.0

New features

When reading partitioned CSV datasets and supplying a schema toopen_dataset(), the partition variables are now included inthe resulting dataset (#37658).
New functionwrite_csv_dataset() now wrapswrite_dataset() and mirrors the syntax ofwrite_csv_arrow() (@dgreiss, #36436).
open_delim_dataset() now acceptsquoted_naargument to empty strings to be parsed as NA values (#37828).
schema() can now be called ondata.frameobjects to retrieve their inferred Arrow schema (#37843).
CSVs with a comma or other character as decimal mark can now be readin by the dataset reading functions and new functionread_csv2_arrow() (#38002).

Minor improvements andfixes

Documentation forCsvParseOptions object creation nowcontains more information about default values (@angela-li, #37909).
Fixed a code path which may have resulted in R code being calledfrom a non-R thread after a failed allocation (#37565).
Fixed a bug where large Parquet files could not be read from Rconnections (#37274).
Bindings to stringr helpers (e.g.,fixed(),regex() etc.) now allow variables to be reliably used intheir arguments (#36784).
Thrift string and container size limits can now be configured vianewly exposedParquetReaderProperties, allowing users towork with Parquet files with unusually large metadata (#36992).
Error messages resulting from use ofadd_filename() areimproved (@amoeba,#37372).

Installation

macOS builds now use the same installation pathway as on Linux(@assignUser,#37684).
A warning message is now issued on package load when running underemulation on macOS (i.e., use of x86 installation of R on M1/aarch64;#37777).
R scripts that run during configuration and installation are now runusing the correct R interpreter (@meztez, #37225).
Failed libarrow builds now return more detailed output (@amoeba, #37727).
create_package_with_all_dependencies() now properlyescapes paths on Windows (#37226).

arrow 13.0.0.1

Remove reference to legacy timezones to prevent CRAN check failures(#37671)

arrow 13.0.0

Breaking changes

Input objects which inherit only fromdata.frame and noother classes now have theclass attribute dropped,resulting in now always returning tibbles from file reading functionsandarrow_table(), which results in consistency in the typeof returned objects. Callingas.data.frame() on ArrowTabular objects now always returns adata.frame object(#34775)

New features

open_dataset() now works with ND-JSON files(#35055)
Callingschema() on multiple Arrow objects now returnsthe object’s schema (#35543)
dplyr.by/by argument now supported inarrow implementation of dplyr verbs (@eitsupi, #35667)
Binding fordplyr::case_when() now accepts.default parameter to match the update in dplyr 1.1.0(#35502)

Minor improvements andfixes

Convenience functionarrow_array() can be used tocreate Arrow Arrays (#36381)
Convenience functionscalar() can be used to createArrow Scalars (#36265)
Prevent crashed when passing data between arrow and duckdb by alwayscallingRecordBatchReader::ReadNext() from DuckDB from themain R thread (#36307)
Issue a warning forset_io_thread_count() withnum_threads < 2 (#36304)
Ensure missing grouping variables are added to the beginning of thevariable list (#36305)
CSV File reader options class objects can print the selected values(#35955)
Schema metadata can be set as a named character vector (#35954)
Ensure that the RStringViewer helper class does not own any Arrayreferences (#35812)
strptime() in arrow will return a timezone-awaretimestamp if%z is part of the format string (#35671)
Column ordering when combininggroup_by() andacross() now matches dplyr (@eitsupi, #35473)

Installation

Link to correct version of OpenSSL when using autobrew (#36551)
Require cmake 3.16 in bundled build script (#36321)

Docs

Split out R6 classes and convenience functions to improvereadability (#36394)
Enable pkgdown built-in search (@eitsupi, #36374)
Re-organise reference page on pkgdown site to improve readability(#36171)

arrow 12.0.1.1

Update a package version reference to be text only instead ofnumeric due to CRAN update requiring this (#36353, #36364)

arrow 12.0.1

Update the version of the date library vendored with Arrow C++library for compatibility with tzdb 0.4.0 (#35594, #35612).
Update some tests for compatibility with waldo 0.5.1 (#35131,#35308).

arrow 12.0.0

New features

Theread_parquet() andread_feather()functions can now accept URL arguments (#33287, #34708).
Thejson_credentials argument inGcsFileSystem$create() now accepts a file path containingthe appropriate authentication token (@amoeba, #34421, #34524).
The$options member ofGcsFileSystemobjects can now be inspected (@amoeba, #34422, #34477).
Theread_csv_arrow() andread_json_arrow()functions now accept literal text input wrapped inI() toimprove compatability withreadr::read_csv() (@eitsupi, #18487,#33968).
Nested fields can now be accessed using$ and[[ in dplyr expressions (#18818, #19706).

Installation

Hosted static libarrow binaries for Ubuntu 18.04 and 20.04 hadpreviously been built on Ubuntu 18.04, which will stop receiving LTSupdates as of May
1. These binaries are now built on Centos 7 (#32292, #34048).

Minor improvements andfixes

Fix crash that occurred at process exit related to finalizing the S3filesystem component (#15054, #33858).
Implement the Arrow C++FetchNode andOrderByNode to improve performance and simplify buildingquery plans from dplyr expressions (#34437, #34685).
Fix a bug where different R metadata were written depending onsubtle argument passing semantics inarrow_table() (#35038,#35039).
Improve error message when attempting to convert adata.frame withNULL column names to aTable (#15247, #34798).
Vignettes were updated to reflect improvements in theopen_csv_dataset() family of functions (#33998,#34710).
Fixed a crash that occurred when arrow ALTREP vectors werematerialized and converted back to arrow Arrays (#34211, #34489).
Improved conda install instructions (#32512, #34398).
Improved documentation URL configurations (@eitsupi, #34276).
Updated links to JIRA issues that were migrated to GitHub (@eitsupi, #33631,#34260).
Thedplyr::n() function is now mapped to thecount_all kernel to improve performance and simplify the Rimplementation (#33892, #33917).
Improved the experience of using thes3_bucket()filesystem helper withendpoint_override and fixedsurprising behaviour that occurred when passing some combinations ofarguments (@cboettig, #33904, #34009).
Do not raise error ifschema is supplied andcol_names = TRUE inopen_csv_dataset()(#34217, #34092).

arrow 11.0.0.3

Minor improvements andfixes

open_csv_dataset() allows a schema to be specified.(#34217)
To ensure compatibility with an upcoming dplyr release, we no longercalldplyr:::check_names() (#34369)

arrow 11.0.0.2

Breaking changes

map_batches() is lazy by default; it now returns aRecordBatchReader instead of a list ofRecordBatch objects unlesslazy = FALSE.(#14521)

New features

Docs

A substantial reorganisation, rewrite of and addition to, many ofthe vignettes and README. (@djnavarro, #14514)

Reading/writing data

New functionsopen_csv_dataset(),open_tsv_dataset(), andopen_delim_dataset()all wrapopen_dataset()- they don’t provide newfunctionality, but allow for readr-style options to be supplied, makingit simpler to switch between individual file-reading and datasetfunctionality. (#33614)
User-defined null values can be set when writing CSVs both asdatasets and as individual files. (@wjones127, #14679)
The newcol_names parameter allows specification ofcolumn names when opening a CSV dataset. (@wjones127, #14705)
Theparse_options,read_options, andconvert_options parameters for reading individual files(read_*_arrow() functions) and datasets(open_dataset() and the newopen_*_dataset()functions) can be passed in as lists. (#15270)
File paths containing accents can be read byread_csv_arrow(). (#14930)

dplyr compatibility

New dplyr (1.1.0) functionjoin_by() has beenimplemented for dplyr joins on Arrow objects (equality conditions only).(#33664)
Output is accurate when multipledplyr::group_by()/dplyr::summarise() calls areused. (#14905)
dplyr::summarize() works with division when divisor isa variable. (#14933)
dplyr::right_join() correctly coalesces keys.(#15077)
Multiple changes to ensure compatibility with dplyr 1.1.0. (@lionel-, #14948)

Function bindings

The following functions can be used in queries on Arrow objects:
- lubridate::with_tz() andlubridate::force_tz() (@eitsupi, #14093)
- stringr::str_remove() andstringr::str_remove_all() (#14644)

Arrow object creation

Arrow Scalars can be created fromPOSIXlt objects.(#15277)
Array$create() can create Decimal arrays. (#15211)
StructArray$create() can be used to create StructArrayobjects. (#14922)
Creating an Array from an object bigger than 2^31 has correct length(#14929)

Installation

Improved offline installation using pre-downloaded binaries. (@pgramme, #14086)
The package can automatically link to system installations of theAWS SDK for C++. (@kou,#14235)

Minor improvements andfixes

Callinglubridate::as_datetime() on Arrow objects canhandle time in sub-seconds. (@eitsupi, #13890)
head() can be called afteras_record_batch_reader(). (#14518)
as.Date() can go fromtimestamp[us] totimestamp[s]. (#14935)
curl timeout policy can be configured for S3. (#15166)
rlang dependency must be at least version 1.0.0 because ofcheck_dots_empty(). (@daattali, #14744)

arrow 10.0.1

Minor improvements and fixes:

Fixes for failing test after lubridate 1.9 release (#14615)
Update to ensure compatibility with changes in dev purrr(#14581)
Fix to correctly handle.data pronoun indplyr::group_by() (#14484)

arrow 10.0.0

Arrow dplyr queries

Several new functions can be used in queries:

dplyr::across() can be used to apply the samecomputation across multiple columns, and thewhere()selection helper is supported inacross();
add_filename() can be used to get the filename a rowcame from (only available when querying?Dataset);
Added five functions in theslice_* family:dplyr::slice_min(),dplyr::slice_max(),dplyr::slice_head(),dplyr::slice_tail(), anddplyr::slice_sample().

The package now has documentation that lists alldplyrmethods and R function mappings that are supported on Arrow data, alongwith notes about any differences in functionality between queriesevaluated in R versus in Acero, the Arrow query engine. See?acero.

A few new features and bugfixes were implemented for joins:

Extension arrays are now supported in joins, allowing, for example,joining datasets that containgeoarrow data.
Thekeep argument is now supported, allowing separatecolumns for the left and right hand side join keys in join output. Fulljoins now coalesce the join keys (whenkeep = FALSE),avoiding the issue where the join keys would be allNA forrows in the right hand side without any matches on the left.

Some changes to improve the consistency of the API:

In a future release, callingdplyr::pull() will returna?ChunkedArray instead of an R vector by default. Thecurrent default behavior is deprecated. To update to the new behaviornow, specifypull(as_vector = FALSE) or setoptions(arrow.pull_as_vector = FALSE) globally.
Callingdplyr::compute() on a query that is groupedreturns a?Table instead of a query object.

Finally, long-running queries can now be cancelled and will aborttheir computation immediately.

Arrays and tables

as_arrow_array() can now takeblob::bloband?vctrs::list_of, which convert to binary and listarrays, respectively. Also fixed an issue whereas_arrow_array() ignored type argument when passed aStructArray.

Theunique() function works on?Table,?RecordBatch,?Dataset, and?RecordBatchReader.

Reading and writing

write_feather() can takecompression = FALSE to choose writing uncompressedfiles.

Also, a breaking change for IPC files inwrite_dataset(): passing"ipc" or"feather" toformat will now write files with.arrow extension instead of.ipc or.feather.

Installation

As of version 10.0.0,arrow requires C++17 to build.This means that:

On Windows, you needR >= 4.0. Version 9.0.0 was thelast version to support R 3.6.
On CentOS 7, you can build the latest version ofarrow,but you first need to install a newer compiler than the default systemcompiler, gcc 4.8. Seevignette("install", package = "arrow") for guidance. Notethat you only need the newer compiler to buildarrow:installing a binary package, as from RStudio Package Manager, or loadinga package you’ve already installed works fine with the systemdefaults.

arrow 9.0.0

Arrow dplyr queries

New dplyr verbs:
- dplyr::union anddplyr::union_all(#13090)
- dplyr::glimpse (#13563)
- show_exec_plan() can be added to the end of a dplyrpipeline to show the underlying plan, similar todplyr::show_query().dplyr::show_query() anddplyr::explain() also work and show the same output, butmay change in the future. (#13541)
User-defined functions are supported in queries. Useregister_scalar_function() to create them. (#13397)
map_batches() returns aRecordBatchReaderand requires that the function it maps returns something coercible to aRecordBatch through theas_record_batch() S3function. It can also run in streaming fashion if passed.lazy = TRUE. (#13170, #13650)
Functions can be called with package namespace prefixes(e.g. stringr::,lubridate::) within queries.For example,stringr::str_length will now dispatch to thesame kernel asstr_length. (#13160)
Support for new functions:
- lubridate::parse_date_time() datetime parser: (#12589,#13196, #13506)
  - orders with year, month, day, hours, minutes, andseconds components are supported.
  - theorders argument in the Arrow binding works asfollows:orders are transformed intoformatswhich subsequently get applied in turn. There is noselect_formats parameter and no inference takes place (likeis the case inlubridate::parse_date_time()).
- lubridate date and datetime parsers such aslubridate::ymd(),lubridate::yq(), andlubridate::ymd_hms() (#13118, #13163, #13627)
- lubridate::fast_strptime() (#13174)
- lubridate::floor_date(),lubridate::ceiling_date(), andlubridate::round_date() (#12154)
- strptime() supports thetz argument topass timezones. (#13190)
- lubridate::qday() (day of quarter)
- exp() andsqrt(). (#13517)
Bugfixes:
- Count distinct now gives correct result across multiple row groups.(#13583)
- Aggregations over partition columns return correct results.(#13518)

Reading and writing

New functionsread_ipc_file() andwrite_ipc_file() are added. These functions are almost thesame asread_feather() andwrite_feather(),but differ in that they only target IPC files (Feather V2 files), notFeather V1 files.
read_arrow() andwrite_arrow(), deprecatedsince 1.0.0 (July 2020), have been removed. Instead of these, use theread_ipc_file() andwrite_ipc_file() for IPCfiles, or,read_ipc_stream() andwrite_ipc_stream() for IPC streams. (#13550)
write_parquet() now defaults to writing Parquet formatversion 2.4 (was 1.0). Previously deprecated argumentsproperties andarrow_properties have beenremoved; if you need to deal with these lower-level properties objectsdirectly, useParquetFileWriter, whichwrite_parquet() wraps. (#13555)
UnionDatasets can unify schemas of multiple InMemoryDatasets withvarying schemas. (#13088)
write_dataset() preserves all schema metadata again. In8.0.0, it would drop most metadata, breaking packages such as sfarrow.(#13105)
Reading and writing functions (such aswrite_csv_arrow()) will automatically (de-)compress data ifthe file path contains a compression extension(e.g. "data.csv.gz"). This works locally as well as onremote filesystems like S3 and GCS. (#13183)
FileSystemFactoryOptions can be provided toopen_dataset(), allowing you to pass options such as whichfile prefixes to ignore. (#13171)
By default,S3FileSystem will not create or deletebuckets. To enable that, pass the configuration optionallow_bucket_creation orallow_bucket_deletion. (#13206)
GcsFileSystem andgs_bucket() allowconnecting to Google Cloud Storage. (#10999, #13601)

Arrays and tables

Table and RecordBatch$num_rows() method returns adouble (previously integer), avoiding integer overflow on larger tables.(#13482, #13514)

Packaging

Thearrow.dev_repo for nightly builds of the R packageand prebuilt libarrow binaries is nowhttps://nightlies.apache.org/arrow/r/.
Brotli and BZ2 are shipped with macOS binaries. BZ2 is shipped withWindows binaries. (#13484)

arrow 8.0.0

Enhancements to dplyr anddatasets

open_dataset():
- correctly supports theskip argument for skippingheader rows in CSV datasets.
- can take a list of datasets with differing schemas and attempt tounify the schemas to produce aUnionDataset.
Arrow{dplyr} queries:
- are supported onRecordBatchReader. This allows, forexample, results from DuckDB to be streamed back into Arrow rather thanmaterialized before continuing the pipeline.
- no longer need to materialize the entire result table before writingto a dataset if the query contains aggregations or joins.
- supportsdplyr::rename_with().
- dplyr::count() returns an ungrouped dataframe.
write_dataset() has more options for controlling rowgroup and file sizes when writing partitioned datasets, such asmax_open_files,max_rows_per_file,min_rows_per_group, andmax_rows_per_group.
write_csv_arrow() accepts aDataset or anArrow dplyr query.
Joining one or more datasets whileoption(use_threads = FALSE) no longer crashes R. Thatoption is set by default on Windows.
dplyr joins support thesuffix argument tohandle overlap in column names.
Filtering a Parquet dataset withis.na() no longermisses any rows.
map_batches() correctly acceptsDatasetobjects.

Enhancements to date andtime support

read_csv_arrow()’s readr-style typeT ismapped totimestamp(unit = "ns") instead oftimestamp(unit = "s").
For Arrow dplyr queries, added additional{lubridate}features and fixes:
- New component extraction functions:
  - lubridate::tz() (timezone),
  - lubridate::semester(),
  - lubridate::dst() (daylight savings time boolean),
  - lubridate::date(),
  - lubridate::epiyear() (year according to epidemiologicalweek calendar),
- lubridate::month() works with integer inputs.
- lubridate::make_date() &lubridate::make_datetime() +base::ISOdatetime() &base::ISOdate() tocreate date-times from numeric representations.
- lubridate::decimal_date() andlubridate::date_decimal()
- lubridate::make_difftime() (duration constructor)
- ?lubridate::duration helper functions, such aslubridate::dyears(),lubridate::dhours(),lubridate::dseconds().
- lubridate::leap_year()
- lubridate::as_date() andlubridate::as_datetime()
Also for Arrow dplyr queries, added support and fixes for base dateand time functions:
- base::difftime andbase::as.difftime()
- base::as.Date() to convert to date
- Arrow timestamp and date arrays supportbase::format()
- strptime() returnsNA instead of erroringin case of format mismatch, just likebase::strptime().
Timezone operations are supported on Windows if thetzdb package is alsoinstalled.

Extensibility

Added S3 generic conversion functions such asas_arrow_array() andas_arrow_table() for mainArrow objects. This includes, Arrow tables, record batches, arrays,chunked arrays, record batch readers, schemas, and data types. Thisallows other packages to define custom conversions from their types toArrow objects, including extension arrays.
Customextensiontypes and arrays can be created and registered, allowing otherpackages to define their own array types. Extension arrays wrap regularArrow array types and provide customized behavior and/or storage. Seedescription and an example with?new_extension_type.
Implemented a generic extension type and as_arrow_array() methodsfor all objects wherevctrs::vec_is() returns TRUE (i.e.,any object that can be used as a column in atibble::tibble()), provided that the underlyingvctrs::vec_data() can be converted to an Arrow Array.

Concatenation Support

Arrow arrays and tables can be easily concatenated:

Arrays can be concatenated withconcat_arrays() or, ifzero-copy is desired and chunking is acceptable, usingChunkedArray$create().
ChunkedArrays can be concatenated withc().
RecordBatches and Tables supportcbind().
Tables supportrbind().concat_tables() isalso provided to concatenate tables while unifying schemas.

Other improvements and fixes

Dictionary arrays support using ALTREP when converting to Rfactors.
Math group generics are implemented for ArrowDatum. This means youcan use base functions likesqrt(),log(), andexp() with Arrow arrays and scalars.
read_* andwrite_* functions support RConnection objects for reading and writing files.
Parquet improvements:
- Parquet writer supports Duration type columns.
- The dataset Parquet reader consumes less memory.
median() andquantile() will warn onlyonce about approximate calculations regardless of interactivity.
Array$cast() can cast StructArrays into another structtype with the same field names and structure (or a subset of fields) butdifferent field types.
Removed special handling for Solaris.
The CSV writer is much faster when writing string columns.
Fixed an issue whereset_io_thread_count() would setthe CPU count instead of the IO thread count.
RandomAccessFile has a$ReadMetadata()method that provides useful metadata provided by the filesystem.
grepl binding returnsFALSE forNA inputs (previously it returnedNA), tomatch the behavior ofbase::grepl().
create_package_with_all_dependencies() works on Windowsand Mac OS, instead of only Linux.

arrow 7.0.0

Enhancements to dplyr anddatasets

Additional{lubridate} features:week(),more of theis.*() functions, and the label argument tomonth() have been implemented.
More complex expressions insidesummarize(), such asifelse(n() > 1, mean(y), mean(z)), are supported.
When adding columns in a dplyr pipeline, one can now usetibble anddata.frame to create columns oftibbles or data.frames respectively(e.g. ... %>% mutate(df_col = tibble(a, b)) %>% ...).
Dictionary columns (Rfactor type) are supported insideofcoalesce().
open_dataset() accepts thepartitioningargument when reading Hive-style partitioned files, even though it isnot required.
The experimentalmap_batches() function for customoperations on dataset has been restored.

CSV

Delimited files (including CSVs) with encodings other than UTF cannow be read (using theencoding argument whenreading).
open_dataset() correctly ignores byte-order marks(BOMs) in CSVs, as already was true for reading singlefiles
Reading a dataset internally uses an asynchronous scanner bydefault, which resolves a potential deadlock when reading in large CSVdatasets.
head() no longer hangs on large CSV datasets.
There is an improved error message when there is a conflict betweena header in the file and schema/column names provided as arguments.
write_csv_arrow() now follows the signature ofreadr::write_csv().

Other improvements andfixes

Many of the vignettes have been reorganized, restructured andexpanded to improve their usefulness and clarity.
Code to generate schemas (and individual data type specifications)are accessible with the$code() method on aschema ortype. This allows you to easily getthe code needed to create a schema from an object that already hasone.
ArrowDuration type has been mapped to R’sdifftime class.
Thedecimal256() type is supported. Thedecimal() function has been revised to call eitherdecimal256() ordecimal128() based on thevalue of theprecision argument.
write_parquet() uses a reasonable guess atchunk_size instead of always writing a single chunk. Thisimproves the speed of reading and writing large Parquet files.
write_parquet() no longer drops attributes for groupeddata.frames.
Chunked arrays are now supported using ALTREP.
ALTREP vectors backed by Arrow arrays are no longer unexpectedlymutated by sorting or negation.
S3 file systems can be created withproxy_options.
A segfault when creating S3 file systems has been fixed.
Integer division in Arrow more closely matches R’s behavior.

Installation

Source builds now by default usepkg-config to searchfor system dependencies (such aslibz) and link to them ifpresent. This new default will make building Arrow from source quickeron systems that have these dependencies installed already. To retain theprevious behavior of downloading and building all dependencies, setARROW_DEPENDENCY_SOURCE=BUNDLED.
Snappy and lz4 compression libraries are enabled by default in Linuxbuilds. This means that the default build of Arrow, without setting anyenvironment variables, will be able to read and write snappy encodedParquet files.
Windows binary packages include brotli compression support.
Building Arrow on Windows can find a locally built libarrowlibrary.
The package compiles and installs on Raspberry Pi OS.

Under-the-hood changes

The pointers used to pass data between R and Python have been mademore reliable. Backwards compatibility with older versions of pyarrowhas been maintained.
The internal method of registering new bindings for use in dplyrqueries has changed. See the new vignette about writing bindings formore information about how that works.
R 3.3 is no longer supported.glue, whicharrow depends on transitively, has dropped support forit.

arrow 6.0.1

Joins now support inclusion of dictionary columns, and multiplecrashes have been fixed
Grouped aggregation no longer crashes when working on data that hasbeen filtered down to 0 rows
Bindings added forstr_count() in dplyr queries
Work around a critical bug in the AWS SDK for C++ that could affectS3 multipart upload
A UBSAN warning in the round kernel has been resolved
Fixes for build failures on Solaris and on old versions ofmacOS

arrow 6.0.0

There are now two ways to query Arrow data:

1. ExpandedArrow-native queries: aggregation and joins

dplyr::summarize(), both grouped and ungrouped, is nowimplemented for Arrow Datasets, Tables, and RecordBatches. Because datais scanned in chunks, you can aggregate over larger-than-memory datasetsbacked by many files. Supported aggregation functions includen(),n_distinct(),min(),max(),sum(),mean(),var(),sd(),any(), andall().median() andquantile()with one probability are also supported and currently return approximateresults using the t-digest algorithm.

Along withsummarize(), you can also callcount(),tally(), anddistinct(),which effectively wrapsummarize().

This enhancement does change the behavior ofsummarize()andcollect() in some cases: see “Breaking changes” belowfor details.

In addition tosummarize(), mutating and filteringequality joins (inner_join(),left_join(),right_join(),full_join(),semi_join(), andanti_join()) with are alsosupported natively in Arrow.

Grouped aggregation and (especially) joins should be consideredsomewhat experimental in this release. We expect them to work, but theymay not be well optimized for all workloads. To help us focus ourefforts on improving them in the next release, please let us know if youencounter unexpected behavior or poor performance.

New non-aggregating compute functions include string functions likestr_to_title() andstrftime() as well ascompute functions for extracting date parts (e.g. year(),month()) from dates. This is not a complete list ofadditional compute functions; for an exhaustive list of availablecompute functions seelist_compute_functions().

We’ve also worked to fill in support for all data types, such asDecimal, for functions added in previous releases. All typelimitations mentioned in previous release notes should be no longervalid, and if you find a function that is not implemented for a certaindata type, pleasereport anissue.

2. DuckDB integration

If you have theduckdb packageinstalled, you can hand off an Arrow Dataset or query object toDuckDB for further querying using theto_duckdb() function. This allows you to use duckdb’sdbplyr methods, as well as its SQL interface, to aggregatedata. Filtering and column projection done beforeto_duckdb() is evaluated in Arrow, and duckdb can push downsome predicates to Arrow as well. This handoffdoes not copythe data, instead it uses Arrow’s C-interface (just like passing arrowdata between R and Python). This means there is no serialization or datacopying costs are incurred.

You can also take a duckdbtbl and callto_arrow() to stream data to Arrow’s query engine. Thismeans that in a single dplyr pipeline, you could start with an ArrowDataset, evaluate some steps in DuckDB, then evaluate the rest inArrow.

Breaking changes

Row order of data from a Dataset query is no longer deterministic.If you need a stable sort order, you should explicitlyarrange() the query result. For calls tosummarize(), you can setoptions(arrow.summarise.sort = TRUE) to match the currentdplyr behavior of sorting on the grouping columns.
dplyr::summarize() on an in-memory Arrow Table orRecordBatch no longer eagerly evaluates. Callcompute() orcollect() to evaluate the query.
head() andtail() also no longer eagerlyevaluate, both for in-memory data and for Datasets. Also, because roworder is no longer deterministic, they will effectively give you arandom slice of data from somewhere in the dataset unless youarrange() to specify sorting.
Simple Feature (SF) columns no longer save all of their metadatawhen converting to Arrow tables (and thus when saving to Parquet orFeather). This also includes any dataframe column that has attributes oneach element (in other words: row-level metadata). Our previous approachto saving this metadata is both (computationally) inefficient andunreliable with Arrow queries + datasets. This will most impact savingSF columns. For saving these columns we recommend either converting thecolumns to well-known binary representations (usingsf::st_as_binary(col)) or using thesfarrow packagewhich handles some of the intricacies of this conversion process. Wehave plans to improve this and re-enable custom metadata like this inthe future when we can implement the saving in a safe and efficient way.If you need to preserve the pre-6.0.0 behavior of saving this metadata,you can setoptions(arrow.preserve_row_level_metadata = TRUE). We willbe removing this option in a coming release. We strongly recommendavoiding using this workaround if possible since the results will not besupported in the future and can lead to surprising and inaccurateresults. If you run into a custom class besides sf columns that areimpacted by this pleasereport anissue.
Datasets are officially no longer supported on 32-bit Windows on R< 4.0 (Rtools 3.5). 32-bit Windows users should upgrade to a newerversion of R in order to use datasets.

Installation on Linux

Package installation now fails if the Arrow C++ library does notcompile. In previous versions, if the C++ library failed to compile, youwould get a successful R package installation that wouldn’t do muchuseful.
You can disable all optional C++ components when building fromsource by setting the environment variableLIBARROW_MINIMAL=true. This will have the coreArrow/Feather components but excludes Parquet, Datasets, compressionlibraries, and other optional features.
Source packages now bundle the Arrow C++ source code, so it does nothave to be downloaded in order to build the package. Because the sourceis included, it is now possible to build the package on anoffline/airgapped system. By default, the offline build will be minimalbecause it cannot download third-party C++ dependencies required tosupport all features. To allow a fully featured offline build, theincludedcreate_package_with_all_dependencies() function(also available on GitHub without installing the arrow package) willdownload all third-party C++ dependencies and bundle them inside the Rsource package. Run this function on a system connected to the networkto produce the “fat” source package, then copy that .tar.gz package toyour offline machine and install. Special thanks to@karldw for the huge amountof work on this.
Source builds can make use of system dependencies (such aslibz) by settingARROW_DEPENDENCY_SOURCE=AUTO.This is not the default in this release (BUNDLED,i.e. download and build all dependencies) but may become the default inthe future.
The JSON library components (read_json_arrow()) are nowoptional and still on by default; setARROW_JSON=OFF beforebuilding to disable them.

Other enhancements and fixes

More Arrow data types use ALTREP when converting to and from R. Thisspeeds up some workflows significantly, while for others it merelydelays conversion from Arrow to R. ALTREP is used by default, but todisable it, setoptions(arrow.use_altrep = FALSE)
Field objects can now be created as non-nullable, andschema() now optionally accepts a list ofFields
Numeric division by zero now matches R’s behavior and no longerraises an error
write_parquet() no longer errors when used with agrouped data.frame
case_when() now errors cleanly if an expression is notsupported in Arrow
open_dataset() now works on CSVs without headerrows
Fixed a minor issue where the short readr-style typesTandt were reversed inread_csv_arrow()
Bindings forlog(..., base = b) where b is somethingother than 2, e, or 10
A number of updates and expansions to our vignettes
Fix segfaults in converting length-0 ChunkedArrays to R vectors
Table$create() now has aliasarrow_table()

Internals

We now use testthat 3rd edition as our default
A number of large test reorganizations
Style changes to conform with the tidyverse style guide + usinglintr

arrow 5.0.0.2

This patch version contains fixes for some sanitizer and compilerwarnings.

arrow 5.0.0

More dplyr

There are now more than 250 compute functions available for useindplyr::filter(),mutate(), etc. Additionsin this release include:
- String operations:strsplit() andstr_split();strptime();paste(),paste0(), andstr_c();substr()andstr_sub();str_like();str_pad();stri_reverse()
- Date/time operations:lubridate methods such asyear(),month(),wday(), and soon
- Math: logarithms (log() et al.); trigonometry(sin(),cos(), et al.);abs();sign();pmin() andpmax();ceiling(),floor(), andtrunc()
- Conditional functions, with some limitations on input type in thisrelease:ifelse() andif_else() for all butDecimal types;case_when() for logical,numeric, and temporal types only;coalesce() for all butlists/structs. Note also that in this release, factors/dictionaries areconverted to strings in these functions.
- is.* functions are supported and can be used insiderelocate()
The print method forarrow_dplyr_query now includesthe expression and the resulting type of columns derived bymutate().
transmute() now errors if passed arguments.keep,.before, or.after, forconsistency with the behavior ofdplyr ondata.frames.

CSV writing

write_csv_arrow() to use Arrow to write a data.frame toa single CSV file
write_dataset(format = "csv", ...) to write a Datasetto CSVs, including with partitioning

C interface

Added bindings for the remainder of C data interface: Type, Field,and RecordBatchReader (from the experimental C stream interface). Thesealso havereticulate::py_to_r() andr_to_py()methods. Along with the addition of theScanner$ToRecordBatchReader() method, you can now build upa Dataset query in R and pass the resulting stream of batches to anothertool in process.
C interface methods are exposed on Arrow objects(e.g. Array$export_to_c(),RecordBatch$import_from_c()), similar to how they are inpyarrow. This facilitates their use in other packages. Seethepy_to_r() andr_to_py() methods for usageexamples.

Other enhancements

Converting an Rdata.frame to an ArrowTable uses multithreading across columns
Some Arrow array types now use ALTREP when converting to R. Todisable this, setoptions(arrow.use_altrep = FALSE)
is.na() now evaluates toTRUE onNaN values in floating point number fields, for consistencywith base R.
is.nan() now evaluates toFALSE onNA values in floating point number fields andFALSE on all values in non-floating point fields, forconsistency with base R.
Additional methods forArray,ChunkedArray,RecordBatch, andTable:na.omit() and friends,any()/all()
Scalar inputs toRecordBatch$create() andTable$create() are recycled
arrow_info() includes details on the C++ build, such ascompiler version
match_arrow() now convertsx into anArray if it is not aScalar,Array orChunkedArray and no longer dispatchesbase::match().
Row-level metadata is now restricted to reading/writing singleparquet or feather files. Row-level metadata with datasets is ignored(with a warning) if the dataset contains row-level metadata. Writing adataset with row-level metadata will also be ignored (with a warning).We are working on a more robust implementation to support row-levelmetadata (and other complex types) — stay tuned. For working with {sf}objects,{sfarrow} ishelpful for serializing sf columns and sharing them with geopandas.

arrow 4.0.1

Resolved a few bugs in new string compute kernels (#10320,#10287)

arrow 4.0.0.1

The mimalloc memory allocator is the default memory allocator whenusing a static source build of the package on Linux. This is because ithas better behavior under valgrind than jemalloc does. A full-featuredbuild (installed withLIBARROW_MINIMAL=false) includes bothjemalloc and mimalloc, and it has still has jemalloc as default, thoughthis is configurable at runtime with theARROW_DEFAULT_MEMORY_POOL environment variable.
Environment variablesLIBARROW_MINIMAL,LIBARROW_DOWNLOAD, andNOT_CRAN are nowcase-insensitive in the Linux build script.
A build configuration issue in the macOS binary package has beenresolved.

arrow 4.0.0

dplyr methods

Many moredplyr verbs are supported on Arrowobjects:

dplyr::mutate() is now supported in Arrow for manyapplications. For queries onTable andRecordBatch that are not yet supported in Arrow, theimplementation falls back to pulling data into an in-memory Rdata.frame first, as in the previous release. For queriesonDataset (which can be larger than memory), it raises anerror if the function is not implemented. The mainmutate()features that cannot yet be called on Arrow objects are (1)mutate() aftergroup_by() (which is typicallyused in combination with aggregation) and (2) queries that usedplyr::across().
dplyr::transmute() (which callsmutate())
dplyr::group_by() now preserves the.dropargument and supports on-the-fly definition of columns
dplyr::relocate() to reorder columns
dplyr::arrange() to sort rows
dplyr::compute() to evaluate the lazy expressions andreturn an Arrow Table. This is equivalent todplyr::collect(as_data_frame = FALSE), which was added in2.0.0.

Over 100 functions can now be called on Arrow objects inside adplyr verb:

String functionsnchar(),tolower(), andtoupper(), along with theirstringr spellingsstr_length(),str_to_lower(), andstr_to_upper(), are supported in Arrowdplyrcalls.str_trim() is also supported.
Regular expression functionssub(),gsub(), andgrepl(), along withstr_replace(),str_replace_all(), andstr_detect(), are supported.
cast(x, type) anddictionary_encode()allow changing the type of columns in Arrow objects;as.numeric(),as.character(), etc. are exposedas similar type-altering conveniences
dplyr::between(); the Arrow version also allows theleft andright arguments to be columns in thedata and not just scalars
Additionally, any Arrow C++ compute function can be called inside adplyr verb. This enables you to access Arrow functions thatdon’t have a direct R mapping. Seelist_compute_functions()for all available functions, which are available indplyrprefixed byarrow_.
Arrow C++ compute functions now do more systematic type promotionwhen called on data with different types (e.g. int32 and float64).Previously, Scalars in an expressions were always cast to match the typeof the corresponding Array, so this new type promotion enables, amongother things, operations on two columns (Arrays) in a dataset. As a sideeffect, some comparisons that worked in prior versions are no longersupported: for example,dplyr::filter(arrow_dataset, string_column == 3) will errorwith a message about the type mismatch between the numeric3 and the string type ofstring_column.

Datasets

open_dataset() now accepts a vector of file paths (oreven a single file path). Among other things, this enables you to open asingle very large file and usewrite_dataset() to partitionit without having to read the whole file into memory.
Datasets can now detect and read a directory of compressed CSVs
write_dataset() now defaults toformat = "parquet" and better validates theformat argument
Invalid input forschema inopen_dataset()is now correctly handled
Collecting 0 columns from a Dataset now no longer returns all of thecolumns
TheScanner$Scan() method has been removed; useScanner$ScanBatches()

Other improvements

value_counts() to tabulate values in anArray orChunkedArray, similar tobase::table().
StructArray objects gain data.frame-like methods,includingnames(),$,[[, anddim().
RecordBatch columns can now be added, replaced, or removed byassigning (<-) with either$ or[[
Similarly,Schema can now be edited by assigning in newtypes. This enables using the CSV reader to detect the schema of a file,modify theSchema object for any columns that you want toread in as a different type, and then use thatSchema toread the data.
Better validation when creating aTable with a schema,with columns of different lengths, and with scalar value recycling
Reading Parquet files in Japanese or other multi-byte locales onWindows no longer hangs (workaround for abug inlibstdc++; thanks@yutannihilation for the persistencein discovering this!)
If you attempt to read string data that has embedded nul(\0) characters, the error message now informs you that youcan setoptions(arrow.skip_nul = TRUE) to strip them out.It is not recommended to set this option by default since this code pathis significantly slower, and most string data does not containnuls.
read_json_arrow() now accepts a schema:read_json_arrow("file.json", schema = schema(col_a = float64(), col_b = string()))

Installation andconfiguration

The R package can now support working with an Arrow C++ library thathas additional features (such as dataset, parquet, string libraries)disabled, and the bundled build script enables setting environmentvariables to disable them. Seevignette("install", package = "arrow") for details. Thisallows a faster, smaller package build in cases where that is useful,and it enables a minimal, functioning R package build on Solaris.
On macOS, it is now possible to use the same bundled C++ build thatis used by default on Linux, along with all of its customizationparameters, by setting the environment variableFORCE_BUNDLED_BUILD=true.
arrow now uses themimalloc memoryallocator by default on macOS, if available (as it is in CRAN binaries),instead ofjemalloc. There areconfigurationissues withjemalloc on macOS, andbenchmarkanalysis shows that this has negative effects on performance,especially on memory-intensive workflows.jemalloc remainsthe default on Linux;mimalloc is default on Windows.
Setting theARROW_DEFAULT_MEMORY_POOL environmentvariable to switch memory allocators now works correctly when the ArrowC++ library has been statically linked (as is usually the case wheninstalling from CRAN).
Thearrow_info() function now reports on the additionaloptional features, as well as the detected SIMD level. If key featuresor compression libraries are not enabled in the build,arrow_info() will refer to the installation vignette forguidance on how to install a more complete build, if desired.
If you attempt to read a file that was compressed with a codec thatyour Arrow build does not contain support for, the error message nowwill tell you how to reinstall Arrow with that feature enabled.
A new vignette about developer environment setupvignette("developing", package = "arrow").
When building from source, you can use the environment variableARROW_HOME to point to a specific directory where the Arrowlibraries are. This is similar to passingINCLUDE_DIR andLIB_DIR.

arrow 3.0.0

Python and Flight

Flight methodsflight_get() andflight_put() (renamed frompush_data() in thisrelease) can handle both Tables and RecordBatches
flight_put() gains anoverwrite argumentto optionally check for the existence of a resource with the samename
list_flights() andflight_path_exists()enable you to see available resources on a Flight server
Schema objects now haver_to_py andpy_to_r methods
Schema metadata is correctly preserved when converting Tablesto/from Python

Enhancements

Arithmetic operations (+,*, etc.) aresupported on Arrays and ChunkedArrays and can be used in filterexpressions in Arrowdplyr pipelines
Table columns can now be added, replaced, or removed by assigning(<-) with either$ or[[
Column names of Tables and RecordBatches can be renamed by assigningnames()
Large string types can now be written to Parquet files
Therlang pronouns.data and.env are now fully supported in Arrowdplyrpipelines.
Optionarrow.skip_nul (defaultFALSE, asinbase::scan()) allows conversion of Arrow string(utf8()) type data containing embedded nul\0characters to R. If set toTRUE, nuls will be stripped anda warning is emitted if any are found.
arrow_info() for an overview of various run-time andbuild-time Arrow configurations, useful for debugging
Set environment variableARROW_DEFAULT_MEMORY_POOLbefore loading the Arrow package to change memory allocators. Windowspackages are built withmimalloc; most others are builtwith bothjemalloc (used by default) andmimalloc. These alternative memory allocators are generallymuch faster than the system memory allocator, so they are used bydefault when available, but sometimes it is useful to turn them off fordebugging purposes. To disable them, setARROW_DEFAULT_MEMORY_POOL=system.
List columns that have attributes on each element are now alsoincluded with the metadata that is saved when creating Arrow tables.This allowssf tibbles to faithfully preserved androundtripped (#8549).
R metadata that exceeds 100Kb is now compressed before being writtento a table; seeschema() for more details.

Bug fixes

Fixed a performance regression in converting Arrow string types to Rthat was present in the 2.0.0 release
C++ functions now trigger garbage collection when needed
write_parquet() can now write RecordBatches
Reading a Table from a RecordBatchStreamReader containing 0 batchesno longer crashes
readr’sproblems attribute is removed whenconverting to Arrow RecordBatch and table to prevent large amounts ofmetadata from accumulating inadvertently (#9092)
Fixed reading of compressed Feather files written with Arrow 0.17(#9128)
SubTreeFileSystem gains a useful print method and nolonger errors when printing

Packaging and installation

Nightly development versions of the condar-arrowpackage are available withconda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
Linux installation now safely supports oldercmakeversions
Compiler version checking for enabling S3 support correctlyidentifies the active compiler
Updated guidance and troubleshooting invignette("install", package = "arrow"), especially forknown CentOS issues
Operating system detection on Linux uses thedistro package. Ifyour OS isn’t correctly identified, please report an issue there.

arrow 2.0.0

Datasets

write_dataset() to Feather or Parquet files withpartitioning. See the end ofvignette("dataset", package = "arrow") for discussion andexamples.
Datasets now havehead(),tail(), and take([) methods.head() is optimized but theothers may not be performant.
collect() gains anas_data_frame argument,defaultTRUE but whenFALSE allows you toevaluate the accumulatedselect andfilterquery but keep the result in Arrow, not an Rdata.frame
read_csv_arrow() supports specifying column types, bothwith aSchema and with the compact string representationfor types used in thereadr package. It also has gained atimestamp_parsers argument that lets you express a set ofstrptime parse strings that will be tried to convertcolumns designated asTimestamp type.

AWS S3 support

S3 support is now enabled in binary macOS and Windows (Rtools40only, i.e. R >= 4.0) packages. To enable it on Linux, you need theadditional system dependencieslibcurl andopenssl, as well as a sufficiently modern compiler. Seevignette("install", package = "arrow") for details.
File readers and writers (read_parquet(),write_feather(), et al.), as well asopen_dataset() andwrite_dataset(), allow youto access resources on S3 (or on file systems that emulate S3) either byproviding ans3:// URI or by providing aFileSystem$path(). Seevignette("fs", package = "arrow") for examples.
copy_files() allows you to recursively copy directoriesof files from one file system to another, such as from S3 to your localmachine.

Flight RPC

Flightis a general-purpose client-server framework for high performancetransport of large datasets over network interfaces. Thearrow R package now provides methods for connecting toFlight RPC servers to send and receive data. Seevignette("flight", package = "arrow") for an overview.

Computation

Comparison (==,>, etc.) and boolean(&,|,!) operations, alongwithis.na,%in% andmatch(calledmatch_arrow()), on Arrow Arrays and ChunkedArraysare now implemented in the C++ library.
Aggregation methodsmin(),max(), andunique() are implemented for Arrays and ChunkedArrays.
dplyr filter expressions on Arrow Tables andRecordBatches are now evaluated in the C++ library, rather than bypulling data into R and evaluating. This yields significant performanceimprovements.
dim() (nrow) for dplyr queries onTable/RecordBatch is now supported

Packaging and installation

arrow now depends oncpp11, which bringsmore robust UTF-8 handling and faster compilation
The Linux build script now succeeds on older versions of R
macOS binary packages now ship with zstandard compressionenabled

Bug fixes and otherenhancements

Automatic conversion of ArrowInt64 type when allvalues fit with an R 32-bit integer now correctly inspects all chunks ina ChunkedArray, and this conversion can be disabled (so thatInt64 always yields abit64::integer64 vector)by settingoptions(arrow.int64_downcast = FALSE).
In addition to the data.frame column metadata preserved in roundtrip, added in 1.0.0, now attributes of the data.frame itself are alsopreserved in Arrow schema metadata.
File writers now respect the system umask setting
ParquetFileReader has additional methods for accessingindividual columns or row groups from the file
Various segfaults fixed: invalid input inParquetFileWriter; invalidArrowObject pointerfrom a saved R object; converting deeply nested structs from Arrow toR
Theproperties andarrow_propertiesarguments towrite_parquet() are deprecated

arrow 1.0.1

Bug fixes

Filtering a Dataset that has multiple partition keys using an%in% expression now faithfully returns all relevantrows
Datasets can now have path segments in the root directory that startwith. or_; files and subdirectories startingwith those prefixes are still ignored
open_dataset("~/path") now correctly expands thepath
Theversion option towrite_parquet() isnow correctly implemented
An UBSAN failure in theparquet-cpp library has beenfixed
For bundled Linux builds, the logic for findingcmakeis more robust, and you can now specify a/path/to/cmake bysetting theCMAKE environment variable

arrow 1.0.0

Arrow format conversion

vignette("arrow", package = "arrow") includes tablesthat explain how R types are converted to Arrow types and viceversa.
Support added for converting to/from more Arrow types:uint64,binary,fixed_size_binary,large_binary,large_utf8,large_list,list ofstructs.
character vectors that exceed 2GB are converted toArrowlarge_utf8 type
POSIXlt objects can now be converted to Arrow(struct)
Rattributes() are preserved in Arrow metadata whenconverting to Arrow RecordBatch and table and are restored whenconverting from Arrow. This means that custom subclasses, such ashaven::labelled, are preserved in round trip throughArrow.
Schema metadata is now exposed as a named list, and it can bemodified by assignment likebatch$metadata$new_key <- "new value"
Arrow typesint64,uint32, anduint64 now are converted to Rinteger if allvalues fit in bounds
Arrowdate32 is now converted to RDatewithdouble underlying storage. Even though the data valuesthemselves are integers, this provides more strict round-tripfidelity
When converting to Rfactor,dictionaryChunkedArrays that do not have identical dictionaries are properlyunified
In the 1.0 release, the Arrow IPC metadata version is increased fromV4 to V5. By default,RecordBatch{File,Stream}Writer willwrite V5, but you can specify an alternatemetadata_version. For convenience, if you know the consumeryou’re writing to cannot read V5, you can set the environment variableARROW_PRE_1_0_METADATA_VERSION=1 to write V4 withoutchanging any other code.

Datasets

CSV and other text-delimited datasets are now supported
With a custom C++ build, it is possible to read datasets directly onS3 by passing a URL likeds <- open_dataset("s3://...").Note that this currently requires a special C++ library build withadditional dependencies–this is not yet available in CRAN releases or innightly packages.
When reading individual CSV and JSON files, compression isautomatically detected from the file extension

Other enhancements

Initial support for C++ aggregation methods:sum() andmean() are implemented forArray andChunkedArray
Tables and RecordBatches have additional data.frame-like methods,includingdimnames() andas.list()
Tables and ChunkedArrays can now be moved to/from Python viareticulate

Bug fixes and deprecations

Non-UTF-8 strings (common on Windows) are correctly coerced to UTF-8when passing to Arrow memory and appropriately re-localized whenconverting to R
Thecoerce_timestamps option towrite_parquet() is now correctly implemented.
Creating a Dictionary array respects thetypedefinition if provided by the user
read_arrow andwrite_arrow are nowdeprecated; use theread/write_feather() andread/write_ipc_stream() functions depending on whetheryou’re working with the Arrow IPC file or stream format,respectively.
Previously deprecatedFileStats,read_record_batch, andread_table have beenremoved.

Installation and packaging

For improved performance in memory allocation, macOS and Linuxbinaries now havejemalloc included, and Windows packagesusemimalloc
Linux installation: some tweaks to OS detection for binaries, someupdates to known installation issues in the vignette
The bundled libarrow is built with the sameCC andCXX values that R uses
Failure to build the bundled libarrow yields a clear message
Various streamlining efforts to reduce library size and compiletime

arrow 0.17.1

Updates for compatibility withdplyr 1.0
reticulate::r_to_py() conversion now correctly worksautomatically, without having to call the method yourself
Assorted bug fixes in the C++ library around Parquet reading

arrow 0.17.0

Feather v2

This release includes support for version 2 of the Feather fileformat. Feather v2 features full support for all Arrow data types, fixesthe 2GB per-column limitation for large amounts of string data, and itallows files to be compressed using eitherlz4 orzstd.write_feather() can write either version2 orversion 1 Featherfiles, andread_feather() automatically detects which fileversion it is reading.

Related to this change, several functions around reading and writingdata have been reworked.read_ipc_stream() andwrite_ipc_stream() have been added to facilitate writingdata to the Arrow IPC stream format, which is slightly different fromthe IPC file format (Feather v2is the IPC file format).

Behavior has been standardized: allread_<format>() return an Rdata.frame(default) or aTable if the argumentas_data_frame = FALSE; allwrite_<format>() functions return the data object,invisibly. To facilitate some workflows, a specialwrite_to_raw() function is added to wrapwrite_ipc_stream() and return theraw vectorcontaining the buffer that was written.

To achieve this standardization,read_table(),read_record_batch(),read_arrow(), andwrite_arrow() have been deprecated.

Python interoperability

The 0.17 Apache Arrow release includes a C data interface that allowsexchanging Arrow data in-process at the C level without copying andwithout libraries having a build or runtime dependency on each other.This enables us to usereticulate to share data between Rand Python (pyarrow) efficiently.

Seevignette("python", package = "arrow") fordetails.

Datasets

Dataset reading benefits from many speedups and fixes in the C++library
Datasets have adim() method, which sums rows acrossall files (#6635,@boshek)
Combine multiple datasets into a single queryableUnionDataset with thec() method
Dataset filtering now treatsNA asFALSE,consistent withdplyr::filter()
Dataset filtering is now correctly supported for all Arrowdate/time/timestamp column types
vignette("dataset", package = "arrow") now has correct,executable code

Installation

Installation on Linux now builds C++ the library from source bydefault, with some compression libraries disabled. For a faster, richerbuild, set the environment variableNOT_CRAN=true. Seevignette("install", package = "arrow") for details and moreoptions.
Source installation is faster and more reliable on more Linuxdistributions.

Other bug fixes andenhancements

unify_schemas() to create aSchemacontaining the union of fields in multiple schemas
Timezones are faithfully preserved in roundtrip between R andArrow
read_feather() and other reader functions close anyfile connections they open
Arrow R6 objects no longer have namespace collisions when theR.oo package is also loaded
FileStats is renamed toFileInfo, and theoriginal spelling has been deprecated

arrow 0.16.0.2

install_arrow() now installs the latest release ofarrow, including Linux dependencies, either for CRANreleases or for development builds (ifnightly = TRUE)
Package installation on Linux no longer downloads C++ dependenciesunless theLIBARROW_DOWNLOAD orNOT_CRANenvironment variable is set
write_feather(),write_arrow() andwrite_parquet() now return their input, similar to thewrite_* functions in thereadr package (#6387,@boshek)
Can now infer the type of an Rlist and create aListArray when all list elements are the same type (#6275,@michaelchirico)

arrow 0.16.0

Multi-file datasets

This release includes adplyr interface to ArrowDatasets, which let you work efficiently with large, multi-file datasetsas a single entity. Explore a directory of data files withopen_dataset() and then usedplyr methods toselect(),filter(), etc. Work will be donewhere possible in Arrow memory. When necessary, data is pulled into Rfor further computation.dplyr methods are conditionallyloaded if you havedplyr available; it is not a harddependency.

Seevignette("dataset", package = "arrow") fordetails.

Linux installation

A source package installation (as from CRAN) will now handle its C++dependencies automatically. For common Linux distributions and versions,installation will retrieve a prebuilt static C++ library for inclusionin the package; where this binary is not available, the package executesa bundled script that should build the Arrow C++ library with no systemdependencies beyond what R requires.

Seevignette("install", package = "arrow") fordetails.

Data exploration

Tables andRecordBatches also havedplyr methods.
For exploration withoutdplyr,[ methodsfor Tables, RecordBatches, Arrays, and ChunkedArrays now support naturalrow extraction operations. These use the C++Filter,Slice, andTake methods for efficient access,depending on the type of selection vector.
An experimental, lazily evaluatedarray_expressionclass has also been added, enabling among other things the ability tofilter a Table with some function of Arrays, such asarrow_table[arrow_table$var1 > 5, ] without having topull everything into R first.

Compression

write_parquet() now supports compression
codec_is_available() returnsTRUE orFALSE whether the Arrow C++ library was built with supportfor a given compression library (e.g. gzip, lz4, snappy)
Windows builds now include support for zstd and lz4 compression(#5814,@gnguy)

Other fixes and improvements

Arrow null type is now supported
Factor types are now preserved in round trip through Parquet format(#6135,@yutannihilation)
Reading an Arrow dictionary type coerces dictionary values tocharacter (as Rfactor levels are required tobe) instead of raising an error
Many improvements to Parquet function documentation (@karldw,@khughitt)

arrow 0.15.1

This patch release includes bugfixes in the C++ library arounddictionary types and Parquet reading.

arrow 0.15.0

Breaking changes

The R6 classes that wrap the C++ classes are now documented andexported and have been renamed to be more R-friendly. Users of thehigh-level R interface in this package are not affected. Those who wantto interact with the Arrow C++ API more directly should work with theseobjects and methods. As part of this change, many functions thatinstantiated these R6 objects have been removed in favor ofClass$create() methods. Notably,arrow::array() andarrow::table() have beenremoved in favor ofArray$create() andTable$create(), eliminating the package startup messageabout maskingbase functions. For more information, see thenewvignette("arrow").
Due to a subtle change in the Arrow message format, data written bythe 0.15 version libraries may not be readable by older versions. If youneed to send data to a process that uses an older version of Arrow (forexample, an Apache Spark server that hasn’t yet updated to Arrow 0.15),you can set the environment variableARROW_PRE_0_15_IPC_FORMAT=1.
Theas_tibble argument in theread_*()functions has been renamed toas_data_frame (#5399,@jameslamb)
Thearrow::Column class has been removed, as it wasremoved from the C++ library

New features

Table andRecordBatch objects have S3methods that enable you to work with them more likedata.frames. Extract columns, subset, and so on. See?Table and?RecordBatch for examples.
Initial implementation of bindings for the C++ File System API.(#5223)
Compressed streams are now supported on Windows (#5329), and you canalso specify a compression level (#5450)

Other upgrades

Parquet file reading is much, much faster, thanks to improvements inthe Arrow C++ library.
read_csv_arrow() supports more parsing options,includingcol_names,na,quoted_na, andskip
read_parquet() andread_feather() caningest data from araw vector (#5141)
File readers now properly handle paths that need expanding, such as~/file.parquet (#5169)
Improved support for creating types in a schema: the types’ printednames (e.g. “double”) are guaranteed to be valid to use in instantiatinga schema (e.g. double()), and time types can be createdwith human-friendly resolution strings (“ms”, “s”, etc.). (#5198,#5201)

arrow 0.14.1

Initial CRAN release of thearrow package. Key featuresinclude:

Read and write support for various file formats, including Parquet,Feather/Arrow, CSV, and JSON.
API bindings to the C++ library for Arrow data types and objects, aswell as mapping between Arrow types and R data types.
Tools for helping with C++ library configuration andinstallation.

[8]ページ先頭