Movatterモバイル変換


[0]ホーム

URL:


Skip to contents

Installing on Linux

Source:vignettes/install.Rmd
install.Rmd

In most cases,install.packages("arrow") should justwork. There are things you can do to make the installation faster,documented in this article. If for some reason installation does notwork, set the environment variableARROW_R_DEV=true, retry,and share the logs with us.

Background

The Apache Arrow project is implemented in multiple languages, andthe R package depends on the Arrow C++ library (referred to from here onas libarrow). This means that when you install arrow, you need both theR and C++ versions. If you install arrow from CRAN on a machine runningWindows or macOS, when you callinstall.packages("arrow"),a precompiled binary containing both the R package and libarrow will bedownloaded. However, CRAN does not host R package binaries for Linux,and so you must choose from one of the alternative approaches.

This article outlines the recommend approaches to installing arrow onLinux, starting from the simplest and least customizable to the mostcomplex but with more flexibility to customize your installation.

The primary audience for this document is arrow R packageusers on Linux, and not Arrowdevelopers. Additionalresources for developers are listed at the end of this article.

System dependencies

The arrow package is designed to work with very minimal systemrequirements, but there are a few things to note.

Compilers

As of version 10.0.0, arrow requires a C++17 compiler to build. Forgcc, this generally means version 7 or newer. Mostcontemporary Linux distributions have a new enough compiler; however,CentOS 7 is a notable exception, as it ships with gcc 4.8.

Libraries

Optional support for reading from cloud storage–AWS S3 and GoogleCloud Storage (GCS)–requires additional system dependencies:

  • CURL: installlibcurl-devel (rpm) orlibcurl4-openssl-dev (deb)
  • OpenSSL >= 1.0.2: installopenssl-devel (rpm) orlibssl-dev (deb)

The prebuilt binaries come with S3 and GCS support enabled, so youwill need to meet these system requirements in order to use them. Ifyou’re building everything from source, the install script will checkfor the presence of these dependencies and turn off S3 and GCS supportin the build if the prerequisites are not met–installation will succeedbut without S3 or GCS functionality. If afterwards you install themissing system requirements, you’ll need to reinstall the package inorder to enable S3 and GCS support.

Install release version (easy way)

On macOS and Windows, when you runinstall.packages("arrow") and install arrow from CRAN, youget an R binary package that contains a precompiled version of libarrow.Installing binaries is much easier that installing from source, but CRANdoes not host binaries for Linux. This means that the default behaviourwhen you runinstall.packages() on Linux is to retrieve thesource version of the R package and compile both the R packageand libarrow from source. We’ll talk about this scenario in thenext section (the “less easy” way), but first we’ll suggest two fasteralternatives that are usually much easier.

Binary R package with libarrow binary via RSPM/conda

Graphic showing R and C++ logo inside the package icon

If you want a quicker installation process, and by default a morefully-featured build, you could install arrow fromRStudio’s publicpackage manager, which hosts binaries for both Windows andLinux.

For example, if you are using Ubuntu 20.04 (Focal):

options(  HTTPUserAgent=sprintf("R/%s R (%s)",getRversion(),paste(getRversion(),R.version["platform"],R.version["arch"],R.version["os"])))install.packages("arrow", repos="https://packagemanager.rstudio.com/all/__linux__/focal/latest")

Note that the User Agent header must be specified as in the exampleabove. Please checktheRStudio Package Manager: Admin Guide for more details.

For other Linux distributions, to get the relevant URL, you can visittheRSPM site, click on ‘binary’, and select your preferreddistribution.

Similarly, if you useconda to manage your Renvironment, you can get the latest official release of the R packageincluding libarrow via:

# Using the --strict-channel-priority flag on `conda install` causes very long# solve times, so we add it directly to the configconda config --set channel_priority strictconda install -c conda-forge r-arrow

R source package with libarrow binary

Graphic showing R logo in folder icon, then a plus sign, then C++ logo inside the package icon

Another way of achieving faster installation with all key featuresenabled is to use static libarrow binaries we host. These are usedautomatically on many Linux distributions (x86_64 architecture only),according to theallowlist.If your distribution isn’t in the list, you can opt-in by setting theNOT_CRAN environment variable before you callinstall.packages():

Sys.setenv("NOT_CRAN"="true")install.packages("arrow")

This installs the source version of the R package, but during theinstallation process will check for compatible libarrow binaries that wehost and use those if available. If no binary is available or can’t befound, then this option falls back onto method 2 below (full sourcebuild), but setting the environment variable results in a morefully-featured build than default.

The libarrow binaries include support for AWS S3 and GCS, so theyrequire the libcurl and openssl libraries installed separately, as notedabove. If you don’t have these installed, the libarrow binary won’t beused, and you will fall back to the full source build (with S3 and GCSsupport disabled).

If the internet access of your computer doesn’t allow downloading thelibarrow binaries (e.g. if access is limited to CRAN), you can firstidentify the right source and version by trying to install on theoffline computer:

Sys.setenv("NOT_CRAN"="true","LIBARROW_BUILD"=FALSE,"ARROW_R_DEV"=TRUE)install.packages("arrow")# This will fail if no internet access, but will print the binaries URL

Then you can obtain the libarrow binaries (using a computer withinternet access) and transfer the zip file to the target computer. Nowyou just have to tell the installer to use that pre-downloaded file:

# Watchout: release numbers of the pre-downloaded libarrow must match CRAN!Sys.setenv("ARROW_DOWNLOADED_BINARIES"="/path/to/downloaded/libarrow.zip")install.packages("arrow")

Install release version (less easy)

Graphic showing R inside a folder icon, then a plus sign, then C++ logo inside a folder icon

The “less easy” way to install arrow is to install both the R packageand the underlying Arrow C++ library (libarrow) from source. This methodis somewhat more difficult because compiling and installing R packageswith C++ dependencies generally requires installing system packages,which you may not have privileges to do, and/or building the C++dependencies separately, which introduces all sorts of additional waysfor things to go wrong.

Installing from the full source build of arrow, compiling both C++and R bindings, will handle most of the dependency management for you,but it is much slower than using binaries. However, if using binariesisn’t an option for you,or you wish to customize your Linuxinstallation, the instructions in this section explain how to dothat.

Basic configuration

If you wish to install libarrow from source instead of looking forpre-compiled binaries, you can set theLIBARROW_BINARYvariable.

Sys.setenv("LIBARROW_BINARY"=FALSE)

By default, this is set toTRUE, and so libarrow willonly be built from source if this environment variable is set toFALSE or no compatible binary for your OS can be found.

When compiling libarrow from source, you have the power to reallyfine-tune which features to install. You can set the environmentvariableLIBARROW_MINIMAL toFALSE to enable amore full-featured build including S3 support and alternative memoryallocators.

Sys.setenv("LIBARROW_MINIMAL"=FALSE)

By default this variable is unset, which builds many commonly usedfeatures such as Parquet support but disables some features that aremore costly to build, like S3 and GCS support. If set toTRUE, a trimmed-down version of arrow is installed with alloptional features disabled.

Note that in this guide, you will have seen us mention theenvironment variableNOT_CRAN - this is a conveniencevariable, which when set toTRUE, automatically setsLIBARROW_MINIMAL toFALSE andLIBARROW_BINARY toTRUE.

Building libarrow from source requires more time and resources thaninstalling a binary. We recommend that you set the environment variableARROW_R_DEV toTRUE for more verbose outputduring the installation process if anything goes wrong.

Sys.setenv("ARROW_R_DEV"=TRUE)

Once you have set these variables, callinstall.packages() to install arrow using thisconfiguration.

The section below discusses environment variables you can set beforecallinginstall.packages("arrow") to build from source andcustomise your configuration.

Handling libarrow dependencies

When you build libarrow from source, its dependencies will beautomatically downloaded. The environment variableARROW_DEPENDENCY_SOURCE controls whether the libarrowinstallation also downloads or installs all dependencies (when set toBUNDLED), uses only system-installed dependencies (when settoSYSTEM) or checks system-installed dependencies firstand only installs dependencies which aren’t already present (when set toAUTO, the default).

These dependencies vary by platform; however, if you wish to installthese yourself prior to libarrow installation, we recommend that youtake a look at thedocker filefor whichever of our CI builds (the ones ending in “cpp” are forbuilding Arrow’s C++ libraries, aka libarrow) corresponds most closelyto your setup. This will contain the most up-to-date information aboutdependencies and minimum versions.

If downloading dependencies at build time is not an option, as whenbuilding on a system that is disconnected or behind a firewall, thereare a few options. See “Offline builds” below.

Dependencies for S3 and GCS support

Support for working with data in S3 and GCS is not enabled in thedefault source build, and it has additional system requirements asdescribed above. To enable it, set the environment variableLIBARROW_MINIMAL=false orNOT_CRAN=true tochoose the full-featured build, or more selectively setARROW_S3=ON and/orARROW_GCS=ON.

When either feature is enabled, the install script will check for thepresence of the required dependencies, and if the prerequisites are met,it will turn off S3 and GCS support–installation will succeed butwithout S3 or GCS functionality. If afterwards you install the missingsystem requirements, you’ll need to reinstall the package in order toenable S3 and GCS support.

Advanced configuration

In this section, we describe how to fine-tune your installation at amore granular level.

libarrow configuration

Some features are optional when you build Arrow from source - you canconfigure whether these components are built via the use of environmentvariables. The names of the environment variables which control thesefeatures and their default values are shown below.

NameDescriptionDefault Value
ARROW_S3S3 support (if dependencies are met)*OFF
ARROW_GCSGCS support (if dependencies are met)*OFF
ARROW_JEMALLOCThejemalloc memory allocatorON
ARROW_MIMALLOCThemimalloc memory allocatorON
ARROW_PARQUETON
ARROW_DATASETON
ARROW_JSONThe JSON parsing libraryON
ARROW_WITH_RE2The RE2 regular expression library, used in some string computefunctionsON
ARROW_WITH_UTF8PROCThe UTF8Proc string library, used in many other string computefunctionsON
ARROW_WITH_BROTLICompression algorithmON
ARROW_WITH_BZ2Compression algorithmON
ARROW_WITH_LZ4Compression algorithmON
ARROW_WITH_SNAPPYCompression algorithmON
ARROW_WITH_ZLIBCompression algorithmON
ARROW_WITH_ZSTDCompression algorithmON

R package configuration

There are a number of other variables that affect theconfigure script and the bundled build script. All booleanvariables are case-insensitive.

NameDescriptionDefault
LIBARROW_BUILDAllow building from sourcetrue
LIBARROW_BINARYTry to installlibarrow binary instead of building fromsource(unset)
LIBARROW_DOWNLOADSet tofalse to explicitly forbid fetching alibarrow binary(unset)
LIBARROW_MINIMALBuild with minimal features enabled(unset)
NOT_CRANSetLIBARROW_BINARY=true andLIBARROW_MINIMAL=falsefalse
ARROW_R_DEVMore verbose messaging and regenerates some codefalse
ARROW_USE_PKG_CONFIGUsepkg-config to search forlibarrowinstalltrue
LIBARROW_DEBUG_DIRDirectory to save source build logs(unset)
CMAKEAlternative CMake path(unset)

See below for more in-depth explanations of these environmentvariables.

  • LIBARROW_BINARY : By default on many distributions, orif explicitly set totrue, the script will determinewhether there is a prebuilt libarrow that will work with your system.You can set it tofalse to skip this option altogether, oryou can specify a string “distro-version” that corresponds to a binarythat is available, to override what this function may discover bydefault. Possible values are: “linux-openssl-1.0”, “linux-openssl-1.1”,“linux-openssl-3.0”.
  • LIBARROW_BUILD : If set tofalse, thebuild script will not attempt to build the C++ from source. This meansyou will only get a working arrow R package if a prebuilt binary isfound. Use this if you want to avoid compiling the C++ library, whichmay be slow and resource-intensive, and ensure that you only use aprebuilt binary.
  • LIBARROW_MINIMAL : If set tofalse, thebuild script will enable some optional features, including S3 supportand additional alternative memory allocators. This will increase thesource build time but results in a more fully functional library. If settotrue turns off Parquet, Datasets, compression libraries,and other optional features. This is not commonly used but may behelpful if needing to compile on a platform that does not support thesefeatures, e.g. Solaris.
  • NOT_CRAN : If this variable is set totrue, as thedevtools package does, the buildscript will setLIBARROW_BINARY=true andLIBARROW_MINIMAL=false unless those environment variablesare already set. This provides for a more complete and fast installationexperience for users who already haveNOT_CRAN=true as partof their workflow, without requiring additional environment variables tobe set.
  • ARROW_R_DEV : If set totrue, more verbosemessaging will be printed in the build script.arrow::install_arrow(verbose = TRUE) sets this. Thisvariable also is needed if you’re modifying C++ code in the package: seethe developer guide article.
  • ARROW_USE_PKG_CONFIG: If set tofalse, theconfigure script won’t look for Arrow libraries on your system andinstead will look to download/build them. Use this if you have a versionmismatch between installed system libraries and the version of the Rpackage you’re installing.
  • LIBARROW_DEBUG_DIR : If the C++ library building fromsource fails (cmake), there may be messages telling you tocheck some log file in the build directory. However, when the library isbuilt during R package installation, that location is in a tempdirectory that is already deleted. To capture those logs, set thisvariable to an absolute (not relative) path and the log files will becopied there. The directory will be created if it does not exist.
  • CMAKE : When building the C++ library from source, youcan specify a/path/to/cmake to use a different versionthan whatever is found on the$PATH.

Using install_arrow()

The previous instructions are useful for a fresh arrow installation,but arrow provides the functioninstall_arrow(). There arethree common use cases for this function:

  • You have arrow installed and want to upgrade to a differentversion
  • You want to try to reinstall and fix issues with Linux C++binaries
  • You want to install a development build

Examples of usinginstall_arrow() are shown below:

install_arrow()# latest releaseinstall_arrow(nightly=TRUE)# install development versioninstall_arrow(verbose=TRUE)# verbose output to debug install errors

Although this function is part of the arrow package, it is alsoavailable as a standalone script, so you can access it without firstinstalling the package:

source("https://raw.githubusercontent.com/apache/arrow/main/r/R/install-arrow.R")

Notes:

  • install_arrow() does not require environment variablesto be set in order to satisfy C++ dependencies.
  • unlike packages liketensorflow,blogdown,and others that require external dependencies, you do not need to runinstall_arrow() after a successful arrow installation.

Offline installation

Theinstall-arrow.R file mentioned in the previoussection includes a function calledcreate_package_with_all_dependencies(). Normally, wheninstalling on a computer with internet access, the build process willdownload third-party dependencies as needed. This function provides away to download them in advance, which can be useful when installingArrow on a computer without internet access. The process is asfollows:

Step 1. Using a computer with internet access,download dependencies:

Step 2. On the computer without internet access,install the prepared package:

  • Install the arrow package from the copied file:

    install.packages("my_arrow_pkg.tar.gz",  dependencies=c("Depends","Imports","LinkingTo"))

    This installation will build from source, socmake mustbe available

  • Runarrow_info() to check installedcapabilities

Notes:

  • arrowcan be installed on a computer without internetaccess without using this function, but many useful features will bedisabled, as they depend on third-party components. More precisely,arrow::arrow_info()$capabilities() will beFALSE for every capability.

  • If you are using binary packages you shouldn’t need to thisfunction. You can download the appropriate binary from your packagerepository, transfer that to the offline computer, and installthat.

  • If you’re using RStudio Package Manager on Linux (RSPM), and youwant to make a source bundle with this function, make sure to set thefirst repository inoptions("repos") to be a mirror thatcontains source packages. That is, the repository needs to be somethingother than the RSPM binary mirror URLs.

Offline installation (alternative)

A second method for offline installation is a little more hands-on.Follow these steps if you wish to try it:

  • Download the dependency files(cpp/thirdparty/download_dependencies.sh may behelpful)
  • Copy the directory of dependencies to the offline computer
  • Create the environment variableARROW_THIRDPARTY_DEPENDENCY_DIR on the offline computer,pointing to the copied directory.
  • Install the arrow package as usual.

For offline installation using libarrow binaries, see Method 1babove.

Troubleshooting

The intent is thatinstall.packages("arrow") will justwork and handle all C++ dependencies, but depending on your system, youmay have better results if you tune one of several parameters. Here aresome known complications and ways to address them.

Package failed to build C++ dependencies

If you see a message like

------------------------- NOTE ---------------------------There was an issue preparing the Arrow C++ libraries.See https://arrow.apache.org/docs/r/articles/install.html---------------------------------------------------------

in the output when the package fails to install, that means thatinstallation failed to retrieve or build the libarrow version compatiblewith the current version of the R package.

Please check the “Known installation issues” below to see if anyapply, and if none apply, set the environment variableARROW_R_DEV=TRUE for more verbose output and try installingagain. Then, pleasereport anissue and include the full installation output.

Using system libraries

If a system library or other installed Arrow is found but it doesn’tmatch the R package version (for example, you have libarrow 1.0.0 onyour system and are installing R package 2.0.0), it is likely that the Rbindings will fail to compile. Because the Apache Arrow project is underactive development, it is essential that versions of libarrow and the Rpackage matches. Wheninstall.packages("arrow") has todownload libarrow, the install script ensures that you fetch thelibarrow version that corresponds to your R package version. However, ifyou are using a version of libarrow already on your system, versionmatch isn’t guaranteed.

To fix version mismatch, you can either update your libarrow systempackages to match the R package version, or set the environment variableARROW_USE_PKG_CONFIG=FALSE to tell the configure script notto look for system version of libarrow. (The latter is the default ofinstall_arrow().) System libarrow versions are availablecorresponding to all CRAN releases but not for nightly or dev versions,so depending on the R package version you’re installing, system libarrowversion may not be an option.

Note also that once you have a working R package installation basedon system (shared) libraries, if you update your system libarrowinstallation, you’ll need to reinstall the R package to match itsversion. Similarly, if you’re using libarrow system libraries, runningupdate.packages() after a new release of the arrow packagewill likely fail unless you first update the libarrow systempackages.

Using prebuilt binaries

If the R package finds and downloads a prebuilt binary of libarrow,but then the arrow package can’t be loaded, perhaps with “undefinedsymbols” errors, pleasereport anissue. This is likely a compiler mismatch and may be resolvable bysetting some environment variables to instruct R to compile the packagesto match libarrow.

A workaround would be to set the environment variableLIBARROW_BINARY=FALSE and retry installation: this valueinstructs the package to build libarrow from source instead ofdownloading the prebuilt binary. That should guarantee that the compilersettings match.

If a prebuilt libarrow binary wasn’t found for your operating systembut you think it should have been, pleasereport anissue and share the console output. You may also set the environmentvariableARROW_R_DEV=TRUE for additional debugmessages.

Building libarrow from source

If building libarrow from source fails, check the error message. (Ifyou don’t see an error message, only the----- NOTE -----,set the environment variableARROW_R_DEV=TRUE to increaseverbosity and retry installation.) The install script should workeverywhere, so if libarrow fails to compile, pleasereport anissue so that we can improve the script.

Contributing

We are constantly working to make the installation process aspainless as possible. If you find ways to improve the process, pleasereport an issue sothat we can document it. Similarly, if you find that your Linuxdistribution or version is not supported, we would welcome thecontribution of Docker images (hosted on Docker Hub) that we can use inour continuous integration and hopefully improve our coverage. If you docontribute a Docker image, it should be as minimal as possible,containing only R and the dependencies it requires. For reference, seethe images thatR-hub uses.

You can test the arrow R package installation using thedocker compose setup included in theapache/arrow git repository. For example,

R_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker compose build rR_ORG=rhub R_IMAGE=ubuntu-release R_TAG=latest docker compose run r

installs the arrow R package, including libarrow, on therhub/ubuntu-releaseimage.

Further reading

  • To learn about installing development versions, see the article oninstalling nightly builds.
  • If you’re contributing to the Arrow project, see theArrow R developers guide for resources tohelp you on set up your development environment.
  • Arrow developers may also wish to read a more detailed discussion ofthe code run during the installation process, described in theinstall detailsarticle.

[8]ページ先頭

©2009-2025 Movatter.jp