Working with data stored in cloud storage systems likeAmazon Simple Storage Service(S3) andGoogle CloudStorage (GCS) is a very common task. Because of this, the Arrow C++library provides a toolkit aimed to make it as simple to work with cloudstorage as it is to work with the local filesystem.
To make this work, the Arrow C++ library contains a general-purposeinterface for file systems, and the arrow package exposes this interfaceto R users. For instance, if you want to you can create aLocalFileSystem object that allows you to interact with thelocal file system in the usual ways: copying, moving, and deletingfiles, obtaining information about files and folders, and so on (seehelp("FileSystem", package = "arrow") for details). Ingeneral you probably don’t need this functionality because you alreadyhave tools for working with your local file system, but this interfacebecomes much more useful in the context of remote file systems.Currently there is a specific implementation for Amazon S3 provided bytheS3FileSystem class, and another one for Google CloudStorage provided byGcsFileSystem.
This article provides an overview of working with both S3 and GCSdata using the Arrow toolkit.
S3 and GCS support on Linux
Before you start, make sure that your arrow install has support forS3 and/or GCS enabled. For most users this will be true by default,because the Windows and macOS binary packages hosted on CRAN include S3and GCS support. You can check whether support is enabled via helperfunctions:
If these returnTRUE then the relevant support isenabled.
In some cases you may find that your system does not have supportenabled. The most common case for this occurs on Linux when installingarrow from source. In this situation S3 and GCS support is not alwaysenabled by default, and there are additional system requirementsinvolved. See theinstallation article fordetails on how to resolve this.
Connecting to cloud storage
One way of working with filesystems is to create?FileSystem objects.?S3FileSystem objects canbe created with thes3_bucket() function, whichautomatically detects the bucket’s AWS region. Similarly,?GcsFileSystem objects can be created with thegs_bucket() function. The resultingFileSystemwill consider paths relative to the bucket’s path (so for example youdon’t need to prefix the bucket path when listing a directory).
With aFileSystem object, you can point to specificfiles in it with the$path() method and pass the result tofile readers and writers (read_parquet(),write_feather(), et al.).
Often the reason users work with cloud storage in real world analysisis to access large data sets. An example of this is discussed in thedatasets article, but new users may prefer towork with a much smaller data set while learning how the arrow cloudstorage interface works. To that end, the examples in this article relyon a multi-file Parquet dataset that stores a copy of thediamonds data made available through theggplot2 package,documented inhelp("diamonds", package = "ggplot2"). Thecloud storage version of this data set consists of 5 Parquet filestotaling less than 1MB in size.
The diamonds data set is hosted on both S3 and GCS, in a bucket namedvoltrondata-labs-datasets. To create an S3FileSystem objectthat refers to that bucket, use the following command:
bucket<-s3_bucket("voltrondata-labs-datasets")To do this for the GCS version of the data, the command is asfollows:
bucket<-gs_bucket("voltrondata-labs-datasets", anonymous=TRUE)Note thatanonymous = TRUE is required for GCS ifcredentials have not been configured.
Within this bucket there is a folder calleddiamonds. Wecan callbucket$ls("diamonds") to list the files stored inthis folder, orbucket$ls("diamonds", recursive = TRUE) torecursively search subfolders. Note that on GCS, you should always setrecursive = TRUE because directories often don’t appear inthe results.
Here’s what we get when we list the files stored in the GCSbucket:
bucket$ls("diamonds", recursive=TRUE)## [1] "diamonds/cut=Fair/part-0.parquet"## [2] "diamonds/cut=Good/part-0.parquet"## [3] "diamonds/cut=Ideal/part-0.parquet"## [4] "diamonds/cut=Premium/part-0.parquet"## [5] "diamonds/cut=Very Good/part-0.parquet"There are 5 Parquet files here, one corresponding to each of the“cut” categories in thediamonds data set. We can specifythe path to a specific file by callingbucket$path():
parquet_good<-bucket$path("diamonds/cut=Good/part-0.parquet")We can useread_parquet() to read from this pathdirectly into R:
diamonds_good<-read_parquet(parquet_good)diamonds_good## # A tibble: 4,906 × 9## carat color clarity depth table price x y z## <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31## 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75## 3 0.3 J SI1 64 55 339 4.25 4.28 2.73## 4 0.3 J SI1 63.4 54 351 4.23 4.29 2.7## 5 0.3 J SI1 63.8 56 351 4.23 4.26 2.71## 6 0.3 I SI2 63.3 56 351 4.26 4.3 2.71## 7 0.23 F VS1 58.2 59 402 4.06 4.08 2.37## 8 0.23 E VS1 64.1 59 402 3.83 3.85 2.46## 9 0.31 H SI1 64 54 402 4.29 4.31 2.75## 10 0.26 D VS2 65.2 56 403 3.99 4.02 2.61## # … with 4,896 more rows## # ℹ Use `print(n = ...)` to see more rowsNote that this will be slower to read than if the file werelocal.
Connecting directly with a URI
In most use cases, the easiest and most natural way to connect tocloud storage in arrow is to use the FileSystem objects returned bys3_bucket() andgs_bucket(), especially whenmultiple file operations are required. However, in some cases you maywant to download a file directly by specifying the URI. This ispermitted by arrow, and functions likeread_parquet(),write_feather(),open_dataset() etc will allaccept URIs to cloud resources hosted on S3 or GCS. The format of an S3URI is as follows:
s3://[access_key:secret_key@]bucket/path[?region=]For GCS, the URI format looks like this:
gs://[access_key:secret_key@]bucket/pathgs://anonymous@bucket/pathFor example, the Parquet file storing the “good cut” diamonds that wedownloaded earlier in the article is available on both S3 and CGS. Therelevant URIs are as follows:
uri<-"s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"uri<-"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"Note that “anonymous” is required on GCS for public buckets.Regardless of which version you use, you can pass this URI toread_parquet() as if the file were stored locally:
df<-read_parquet(uri)URIs accept additional options in the query parameters (the partafter the?) that are passed down to configure theunderlying file system. They are separated by&. Forexample,
s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=trueis equivalent to:
bucket<-S3FileSystem$create( endpoint_override="https://storage.googleapis.com", allow_bucket_creation=TRUE)bucket$path("voltrondata-labs-datasets/")Both tell theS3FileSystem object that it should allowthe creation of new buckets and to talk to Google Storage instead of S3.The latter works because GCS implements an S3-compatible API – seeFile systems that emulate S3below – but if you want better support for GCS you should refer to aGcsFileSystem but using a URI that starts withgs://.
Also note that parameters in the URI need to bepercentencoded, which is why:// is written as%3A%2F%2F.
For S3, only the following options can be included in the URI asquery parameters areregion,scheme,endpoint_override,access_key,secret_key,allow_bucket_creation,allow_bucket_deletion andcheck_directory_existence_before_creation. For GCS, thesupported parameters arescheme,endpoint_override, andretry_limit_seconds.
In GCS, a useful option isretry_limit_seconds, whichsets the number of seconds a request may spend retrying before returningan error. The current default is 15 minutes, so in many interactivecontexts it’s nice to set a lower value:
gs://anonymous@voltrondata-labs-datasets/diamonds/?retry_limit_seconds=10Authentication
S3 Authentication
To access private S3 buckets, you need typically need two secretparameters: aaccess_key, which is like a user id, andsecret_key, which is like a token or password. There are afew options for passing these credentials:
Include them in the URI, like
s3://access_key:secret_key@bucket-name/path/to/file. Besure toURL-encodeyour secrets if they contain special characters like “/” (e.g.,URLencode("123/456", reserved = TRUE)).Pass them as
access_keyandsecret_keytoS3FileSystem$create()ors3_bucket()Set them as environment variables named
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY,respectively.Define them in a
~/.aws/credentialsfile, accordingto theAWSdocumentation.Use anAccessRolefor temporary access by passing the
role_arnidentifier toS3FileSystem$create()ors3_bucket().
GCS Authentication
The simplest way to authenticate with GCS is to run thegcloud command to setupapplication default credentials:
gcloud auth application-default loginTo manually configure credentials, you can pass eitheraccess_token andexpiration, for usingtemporary tokens generated elsewhere, orjson_credentials,to reference a downloaded credentials file.
If you haven’t configured credentials, then to accesspublicbuckets, you must passanonymous = TRUE oranonymous as the user in a URI:
bucket<-gs_bucket("voltrondata-labs-datasets", anonymous=TRUE)fs<-GcsFileSystem$create(anonymous=TRUE)df<-read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")Using a proxy server
If you need to use a proxy server to connect to an S3 bucket, you canprovide a URI in the formhttp://user:password@host:port toproxy_options. For example, a local proxy server running onport 1316 can be used like this:
bucket<-s3_bucket( bucket="voltrondata-labs-datasets", proxy_options="http://localhost:1316")File systems that emulate S3
TheS3FileSystem machinery enables you to work with anyfile system that provides an S3-compatible interface. For example,MinIO is and object-storage server thatemulates the S3 API. If you were to runminio serverlocally with its default settings, you could connect to it with arrowusingS3FileSystem like this:
minio<-S3FileSystem$create( access_key="minioadmin", secret_key="minioadmin", scheme="http", endpoint_override="localhost:9000")or, as a URI, it would be
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000(Note the URL escaping of the: inendpoint_override).
Among other applications, this can be useful for testing out codelocally before running on a remote S3 bucket.
Disabling environment variables
As mentioned above, it is possible to make use of environmentvariables to configure access. However, if you wish to pass inconnection details via a URI or alternative methods but also haveexisting AWS environment variables defined, these may interfere withyour session. For example, you may see an error message like:
Error: IOError: When resolving regionfor bucket'analysis': AWS Error [code99]: curlCode:6, Couldn't resolve host nameYou can unset these environment variables usingSys.unsetenv(), for example:
Sys.unsetenv("AWS_DEFAULT_REGION")Sys.unsetenv("AWS_S3_ENDPOINT")By default, the AWS SDK tries to retrieve metadata about userconfiguration, which can cause conflicts when passing in connectiondetails via URI (for example when accessing a MINIO bucket). To disablethe use of AWS environment variables, you can set environment variableAWS_EC2_METADATA_DISABLED toTRUE.
Sys.setenv(AWS_EC2_METADATA_DISABLED=TRUE)Further reading
- To learn more about
FileSystemclasses, includingS3FileSystemandGcsFileSystem, seehelp("FileSystem", package = "arrow"). - To see a data analysis example that relies on data hosted on cloudstorage, see thedataset article.