The goal of theRAthena package is to provide aDBI-compliant interface to Amazon’s Athena (https://aws.amazon.com/athena/) usingBoto3SDK. This allows for an efficient, easy setup connection to Athena usingtheBoto3 SDK as a driver.
NOTE:Before usingRAthena you musthave an aws account or have access to aws account with permissionsallowing you to use Athena.
Before installingRAthena ensure thatPython 3+ is installed onto your machine:https://www.python.org/downloads/. To installBoto3 either it can be installed via pip command or usingRAthena installation function:
pip install boto3>=1.14.0RAthena Method (afterRAthena has been installed thismethod can be used)
RAthena::install_boto()To installRAthena you can get it from CRAN with:
install.packages("RAthena")Or to get the development version from Github with:
remotes::install_github("dyfanjones/rathena")The most basic way to connect to AWS Athena is to hard-code youraccess key and secret access key. However this method isnot recommended as your credentials are hard-coded.
library(DBI)con<-dbConnect(RAthena::athena(),aws_access_key_id='YOUR_ACCESS_KEY_ID',aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',s3_staging_dir='s3://path/to/query/bucket/',region_name='eu-west-1')The next method is to use profile names set up by AWS CLI or createdmanually in the~/.aws directory. To create the profilenames manually please refer to:https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html.
RAthena is compatible with AWS CLI. This allows your aws credentialsto be stored and not be hard coded in your connection.
To install AWS CLI please refer to:https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html,to configure AWS CLI please refer to:https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Once AWS CLI has been set up you will be able to connect to Athena byonly putting thes3_staging_dir.
Using default profile name:
library(DBI)con<-dbConnect(RAthena::athena(),s3_staging_dir ='s3://path/to/query/bucket/')Connecting to Athena using profile name other thandefault.
library(DBI)con<-dbConnect(RAthena::athena(),profile_name ="your_profile",s3_staging_dir ='s3://path/to/query/bucket/')library(RAthena)get_session_token("YOUR_PROFILE_NAME",serial_number='arn:aws:iam::123456789012:mfa/user',token_code ="531602",set_env =TRUE)# Connect to Athena using temporary credentialscon<-dbConnect(athena(),s3_staging_dir ='s3://path/to/query/bucket/')Another method in connecting to Athena is to use Amazon Resource Name(ARN) role.
Setting credentials in environmental variables:
library(RAthena)assume_role(profile_name ="YOUR_PROFILE_NAME",role_arn ="arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",set_env =TRUE)# Connect to Athena using temporary credentialscon<-dbConnect(athena(),s3_staging_dir ='s3://path/to/query/bucket/')Connecting to Athena directly using ARN role:
library(DBI) con<-dbConnect(athena(),profile_name ="YOUR_PROFILE_NAME",role_arn ="arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",s3_staging_dir ='s3://path/to/query/bucket/')To change the duration of ARN role session please change theparameterduration_seconds. By defaultduration_seconds is set to 3600 seconds (1 hour).
Connect to athena, and send a query and return results back to R.
library(DBI)con<-dbConnect(RAthena::athena(),aws_access_key_id='YOUR_ACCESS_KEY_ID',aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',s3_staging_dir='s3://path/to/query/bucket/',region_name='eu-west-1')res<-dbExecute(con,"SELECT * FROM one_row")dbFetch(res)dbClearResult(res)To retrieve query in 1 step.
dbGetQuery(con,"SELECT * FROM one_row")To create a tables in athena,dbExecute will send thequery to athena and wait until query has been executed. This makes itand idea method to create tables within athena.
query<-"CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string, number string, processId string, browserCookie string, requestEndTime string, timers struct<modelLookup:string, requestTime:string>, threadId string, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;"dbExecute(con, query)RAthena has 2 extra function to return extra information aroundAthena tables:dbGetParitiions anddbShow
dbGetPartitions will return all the partitions (returnsdata.frame):
RAthena::dbGetPartition(con,"impressions")dbShow will return the table’s ddl, so you will able tosee how the table was constructed in Athena (returns SQL character):
RAthena::dbShow(con,"impressions")library(DBI)con<-dbConnect(RAthena::athena(),s3_staging_dir ='s3://path/to/query/bucket/')RAthena has created a method to send data.frame from R to Athena.
# Check existing tablesdbListTables(con)# Upload iris to AthenadbWriteTable(con,"iris", iris,partition=c("TIMESTAMP"=format(Sys.Date(),"%Y%m%d")))# Read in iris from AthenadbReadTable(con,"iris")# Check new existing tables in AthenadbListTables(con)# Check if iris exists in AthenadbExistsTable(con,"iris")Please check outRAthena method fordbWriteTablefor more information in how to upload data to AWS Athena and AWS S3.
For more information around how to get the most out of AWS Athenawhen uploading data please check out:Top10 Performance Tuning Tips for Amazon Athena
Creating a connection to Athena and query and already existing tableiris that was created in previous example.
library(DBI)library(dplyr)con<-dbConnect(RAthena::athena(),aws_access_key_id='YOUR_ACCESS_KEY_ID',aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',s3_staging_dir='s3://path/to/query/bucket/',region_name='eu-west-1')tbl(con,sql("SELECT * FROM iris"))# Source: SQL [?? x 5]# Database: Athena 1.9.210 [eu-west-1/default] sepal_length sepal_width petal_length petal_width species <dbl> <dbl> <dbl> <dbl> <chr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # … with more rowsdplyr provides lazy querying with allows to short handtbl(con, sql("SELECT * FROM iris")) totbl(con, "iris"). For more information please look athttps://solutions.posit.co/connections/db/r-packages/dplyr/.
tbl(con,"iris")# Source: table<iris> [?? x 5]# Database: Athena 1.9.210 [eu-west-1/default] sepal_length sepal_width petal_length petal_width species <dbl> <dbl> <dbl> <dbl> <chr> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # … with more rowsQuerying Athena withprofile_name instead of hard codingaws_access_key_id andaws_secret_access_key.By usingprofile_name extra Meta Data is returned in thequery to give users extra information.
con<-dbConnect(RAthena::athena(),profile_name ="your_profile",s3_staging_dir='s3://path/to/query/bucket/')tbl(con,"iris"))%>%filter(petal_length<1.3)# Source: lazy query [?? x 5]# Database: Athena 1.9.210 [your_profile@eu-west-1/default] sepal_length sepal_width petal_length petal_width species <dbl> <dbl> <dbl> <dbl> <chr> 1 4.7 3.2 1.3 0.2 setosa 2 4.3 3 1.1 0.1 setosa 3 5.8 4 1.2 0.2 setosa 4 5.4 3.9 1.3 0.4 setosa 5 4.6 3.6 1 0.2 setosa 6 5 3.2 1.2 0.2 setosa 7 5.5 3.5 1.3 0.2 setosa 8 4.4 3 1.3 0.2 setosa 9 5 3.5 1.3 0.3 setosa 10 4.5 2.3 1.3 0.3 setosa # … with more rowstbl(con,"iris")%>%select(contains("sepal"),contains("petal"))# Source: lazy query [?? x 4]# Database: Athena 1.9.210 [your_profile@eu-west-1/default] sepal_length sepal_width petal_length petal_width <dbl> <dbl> <dbl> <dbl> 1 5.1 3.5 1.4 0.2 2 4.9 3 1.4 0.2 3 4.7 3.2 1.3 0.2 4 4.6 3.1 1.5 0.2 5 5 3.6 1.4 0.2 6 5.4 3.9 1.7 0.4 7 4.6 3.4 1.4 0.3 8 5 3.4 1.5 0.2 9 4.4 2.9 1.4 0.210 4.9 3.1 1.5 0.1# … with more rowsUpload data usingdplyr functioncopy_toandcompute.
library(DBI)library(dplyr)con<-dbConnect(RAthena::athena(),profile_name ="your_profile",s3_staging_dir='s3://path/to/query/bucket/')Write data.frame to Athena table
copy_to(con, mtcars,s3_location ="s3://mybucket/data/")Write Athena table from tbl_sql
athena_mtcars<-tbl(con,"mtcars")mtcars_filter<- athena_mtcars%>%filter(gear>=4)Create athena with unique table name
mtcars_filer%>%compute()Create athena with specified name and s3 location
mtcars_filer%>%compute("mtcars_filer",s3_location ="s3://mybucket/mtcars_filer/")# Disconnect from AthenadbDisconnect(con)Creating work group:
library(RAthena)library(DBI)con<-dbConnect(RAthena::athena(),profile_name ="your_profile",encryption_option ="SSE_S3",s3_staging_dir='s3://path/to/query/bucket/')create_work_group(con,"demo_work_group",description ="This is a demo work group",tags =tag_options(key="demo_work_group",value ="demo_01"))List work groups:
list_work_groups(con)[[1]][[1]]$Name[1] "demo_work_group"[[1]]$State[1] "ENABLED"[[1]]$Description[1] "This is a demo work group"[[1]]$CreationTime2019-09-06 18:51:28.902000+01:00[[2]][[2]]$Name[1] "primary"[[2]]$State[1] "ENABLED"[[2]]$Description[1] ""[[2]]$CreationTime2019-08-22 16:14:47.902000+01:00Update work group:
update_work_group(con,"demo_work_group",description ="This is a demo work group update")Return work group meta data:
get_work_group(con,"demo_work_group")$Name[1] "demo_work_group"$State[1] "ENABLED"$Configuration$Configuration$ResultConfiguration$Configuration$ResultConfiguration$OutputLocation[1] "s3://path/to/query/bucket/"$Configuration$ResultConfiguration$EncryptionConfiguration$Configuration$ResultConfiguration$EncryptionConfiguration$EncryptionOption[1] "SSE_S3"$Configuration$EnforceWorkGroupConfiguration[1] FALSE$Configuration$PublishCloudWatchMetricsEnabled[1] FALSE$Configuration$BytesScannedCutoffPerQuery[1] 10000000$Configuration$RequesterPaysEnabled[1] FALSE$Description[1] "This is a demo work group update"$CreationTime2019-09-06 18:51:28.902000+01:00Connect to Athena using work group:
con<-dbConnect(RAthena::athena(),profile_name ="your_profile",work_group ="demo_work_group")Delete work group:
delete_work_group(con,"demo_work_group")pyAthena - A python wrapper of the python packageBoto3 using the sqlAlchemy framework:https://github.com/laughingman7743/PyAthenapyAthenaJDBC - A python interface into AWS Athena’sJDBC drivers:https://github.com/laughingman7743/PyAthenaJDBCAWR.Athena - A R wrapper of RJDBC for the AWS Athena’sJDBC drivers:https://github.com/nfultz/AWR.Athenanoctua - A R wrapper of the R AWS SDKpaws to develop aDBI interfacehttps://github.com/DyfanJones/noctuaawsathena - rJava Interface to AWS Athena SDKhttps://github.com/hrbrmstr/awsathenametis - Helpers for Accessing and Querying AmazonAthena using R, Including a lightweight RJDBC shimhttps://github.com/hrbrmstr/metismetisjars - JARs formetishttps://github.com/hrbrmstr/metis-jarsmetis.tidy - Access and Query Amazon Athena via theTidyversehttps://github.com/hrbrmstr/metis-tidyawsathena andmetis family of packages arecurrently used in production every day to analyze petabytes of internetscan and honeypot data.
The reason whyRAthena stands slightly apart fromAWR.Athena is thatAWR.Athena uses the AthenaJDBC drivers andRAthena uses the Python AWS SDKBoto3. The ultimate goal is to provide an extra method forR users to interface with AWS Athena. AspyAthena is themost similar project, this project has used an appropriate name toreflect this …RAthena.