EXPLAIN (TYPE validate) (noctua # 225)thanks to@tyner forraising issuedbSendQuery (noctua # 223)thanks to@tyner forraising issueRAthena_options(noctua #226) thanks toSelectedEngineVersion inupdate_work_group (noctua # 224)thanks to@tyner forraising issuedbExistsTable to catch update AWS errormessage.dbplyr 2.3.3.9000+AWS_ROLE_ARN. This causedconfusing when connecting through web identity (#177)dbplyr::in_catalog when working withdplyr::tbl (#178)INFO: (Data scanned: -43839744 Bytes)clear_s3_resource parameter toRAthena_options to prevent AWS Athena output AWS S3resource being cleared up bydbClearResult (#168). Thanksto@juhoautio forthe request.boto3.session.Session class andclient method(#169)endpoint_override parameter allow defaultendpoints for each service to be overridden accordingly (#169). Thanksto@aoyh for the requestand checking the package in development.test_data to usesize parameter explicitly.RAthena_options to change 1 parameter at a timewithout affecting other pre-configured settingsretry_quietparameter inRAthena_options function.dbplyr 2.0.0 backend API.dplyr to benefit fromAWS Athena unloadmethods (noctua #174).dbGetQuery,dbExecute,dbSendQuery,dbSendStatement work on olderversions ofR (noctua #170). Thanks toAWS Athena UNLOAD(noctua: #160). This is to take advantage of read/write speedparquet has to offer.import awswrangleras wrimport getpassbucket= getpass.getpass()path=f"s3://{bucket}/data/"if"awswrangler_test"notin wr.catalog.databases().values: wr.catalog.create_database("awswrangler_test")cols= ["id","dt","element","value","m_flag","q_flag","s_flag","obs_time"]df= wr.s3.read_csv( path="s3://noaa-ghcn-pds/csv/189", names=cols, parse_dates=["dt","obs_time"])# Read 10 files from the 1890 decade (~1GB)wr.s3.to_parquet( df=df, path=path, dataset=True, mode="overwrite", database="awswrangler_test", table="noaa");wr.catalog.table(database="awswrangler_test", table="noaa")library(DBI)con<-dbConnect(RAthena::athena())# Query ran using CSV outputsystem.time({ df=dbGetQuery(con,"SELECT * FROM awswrangler_test.noaa")})# Info: (Data scanned: 80.88 MB)# user system elapsed# 57.004 8.430 160.567RAthena::RAthena_options(cache_size =1)# Query ran using UNLOAD Parquet outputsystem.time({ df=dbGetQuery(con,"SELECT * FROM awswrangler_test.noaa",unload = T)})# Info: (Data scanned: 80.88 MB)# user system elapsed# 21.622 2.350 39.232# Query ran using cachesystem.time({ df=dbGetQuery(con,"SELECT * FROM awswrangler_test.noaa",unload = T)})# Info: (Data scanned: 80.88 MB)# user system elapsed# 13.738 1.886 11.029sql_translate_env correctly translates R functionsquantile andmedian toAWS Athenaequivalents (noctua #153). Thanks toAWS Athenatimestamp with time zone data type.list when converting data toAWS AthenaSQL format.library(data.table)library(DBI)x=5dt=data.table(var1 =sample(LETTERS,size = x, T),var2 =rep(list(list("var3"=1:3,"var4"=list("var5"= letters[1:5]))), x))con<-dbConnect(RAthena::athena())#> Version: 2.2.0sqlData(con, dt)# Registered S3 method overwritten by 'jsonify':# method from# print.json jsonlite# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv# var1 var2# 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}# 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}# 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}# 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}# 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}#> Version: 2.1.0sqlData(con, dt)# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv# var1 var2# 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e"))# 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e"))# 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e"))# 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e"))# 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))v-2.2.0 now converts lists into json lines format so that AWS Athenacan parse withsqlarray/mapping/json functions.Small down side a s3 method conflict occurs whenjsonify iscalled to convert lists into json lines.jsonify was choosein favor tojsonlite due to the performance improvements(noctua #156).
dbIsValid wrongly stated connection is valid for resultclass when connection class was disconnected.sql_translate_env.paste broke with latest version ofdbplyr. New method is compatible withdbplyr>=1.4.3 (noctua #149).sql_translate_env: add support forstringr/lubridate style functions, similar toPostgresbackend.dbConnect addtimezone parameter so thattime zone betweenR andAWS Athena isconsistent (noctua #149).AthenaConnection class:ptr andinfo slots changed fromlist toenvironment with inAthenaConnect class.Allows class to be updated by reference. Simplifies notation whenviewing class from RStudio environment tab.AthenaResult class:info slot changed fromlist toenvironment. Allows class to beupdated by reference.By utilising environments forAthenaConnection andAthenaResult, allAthenaResult classes createdfromAthenaConnection will point to the sameptr andinfo environments for it’s connection.Previouslyptr andinfo would make a copy.This means if it was modified it would not affect the child or parentclass for example:
# Old Methodlibrary(DBI)con<-dbConnect(RAthena::athena(),rstudio_conn_tab = F)res<-dbExecute(con,"select 'helloworld'")# modifying parent class to influence childcon@info$made_up<-"helloworld"# nothing happenedres@connection@info$made_up# > NULL# modifying child class to influence parentres@connection@info$made_up<-"oh no!"# nothing happenedcon@info$made_up# > "helloworld"# New Methodlibrary(DBI)con<-dbConnect(RAthena::athena(),rstudio_conn_tab = F)res<-dbExecute(con,"select 'helloworld'")# modifying parent class to influence childcon@info$made_up<-"helloworld"# picked up changeres@connection@info$made_up# > "helloworld"# modifying child class to influence parentres@connection@info$made_up<-"oh no!"# picked up changecon@info$made_up# > "oh no!"AWS Athena data types[array, row, map, json, binary, ipaddress] (noctua: #135). Conversion types can be changed throughdbConnectandRAthena_options.library(DBI)library(RAthena)# default conversion methodscon<-dbConnect(RAthena::athena())# change json conversion methodRAthena_options(json ="character")RAthena:::athena_option_env$json# [1] "character"# change json conversion to custom methodRAthena_options(json = jsonify::from_json)RAthena:::athena_option_env$json# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)# {# json_to_r(json, simplify, fill_na, buffer_size)# }# <bytecode: 0x7f823b9f6830># <environment: namespace:jsonify># change bigint conversion without affecting custom json conversion methodsRAthena_options(bigint ="numeric")RAthena:::athena_option_env$json# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)# {# json_to_r(json, simplify, fill_na, buffer_size)# }# <bytecode: 0x7f823b9f6830># <environment: namespace:jsonify>RAthena:::athena_option_env$bigint# [1] "numeric"# change binary conversion without affect, bigint or json methodsRAthena_options(binary ="character")RAthena:::athena_option_env$json# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)# {# json_to_r(json, simplify, fill_na, buffer_size)# }# <bytecode: 0x7f823b9f6830># <environment: namespace:jsonify>RAthena:::athena_option_env$bigint# [1] "numeric"RAthena:::athena_option_env$binary# [1] "character"# no conversion for json objectscon2<-dbConnect(RAthena::athena(),json ="character")# use custom json parsercon<-dbConnect(RAthena::athena(),json = jsonify::from_json)rstudio_conn_tab withindbConnect.AWS Athena usesfloat data type for theDDL only,RAthena was wrongly parsingfloatdata type back to R. InsteadAWS Athena uses data typereal in SQL functions likeselect casthttps://docs.aws.amazon.com/athena/latest/ug/data-types.html.RAthena now correctly parsesreal to R’s datatypedouble (noctua: #133)AWS returns to get allresults fromAWS Glue catalogue (noctua: #137)dbGetPartition. Thissimply tidies up the default AWS Athena partition format.library(DBI)library(RAthena)con<-dbConnect(athena())dbGetPartition(con,"test_df2",.format = T)# Info: (Data scanned: 0 Bytes)# year month day# 1: 2020 11 17dbGetPartition(con,"test_df2")# Info: (Data scanned: 0 Bytes)# partition# 1: year=2020/month=11/day=17bigint, this isto align with other DBI interfaces i.e. RPostgres. Nowbigint can be return in the possible formats: [“integer64”,“integer”, “numeric”, “character”]library(DBI)con <- dbConnect(RAthena::athena(), bigint = "numeric")When switching between the different file parsers thebigint to be represented according to the file parseri.e. data.table: “integer64” ->vroom:“I”.
dbRemoveTable: Check if key has “.” or ends with “/”before adding “/” to the end (noctua: #125)Error: write_parquet requires the arrow package, please install it first and try againsql_escape_date intodplyr_integration.R backend (#121). Thanks toRAthena to append to a static AWS s3 locationusing uuiduse_deprecated_int96_timestamps set toTRUE.This puts POSIXct data type in tojava.sql.Timestampcompatible format, such asyyyy-MM-dd HH:mm:ss[.f...].Thanks to Christian N Wolz for highlight this issue.s3_upload_location simplified how s3 location is built.Now s3.location parameter isn’t affected and instead only additionalcomponents e.g. name, schema and partition.dbplyr v-2.0.0 functionin_schema nowwraps strings in quotes, this breaksdb_query_fields.AthenaConnection. Nowdb_query_fields.AthenaConnection removes any quotation fromthe string so that it can searchAWS GLUE for tablemetadata. (noctua: #117)R has beeninterrupt a new parameter has been added todbConnect,keyboard_interrupt. Example:# Stop AWS Athena when R has been interrupted:con<-dbConnect(RAthena::athena())# Let AWS Athena keep running when R has been interrupted:con<-dbConnect(RAthena::athena(),keyboard_interrupt = F)RAthena would return adata.frame for utilitySQL queries regardlessof backend file parser. This is due toAWS AthenaoutputtingSQL UTILITY queries as a text file that requiredto be read in line by line. NowRAthena will return thecorrect data format based on file parser set inRAthena_options for example:RAthena_options("vroom") will returntibbles.dbClearResult when user doesn’t have permission to deleteAWS S3 objects (noctua: #96)RAthena_options contains 2 new parameters to controlhowRAthena handles retries.dbFetch is able to return data from AWS Athena inchunk. This has been achieved by passingNextToken toAthenaResult s4 class. This method won’t be as fastn = -1 as each chunk will have to be process into dataframe format.library(DBI)con<-dbConnect(RAthena::athena())res<-dbExecute(con,"select * from some_big_table limit 10000")dbFetch(res,5000)dbWriteTable opts to usealter table insteadof standardmsck repair table. This is to improveperformance when appending to tables with high number of existingpartitions.dbWriteTable now allows json to be appended to jsonddls created with the Openx-JsonSerDe library.dbConvertTable bringsdplyr::computefunctionality to base package, allowingRAthena to use thepower of AWS Athena to convert tables and queries to more efficient fileformats in AWS S3 (#37).dplyr::compute to give same functionality ofdbConvertTableboto3 not being detectedhas been updated. This is due to several users not sure how to getRAthena set-up.stop("Boto3 is not detected please install boto3 using either: `pip install boto3 numpy` in terminal or `install_boto()`.", "\nIf this doesn't work please set the python you are using with `reticulate::use_python()` or `reticulate::use_condaenv()`", call. = FALSE)region_name check before making a connection toAWS Athena (#110)dbWriteTable would throwthrottling errorevery now and again,retry_api_call as been built to handlethe parsing of data between R and AWS S3.dbWriteTable did not clear down all metadata whenuploading toAWS AthenadbWriteTable added support ddl structures for user whohave created ddl’s outside ofRAthenaRAthena retryfunctionality\dontrun(#108)pyathena,RAthena_options nowhas a new parametercache_size. This implements localcaching in R environments instead of using AWSlist_query_executions. This is down todbClearResult clearing S3’s Athena output when cachingisn’t disabledRAthena_options now hasclear_cacheparameter to clear down all cached data.dbRemoveTable now utiliseAWS Glue toremove tables fromAWS Glue catalogue. This has aperformance enhancement:library(DBI)con=dbConnect(RAthena::athena())# upload iris dataframe for removal testdbWriteTable(con,"iris2", iris)# Athena methodsystem.time(dbRemoveTable(con,"iris2",confirm = T))# user system elapsed# 0.131 0.037 2.404# upload iris dataframe for removal testdbWriteTable(con,"iris2", iris)# Glue methodsystem.time(dbRemoveTable(con,"iris2",confirm = T))# user system elapsed# 0.065 0.009 1.303dbWriteTable now supports uploading json lines(http://jsonlines.org/) format up toAWS Athena (#88).library(DBI)con=dbConnect(RAthena::athena())dbWriteTable(con,"iris2", iris,file.type ="json")dbGetQuery(con,"select * from iris2")dbWriteTable appending to existing table compress filetype was incorrectly return.install_boto addednumpy toRAthena environment install asreticulateappears to favour environments withnumpy(https://github.com/rstudio/reticulate/issues/216)Rstudio connection tab comes into an issue when GlueTable isn’t stored correctly (#92)AWS_REGION intodbConnectfwrite (>=1.12.4)https://github.com/Rdatatable/data.table/blob/master/NEWS.mdsql_translate_env(#44)# Beforedbplyr::translate_sql("2019-01-01",con = con)# '2019-01-01'# Nowdbplyr::translate_sql("2019-01-01",con = con)# DATE '2019-01-01'paste/paste0 would use defaultdplyr:sql-translate-env (concat_ws).paste0 now uses Presto’sconcat function andpaste now uses pipes to get extra flexibility for customseparating values.# R code:paste("hi","bye",sep ="-")# SQL translation:('hi'||'-'||'bye')append set toTRUE then existing s3.location will be utilised (#73)db_compute returned table name, however when a userwished to write table to another location (#74). An error would beraised:Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not existThis has now been fixed with db_compute returningdbplyr::in_schema.library(DBI)library(dplyr)con<-dbConnect(RAthena::athena())tbl(con,"iris")%>%compute(name ="temp.iris")dbListFields didn’t display partitioned columns. Thishas now been fixed with the call to AWS Glue being altered to includemore metadata allowing for column names and partitions to bereturned.dbListFieldsRAthena_optionsvroom has been restricted to >= 1.2.0due to integer64 support and changes tovroom apidbStatistics is a wrapper aroundboto3get_query_execution to return statistics forRAthena::dbSendQuery results (#67)dbGetQuery has new parameterstatistics toprint outdbStatistics before returning Athena results(#67)s3.location now follows new syntaxs3://bucket/{schema}/{table}/{partition}/{table_file} toalign withPyathena and to allow tables with same name butin different schema to be uploaded to s3 (#73).dplyr::tbl when calling Athena when using the identmethod (noctua# 64):library(DBI)library(dplyr)con<-dbConnect(RAthena::athena())# ident method:t1<-system.time(tbl(con,"iris"))# sub query method:t2<-system.time(tbl(con,sql("select * from iris")))# ident method# user system elapsed# 0.082 0.012 0.288# sub query method# user system elapsed# 0.993 0.138 3.660dplyr sql_translate_env: expected results have now beenupdated to take into account bug fix with date fieldsdata.table tovroom. From now onit is possible to change file parser usingRAthena_optionsfor example:library(RAthena)RAthena_options("vroom")dbGetTables that returns Athena hierarchyas a data.framevroomUpdated R documentation toroxygen2 7.0.2
dbWriteTableappend parameter checks and usesexisting AWS Athena DDL file type. Iffile.type doesn’tmatch Athena DDL file type then user will receive a warningmessage:warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".',call. =FALSE)tolower conversion due to request #41dbRemoveTable can now remove S3 files for AWS Athena tablebeing removed.as.character was getting wronglytranslated #45INTEGER being incorrectly translatedinsql_translate_env.Rdata-transferdbRemoveTable new parameters are added in unittestsql_translate_env until test to cater bugfixdbWriteTable now will splitgzipcompressed files to improve AWS Athena performance. By defaultgzip compressed files will be split into 20.Performance results
library(DBI)X<-1e8df<-data.frame(w =runif(X),x =1:X,y =sample(letters, X,replace = T),z =sample(c(TRUE,FALSE), X,replace = T))con<-dbConnect(RAthena::athena())# upload dataframe with different splitsdbWriteTable(con,"test_split1", df,compress = T,max.batch =nrow(df),overwrite = T)# no splitsdbWriteTable(con,"test_split2", df,compress = T,max.batch =0.05*nrow(df),overwrite = T)# 20 splitsdbWriteTable(con,"test_split3", df,compress = T,max.batch =0.1*nrow(df),overwrite = T)# 10 splitsAWS Athena performance results from AWS console (query executed:select count(*) from .... ):
library(DBI)X<-1e8df<-data.frame(w =runif(X),x =1:X,y =sample(letters, X,replace = T),z =sample(c(TRUE,FALSE), X,replace = T))con<-dbConnect(RAthena::athena())dbWriteTable(con,"test_split1", df,compress = T,overwrite = T)# default will now split compressed file into 20 equal size files.Added information message to inform user about what files have beenadded to S3 location if user is overwriting an Athena table.
copy_to method now supports compress and max_batch, toalign withdbWriteTabledbWriteTablePOSIXct to Athena. This class wasconvert incorrectly and AWS Athena would return NA instead.RAthena will now correctly convertPOSIXct totimestamp but will also correct read in timestamp intoPOSIXctNA in string format. BeforeRAthena wouldreturnNA in string class as"" this has nowbeen fixed.RAthena would translate output into a vector with currentthe methoddbFetch n = 0.sql_translate_env. PreviouslyRAthenawould take the defaultdplyr::sql_translate_env, nowRAthena has a custom method that uses Data types from:https://docs.aws.amazon.com/athena/latest/ug/data-types.html and windowfunctions from:https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.htmlPOSIXct class has now been added to data transfer unittestdplyr sql_translate_env tests if R functions arecorrect translated in to Athenasql syntax.dbWriteTable is called. The bug is due to functionsqlCreateTable whichdbWriteTable calls.Parameterstable andfields were set toNULL. This has now been fixed.s3.location parameter isdbWriteTable cannow be made nullablesqlCreateTable info message will now only inform userif colnames have changed and display the column name that havechangedupload_data has been rebuilt andremoved the old “horrible” if statement withpaste now thefunction relies onsprintf to construct the s3 locationpath. This method now is a lot clearer in how the s3 location is createdplus it enables adbWriteTable to be simplified.dbWriteTable can now upload data to the default s3_stagingdirectory created indbConnect this simplifiesdbWriteTable to :library(DBI)con<-dbConnect(RAthena::athena())dbWrite(con,"iris", iris)dbWriteTabledata transfer test now tests compress, and defaults3.location when transferring datadata.table::fread. This enables data types to be read incorrectly and not required a second stage to convert data types oncedata has been read into Rdata.table::fread anddata.table::fwrite have been disabledutil functions from namespace:write.table,read.csvdata.table to namespacebigint are convert into Rbit64::integer64 and visa versabigint tointeger64 in data.transferunit testdbConnect methoddbFetch with chunk sizes between 0 - 999.Fixed error wherefor loop would return error instead ofbreaking.py_error function, setcall.parameter toFALSEAthenaQuery s4 class changed toAthenaResultdbFetch added datatype collectiondbFetch replaced S3 search for query key with outputlocation from AthenadbClearResult changed error, to return python error aswarning to warn user doesn’t have permission to delete S3 resourcedbClearResult replaced S3 search for query key with outlocation from AthenadbListTables now returns vector of tables fromaws glue instead of using anAWS Athena query.This method increases speed of call of querydbListFields now returns column names fromaws glue instead of using anAWS Athenaquery.. This method increases speed of call of querydbExistsTable now returns boolean fromaws glue instead of using anAWS Athenaquery.. This method increases speed of call of querycreate_work_group: Creates a workgroup with thespecified name.delete_work_group: Deletes the workgroup with thespecified name.list_work_group: Lists available workgroups for theaccount.get_work_group: Returns information about the workgroupwith the specified name.update_work_group: Updates the workgroup with thespecified name. The workgroup’s name cannot be changed.get_session_token tocreate temporary session credentialsassume_role to assumeAWS ARN RoledbConnectset_aws_env to set aws tokensto environmental variablesget_aws_env to return expectedresults from system variablestag_options to create tagoptions forcreate_work_groupwork_group_config andwork_group_config_update to create config of workgroupAthenaConnectiondbColumnInfo method: returns data.framecontainingfield_name andtypetime_check to check how long isleft on the Athena Connection, if less than 15 minutes a warning messageis outputted to notify userdb_collect for betterintegration with dplyrdb_save_query for betterintegration with dplyrdb_copy_to for betterintegration with dplyrdbFetch Athena data type miss alignmentAthenaConnection:request build Athena queryrequestdb_descdbConnectstop_query_execution todbClearResult if the query is still runningdbWriteTable)waiter topoll, toalign with python’s polling