Theltertools package
The goal ofltertools is to centralize the R functionscreated by members of the Long Term Ecological Research (LTER)community. Many of these functions likely have broad relevance thatexpands beyond the context of their creation and this package is anattempt to share those tools and limit the amount of “re-inventing thewheel” that we each do in our own silos.
The conceptual theme of functions inltertools isnecessarily broad given the scope of the community we aim to serve. Thatsaid, the identity of this package will likely become more clear as weaccrue contributed functions. This vignette describes the main functionsofltertools as they currently exist.
The LTER Network is hypothesis-driven with a focus on long term datafrom sites in the network. This results in data that may reasonably becompared but are–potentially–quite differently formatted based on thelogic of the investigators responsible for each dataset. Dataharmonization (the process of resolving these formatting inconsistenciesto facilitate combination/comparison across projects) is therefore asignificant hurdle for many projects using LTER data. We suggest a“column key”-based approach that has the potential togreatlysimplify harmonization efforts.
This method requires researchers to develop a 3-column key thatcontains (1) the name of each raw data file to be harmonized, (2) thename of all of the columns in each of those files, and (3) the “tidyname” that corresponds to each raw column name. Each dataset can then beread in and have its raw names replaced with the tidy ones specified inthe key. Once this has been done to all files in the specified folder,they can be combined by their newly consistent column names. A visualversion of this column key approach to harmonization is included forconvenience here:
To demonstrate this workflow, we will need to create some exampledata tables and export them to a temporary directory (so that they canbe read back in as is required by the harmonization functions).
# Generate two simple tables## Dataframe 1df1<-data.frame("xx"=c(1:3),"unwanted"=c("not","needed","column"),"yy"= letters[1:3])## Dataframe 2df2<-data.frame("LETTERS"= letters[4:7],"NUMBERS"=c(4:7),"BONUS"=c("plantae","animalia","fungi","protista"))# Generate a known temporary folder for exportingtemp_folder<-tempdir()# Export both files to that folderutils::write.csv(x = df1,file =file.path(temp_folder,"df1.csv"),row.names =FALSE)utils::write.csv(x = df2,file =file.path(temp_folder,"df2.csv"),row.names =FALSE)While the raw data must be in a folder, the data key is a dataframein R to allow more flexibility on the user’s end about file format /storage (many LTER working groups like generating their keycollaboratively as a Google Sheet). For this example, we can generate adata key manually here.
# Generate a key that matches the data we created abovekey_obj<-data.frame("source"=c(rep("df1.csv",3),rep("df2.csv",3)),"raw_name"=c("xx","unwanted","yy","LETTERS","NUMBERS","BONUS"),"tidy_name"=c("numbers",NA,"letters","letters","numbers","kingdom"))# Check that outkey_obj#> source raw_name tidy_name#> 1 df1.csv xx numbers#> 2 df1.csv unwanted <NA>#> 3 df1.csv yy letters#> 4 df2.csv LETTERS letters#> 5 df2.csv NUMBERS numbers#> 6 df2.csv BONUS kingdomWith some example files and a key object generated, we can nowdemonstrate the actual workflow! The most fundamental of these functionsisharmonize. This function requires the “column key”described above, as well as thefolder containing the raw datafiles to which the key refers. Raw data format can also be specified toany of CSV, TXT, XLS, and/or XLSX. There is aquietargument that will silence messages about key-to-data mismatches (eitherexpected-but-missing column names or unexpected columns).
# Use the key to harmonize our example dataharmony<- ltertools::harmonize(key = key_obj,raw_folder = temp_folder,data_format ="csv",quiet =TRUE)# Check the structure of thatutils::str(harmony)#> 'data.frame': 7 obs. of 4 variables:#> $ source : chr "df1.csv" "df1.csv" "df1.csv" "df2.csv" ...#> $ numbers: chr "1" "2" "3" "4" ...#> $ letters: chr "a" "b" "c" "d" ...#> $ kingdom: chr NA NA NA "plantae" ...For users that want help generating the column key, we have createdthebegin_key function. This function accepts the rawfolder and data format arguments included inharmonize withan additional (optional)guess_tidy argument. IfTRUE, that argument attempts to “guess” the desired tidyname for each raw column name; it does this by standardizing casing andremoving special characters. This may be ideal if you anticipate thatmany of your raw data files only differ in casing/special charactersrather than being phrased incompatibly. For our example here we’ll allowthe key to “guess”.
# Generate a column key with "guesses" at tidy column namestest_key<- ltertools::begin_key(raw_folder = temp_folder,data_format ="csv",guess_tidy =TRUE)# Examine what that generatedtest_key#> source raw_name tidy_name#> 1 df1.csv xx xx#> 2 df1.csv unwanted unwanted#> 3 df1.csv yy yy#> 4 df2.csv LETTERS letters#> 5 df2.csv NUMBERS numbers#> 6 df2.csv BONUS bonusUsers that have already embracedharmonize may want toadd a ‘batch’ of new raw data files to an existing key. In such cases,begin_keycan work but it is cumbersome to need tomanually sort through for only the new rows to add to your existingcolumn key. Theexpand_key function exists to simplify thisprocess. It creates an output much like that ofbegin_keybut it only includes rows for data files that are not already in (A) theexisting key object or (B) the harmonized data object.
# Make another simple 'raw' filedf3<-data.frame("xx"=c(10:15),"letters"= letters[10:15])# Export this locally to the temp folder tooutils::write.csv(x = df3,file =file.path(temp_folder,"df3.csv"),row.names =FALSE)# Identify what needs to be added to the existing column keyltertools::expand_key(key = key_obj,raw_folder = temp_folder,harmonized_df = harmony,data_format ="csv",guess_tidy =TRUE)#> source raw_name tidy_name#> 1 df3.csv xx xx#> 2 df3.csv letters lettersSometimes it is convenient to read in all of the data files from aspecified folder.read offers the chance to do just thatand returns a list where the name of each element is the correspondingfile name and the contents of the list element is the full data table.Users may specify the data format or formats they wish to read in.Currently,read supports CSV, TXT, XLS, and XLSX files.
We can demonstrate this with the test CSVs we created to demonstratethe harmonization workflow earlier.
# Read in all of the CSVs that we created abovedata_list<- ltertools::read(raw_folder = temp_folder,data_format ="csv")# Check the structure of thatutils::str(data_list)#> List of 3#> $ df1.csv:'data.frame': 3 obs. of 3 variables:#> ..$ xx : int [1:3] 1 2 3#> ..$ unwanted: chr [1:3] "not" "needed" "column"#> ..$ yy : chr [1:3] "a" "b" "c"#> $ df2.csv:'data.frame': 4 obs. of 3 variables:#> ..$ LETTERS: chr [1:4] "d" "e" "f" "g"#> ..$ NUMBERS: int [1:4] 4 5 6 7#> ..$ BONUS : chr [1:4] "plantae" "animalia" "fungi" "protista"#> $ df3.csv:'data.frame': 6 obs. of 2 variables:#> ..$ xx : int [1:6] 10 11 12 13 14 15#> ..$ letters: chr [1:6] "j" "k" "l" "m" ...On a different note, thesolar_day_info function allowsyou to identify the time of sunrise, sunset, and solar noon (as well asthe total length of the day) for each day between a specified start andend date at a set of latitude/longitude coordinates. All times are inUTC and the information retrieved is returned as a dataframe.
# Identify day information in Santa Barbara (California) for one weeksolar_day_info(lat =34.41,lon =-119.71,start_date ="2022-02-07",end_date ="2022-02-12",quiet =TRUE)#> date sunrise sunset solar_noon day_length time_zone#> 1 2022-02-07 2:49:59 PM 1:35:56 AM 8:12:57 PM 10:45:57 UTC#> 2 2022-02-08 2:49:05 PM 1:36:54 AM 8:13:00 PM 10:47:49 UTC#> 3 2022-02-09 2:48:10 PM 1:37:52 AM 8:13:01 PM 10:49:42 UTC#> 4 2022-02-10 2:47:14 PM 1:38:50 AM 8:13:02 PM 10:51:36 UTC#> 5 2022-02-11 2:46:16 PM 1:39:48 AM 8:13:02 PM 10:53:32 UTC#> 6 2022-02-12 2:45:17 PM 1:40:45 AM 8:13:01 PM 10:55:28 UTCMuch of the synthesis work with LTER data–and indeed many ecologicalresearch projects generally–requires quantification ofvariation. To that end, we’ve written a function that simplycalculates the coefficient of variation (standard deviation divided bymean) for a vector of numbers. Becausesd andmean both support an argument for defining how missingvalues are handled, ourcv function does as well.
# Calculate CV (excluding missing values)ltertools::cv(x =c(4,5,6,4,5,5),na_rm =TRUE)#> [1] 0.1557461We also included a simple function for converting temperature values(convert_temp) among different accepted units. Simplyspecify the values to convert, their current units, and the units towhich you would like to convert and the function will perform the neededarithmetic. Units are case-insensitive and support either the one-letterabbreviation or the full name of the unit.
# Convert some temperatures from F to Kelvinconvert_temp(value =c(0,32,110),from ="Fahrenheit",to ="k")#> [1] 255.3722 273.1500 316.4833Note that we chose this function’s naming convention in part to allowfor an ecosystem of related ‘unit conversion’ functions that may proveworthwhile to develop.
The LTER Network is composed of many separate sites. While all ofthese sites are “long term” they do vary slightly in when they werecreated. For those interested in knowing the temporal coverage of datafrom a particular site or group of sites,site_timeline canprove a helpful function. This function creates aggplot2timeline where sites are on the vertical axis and years are on thehorizontal. Lines are colored based on the habitat of the site and thereis support for a user-defined set of hexadecimal colors though bydefault an internal palette is used.
Sites can be specified by their three letter site code or all sitesin a particular habitat can be included.
# Check the timeline for all grassland or forest LTER sitesltertools::site_timeline(habitats =c("grassland","forest"))Running the function without specifying site codes or habitat typeswill result in a timeline of all active LTER sites.