| Title: | Intensive Care Unit Data with R |
| Description: | Focused on (but not exclusive to) data sets hosted on PhysioNet (https://physionet.org), 'ricu' provides utilities for download, setup and access of intensive care unit (ICU) data sets. In addition to functions for running arbitrary queries against available data sets, a system for defining clinical concepts and encoding their representations in tabular ICU data is presented. |
| Version: | 0.6.3 |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| Language: | en-US |
| URL: | https://github.com/eth-mds/ricu,https://physionet.org |
| BugReports: | https://github.com/eth-mds/ricu/issues |
| Depends: | R (≥ 3.5.0) |
| Imports: | data.table, curl, assertthat, fst, readr, jsonlite, methods,stats, prt (≥ 0.1.2), tibble, backports, rlang, vctrs, cli (≥2.1.0), fansi, openssl, utils |
| Suggests: | xml2, covr, testthat (≥ 3.0.0), withr, mockthat, pkgload,mimic.demo, eicu.demo, progress, knitr, rmarkdown, ggplot2,cowplot, survival, forestmodel, rticles, kableExtra, units,pdftools, magick, pillar |
| RoxygenNote: | 7.3.2 |
| Additional_repositories: | https://eth-mds.github.io/physionet-demo |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2025-09-03 21:17:03 UTC; nbennett |
| Author: | Nicolas Bennett [aut, cre], Drago Plecko [aut], Ida-Fong Ukor [aut] |
| Maintainer: | Nicolas Bennett <r@nbenn.ch> |
| Repository: | CRAN |
| Date/Publication: | 2025-09-03 21:50:09 UTC |
ricu: Intensive Care Unit Data with R
Description
Focused on (but not exclusive to) data sets hosted on PhysioNet (https://physionet.org), 'ricu' provides utilities for download, setup and access of intensive care unit (ICU) data sets. In addition to functions for running arbitrary queries against available data sets, a system for defining clinical concepts and encoding their representations in tabular ICU data is presented.
Author(s)
Maintainer: Nicolas Bennettr@nbenn.ch
Authors:
Drago Pleckodrago.plecko@stat.math.ethz.ch
Ida-Fong Ukorida-fong.ukor@monashhealth.org
See Also
Useful links:
Report bugs athttps://github.com/eth-mds/ricu/issues
Internal item callback utilities
Description
The utility functionadd_concept() is exported for convenience when addingexternal datasets and integrating concepts that require other concepts.While this could be solves by defining arec_concpt, in some scenariosthis might not be ideal, as it might be only required thatitmimplementations for certain data sources require additional information.Examples for this include vasopressor rates which might rely on patientweight, and blood cell counts when expressed as ratio. For performancereasons, the pulled in concept is internally cached, as this might be usedunchanged many times, when loading several concepts that need to pull inthe given concept. Persistence of cache is session-level and therefore thisutility is intended to be used somewhat sparingly.
Usage
add_concept(x, env, concept, var_name = concept, aggregate = NULL)add_weight(x, env, var_name = "weight")calc_dur(x, val_var, min_var, max_var, grp_var = NULL)combine_callbacks(...)Arguments
x | Object in loading |
env | Data source environment as available as |
concept | String valued concept name that will be loaded from thedefault dictionary |
var_name | String valued variable name |
aggregate | Forwarded to |
val_var | String valued column name corresponding to the value variable |
min_var,max_var | Column names denoting start and end times |
grp_var | Optional grouping variable (for example linking infusions) |
... | Functions which will be successively applied |
Value
A copy ofx with the requested concept merged in.
Data attach utilities
Description
Making a dataset available toricu consists of 3 steps: downloading(download_src()), importing (import_src()) and attaching(attach_src()). While downloading and importing are one-time procedures,attaching of the dataset is repeated every time the package is loaded.Briefly, downloading loads the raw dataset from the internet (most likelyin.csv format), importing consists of some preprocessing to make thedata available more efficiently and attaching sets up the data for use bythe package.
Usage
attach_src(x, ...)## S3 method for class 'src_cfg'attach_src(x, assign_env = NULL, data_dir = src_data_dir(x), ...)## S3 method for class 'character'attach_src(x, assign_env = NULL, data_dir = src_data_dir(x), ...)detach_src(x)setup_src_env(x, ...)## S3 method for class 'src_cfg'setup_src_env(x, data_dir = src_data_dir(x), link_env = NULL, ...)Arguments
x | Data source to attach |
... | Forwarded to further calls to |
assign_env,link_env | Environment in which the data source will becomeavailable |
data_dir | Directory used to look for |
Details
Attaching a dataset sets up two types of S3 classes: a singlesrc_envobject, containing as manysrc_tbl objects as tables are associated withthe dataset. Asrc_env is an environment with anid_cfg attribute, aswell as sub-classes as specified by the data sourceclass_prefixconfiguration setting (seeload_src_cfg()). Allsrc_env objects createdby callingattach_src() represent environments that are directdescendants of thedata environment and are bound to the respectivedataset name within that environment. For more information onsrc_env andsrc_tbl objects, refer tonew_src_tbl().
If set up correctly, it is not necessary for the user to directly callattach_src(). When the package is loaded, the default data sources (seeauto_attach_srcs()) are attached automatically. This default can becontrolled by setting as environment variableRICU_SRC_LOAD a commaseparated list of data source names before loading the library. Settingthis environment variable as
Sys.setenv(RICU_SRC_LOAD = "mimic_demo,eicu_demo")
will change the default of loading both MIMIC-III and eICU, alongside therespective demo datasets, as well as HiRID and AUMC, to just the two demodatasets. For setting an environment variable upon startup of the Rsession, refer tobase::.First.sys().
Attaching a dataset during package namespace loading will both instantiatea correspondingsrc_env in thedata environment and for conveniencealso assign this object into the package namespace, such that for examplethe MIMIC-III demo dataset not only is available asricu::data::mimic_demo, but also asricu::mimic_demo (or if the packagenamespace is attached, simply asmimic_demo). Dataset attaching usingattach_src() does not need to happen during namespace loading, but can betriggered by the user at any time. If such a convenience link as describedabove is desired by the user, an environment such as.GlobalEnv has to bepassed asassign_env toattach_src().
Data sets are set up assrc_env objects irrespective of whether all (orany) of the required data is available. If some (or all) data is missing,the user is asked for permission to download in interactive sessions and anerror is thrown in non-interactive sessions. Downloading demo datasetsrequires no further information but access to full-scale datasets (eventhough they are publicly available) is guarded by access credentials (seedownload_src()).
Whileattach_src() provides the main entry point,src_env objects areinstantiated by the S3 generic functionsetup_src_env() and the wrappingfunction serves to catch errors that might be caused by config file parsingissues as to not break attaching of the package namespace. Apart form this,attach_src() also provides the convenience linking into the packagenamespace (or a user-specified environment) described above.
Asrc_env object created bysetup_src_env() does not directly containsrc_tbl objects bound to names, but rather an active binding (seebase::makeActiveBinding()) per table. These active bindings check foravailability of required files and evaluate to correspondingsrc_tblobjects if these checks are passed and ask for user input otherwise. Assrc_tbl objects are intended to be read-only, assignment is not possibleexcept for the valueNULL which resets the internally cachedsrc_tblthat is created on first successful access.
Value
Bothattach_src() andsetup_src_env() are called for sideeffects and therefore return invisibly. Whileattach_src() returnsNULL,setup_src_env() returns the newly createdsrc_env object.
Examples
## Not run: Sys.setenv(RICU_SRC_LOAD = "")library(ricu)ls(envir = data)exists("mimic_demo")attach_src("mimic_demo", assign_env = .GlobalEnv)ls(envir = data)exists("mimic_demo")mimic_demo## End(Not run)ICU class data reshaping
Description
Utilities for reshapingid_tbl andts_tbl objects.
Usage
cbind_id_tbl( ..., keep.rownames = FALSE, check.names = FALSE, key = NULL, stringsAsFactors = FALSE)rbind_id_tbl( ..., use.names = TRUE, fill = FALSE, idcol = NULL, ignore.attr = FALSE)## S3 method for class 'id_tbl'merge(x, y, by = NULL, by.x = NULL, by.y = NULL, ...)## S3 method for class 'id_tbl'split(x, ...)rbind_lst(x, ...)merge_lst(x)unmerge(x, col_groups = as.list(data_vars(x)), by = meta_vars(x), na_rm = TRUE)Arguments
... | Objects to combine |
keep.rownames,check.names,key,stringsAsFactors | Forwarded todata.table::data.table |
use.names,fill,idcol | Forwarded todata.table::rbindlist |
x,y | Objects to combine |
by,by.x,by.y | Column names used for combining data |
col_groups | A list of character vectors defining the grouping ofnon-by columns |
na_rm | Logical flag indicating whether to remove rows that have allmissing entries in the respective |
Value
Eitherid_tbl orts_tbl objects (depending on inputs) or liststhereof in case ofsplit() andunmerge().
Switch between id types
Description
ICU datasets such as MIMIC-III or eICU typically represent patients bymultiple ID systems such as patient IDs, hospital stay IDs and ICUadmission IDs. Even if the raw data is available in only one such IDsystem, given a mapping of IDs alongside start and end times, it ispossible to convert data from one ID system to another. The functionchange_id() provides such a conversion utility, internally eithercallingupgrade_id() when moving to an ID system with higher cardinalityanddowngrade_id() when the target ID system is of lower cardinality
Usage
change_id(x, target_id, src, ..., keep_old_id = TRUE, id_type = FALSE)upgrade_id(x, target_id, src, cols = time_vars(x), ...)downgrade_id(x, target_id, src, cols = time_vars(x), ...)## S3 method for class 'ts_tbl'upgrade_id(x, target_id, src, cols = time_vars(x), ...)## S3 method for class 'id_tbl'upgrade_id(x, target_id, src, cols = time_vars(x), ...)## S3 method for class 'ts_tbl'downgrade_id(x, target_id, src, cols = time_vars(x), ...)## S3 method for class 'id_tbl'downgrade_id(x, target_id, src, cols = time_vars(x), ...)Arguments
x |
|
target_id | The destination id name |
src | Passed to |
... | Passed to |
keep_old_id | Logical flag indicating whether to keep the previous IDcolumn |
id_type | Logical flag indicating whether |
cols | Column names that require time-adjustment |
Details
In order to provide ID system conversion for a data source, the (internal)functionid_map() must be able to construct an ID mapping for that datasource. Constructing such a mapping can be expensive w.r.t. the frequencyit might be re-used and therefore,id_map() provides cachinginfrastructure. The mapping itself is constructed by the (internal)functionid_map_helper(), which is expected to provide source anddestination ID columns as well as start and end columns corresponding tothe destination ID, relative to the source ID system. In the followingexample, we request formimic_demo, with ICU stay IDs as source andhospital admissions as destination IDs.
id_map_helper(mimic_demo, "icustay_id", "hadm_id")#> # An `id_tbl`: 136 x 4#> # Id var: `icustay_id`#> icustay_id hadm_id hadm_id_start hadm_id_end#> <int> <int> <drtn> <drtn>#> 1 201006 198503 -3291 mins 9113 mins#> 2 201204 114648 -2 mins 6949 mins#> 3 203766 126949 -1336 mins 8818 mins#> 4 204132 157609 -1 mins 10103 mins#> 5 204201 177678 -369 mins 9444 mins#> ...#> 132 295043 170883 -10413 mins 31258 mins#> 133 295741 176805 -2 mins 3152 mins#> 134 296804 110244 -1295 mins 4598 mins#> 135 297782 167612 -1 mins 207 mins#> 136 298685 151323 -1 mins 19082 mins#> # i 131 more rows
Both start and end columns encode the hospital admission windows relativeto each corresponding ICU stay start time. It therefore comes as nosurprise that most start times are negative (hospital admission typicallyoccurs before ICU stay start time), while end times are often days in thefuture (as hospital discharge typically occurs several days after ICUadmission).
In order to use the ID conversion infrastructure offered byricu for anew dataset, it typically suffices to provide anid_cfg entry in thesource configuration (seeload_src_cfg()), outlining the available IDsystems alongside an ordering, as well as potentially a class specificimplementation ofid_map_helper() for the given source class, specifyingthe corresponding time windows in 1 minute resolution (for every possiblepair of IDs).
While both up- and downgrades forid_tbl objects, as well as downgradesforts_tbl objects are simple merge operations based on the ID mappingprovided byid_map(), ID upgrades forts_tbl objects are slightly moreinvolved. As an example, consider the following setting: we havedataassociated withhadm_id IDs and times relative to hospital admission:
1 2 3 4 5 6 7 8data ---*------*-------*--------*-------*-------*--------*------*--- 3h 10h 18h 27h 35h 43h 52h 59h HADM_1 0h 7h 26h 37h 53h 62hhadm_id |-------------------------------------------------------------|icustay_id |------------------| |---------------| 0h 19h 0h 16h ICU_1 ICU_2
The mapping of data points fromhadm_id toicustay_id is created asfollows: ICU stay end times mark boundaries and all data that is recordedafter the last ICU stay ended is assigned to the last ICU stay. Thereforedata points 1-3 are assigned toICU_1, while 4-8 are assigned toICU_2.Times have to be shifted as well, as timestamps are expected to be relativeto the current ID system. Data points 1-3 therefore are assigned to timestamps -4h, 3h and 11h, while data points 4-8 are assigned to -10h, -2h,6h, 15h and 22h. Implementation-wise, the mapping is computed using anefficientdata.table rolling join.
Value
An object of the same type asx with modified IDs.
Examples
if (require(mimic.demo)) {tbl <- mimic_demo$labeventsdat <- load_difftime(tbl, itemid == 50809, c("charttime", "valuenum"))datchange_id(dat, "icustay_id", tbl, keep_old_id = FALSE)}Internal utilities for ICU data classes
Description
Internal utilities for ICU data classes
Usage
col_renamer(x, new, old = colnames(x), skip_absent = FALSE, by_ref = FALSE)Arguments
x | Object to query |
new,old | Replacement names and existing column names for renamingcolumns |
skip_absent | Logical flag for ignoring non-existent column names |
by_ref | Logical flag indicating whether to perform the operation byreference |
ICU datasets
Description
TheLaboratory for Computational Physiology (LCP) at MIT hosts several large-scaledatabases of hospital intensive care units (ICUs), two of which can beeither downloaded in full (MIMIC-III andeICU) or as demo subsets(MIMIC-III demo andeICU demo), while athird data set is available only in full (HiRID). While demo data sets arefreely available, full download requires credentialed access which can begained by applying for an account withPhysioNet. Even though registration is required,the described datasets are all publicly available. WithAmsterdamUMCdb, a non-PhysioNethosted data source is available as well. As with the PhysioNet datasets,access is public but has to be granted by the data collectors.
Usage
dataFormat
The exporteddata environment contains all datasets that have been madeavailable toricu. For datasets that are attached during package loading(seeattach_src()), shortcuts to the datasets are set up in the packagenamespace, allowing the objectricu::data::mimic_demo to be accessed asricu::mimic_demo (or in case the package namespace has been attached,simply asmimic_demo). Datasets that are made available after the packagenamespace has been sealed will have their proxy object by default locatedin.GlobalEnv. Datasets are represented bysrc_envobjects, while individual tables aresrc_tbl and do notrepresent in-memory data, but rather data stored on disk, subsets of whichcan be loaded into memory.
Details
Setting up a dataset for use withricu requires a configuration object.For the included datasets, configuration can be loaded from
system.file("extdata", "config", "data-sources.json", package = "ricu")by callingload_src_cfg() and for dataset that are external toricu,additional configuration can be made available by setting the environmentvariableRICU_CONFIG_PATH (for more information, refer toload_src_cfg()). Using the dataset configuration object, data can bedownloaded (download_src()), imported (import_src()) and attached(attach_src()). While downloading and importing are one-time procedures,attaching of the dataset is repeated every time the package is loaded.Briefly, downloading loads the raw dataset from the internet (most likelyin.csv format), importing consists of some preprocessing to make thedata available more efficiently (by converting it to.fstformat) and attaching sets up the data for use by the package. For moreinformation on the individual steps, refer to the respective documentationpages.
A dataset that has been successfully made available can interactively beexplored by typing its name into the console and individual tables can beinspected using the$ function. For example for the MIMIC-III demodataset and theicustays table, this gives
mimic_demo#> <mimic_demo_env[25]>#> admissions callout caregivers chartevents #> [129 x 19] [77 x 24] [7,567 x 4] [758,355 x 15] #> cptevents d_cpt d_icd_diagnoses d_icd_procedures #> [1,579 x 12] [134 x 9] [14,567 x 4] [3,882 x 4] #> d_items d_labitems datetimeevents diagnoses_icd #> [12,487 x 10] [753 x 6] [15,551 x 14] [1,761 x 5] #> drgcodes icustays inputevents_cv inputevents_mv #> [297 x 8] [136 x 12] [34,799 x 22] [13,224 x 31] #> labevents microbiologyevents outputevents patients #> [76,074 x 9] [2,003 x 16] [11,320 x 13] [100 x 8] #> prescriptions procedureevents_mv procedures_icd services #> [10,398 x 19] [753 x 25] [506 x 5] [163 x 6] #> transfers #> [524 x 13]mimic_demo$icustays#> # <mimic_tbl>: [136 x 12]#> # ID options: subject_id (patient) < hadm_id (hadm) < icustay_id (icustay)#> # Defaults: `intime` (index), `last_careunit` (val)#> # Time vars: `intime`, `outtime`#> row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit#> <int> <int> <int> <int> <chr> <chr> <chr>#> 1 12742 10006 142345 206504 carevue MICU MICU#> 2 12747 10011 105331 232110 carevue MICU MICU#> 3 12749 10013 165520 264446 carevue MICU MICU#> 4 12754 10017 199207 204881 carevue CCU CCU#> 5 12755 10019 177759 228977 carevue MICU MICU#> ...#> 132 42676 44083 198330 286428 metavision CCU CCU#> 133 42691 44154 174245 217724 metavision MICU MICU#> 134 42709 44212 163189 239396 metavision MICU MICU#> 135 42712 44222 192189 238186 metavision CCU CCU#> 136 42714 44228 103379 217992 metavision SICU SICU#> # i 131 more rows#> # i 5 more variables: first_wardid <int>, last_wardid <int>, intime <dttm>,#> # outtime <dttm>, los <dbl>
Table subsets can be loaded into memory for example using thebase::subset() function, which uses non-standard evaluation (NSE) todetermine a row-subsetting. This design choice stems form the fact thatsome tables can have on the order of 10^8 rows, which makes loading fulltables into memory an expensive operation. Table subsets loaded intomemory are represented asdata.table objects.Extending the above example, if only ICU stays corresponding to the patientwithsubject_id == 10124 are of interest, the respective data can beloaded as
subset(mimic_demo$icustays, subject_id == 10124)#> row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit#> <int> <int> <int> <int> <char> <char> <char>#> 1: 12863 10124 182664 261764 carevue MICU MICU#> 2: 12864 10124 170883 222779 carevue MICU MICU#> 3: 12865 10124 170883 295043 carevue CCU CCU#> 4: 12866 10124 170883 237528 carevue MICU MICU#> first_wardid last_wardid intime outtime los#> <int> <int> <POSc> <POSc> <num>#> 1: 23 23 2192-03-29 10:46:51 2192-04-01 06:36:00 2.8258#> 2: 50 50 2192-04-16 20:58:32 2192-04-20 08:51:28 3.4951#> 3: 7 7 2192-04-24 02:29:49 2192-04-26 23:59:45 2.8958#> 4: 23 23 2192-04-30 14:50:44 2192-05-15 23:34:21 15.3636
Much care has been taken to makericu extensible to new datasets. Forexample the publicly available ICU databaseAmsterdamUMCdbprovided by the Amsterdam University Medical Center, currently is not partof the core datasets ofricu, but code for integrating this dataset isavailable ongithub.
MIMIC-III
The Medical Information Mart for Intensive Care(MIMIC) database holdsdetailed clinical data from roughly 60,000 patient stays in Beth IsraelDeaconess Medical Center (BIDMC) intensive care units between 2001 and 2012.The database includes information such as demographics, vital signmeasurements made at the bedside (~1 data point per hour), laboratory testresults, procedures, medications, caregiver notes, imaging reports, andmortality (both in and out of hospital). For further information, pleaserefer to theMIMIC-III documentation.
The correspondingdemo datasetcontains the full data of a randomly selected subset of 100 patients fromthe patient cohort with conformed in-hospital mortality. The only notabledata omission is thenoteevents table, which contains unstructured textreports on patients.
eICU
More recently, Philips Healthcare and LCP began assembling theeICU Collaborative Research Database as a multi-center resourcefor ICU data. Combining data of several critical care units throughout thecontinental United States from the years 2014 and 2015, this databasecontains de-identified health data associated with over 200,000 admissions,including vital sign measurements, care plan documentation, severity ofillness measures, diagnosis information, and treatment information. Forfurther information, please refer to theeICU documentation.
For thedemo subset,data associated with ICU stays for over 2,500 unit stays selected from 20of the larger hospitals is included. An important caveat that applied to theeICU-based datasets is considerable variability among the large number ofhospitals in terms of data availability.
HiRID
Moving to higher time-resolution,HiRID is a freely accessible criticalcare dataset containing data relating to almost 34,000 patient admissionsto the Department of Intensive Care Medicine of the Bern UniversityHospital, Switzerland. The dataset contains de-identified demographicinformation and a total of 681 routinely collected physiological variables,diagnostic test results and treatment parameters, collected during theperiod from January 2008 to June 2016. Dependent on the type of measurement,time resolution can be on the order of 2 minutes.
AmsterdamUMCdb
With similar time-resolution (for vital-sign measurements) as HiRID,AmsterdamUMCdbcontains data from 23,000 admissions of adult patients from 2003-2016 tothe department of Intensive Care, of Amsterdam University Medical Center.In total, nearly 10^9^ individual observations consisting of vitals signs,clinical scoring systems, device data and lab results data, as well asnearly 5*10^6^ million medication entries, alongside de-identifieddemographic information corresponding to the 20,000 individual patientsis spread over 7 tables.
MIMIC-IV
The latest v2.2 release of MIMIC-IV is available in added inricu.Building on the success of MIMIC-III, this next iterationcontains data on patients admitted to an ICU or the emergency departmentbetween 2008 - 2019 at BIDMC. Therefore, relative to MIMIC-III, patientsadmitted prior to 2008 (whose data is stored in a CareVue-based system) hasbeen removed, while data onward of 2012 has been added. This simplifiesdata queries considerably, as the CareVue/MetaVision data split in MIMIC-IIIno longer applies. While addition of ED data is planned, this is not partof the initial v1.0 release and currently is not supported byricu. Forfurther information, please refer to theMIMIC-III documentation.
SICdb
The Salzburg ICU database (SICdb) originates from the University Hospital ofSalzburg. Inricu, version v1.0.6 is currently supported. Forfurther information, please refer to theSICdb documentation.
References
Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database(version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
MIMIC-III, a freely accessible critical care database. Johnson AEW, PollardTJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA,and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.
Johnson, A., Pollard, T., Badawi, O., & Raffa, J. (2019). eICUCollaborative Research Database Demo (version 2.0). PhysioNet.https://doi.org/10.13026/gxmm-es70.
The eICU Collaborative Research Database, a freely available multi-centerdatabase for critical care research. Pollard TJ, Johnson AEW, Raffa JD,Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI:http://dx.doi.org/10.1038/sdata.2018.178.
Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., &Merz, T. (2020). HiRID, a high time-resolution ICU dataset (version 1.0).PhysioNet. https://doi.org/10.13026/hz5m-md48.
Hyland, S.L., Faltys, M., Hüser, M. et al. Early prediction of circulatoryfailure in the intensive care unit using machine learning. Nat Med 26,364–373 (2020). https://doi.org/10.1038/s41591-020-0789-4
Thoral PJ, Peppink JM, Driessen RH, et al (2020) AmsterdamUMCdb: The FirstFreely Accessible European Intensive Care Database from the ESICM DataSharing Initiative. https://www.amsterdammedicaldatascience.nl.
Elbers, Dr. P.W.G. (Amsterdam UMC) (2019): AmsterdamUMCdb v1.0.2. DANS.https://doi.org/10.17026/dans-22u-f8vd
Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R.(2021). MIMIC-IV (version 1.0). PhysioNet.https://doi.org/10.13026/s6n6-xd98.
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark,R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet:Components of a new research resource for complex physiologic signals.Circulation (Online). 101 (23), pp. e215–e220.
File system utilities
Description
Determine the location where to place data meant to persist betweenindividual sessions.
Usage
data_dir(subdir = NULL, create = TRUE)src_data_dir(srcs)auto_attach_srcs()config_paths()get_config(name, cfg_dirs = config_paths(), combine_fun = c, ...)set_config(x, name, dir = file.path("inst", "extdata", "config"), ...)Arguments
subdir | A string specifying a directory that will be made sure toexist below the data directory. |
create | Logical flag indicating whether to create the specifieddirectory |
srcs | Character vector of data source names, an object for which an |
name | File name of the configuration file ( |
cfg_dirs | Character vector of directories searched for config files |
combine_fun | If multiple files are found, a function for combiningreturned lists |
... | Passed to |
x | Object to be written |
dir | Directory to write the file to (created if non-existent) |
Details
For data, the default location depends on the operating system as
| Platform | Location |
| Linux | ~/.local/share/ricu |
| macOS | ~/Library/Application Support/ricu |
| Windows | %LOCALAPPDATA%/ricu |
If the default storage directory does not exists, it will only be createdupon user consent (requiring an interactive session).
The environment variableRICU_DATA_PATH can be used to overwrite thedefault location. If desired, this variable can be set in an R startup fileto make it apply to all R sessions. For example, it could be set within:
A project-local
.Renviron;The user-level
.Renviron;A file at
$(R RHOME)/etc/Renviron.site.
Any directory specified as environment variable will recursively be created.
Data source directories typically are sub-directories todata_dir() namedthe same as the respective dataset. For demo datasets corresponding tomimic andeicu, file location however deviates from this scheme. Thefunctionsrc_data_dir() is used to determine the expected data locationof a given dataset.
Configuration files used both for data source configuration, as well as fordictionary definitions potentially involve multiple files that are read andmerged. For that reason,get_config() will iterate over directoriespassed ascfg_dirs and look for the specified file (with suffix.jsonappended and might be missing in some of the queried directories). Allfound files are read byjsonlite::read_json() and the resulting lists arecombined by reduction with the binary function passed ascombine_fun.
With default arguments,get_config() will simply concatenate listscorresponding to files found in the default config locations as returned byconfig_paths(): first the directory specified by the environment variableRICU_CONFIG_PATH (if set), followed by the directory at
system.file("extdata", "config", package = "ricu")Further arguments are passed tojsonlite::read_json(), which is calledwith slightly modified defaults:simplifyVector = TRUE,simplifyDataFrame = FALSE andsimplifyMatrix = FALSE.
The utility functionset_config() writes the list passed asx to filedir/name.json, usingjsonlite::write_json() also with slightly modifieddefaults (which can be overridden by passing arguments as...):null = "null",auto_unbox = TRUE andpretty = TRUE.
Whenever the package namespace is attached, a summary of datasetavailability is printed using the utility functionsauto_attach_srcs()andsrc_data_avail(). While the former simply returns a character vectorof data sources that are configures for automatically being set up onpackage loading, the latter returns a summary of the number of availabletables per dataset.m Finally,is_data_avail() returns a named logicalvector indicating which data sources have all required data available.
Value
Functionsdata_dir(),src_data_dir() andconfig_paths() returnfile paths as character vectors,auto_attach_srcs() returns a charactervector of data source names,src_data_avail() returns adata.framedescribing availability of data sources andis_data_avail() a namedlogical vector. Configuration utilitiesget_config() andset_config()read and write list objects to/from JSON format.
Examples
Sys.setenv(RICU_DATA_PATH = tempdir())identical(data_dir(), tempdir())dir.exists(file.path(tempdir(), "some_subdir"))some_subdir <- data_dir("some_subdir")dir.exists(some_subdir)cfg <- get_config("concept-dict")identical( cfg, get_config("concept-dict", system.file("extdata", "config", package = "ricu")))Data download utilities
Description
Making a dataset available toricu consists of 3 steps: downloading(download_src()), importing (import_src()) and attaching(attach_src()). While downloading and importing are one-time procedures,attaching of the dataset is repeated every time the package is loaded.Briefly, downloading loads the raw dataset from the internet (most likelyin.csv format), importing consists of some preprocessing to make thedata available more efficiently (by converting it to.fstformat) and attaching sets up the data for use by the package.
Usage
download_src(x, data_dir = src_data_dir(x), ...)## S3 method for class 'src_cfg'download_src(x, data_dir = src_data_dir(x), tables = NULL, force = FALSE, ...)## S3 method for class 'aumc_cfg'download_src( x, data_dir = src_data_dir(x), tables = NULL, force = FALSE, token = NULL, verbose = TRUE, ...)## S3 method for class 'character'download_src( x, data_dir = src_data_dir(x), tables = NULL, force = FALSE, user = NULL, pass = NULL, verbose = TRUE, ...)Arguments
x | Object specifying the source configuration |
data_dir | Destination directory where the downloaded data is writtento. |
... | Generic consistency |
tables | Character vector specifying the tables to download. If |
force | Logical flag; if |
token | Download token for AmsterdamUMCdb (see 'Details') |
verbose | Logical flag indicating whether to print progress information |
user,pass | PhysioNet credentials; if |
Details
Downloads byricu are focused data hosted byPhysioNet and tools are currently available fordownloading the datasetsMIMIC-III,eICU andHiRID (seedata). Whilecredentials are required for downloading any of the three datasets, demodataset for both MIMIC-III and eICU are available without having to log in.Even though access to full dataset is credentialed, the datasets are infact publicly available. For setting up an account, please refer tothe registration form.
PhysioNet credentials can either be entered in an interactive session,passed as function argumentsuser/pass or as environmentvariablesRICU_PHYSIONET_USER/RICU_PHYSIONET_PASS. For settingenvironment variables on session startup, refer tobase::.First.sys() andfor setting environment variables in general, refer tobase::Sys.setenv()If the openssl package is available, SHA256 hashes of downloaded files areverified usingopenssl::sha256().
Demo datasetsMIMIC-III demo andeICU demo can either beinstalled as R packages directly by running
install.packages( c("mimic.demo", "eicu.demo"), repos = "https://eth-mds.github.io/physionet-demo")or downloaded and imported usingdownload_src() andimport_src().Furthermore,ricu specifiesmimic.demo andeicu.demo asSuggestsdependencies therefore, passingdependencies = TURE when callinginstall.packages() for installingricu, this will automatically installthe demo datasets as well.
While the included data downloaders are intended for data hosted byPhysioNet,download_src() is an S3 generic function that can be extendedto new classes. Method dispatch is intended to occur on objects thatinherit from or can be coerced tosrc_cfg. For more information on datasource configuration, refer toload_src_cfg().
As such, with the addition of the AmsterdamUMCdb dataset, whichunfortunately is not hosted on PhysioNet, A separate downloader for thatdataset is available as well. Currently this requires both availability ofthe CRAN packagexml2, as well as the command line utility 7zip.Furthermore, data access has to berequested and fornon-interactive download the download token has to be made available asenvironment variableRICU_AUMC_TOKEN or passed astoken argument todownload_src(). The download token can be retrieved from the URL providedwhen granted access as by extracting the string followed bytoken=:
https://example.org/?s=download&token=0c27af59-72d1-0349-aa59-00000a8076d9
would translate to
Sys.setenv(RICU_AUMC_TOKEN = "0c27af59-72d1-0349-aa59-00000a8076d9")
If the dependencies outlined above are not fulfilled, download and archiveextraction can be carried out manually into the corresponding folder andimport_src() can be run.
Value
Called for side effects and returnsNULL invisibly.
Examples
## Not run: dir <- tempdir()list.files(dir)download_datasource("mimic_demo", data_dir = dir)list.files(dir)unlink(dir, recursive = TRUE)## End(Not run)Time series utility functions
Description
ICU data as handled byricu is mostly comprised of time series data and assuch, several utility functions are available for working with time seriesdata in addition to a class dedicated to representing time series data (seets_tbl()). Some terminology to begin with: a time series is consideredto have gaps if, per (combination of) ID variable value(s), some time stepsare missing. Expanding and collapsing mean to change betweenrepresentations where time steps are explicit or encoded as interval withstart and end times. For sliding window-type operations,slide() means toiterate over time-windows,slide_index() means to iterate over certaintime-windows, selected relative to the index andhop() means to iterateover time-windows selected in absolute terms.
Usage
expand( x, start_var = index_var(x), end_var = NULL, step_size = time_step(x), new_index = start_var, keep_vars = NULL, aggregate = FALSE)collapse( x, id_vars = NULL, index_var = NULL, start_var = "start", end_var = "end", env = NULL, as_win_tbl = TRUE, ...)has_no_gaps(x)has_gaps(...)is_regular(x)fill_gaps(x, limits = collapse(x), start_var = "start", end_var = "end")remove_gaps(x)slide(x, expr, before, after = hours(0L), ...)slide_index(x, expr, index, before, after = hours(0L), ...)hop( x, expr, windows, full_window = FALSE, lwr_col = "min_time", upr_col = "max_time", left_closed = TRUE, right_closed = TRUE, eval_env = NULL, ...)Arguments
x |
|
start_var,end_var | Name of the columns that represent lower and upperwindows bounds |
step_size | Controls the step size used to interpolate between |
new_index | Name of the new index column |
keep_vars | Names of the columns to hold onto |
aggregate | Function for aggregating values in overlapping intervals |
id_vars,index_var | ID and index variables |
env | Environment used as parent to the environment used to evaluateexpressions passes as |
as_win_tbl | Logical flag indicating whether to return a |
... | Passed to |
limits | A table with columns for lower and upper window bounds or alength 2 difftime vector |
expr | Expression (quoted for |
before,after | Time span to look back/forward |
index | A vector of times around which windows are spanned (relativeto the index) |
windows | An |
full_window | Logical flag controlling how the situation is handledwhere the sliding window extends beyond available data |
lwr_col,upr_col | Names of columns (in |
left_closed,right_closed | Logical flag indicating whether intervalsare closed (default) or open. |
eval_env | Environment in which |
Details
A gap in ats_tbl object is a missing time step, i.e. a missing entry inthe sequenceseq(min(index), max(index), by = interval) in at least onegroup (as defined byid_vars(), where the extrema are calculated pergroup. In this case,has_gaps() will returnTRUE. The functionis_regular() checks whether the time series has no gaps, in addition tothe object being sorted and unique (seeis_sorted() andis_unique()).In order to transform a time series containing gaps into a regular timeseries,fill_gaps() will fill missing time steps withNA values in alldata_vars() columns, whileremove_gaps() provides the inverse operationof removing time steps that consist ofNA values indata_vars() columns.
Anexpand() operation performed on an object inheriting fromdata.tableyields ats_tbl where time-steps encoded by columnsstart_var andend_var are made explicit with values inkeep_vars being appropriatelyrepeated. The inverse operation is available ascollapse(), which groupsbyid_vars, representsindex_var as group-wise extrema in two newcolumnsstart_var andend_var and allows for further data summary using.... An aspect to keep in mind when applyingexpand() to awin_tblobject is that values simply are repeated for all time-steps that fall intoa given validity interval. This gives correct results when awin_tbl forexample contains data on infusions as rates, but might not lead to correctresults when infusions are represented as drug amounts administered over agiven time-span. In such a scenario it might be desirable to evenlydistribute the total amount over the corresponding time steps (currently notimplemented).
Sliding-window type operations are available asslide(),slide_index()andhop() (function naming is inspired by the CRAN packageslider). Themost flexible of the three,hop takes as input ats_tbl objectxcontaining the data, anid_tbl objectwindows, containing for each IDthe desired windows represented by two columnslwr_col andupr_col, aswell as an expressionexpr to be evaluated per window. At the other endof the spectrum,slide() spans windows for every ID and availabletime-step using the argumentsbefore andafter, whileslide_index()can be seen as a compromise between the two, where windows are spanned forcertain time-points, specified byindex.
Value
Most functions returnts_tbl objects with the exception ofhas_gaps()/has_no_gaps()/is_regular(), which return logical flags.
Examples
if (FALSE) {tbl <- ts_tbl(x = 1:5, y = hours(1:5), z = hours(2:6), val = rnorm(5), index_var = "y")exp <- expand(tbl, "y", "z", step_size = 1L, new_index = "y", keep_vars = c("x", "val"))col <- collapse(exp, start_var = "y", end_var = "z", val = unique(val))all.equal(tbl, col, check.attributes = FALSE)tbl <- ts_tbl(x = rep(1:5, 1:5), y = hours(sequence(1:5)), z = 1:15)win <- id_tbl(x = c(3, 4), a = hours(c(2, 1)), b = hours(c(3, 4)))hop(tbl, list(z = sum(z)), win, lwr_col = "a", upr_col = "b")slide_index(tbl, list(z = sum(z)), hours(c(4, 5)), before = hours(2))slide(tbl, list(z = sum(z)), before = hours(2))tbl <- ts_tbl(x = rep(3:4, 3:4), y = hours(sequence(3:4)), z = 1:7)has_no_gaps(tbl)is_regular(tbl)tbl[1, 2] <- hours(2)has_no_gaps(tbl)is_regular(tbl)tbl[6, 2] <- hours(2)has_no_gaps(tbl)is_regular(tbl)}Data loading utilities
Description
Two important tools for smoothing out differences among used datasets areid_origin() which returns origin times for a given ID andid_map()which returns a mapping between two ID systems alongside start and endcolumns of the target ID system relative to the source ID system. As boththese function are called frequently during data loading and might involvesomewhat expensive operations, both rely on internal helper functions(id_orig_helper() andid_map_helper()) which perform the heavy liftingand wrap those helper functions, providing a memoization layer. When addinga new data source, a class specific implementation of the S3 genericfunctionid_map_helper() might be required, as this is used during dataloading usingload_id() andload_ts() viachange_id().
Usage
id_origin(x, id, origin_name = NULL, copy = TRUE)id_orig_helper(x, id)## S3 method for class 'src_env'id_orig_helper(x, id)## S3 method for class 'miiv_env'id_orig_helper(x, id)id_windows(x, copy = TRUE)id_win_helper(x)## S3 method for class 'mimic_env'id_win_helper(x)## S3 method for class 'eicu_env'id_win_helper(x)## S3 method for class 'sic_env'id_win_helper(x)## S3 method for class 'hirid_env'id_win_helper(x)## S3 method for class 'aumc_env'id_win_helper(x)## S3 method for class 'miiv_env'id_win_helper(x)id_map(x, id_var, win_var, in_time = NULL, out_time = NULL)id_map_helper(x, id_var, win_var)## S3 method for class 'src_env'id_map_helper(x, id_var, win_var)Arguments
x | Object identify the ID system (passed to |
id | ID name for which to return origin times |
origin_name | String-valued name which will be used to label the origincolumn |
copy | Logical flag indicating whether to return a copy of the memoized |
id_var | Type of ID all returned times are relative to |
win_var | Type of ID for which the in/out times is returned |
in_time,out_time | column names of the returned in/out times |
Details
For the internal datasets,id_map_helper() relies on yet another S3generic functionid_windows(), which provides a table containing allavailable ID systems, as well as all ID windows for a given data source. Asfor the other two functions, the same helper-function approach is in place,with the data loading functionid_win_helper(). The functionid_map_helper() is then implemented in a data source agnostic manner(dispatching on thesrc_env class), providing subsetting of this largerID map table and ensuring timestamps are relative to the correct ID system.For adding a new data source however, this layer can be forgone. Similarlyforid_origin(), this is used for the internal datasets inload_difftime(). An implementation ofload_difftime(), specific to anew data source can be provided that does not rely onid_windows(),making this function irrelevant for this specific dataset.
Value
id_origin()/id_orig_helper(): anid_tblwith admission time stampscorresponding to the selected IDid_windows()/id_win_helper(): anid_tblholding all IDs and theirrespective start and end timesid_map()/id_map_helper(): anid_tblcontaining the selected IDs anddepending on values passed asin_timeandout_time, start and endtimes of the ID passed aswin_var
Tabular ICU data classes
Description
In order to simplify handling or tabular ICU data,ricu providesS3 classes,id_tbl,ts_tbl, andwin_tbl. These classes essentiallyconsist of adata.table object, alongside some meta data and S3 dispatchis used to enable more natural behavior for some data manipulation tasks.For example, when merging two tables, a default for theby argument canbe chosen more sensibly if columns representing patient ID and timestampinformation can be identified.
Usage
id_tbl(..., id_vars = 1L)is_id_tbl(x)as_id_tbl(x, id_vars = NULL, by_ref = FALSE)ts_tbl(..., id_vars = 1L, index_var = NULL, interval = NULL)is_ts_tbl(x)as_ts_tbl(x, id_vars = NULL, index_var = NULL, interval = NULL, by_ref = FALSE)win_tbl(..., id_vars = NULL, index_var = NULL, interval = NULL, dur_var = NULL)is_win_tbl(x)as_win_tbl( x, id_vars = NULL, index_var = NULL, interval = NULL, dur_var = NULL, by_ref = FALSE)## S3 method for class 'id_tbl'as.data.table(x, keep.rownames = FALSE, by_ref = FALSE, ...)## S3 method for class 'id_tbl'as.data.frame(x, row.names = NULL, optional = FALSE, ...)validate_tbl(x)Arguments
... | forwarded to |
id_vars | Column name(s) to be used as |
x | Object to query/operate on |
by_ref | Logical flag indicating whether to perform the operation byreference |
index_var | Column name of the index column |
interval | Time series interval length specified as scalar-valued |
dur_var | Column name of the duration column |
keep.rownames | Default is |
row.names |
|
optional | logical. If |
Details
The introduced classes are designed for several often encountered datascenarios:
id_tblobjects can be used to represent static (with respect torelevant time scales) patient data such as patient age and such an objectis simply adata.tablecombined with a non-zero length character vectorvalued attribute marking the columns tracking patient ID information(id_vars). All further columns are considered asdata_vars.ts_tblobjects are used for grouped time series data. Adata.tableobject again is augmented by attributes, including a non-zero lengthcharacter vector identifying patient ID columns (id_vars),a string, tracking the column holding time-stamps(index_var) and a scalardifftimeobject determiningthe time-series step sizeinterval. Again, all furthercolumns are treated asdata_vars.win_tbl: In addition to representing grouped time-series data as doesats_tbl,win_tblobjects also encode a validity interval for eachtime-stamped measurement (asdur_var). This can for examplebe useful when a drug is administered at a certain infusion rate for agiven time period.
Owing to the nested structure of required meta data,ts_tbl inherits fromid_tbl andwin_tbl fromts_tbl. Furthermore, both classes inherit fromdata.table. As such,data.tablereference semanticsare available for some operations, indicated by presence of aby_refargument. At default, value,by_ref is set toFALSE as this is in linewith base R behavior at the cost of potentially incurring unnecessary datacopies. Some care has to be taken when passingby_ref = TRUE and enablingby reference operations as this can have side effects (see examples).
For instantiatingts_tbl objects, bothindex_var andinterval can beautomatically determined if not specified. For the index column, the onlyrequirement is that a singledifftime column ispresent, while for the time step, the minimal difference between twoconsecutive observations is chosen (and all differences are thereforerequired to be multiples of the minimum difference). Similarly, for awin_tbl, exactly twodifftime columns are requiredwhere the first is assumed to be corresponding to theindex_var and thesecond to thedur_var.
Upon instantiation, the data might be rearranged: columns are reorderedsuch that ID columns are moved to the front, followed by the index columnand adata.table::key() is set on meta columns, causing rows to be sortedaccordingly. Moving meta columns to the front is done for reasons ofconvenience for printing, while setting a key on meta columns is done toimprove efficiency of subsequent transformations such as merging or groupedoperations. Furthermore,NA values in either ID or index columns are notallowed and therefore corresponding rows are silently removed.
Coercion betweenid_tbl andts_tbl (andwin_tbl) by default keepsintersecting attributes fixed and new attributes are by default inferred asfor class instantiation. Each class comes with a class-specificimplementation of the S3 generic functionvalidate_tbl() which returnsTRUE if the object is considered valid or a string outlining the type ofvalidation failure that was encountered. Validity requires
inheriting from
data.tableand unique column namesfor
id_tblthat all columns specified by the non-zero length charactervector holding onto theid_varsspecification are availablefor
ts_tblthat the string-valuedindex_varcolumn is available anddoes not intersect withid_varsand that the index column obeys thespecified interval.for
win_tblthat the string-valueddur_varcorresponds to adifftimevector and is not among the columns marked as index or IDvariables
Finally, inheritance can be checked by callingis_id_tbl() andis_ts_tbl(). Note that due tots_tbl inheriting fromid_tbl,is_id_tbl() returnsTRUE for bothid_tbl andts_tbl objects (andsimilarly forwin_tbl), whileis_ts_tbl() only returnsTRUE forts_tbl objects.
Value
Constructorsid_tbl()/ts_tbl()/win_tbl(), as well as coercionfunctionsas_id_tbl()/as_ts_tbl()/as_win_tbl() returnid_tbl/ts_tbl/win_tbl objects respectively,while inheritance testersis_id_tbl()/is_ts_tbl()/is_win_tbl() returnlogical flags andvalidate_tbl() returns eitherTRUE or a stringdescribing the validation failure.
Relationship todata.table
Bothid_tbl andts_tbl inherit fromdata.table and as such, functionsintended for use withdata.table objects can be applied toid_tbl andts_tbl as well. But there are some caveats: Many functions introduced bydata.table are not S3 generic and therefore they would have to be maskedin order to retain control over how they operate on objects inheriting formdata.table. Take for example the functiondata.table::setnames(), whichchanges column names by reference. Using this function, the name of anindex column of anid_tbl object can me changed without updating theattribute marking the column as such and thusly leaving the object in aninconsistent state. Instead of masking the functionsetnames(), analternative is provided asrename_cols(). In places where it is possibleto seamlessly insert the appropriate function (such asbase::names<-() orbase::colnames<-()) and the responsibility for notusingdata.table::setnames() in a way that breaks theid_tbl object isleft to the user.
Owing todata.table heritage, one of the functions that is often calledonid_tbl andts_tbl objects is base S3 generic [base::[()]. As thisfunction is capable of modifying the object in a way that makes itincompatible with attached meta data, an attempt is made at preserving asmuch as possible and if all fails, adata.table object is returnedinstead of an object inheriting formid_tbl. If for example the indexcolumn is removed (or modified in a way that makes it incompatible with theinterval specification) from ats_tbl, anid_tbl is returned. Ifhowever the ID column is removed the only sensible thing to return is adata.table (see examples).
Examples
tbl <- id_tbl(a = 1:10, b = rnorm(10))is_id_tbl(tbl)is_ts_tbl(tbl)dat <- data.frame(a = 1:10, b = hours(1:10), c = rnorm(10))tbl <- as_ts_tbl(dat, "a")is_id_tbl(tbl)is_ts_tbl(tbl)tmp <- as_id_tbl(tbl)is_ts_tbl(tbl)is_ts_tbl(tmp)tmp <- as_id_tbl(tbl, by_ref = TRUE)is_ts_tbl(tbl)is_ts_tbl(tmp)tbl <- id_tbl(a = 1:10, b = rnorm(10))names(tbl) <- c("c", "b")tbltbl <- id_tbl(a = 1:10, b = rnorm(10))validate_tbl(data.table::setnames(tbl, c("c", "b")))tbl <- id_tbl(a = 1:10, b = rnorm(10))validate_tbl(rename_cols(tbl, c("c", "b")))tbl <- ts_tbl(a = rep(1:2, each = 5), b = hours(rep(1:5, 2)), c = rnorm(10))tbl[, c("a", "c"), with = FALSE]tbl[, c("b", "c"), with = FALSE]tbl[, list(a, b = as.double(b), c)]ICU class meta data utilities
Description
The two data classesid_tbl andts_tbl, used byricu to represent ICUpatient data, consist of adata.table alongside some meta data. Thisincludes marking columns that have special meaning and for datarepresenting measurements ordered in time, the step size. The followingutility functions can be used to extract columns and column names withspecial meaning, as well as query ats_tbl object regarding its timeseries related meta data.
Usage
id_vars(x)id_var(x)id_col(x)index_var(x)index_col(x)dur_var(x)dur_col(x)dur_unit(x)meta_vars(x)data_vars(x)data_var(x)data_col(x)interval(x)time_unit(x)time_step(x)time_vars(x)Arguments
x | Object to query |
Details
The following functions can be used to query an object for columns orcolumn names that represent a distinct aspect of the data:
id_vars(): ID variables are one or more column names with theinteraction of corresponding columns identifying a grouping of the data.Most commonly this is some sort of patient identifier.id_var(): This function either fails or returns a string and cantherefore be used in case only a single column provides groupinginformation.id_col(): Again, in case only a single column provides groupinginformation, this column can be extracted using this function.index_var(): Suitable for use as index variable is a column that encodesa temporal ordering of observations asdifftimevector. Only a single column can be marked as index variable and thisfunction queries ats_tblobject for its name.index_col(): similarly toid_col(), this function extracts the columnwith the given designation. As ats_tblobject is required to haveexactly one column marked as index, this function always returns forts_tblobjects (and fails forid_tblobjects).dur_var(): Forwin_tblobjects, this returns the name of the columnencoding the data validity interval.dur_col(): Similarly toindex_col(), this returns thedifftimevector corresponding to thedur_var().meta_vars(): Forts_tblobjects, meta variables represent the unionof ID and index variables (forwin_tbl, this also includes thedur_var()), while forid_tblobjects meta variables consist pf IDvariables.data_vars(): Data variables on the other hand are all columns that arenot meta variables.data_var(): Similarly toid_var(), this function either returns thename of a single data variable or fails.data_col(): Building ondata_var(), in situations where only a singledata variable is present, it is returned or if multiple data columnexists, an error is thrown.time_vars(): Time variables are all columns in an object inheritingfromdata.framethat are of typedifftime. Therefore in ats_tblobject the indexcolumn is one of (potentially) several time variables. For awin_tbl,however thedur_var()is not among thetime_vars().interval(): The time series interval length is represented a scalarvalueddifftimeobject.time_unit(): The time unit of the time series interval, represented bya string such as "hours" or "mins" (seedifftime).time_step(): The time series step size represented by a numeric valuein the unit as returned bytime_unit().
Value
Mostly column names as character vectors, in case ofid_var(),index_var(),data_var() andtime_unit() of length 1, else of variablelength. Functionsid_col(),index_col() anddata_col() return tablecolumns as vectors, whileinterval() returns a scalar valueddifftimeobject andtime_step() a number.
Examples
tbl <- id_tbl(a = rep(1:2, each = 5), b = rep(1:5, 2), c = rnorm(10), id_vars = c("a", "b"))id_vars(tbl)tryCatch(id_col(tbl), error = function(...) "no luck")data_vars(tbl)data_col(tbl)tmp <- as_id_tbl(tbl, id_vars = "a")id_vars(tmp)id_col(tmp)tbl <- ts_tbl(a = rep(1:2, each = 5), b = hours(rep(1:5, 2)), c = rnorm(10))index_var(tbl)index_col(tbl)identical(index_var(tbl), time_vars(tbl))interval(tbl)time_unit(tbl)time_step(tbl)Data import utilities
Description
Making a dataset available toricu consists of 3 steps: downloading(download_src()), importing (import_src()) and attaching(attach_src()). While downloading and importing are one-time procedures,attaching of the dataset is repeated every time the package is loaded.Briefly, downloading loads the raw dataset from the internet (most likelyin.csv format), importing consists of some preprocessing to make thedata available more efficiently and attaching sets up the data for use bythe package.
Usage
import_src(x, ...)## S3 method for class 'src_cfg'import_src( x, data_dir = src_data_dir(x), tables = NULL, force = FALSE, verbose = TRUE, ...)## S3 method for class 'aumc_cfg'import_src(x, ...)## S3 method for class 'character'import_src( x, data_dir = src_data_dir(x), tables = NULL, force = FALSE, verbose = TRUE, cleanup = FALSE, ...)import_tbl(x, ...)## S3 method for class 'tbl_cfg'import_tbl( x, data_dir = src_data_dir(x), progress = NULL, cleanup = FALSE, ...)Arguments
x | Object specifying the source configuration |
... | Passed to downstream methods (finally toreadr::read_csv/readr::read_csv_chunked)/generic consistency |
data_dir | Destination directory where the downloaded data is writtento. |
tables | Character vector specifying the tables to download. If |
force | Logical flag; if |
verbose | Logical flag indicating whether to print progress information |
cleanup | Logical flag indicating whether to remove raw csv files afterconversion to fst |
progress | Either |
Details
In order to speed up data access operations,ricu does not directly usethe PhysioNet provided CSV files, but converts all data tofst::fst()format, which allows for random row and column access. Large tables aresplit into chunks in order to keep memory requirements reasonably low.
The one-time step per dataset of data import is fairly resource intensive:depending on CPU and available storage system, it will take on the order ofan hour to run to completion and depending on the dataset, somewherebetween 50 GB and 75 GB of temporary disk space are required as tables areuncompressed, in case of partitioned data, rows are reordered and the dataagain is saved to a storage efficient format.
The S3 generic functionimport_src() performs import of an entire datasource, internally calling the S3 generic functionimport_tbl() in orderto perform import of individual tables. Method dispatch is intended tooccur on objects inheriting fromsrc_cfg andtbl_cfg respectively. Suchobjects can be generated from JSON based configuration files which containinformation such as table names, column types or row numbers, in order toprovide safety in parsing of.csv files. For more information on datasource configuration, refer toload_src_cfg().
Current import capabilities include re-saving a.csv file to.fst atonce (used for smaller sized tables), reading a large.csv file using thereadr::read_csv_chunked() API, while partitioning chunks and reassemblingsub-partitions (used for splitting a large file into partitions), as wellas re-partitioning an already partitioned table according to a newpartitioning scheme. Care has been taken to keep the maximal memoryrequirements for this reasonably low, such that data import is feasible onlaptop class hardware.
Value
Called for side effects and returnsNULL invisibly.
Examples
## Not run: dir <- tempdir()list.files(dir)download_src("mimic_demo", dir)list.files(dir)import_src("mimic_demo", dir)list.files(dir)unlink(dir, recursive = TRUE)## End(Not run)Load concept data
Description
Concept objects are used inricu as a way to specify how a clinicalconcept, such as heart rate can be loaded from a data source. Building onthis abstraction,load_concepts() powers concise loading of data withdata source specific preprocessing hidden away from the user, therebyproviding a data source agnostic interface to data loading. At defaultvalue of the argumentmerge_data, a tabular data structure (either ats_tbl or anid_tbl, depending on what kind ofconcepts are requested), inheriting fromdata.table, is returned, representing the datain wide format (i.e. returning concepts as columns).
Usage
load_concepts(x, ...)## S3 method for class 'character'load_concepts( x, src = NULL, concepts = NULL, ..., dict_name = "concept-dict", dict_dirs = NULL)## S3 method for class 'integer'load_concepts( x, src = NULL, concepts = NULL, ..., dict_name = "concept-dict", dict_dirs = NULL)## S3 method for class 'numeric'load_concepts(x, ...)## S3 method for class 'concept'load_concepts( x, src = NULL, aggregate = NULL, merge_data = TRUE, verbose = TRUE, ...)## S3 method for class 'cncpt'load_concepts(x, aggregate = NULL, ..., progress = NULL)## S3 method for class 'num_cncpt'load_concepts(x, aggregate = NULL, ..., progress = NULL)## S3 method for class 'unt_cncpt'load_concepts(x, aggregate = NULL, ..., progress = NULL)## S3 method for class 'fct_cncpt'load_concepts(x, aggregate = NULL, ..., progress = NULL)## S3 method for class 'lgl_cncpt'load_concepts(x, aggregate = NULL, ..., progress = NULL)## S3 method for class 'rec_cncpt'load_concepts( x, aggregate = NULL, patient_ids = NULL, id_type = "icustay", interval = hours(1L), ..., progress = NULL)## S3 method for class 'item'load_concepts( x, patient_ids = NULL, id_type = "icustay", interval = hours(1L), progress = NULL, ...)## S3 method for class 'itm'load_concepts( x, patient_ids = NULL, id_type = "icustay", interval = hours(1L), ...)Arguments
x | Object specifying the data to be loaded |
... | Passed to downstream methods |
src | A character vector, used to subset the |
concepts | The concepts to be used, or |
dict_name,dict_dirs | In case not concepts are passed as |
aggregate | Controls how data within concepts is aggregated |
merge_data | Logical flag, specifying whether to merge concepts intowide format or return a list, each entry corresponding to a concept |
verbose | Logical flag for muting informational output |
progress | Either |
patient_ids | Optional vector of patient ids to subset the fetched datawith |
id_type | String specifying the patient id type to return |
interval | The time interval used to discretize time stamps with,specified as |
Details
In order to allow for a large degree of flexibility (and extensibility),which is much needed owing to considerable heterogeneity presented bydifferent data sources, several nested S3 classes are involved inrepresenting a concept andload_concepts() follows this hierarchy ofclasses recursively whenresolving a concept. An outline of this hierarchy can be described as
concept: contains manycncptobjects (of potentially differingsub-types), each comprising of some meta-data and anitemobjectitem: contains manyitmobjects (of potentially differingsub-types), each encoding how to retrieve a data item.
The design choice for wrapping a vector ofcncpt objects with a containerclassconcept is motivated by the requirement of having several differentsub-types ofcncpt objects (all inheriting from the parent typecncpt),while retaining control over how this homogeneous w.r.t. parent type, butheterogeneous w.r.t. sub-type vector of objects behaves in terms of S3generic functions.
Value
Anid_tbl/ts_tbl or a list thereof, depending on loadedconcepts and the value passed asmerge_data.
Concept
Top-level entry points are either a character vector of concept names or aninteger vector of concept IDs (matched againstomopid fields), which areused to subset aconcept object or an entireconcept dictionary, or aconcept object. When passing acharacter/integer vector as first argument, the most important furtherarguments at that level control from where the dictionary is taken(dict_name ordict_dirs). Atconcept level, the most importantadditional arguments control the result structure: data merging can bedisabled usingmerge_data and data aggregation is governed by theaggregate argument.
Data aggregation is important for merging several concepts into awide-format table, as this requires data to be unique per observation (i.e.by either id or combination of id and index). Several value types areacceptable asaggregate argument, the most important beingFALSE, whichdisables aggregation, NULL, which auto-determines a suitable aggregationfunction or a string which is ultimately passed todt_gforce() where itidentifies a function such assum(),mean(),min() ormax(). Moreinformation on aggregation is available asaggregate().If the object passed asaggregate is scalar, it is applied to allrequested concepts in the same way. In order to customize aggregation perconcept, a named object (with names corresponding to concepts) of the samelength as the number of requested concepts may be passed.
Under the hood, aconcept object comprises of severalcncpt objectswith varying sub-types (for examplenum_cncpt, representing continuousnumeric data orfct_cncpt representing categorical data). Thisimplementation detail is of no further importance for understanding conceptloading and for more information, please refer to theconcept documentation. The only argument that is introducedatcncpt level isprogress, which controls progress reporting. Ifcalled directly, the default value ofNULL yields messages, sent to theterminal. Internally, if called fromload_concepts() atconcept level(withverbose set toTRUE), aprogress::progress_bar is set up in away that allows nested messages to be captured and not interrupt progressreporting (seemsg_progress()).
Item
A singlecncpt object contains anitem object, which in turn iscomposed of severalitm objects with varying sub-types, the relationshipitem toitm being that ofconcept tocncpt and the rationale forthis implementation choice is the same as previously: a container classused representing a vector of objects of varying sub-types, all inheritingform a common super-type. For more information on theitem class, pleaserefer to therelevant documentation. Arguments introduced atitemlevel includepatient_ids,id_type andinterval. Acceptable values forinterval are scalar-valuedbase::difftime() objects (see also helperfunctions such ashours()) and this argument essentially controls thetime-resolution of the returned time-series. Of course, the limiting factorraw time resolution which is on the order of hours for data sets likeMIMIC-III oreICU but can be much higher for adata set likeHiRID. The argumentid_type is used to specify what kind of id system should be used toidentify different time series in the returned data. A data set likeMIMIC-III, for example, makes possible the resolution of data to 3 nestedID systems:
patient(subject_id): identifies a personhadm(hadm_id): identifies a hospital admission (several of which arepossible for a given person)icustay(icustay_id): identifies an admission to an ICU and again hasa one-to-many relationship tohadm.
Acceptable argument values are strings that match ID systems as specifiedby thedata source configuration. Finally,patient_idsis used to define a patient cohort for which data can be requested. Valuesmay either be a vector of IDs (which are assumed to be of the same type asspecified by theid_type argument) or a tabular object inheriting fromdata.frame, which must contain a column named after the data set-specificID system identifier (for MIMIC-III and anid_type argument ofhadm,for example, that would behadm_id).
Extensions
The presented hierarchy of S3 classes is designed with extensibility inmind: while the current range of functionality covers settings encounteredwhen dealing with the included concepts and datasets, further data setsand/or clinical concepts might necessitate different behavior for dataloading. For this reason, various parts in the cascade of calls toload_concepts() can be adapted for new requirements by defining new sub-classes tocncpt oritm and providing methods for the generic functionload_concepts()specific to these new classes. Atcncpt level, methoddispatch defaults toload_concepts.cncpt() if no method specific to thenew class is provided, while atitm level, no default function isavailable.
Roughly speaking, the semantics for the two functions are as follows:
cncpt: Called with argumentsx(the currentcncptobject),aggregate(controlling how aggregation per time-point and ID ishandled),...(further arguments passed to downstream methods) andprogress(controlling progress reporting), this function should be ableto load and aggregate data for the given concept. Usually this involvesextracting theitemobject and callingload_concepts()again,dispatching on theitemclass with argumentsx(the givenitem),arguments passed as..., as well asprogress.itm: Called with argumentsx(the current object inheriting fromitm,patient_ids(NULLor a patient ID selection),id_type(astring specifying what ID system to retrieve), andinterval(the timeseries interval), this function actually carries out the loading ofindividual data items, using the specified ID system, rounding times tothe correct interval and subsetting on patient IDs. As return value, onobject of class as specified by thetargetentry is expected and alldata_vars()should be named consistently, as data corresponding tomultipleitmobjects concatenated in row-wise fashion as inbase::rbind().
Examples
if (require(mimic.demo)) {dat <- load_concepts("glu", "mimic_demo")gluc <- concept("gluc", item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L))))identical(load_concepts(gluc), dat)class(dat)class(load_concepts(c("sex", "age"), "mimic_demo"))}Load concept dictionaries
Description
Data concepts can be specified in JSON format as a concept dictionary whichcan be read and parsed intoconcept/item objects. Dictionary loadingcan either be performed on the default included dictionary or on a user-specified custom dictionary. Furthermore, a mechanism is provided for addingconcepts and/or data sources to the existing dictionary (see the Detailssection).
Usage
load_dictionary( src = NULL, concepts = NULL, name = "concept-dict", cfg_dirs = NULL)concept_availability(dict = NULL, include_rec = FALSE, ...)explain_dictionary( dict = NULL, cols = c("name", "category", "description"), ...)Arguments
src |
|
concepts | A character vector used to subset the concept dictionary or |
name | Name of the dictionary to be read |
cfg_dirs | File name of the dictionary |
dict | A dictionary ( |
include_rec | Logical flag indicating whether to include |
... | Forwarded to |
cols | Columns to include in the output of |
Details
A default dictionary is provided at
system.file( file.path("extdata", "config", "concept-dict.json"), package = "ricu")and can be loaded in to an R session by callingget_config("concept-dict"). The default dictionary can be extended byadding a fileconcept-dict.json to the path specified by the environmentvariableRICU_CONFIG_PATH. New concepts can be added to this file andexisting concepts can be extended (by adding new data sources).Alternatively,load_dictionary() can be called on non-defaultdictionaries using thefile argument.
In order to specify a concept as JSON object, for example the numericconcept for glucose, is given by
{ "glu": { "unit": "mg/dL", "min": 0, "max": 1000, "description": "glucose", "category": "chemistry", "sources": { "mimic_demo": [ { "ids": [50809, 50931], "table": "labevents", "sub_var": "itemid" } ] } }}Using such a specification, constructors forcncpt anditm objects are called either using default arguments or asspecified by the JSON object, with the above corresponding to a call like
concept( name = "glu", items = item( src = "mimic_demo", table = "labevents", sub_var = "itemid", ids = list(c(50809L, 50931L)) ), description = "glucose", category = "chemistry", unit = "mg/dL", min = 0, max = 1000)
The argumentssrc andconcepts can be used to only load a subset of adictionary by specifying a character vector of data sources and/or conceptnames.
A summary of item availability for a set of concepts can be created usingconcept_availability(). This produces a logical matrix withTRUE entriescorresponding to concepts where for the given data source, at least a singleitem has been defined. If data is loaded for a combination of concept anddata source, where the corresponding entry isFALSE, this will yieldeither a zero-rowid_tbl object or an object inheriting formid_tblwhere the column corresponding to the concept isNA throughout, dependingon whether the concept was loaded alongside other concepts where data isavailable or not.
Whether to includerec_cncpt concepts in the overview produced byconcept_availability() can be controlled via the logical flaginclude_rec. A recursive concept is considered available simply if all itsbuilding blocks are available. This can, however lead to slightly confusingoutput as a recursive concept might not strictly depend on one of itssub-concepts but handle such missingness by design. In such a scenario, theavailability summary might reportFALSE even though data can still beproduced.
Value
Aconcept object containing several data concepts ascncptobjects.
Examples
if (require(mimic.demo)) {head(load_dictionary("mimic_demo"))load_dictionary("mimic_demo", c("glu", "lact"))}Load data asid_tbl orts_tbl objects
Description
Building on functionality provided byload_src() andload_difftime(),load_id() andload_ts() load data from disk asid_tbl andts_tblobjects respectively. Overload_difftime() bothload_id() andload_ts() provide a way to specifymeta_vars() (asid_var andindex_var arguments), as well as an interval size (asintervalargument) for time series data.
Usage
load_id(x, ...)## S3 method for class 'src_tbl'load_id( x, rows, cols = colnames(x), id_var = id_vars(x), interval = hours(1L), time_vars = ricu::time_vars(x), ...)## S3 method for class 'character'load_id(x, src, ...)## S3 method for class 'itm'load_id( x, cols = colnames(x), id_var = id_vars(x), interval = hours(1L), time_vars = ricu::time_vars(x), ...)## S3 method for class 'fun_itm'load_id(x, ...)## Default S3 method:load_id(x, ...)load_ts(x, ...)## S3 method for class 'src_tbl'load_ts( x, rows, cols = colnames(x), id_var = id_vars(x), index_var = ricu::index_var(x), interval = hours(1L), time_vars = ricu::time_vars(x), ...)## S3 method for class 'character'load_ts(x, src, ...)## S3 method for class 'itm'load_ts( x, cols = colnames(x), id_var = id_vars(x), index_var = ricu::index_var(x), interval = hours(1L), time_vars = ricu::time_vars(x), ...)## S3 method for class 'fun_itm'load_ts(x, ...)## Default S3 method:load_ts(x, ...)load_win(x, ...)## S3 method for class 'src_tbl'load_win( x, rows, cols = colnames(x), id_var = id_vars(x), index_var = ricu::index_var(x), interval = hours(1L), dur_var = ricu::dur_var(x), dur_is_end = TRUE, time_vars = ricu::time_vars(x), ...)## S3 method for class 'character'load_win(x, src, ...)## S3 method for class 'itm'load_win( x, cols = colnames(x), id_var = id_vars(x), index_var = ricu::index_var(x), interval = hours(1L), dur_var = ricu::dur_var(x), dur_is_end = TRUE, time_vars = ricu::time_vars(x), ...)## S3 method for class 'fun_itm'load_win(x, ...)## Default S3 method:load_win(x, ...)Arguments
x | Object for which to load data |
... | Generic consistency |
rows | Expression used for row subsetting (NSE) |
cols | Character vector of column names |
id_var | The column defining the id of |
interval | The time interval used to discretize time stamps with,specified as |
time_vars | Character vector enumerating the columns to be treated astimestamps and thus returned as |
src | Passed to |
index_var | The column defining the index of |
dur_var | The column used for determining durations |
dur_is_end | Logical flag indicating whether to use durations as-is orto calculated them by subtracting the |
Details
While forload_difftime() the ID variable can be suggested, the functiononly returns a best effort at fulfilling this request. In some cases, wherethe data does not allow for the desired ID type, data is returned using theID system (among all available ones for the given table) with highestcardinality. Bothload_id() andload_ts() are guaranteed to return anobject withid_vars() set as requested by theid_var argument.Internally, the change of ID system is performed bychange_id().
Additionally, while times returned byload_difftime() are in 1 minuteresolution, the time series step size can be specified by theintervalargument when callingload_id() orload_ts(). This rounding andpotential change of time unit is performed bychange_interval() on allcolumns specified by thetime_vars argument. All time stamps are relativeto the origin provided by the ID system. This means that for anid_varcorresponding to hospital IDs, times are relative to hospital admission.
Whenload_id() (orload_ts()) is called onitm objectsinstead ofsrc_tbl (or objects that can be coerced tosrc_tbl), The row-subsetting is performed according the the specificationas provided by theitm object. Furthermore, at default settings, columnsare returned as required by theitm object andid_var (as well asindex_var) are set accordingly if specified by theitm or set todefault values for the givensrc_tbl object if not explicitly specified.
Value
Anid_tbl or ats_tbl object.
Examples
if (require(mimic.demo)) {load_id("admissions", "mimic_demo", cols = "admission_type")dat <- load_ts(mimic_demo$labevents, itemid %in% c(50809L, 50931L), cols = c("itemid", "valuenum"))glu <- new_itm(src = "mimic_demo", table = "labevents", sub_var = "itemid", ids = c(50809L, 50931L))identical(load_ts(glu), dat)}Low level functions for loading data
Description
Data loading involves a cascade of S3 generic functions, which canindividually be adapted to the specifics of individual data sources. A thelowest level,load_scr is called, followed byload_difftime().Functions up the chain, are described inload_id().
Usage
load_src(x, ...)## S3 method for class 'src_tbl'load_src(x, rows, cols = colnames(x), ...)## S3 method for class 'character'load_src(x, src, ...)load_difftime(x, ...)## S3 method for class 'mimic_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'eicu_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'hirid_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'aumc_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'miiv_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'sic_tbl'load_difftime( x, rows, cols = colnames(x), id_hint = id_vars(x), time_vars = ricu::time_vars(x), ...)## S3 method for class 'character'load_difftime(x, src, ...)Arguments
x | Object for which to load data |
... | Generic consistency |
rows | Expression used for row subsetting (NSE) |
cols | Character vector of column names |
src | Passed to |
id_hint | String valued id column selection (not necessarily honored) |
time_vars | Character vector enumerating the columns to be treated astimestamps and thus returned as |
Details
A function extending the S3 genericload_src() is expected to load asubset of rows/columns from a tabular data source. While the columnspecification is provided as character vector of column names, the rowsubsetting involves non-standard evaluation (NSE). Data-sets that areincluded withricu are represented byprt objects,which userlang::eval_tidy() to evaluate NSE expressions. Furthermore,prt objects potentially represent tabular data split into partitions androw-subsetting expressions are evaluated per partition (see thepart_safeflag inprt::subset.prt()). The return value ofload_src() is expectedto be of typedata.table.
Timestamps are represented differently among the included data sources:while MIMIC-III and HiRID use absolute date/times, eICU provides temporalinformation as minutes relative to ICU admission. Other data sources, suchas the ICU dataset provided by Amsterdam UMC, opt for relative times aswell, but not in minutes since admission, but in milliseconds. In order tosmoothen out such discrepancies, the next function in the data loadinghierarchy isload_difftime(). This function is expected to callload_src() in order to load a subset of rows/columns from a table storedon disk and convert all columns that represent timestamps (as specified bythe argumenttime_vars) intobase::difftime() vectors usingmins astime unit.
The returned object should be of typeid_tbl, with the ID varsidentifying the ID system the times are relative to. If for example alltimes are relative to ICU admission, the ICU stay ID should be returned asID column. The argumentid_hint may suggest an ID type, but if in the rawdata, this ID is not available,load_difftime() may return data using adifferent ID system. In MIMIC-III, for example, data in thelabeventstable is available forsubject_id (patient ID) prhadm_id (hospitaladmission ID). If data is requested foricustay_id (ICU stay ID), thisrequest cannot be fulfilled and data is returned using the ID system withthe highest cardinality (among the available ones). Utilities such aschange_id() can the later be used to resolve data toicustay_id.
Value
Adata.table object.
Examples
if (require(mimic.demo)) {tbl <- mimic_demo$labeventscol <- c("charttime", "value")load_src(tbl, itemid == 50809)colnames( load_src("labevents", "mimic_demo", itemid == 50809, cols = col))load_difftime(tbl, itemid == 50809)colnames( load_difftime(tbl, itemid == 50809, col))id_vars( load_difftime(tbl, itemid == 50809, id_hint = "icustay_id"))id_vars( load_difftime(tbl, itemid == 50809, id_hint = "subject_id"))}Load configuration for a data source
Description
For a data source to be accessible byricu, a configuration objectinheriting from the S3 classsrc_cfg is required. Such objects can begenerated from JSON based configuration files, usingload_src_cfg().Information encoded by this configuration object includes available IDsystems (mainly for use inchange_id(), default column names per tablefor columns with special meaning (such as index column, value columns, unitcolumns, etc.), as well as a specification used for initial setup of thedataset which includes file names and column names alongside their datatypes.
Usage
load_src_cfg(src = NULL, name = "data-sources", cfg_dirs = NULL)Arguments
src | (Optional) name(s) of data sources used for subsetting |
name | String valued name of a config file which will be looked up inthe default config directors |
cfg_dirs | Additional directory/ies to look for configuration files |
Details
Configuration files are looked for as filesname with added suffix.json starting with the directory (or directories) supplied ascfg_dirsargument, followed by the directory specified by the environment variableRICU_CONFIG_PATH, and finally inextdata/config of the package installdirectory. If files with matching names are found in multiple places theyare concatenated such that in cases of name clashes. the earlier hits takeprecedent over the later ones. The following JSON code blocks show excerptsof the config file available at
system.file("extdata", "config", "data-sources.json", package = "ricu")A data source configuration entry in a config file starts with a name,followed by optional entriesclass_prefix and further (variable)key-value pairs, such as an URL. For more information onclass_prefix,please refer to the end of this section. Further entries includeid_cfgandtables which are explained in more detail below. As outline, thisgives for the data sourcemimic_demo, the following JSON object:
{ "name": "mimic_demo", "class_prefix": ["mimic_demo", "mimic"], "url": "https://physionet.org/files/mimiciii-demo/1.4", "id_cfg": { ... }, "tables": { ... }}Theid_cfg entry is used to specify the available ID systems for a datasource and how they relate to each other. An ID system within the contextofricu is a patient identifier of which typically several are present ina data set. In MIMIC-III, for example, three ID systems are available:patient IDs (subject_id), hospital admission IDs (hadm_id) and ICU stayIDs (icustay_id). Furthermore there is a one-to-many relationship betweensubject_id andhadm_id, as well as betweenhadm_id andicustay_id.Required for defining an ID system are a name, aposition entry whichorders the ID systems by their cardinality, atable entry, alongsidecolumn specificationsid,start andend, which define how the IDsthemselves, combined with start and end times can be loaded from a table.This gives the following specification for the ICU stay ID system inMIMIC-III:
{ "icustay": { "id": "icustay_id", "position": 3, "start": "intime", "end": "outtime", "table": "icustays" }}Tables are defined by a name and entriesfiles,defaults, andcols,as well as optional entriesnum_rows andpartitioning. Asfiles entry,a character vector of file names is expected. For all of MIMIC-III a single.csv file corresponds to a table, but for example for HiRID, some tablesare distributed in partitions. Thedefaults entry consists of key-valuepairs, identifying columns in a table with special meaning, such as thedefault index column or the set of all columns that represent timestamps.This gives as an example for a table entry for thechartevents table inMIMIC-III a JSON object like:
{ "chartevents": { "files": "CHARTEVENTS.csv.gz", "defaults": { "index_var": "charttime", "val_var": "valuenum", "unit_var": "valueuom", "time_vars": ["charttime", "storetime"] }, "num_rows": 330712483, "cols": { ... }, "partitioning": { "col": "itemid", "breaks": [127, 210, 425, 549, 643, 741, 1483, 3458, 3695, 8440, 8553, 220274, 223921, 224085, 224859, 227629] } }}The optionalnum_rows entry is used when importing data (seeimport_src()) as a sanity check, which is not performed if this entry ismissing from the data source configuration. The remaining table entry,partitioning, is optional in the sense that if it is missing, the tableis not partitioned and if it is present, the table will be partitionedaccordingly when being imported (seeimport_src()). In order to specify apartitioning, two entries are required,col andbreaks, where the formerdenotes a column and the latter a numeric vector which is used to constructintervals according to whichcol is binned. As such, currentlycol isrequired to be of numeric type. Apartitioning entry as in the exampleabove will assign rows corresponding toidemid 1 through 126 to partition1, 127 through 209 to partition 2 and so on up to partition 17.
Column specifications consist of aname and aspec entry alongside aname which determines the column name that will be used byricu. Thespec entry is expected to be the name of a column specification functionof thereadr package (seereadr::cols()) and all further entries in acols object are used as arguments to thereadr column specification.For theadmissions table of MIMIC-III the columnshadm_id andadmittime are represented by:
{ ..., "hadm_id": { "name": "HADM_ID", "spec": "col_integer" }, "admittime": { "name": "ADMITTIME", "spec": "col_datetime", "format": "%Y-%m-%d %H:%M:%S" }, ...}Internally, asrc_cfg object consist of further S3 classes, which areinstantiated when loading a JSON source configuration file. Functions forcreating and manipulatingsrc_cfg and related objects are markedinternal but a brief overview is given here nevertheless:
src_cfg: wraps objectsid_cfg,col_cfgand optionallytbl_cfgid_cfg: contains information in ID systems and is created fromid_cfgentries in config filescol_cfg: contains column default settings represented bydefaultsentries in table configuration blockstbl_cfg: used when importing data and therefore encompasses informationinfiles,num_rowsandcolsentries of table configuration blocks
Asrc_cfg can be instantiated without correspondingtbl_cfg butconsequently cannot be used for data import (seeimport_src()). In thatsense, table config entriesfiles andcols are optional as well withthe restriction that the data source has to be already available in.fstformat
An example for such a slimmed down config file is available at
system.file("extdata", "config", "demo-sources.json", package = "ricu")Theclass_prefix entry in a data source configuration is used create sub-classes tosrc_cfg,id_cfg,col_cfg andtbl_cfg classes and passedon to constructors ofsrc_env (new_src_env()) andsrc_tblnew_src_tbl() objects. As an example, for the aboveclass_prefix valueofc("mimic_demo", "mimic"), the correspondingsrc_cfg will be assignedclassesc("mimic_demo_cfg", "mimic_cfg", "src_cfg") and consequently thesrc_tbl objects will inherit from"mimic_demo_tbl","mimic_tbl" and "src_tbl". This can be used to adapt the behavior of involved S3 genericfunction to specifics of the different data sources. An example for this ishowload_difftime() uses theses sub-classes to smoothen out differenttime-stamp representations. Furthermore, such a design was chosen withextensibility in mind. Currently,download_src() is designed around datasources hosted on PhysioNet, but in order to include a dataset external toPhysioNet, thedownload_src() generic can simply be extended for the newclass.
Value
A list of data source configurations assrc_cfg objects.
Examples
cfg <- load_src_cfg("mimic_demo")str(cfg, max.level = 1L)cfg <- cfg[["mimic_demo"]]str(cfg, max.level = 1L)cols <- as_col_cfg(cfg)index_var(head(cols))time_vars(head(cols))as_id_cfg(cfg)Utility functions
Description
Several utility functions exported for convenience.
Usage
min_or_na(x)max_or_na(x)is_val(x, val)not_val(x, val)is_true(x)is_false(x)last_elem(x)first_elem(x)Arguments
x | Object to use |
val | Value to compare against or to use as replacement |
Details
The two functionsmin_or_na() andmax_or_na() overcome a design choiceofbase::min() (orbase::max()) that can yield undesirable results. If called on a vector of all missing values with na.rm = TRUE, Inf(and-Infrespectively) are returned. This is changed to returning a missing value of the same type asx'.
The functionsis_val() andnot_val() (as well as analogouslyis_true() andis_false()) return logical vectors of the same length asthe value passed asx, with non-base R semanticists of comparing againstNA: instead of returningc(NA, TRUE) forc(NA, 5) == 5,is_val()will returnc(FALSE TRUE). PassingNA asval might lead to unintendedresults but no warning is thrown.
Finally,first_elem() andlast_elem() has the same semantics asutils::head() andutils::tail() withn = 1L andreplace_na() willreplace all occurrences ofNA inx withval and can be called on bothobjects inheriting fromdata.table in which case internallydata.table::setnafill() is called or other objects.
Value
min_or_na()/max_or_na(): scalar-valued extrema of a vectoris_val()/not_val()/is_true()/is_false(): Logical vector of thesame length as the object passed asxfirst_elem()/last_elem(): single element of the object passed asxreplace_na(): modified version of the object passed asx
Examples
some_na <- c(NA, sample(1:10, 5), NA)identical(min(some_na, na.rm = TRUE), min_or_na(some_na))all_na <- rep(NA, 5)min(all_na, na.rm = TRUE)min_or_na(all_na)is_val(some_na, 5)some_na == 5is_val(some_na, NA)identical(first_elem(letters), head(letters, n = 1L))identical(last_elem(letters), tail(letters, n = 1L))replace_na(some_na, 11)replace_na(all_na, 11)replace_na(1:5, 11)tbl <- ts_tbl(a = 1:10, b = hours(1:10), c = c(NA, 1:5, NA, 8:9, NA))res <- replace_na(tbl, 0)identical(tbl, res)Message signaling nested with progress reporting
Description
In order to not interrupt progress reporting by aprogress::progress_bar,messages are wrapped with classmsg_progress which causes them to becaptured printed after progress bar completion. This function is intended tobe used when signaling messages in callback functions.
Usage
msg_progress(..., envir = parent.frame())fmt_msg(msg, envir = parent.frame(), indent = 0L, exdent = 0L)Arguments
... | Passed to |
envir | Environment in this objects from |
msg | String valued message |
indent,exdent | Vector valued and mapped to |
Value
Called for side effects and returnsNULL invisibly.
Examples
msg_progress("Foo", "bar")capt_fun <- function(x) { message("captured: ", conditionMessage(x))}tryCatch(msg_progress("Foo", "bar"), msg_progress = capt_fun)Data Concepts
Description
Concept objects are used inricu as a way to specify how a clinicalconcept, such as heart rate can be loaded from a data source and are mainlyconsumed byload_concepts(). Several functions are available forconstructingconcept (and related auxiliary) objects either from code orby parsing a JSON formatted concept dictionary usingload_dictionary().
Usage
new_cncpt( name, items, description = name, omopid = NA_integer_, category = NA_character_, aggregate = NULL, ..., target = "ts_tbl", class = "num_cncpt")is_cncpt(x)init_cncpt(x, ...)## S3 method for class 'num_cncpt'init_cncpt(x, unit = NULL, min = NULL, max = NULL, ...)## S3 method for class 'unt_cncpt'init_cncpt(x, unit = NULL, min = NULL, max = NULL, ...)## S3 method for class 'fct_cncpt'init_cncpt(x, levels, ...)## S3 method for class 'cncpt'init_cncpt(x, ...)## S3 method for class 'rec_cncpt'init_cncpt( x, callback = paste0("rename_data_var('", x[["name"]], "')"), interval = NULL, ...)new_concept(x)concept(...)is_concept(x)as_concept(x)Arguments
name | The name of the concept |
items | Zero or more |
description | String-valued concept description |
omopid | OMOP identifier |
category | String-valued category |
aggregate | NULL or a string denoting a function used to aggregate perid and if applicable per time step |
... | Further specification of the |
target | The target object yielded by loading |
class |
|
x | Object to query/dispatch on |
unit | A string, specifying the measurement unit of the concept (canbe |
min,max | Scalar valued; defines a range of plausible values for anumeric concept |
levels | A vector of possible values a categorical concept may take on |
callback | Name of a function to be called on the returned data usedfor data cleanup operations |
interval | Time interval used for data loading; if NULL, the respectiveinterval passed as argument to |
Details
In order to allow for a large degree of flexibility (and extensibility),which is much needed owing to considerable heterogeneity presented bydifferent data sources, several nested S3 classes are involved inrepresenting a concept. An outline of this hierarchy can be described as
concept: contains manycncptobjects (of potentially differingsub-types), each comprising of some meta-data and anitemobjectitem: contains manyitmobjects (of potentially differingsub-types), each encoding how to retrieve a data item.
The design choice for wrapping a vector ofcncpt objects with a containerclassconcept is motivated by the requirement of having several differentsub-types ofcncpt objects (all inheriting from the parent typecncpt),while retaining control over how this homogeneous w.r.t. parent type, butheterogeneous w.r.t. sub-type vector of objects behaves in terms of S3generic functions.
Each individualcncpt object contains the following information: a string-valued name, anitem vector containingitmobjects, a string-valued description (can be missing), a string-valuedcategory designation (can be missing), a character vector-valuedspecification for an aggregation function and a target class specification(e.g.id_tbl orts_tbl). Additionally, a sub-class tocncpt has to be specified, each representing a differentdata-scenario and holding further class-specific information. The followingsub-classes tocncpt are available:
num_cncpt: The most widely used concept type is indented for conceptsrepresenting numerical measurements. Additional information that can bespecified includes a string-valued unit specification, alongside aplausible range which can be used during data loading.fct_cncpt: In case of categorical concepts, such assex, a set offactor levels can be specified, against which the loaded data is checked.lgl_cncpt: A special case offct_cncpt, this allows only for logicalvalues (TRUE,FALSEandNA).rec_cncpt: More involved concepts, such as aSOFA scorecan pull in other concepts. Recursive concepts can build on otherrecursive concepts up to arbitrary recursion depth. Owing to the morecomplicated nature of such concepts, acallbackfunction can bespecified which is used in data loading for concept-specific post-processing steps.unt_cncpt: A recent (experimental) addition which inherits fromnum_cncptbut instead of manual unit conversion, leverages
Class instantiation is organized in the same fashion as foritem objects:concept() maps vector-valued argumentstonew_cncpt(), which internally calls the S3 generic functioninit_cncpt(), whilenew_concept() instantiates aconcept object froma list ofcncpt objects (created by calls tonew_cncpt()). Coercion isonly possible fromlist andcncpt, by callingas_concept() andinheritance can be checked usingis_concept() oris_cncpt().
Value
Constructors and coercion functions returncncpt andconceptobjects, while inheritance tester functions return logical flags.
Examples
if (require(mimic.demo)) {gluc <- concept("glu", item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L))), description = "glucose", category = "chemistry", unit = "mg/dL", min = 0, max = 1000)is_concept(gluc)identical(gluc, load_dictionary("mimic_demo", "glu"))gl1 <- new_cncpt("glu", item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L))), description = "glucose")is_cncpt(gl1)is_concept(gl1)conc <- concept(c("glu", "lact"), list( item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L))), item("mimic_demo", "labevents", "itemid", 50813L) ), description = c("glucose", "lactate"))concidentical(as_concept(gl1), conc[1L])}Data items
Description
Item objects are used inricu as a way to specify how individual dataitems corresponding to clinical concepts (see alsoconcept()), such asheart rate can be loaded from a data source. Several functions areavailable for constructingitem (and related auxiliary) objects eitherfrom code or by parsing a JSON formatted concept dictionary usingload_dictionary().
Usage
new_itm(src, ..., interval = NULL, target = NA_character_, class = "sel_itm")is_itm(x)init_itm(x, ...)## S3 method for class 'sel_itm'init_itm(x, table, sub_var, ids, callback = "identity_callback", ...)## S3 method for class 'hrd_itm'init_itm(x, table, sub_var, ids, callback = "identity_callback", ...)## S3 method for class 'col_itm'init_itm(x, table, unit_val = NULL, callback = "identity_callback", ...)## S3 method for class 'rgx_itm'init_itm(x, table, sub_var, regex, callback = "identity_callback", ...)## S3 method for class 'fun_itm'init_itm(x, callback, ...)## S3 method for class 'itm'init_itm(x, ...)new_item(x)item(...)as_item(x)is_item(x)Arguments
src | The data source name |
... | Further specification of the |
interval | A default data loading interval (either specified as scalar |
target | Item target class (e.g. "id_tbl"), |
class | Sub class for customizing |
x | Object to query/dispatch on |
table | Name of the table containing the data |
sub_var | Column name used for subsetting |
ids | Vector of ids used to subset table rows. If |
callback | Name of a function to be called on the returned data usedfor data cleanup operations (or a string that evaluates to a function) |
unit_val | String valued unit to be used in case no |
regex | String-valued regular expression which will be evaluated by |
Details
In order to allow for a large degree of flexibility (and extensibility),which is much needed owing to considerable heterogeneity presented bydifferent data sources, several nested S3 classes are involved inrepresenting a concept. An outline of this hierarchy can be described as
concept: contains manycncptobjects (of potentiallydiffering sub-types), each comprising of some meta-data and anitemobjectitem: contains manyitmobjects (of potentially differingsub-types), each encoding how to retrieve a data item.
The design choice for wrapping a vector ofitm objects with a containerclassitem is motivated by the requirement of having several differentsub-types ofitm objects (all inheriting from the parent typeitm),while retaining control over how this homogeneous w.r.t. parent type, butheterogeneous w.r.t. sub-type vector of objects behaves in terms of S3generic functions.
The following sub-classes toitm are available, each representing adifferent data-scenario:
sel_itm: The most widely used item class is intended for the situationwhere rows of interest can be identified by looking for occurrences of aset of IDs (ids) in a column (sub_var). An example for this is heartratehron mimic, where the IDs211and 220045are looked up in theitemidcolumn ofchartevents'.col_itm: This item class can be used if no row-subsetting is required.An example for this is heart rate (hr) oneicu, where the tablevitalperiodiccontains an entire column dedicated to heart ratemeasurements.rgx_itm: As alternative to the value-matching approach ofsel_itmobjects, this class identifies rows using regular expressions. Used forexample for insulin ineicu, where the regular expression^insulin (250.+)?\\(((ml|units)/hr)?\\)$is matched against thedrugnamecolumnofinfusiondrug. The regular expression is evaluated bybase::grepl()withignore.case = TRUE.fun_itm: Intended for the scenario where data of interest is notdirectly available from a table, thisitmclass offers most flexibility.A function can be specified ascallbackand this function will be calledwith argumentsx(the object itself),patient_ids,id_typeandinterval(seeload_concepts()) and is expected to return an object asspecified by thetargetentry.hrd_itm: A special case ofsel_itmfor HiRID data where measurementunits are not available as separate column, but as separate table withunits fixed per concept.
Allitm objects have to specify a data source (src) as well as asub-class. Further arguments then are specific to the respective sub-classand encode information that define data loading, such as the table toquery, the column name and values to use for identifying relevant rows,etc. The S3 generic functioninit_itm() is responsible for inputvalidation of class-specific arguments as well as class initialization. Alist ofitm objects, created by calls tonew_itm() can be passed tonew_item in order to instantiate anitem object. An alternativeconstructor foritem objects is given byitem() which callsnew_itm()on the passed arguments (see examples). Finallyas_item() can be usedfor coercion of related objects such aslist,concept, and the like.Several additional S3 generic functions exist for manipulation ofitem-like objects but are markedinternal (seeitem/concept utilities).
Value
Constructors and coercion functions returnitm anditem objects,while inheritance tester functions return logical flags.
Examples
if (require(mimic.demo)) {gluc <- item("mimic_demo", "labevents", "itemid", list(c(50809L, 50931L)), unit_var = TRUE, target = "ts_tbl")is_item(gluc)all.equal(gluc, as_item(load_dictionary("mimic_demo", "glu")))hr1 <- new_itm(src = "mimic_demo", table = "chartevents", sub_var = "itemid", ids = c(211L, 220045L))hr2 <- item(src = c("mimic_demo", "eicu_demo"), table = c("chartevents", "vitalperiodic"), sub_var = list("itemid", NULL), val_var = list(NULL, "heartrate"), ids = list(c(211L, 220045L), FALSE), class = c("sel_itm", "col_itm"))hr3 <- new_itm(src = "eicu_demo", table = "vitalperiodic", val_var = "heartrate", class = "col_itm")identical(as_item(hr1), hr2[1])identical(new_item(list(hr1)), hr2[1])identical(hr2, as_item(list(hr1, hr3)))}Internal utilities for working with data source configurations
Description
Data source configuration objects store information on data sources usedthroughoutricu. This includes URLs for data set downloading, Columnspecifications used for data set importing, default values per table forimportant columns such as index columns when loading data and how differentpatient identifiers used throughout a dataset relate to another. Perdataset, asrc_cfg object is created from a JSON file (seeload_src_cfg()), consisting of several helper-classes compartmentalizingthe pieces of information outlined above. Alongside constructors for thevarious classes, several utilities, such as inheritance checks, coercionfunctions, as well as functions to extract pieces of information from theseobjects are provided.
Usage
new_src_cfg(name, id_cfg, col_cfg, tbl_cfg, ..., class_prefix = name)new_id_cfg( src, name, id, pos = seq_along(name), start = NULL, end = NULL, table = NULL, class_prefix = src)new_col_cfg(src, table, ..., class_prefix = src)new_tbl_cfg( src, table, files = NULL, cols = NULL, num_rows = NULL, partitioning = NULL, ..., class_prefix = src)is_src_cfg(x)as_src_cfg(x)is_id_cfg(x)as_id_cfg(x)is_col_cfg(x)as_col_cfg(x)is_tbl_cfg(x)as_tbl_cfg(x)src_name(x)tbl_name(x)src_extra_cfg(x)src_prefix(x)src_url(x)id_var_opts(x)default_vars(x, type)Arguments
name | Name of the data source |
id_cfg | An |
col_cfg | A list of |
tbl_cfg | A list of |
... | Further objects to add (such as an URL specification) |
class_prefix | A character vector of class prefixes that are added tothe instantiated classes |
src | Data source name |
id,start,end | Name(s) of ID column(s), as well as respective startand end timestamps |
pos | Integer valued position, ordering IDs by their cardinality |
table | Table name |
cols | List containing a list per column each holding string valuedentries |
num_rows | A count indicating the expected number of rows |
partitioning | A table partitioning is defined by a column name and avector of numeric values that are passed as |
x | Object to coerce/query |
Details
The following classes are used to represent data source configurationobjects:
src_cfg: wraps objectsid_cfg,col_cfgand optionallytbl_cfgid_cfg: contains information in ID systems and is created fromid_cfgentries in config filescol_cfg: contains column default settings represented bydefaultsentries in table configuration blockstbl_cfg: used when importing data and therefore encompasses informationinfiles,num_rowsandcolsentries of table configuration blocks
Represented by acol_cfg, a table can have some of its columns marked asdefault columns for the following concepts and further column meanings canbe specified via...:
id_col: column will be used for as id foricu_tblobjectsindex_col: column represents a timestamp variable and will be use assuch forts_tblobjectsval_col: column contains the measured variable of interestunit_col: column specifies the unit of measurement in the correspondingval_col
Alongside constructors (new_*()), inheritance checking functions(is_*()), as well as coercion functions (as_*(), relevant utilityfunctions include:
src_url(): retrieve the URL of a data sourceid_var_opts(): column name(s) corresponding to ID systemssrc_name(): name of the data sourcetbl_name(): name of a table
Coercion between objects under some circumstances can yield list-of objectreturn types. For example when coercingsrc_cfg totbl_cfg, this willresult in a list oftbl_cfg objects, as multiple tables typicallycorrespond to a data source.
Value
Constructorsnew_*() as well as coercion functionsas_*()return the respective objects, while inheritance tester functionsis_*()return a logical flag.
src_url(): string valued data source URLid_var_opts(): character vector of ID variable optionssrc_name(): string valued data source nametbl_name(): string valued table name
Data source environments
Description
Attaching a data source (seeattach_src()) instantiates two types of S3classes: a singlesrc_env object, representing the data source ascollection of tables, as well as asrc_tbl objects per table,representing the given table. Upon package loading,src_env objectsincluding the respectivesrc_tbl objects are created for all data sourcesthat are configured for auto-attaching, irrespective of whether data isactually available. If some (or all) data is missing, the user is asked forpermission to download in interactive sessions and an error is thrown innon-interactive sessions. Seesetup_src_env() for manually downloadingand setting up data sources.
Usage
new_src_tbl(files, col_cfg, tbl_cfg, prefix, src_env)is_src_tbl(x)as_src_tbl(x, ...)## S3 method for class 'src_env'as_src_tbl(x, tbl, ...)new_src_env(x, env = new.env(parent = data_env()), link = NULL)is_src_env(x)## S3 method for class 'src_env'as.list(x, ...)as_src_env(x)attached_srcs()is_tbl_avail(tbl, env)src_tbl_avail(env, tbls = ls(envir = env))src_data_avail(src = auto_attach_srcs())is_data_avail(src = auto_attach_srcs())Arguments
files | File names of |
col_cfg | Coerced to |
tbl_cfg | Coerced to |
prefix | Character vector valued data source name(s) (used as classprefix) |
src_env | The data source environment (as |
x | Object to test/coerce |
tbl | String-valued table name |
env | Environment used as |
link |
|
tbls | Character vector of table names |
src | Character vector of data source names or any other object (orlist thereof) for which an |
Details
Asrc_env object is an environment with attributessrc_name (astring-valued data source name, such asmimic_demo) andid_cfg(describing the possible patient IDs for the given data source). Inaddition to thesrc_env class attribute, sub-classes are defined by thesourceclass_prefix configuration setting (seeload_src_cfg()). Suchdata source environments are intended to contain several correspondingsrc_tbl objects (or rather active bindings that evaluate tosrc_tblobjects; seesetup_src_env()).
The S3 classsrc_tbl inherits fromprt, whichrepresents a partitionedfst file. In addition to theprtobject, meta data in the form ofcol_cfg andtbl_cfg is associated withasrc_tbl object (seeload_src_cfg()). Furthermore, sub-classes areadded as specified by the source configurationclass_prefix entry, aswithsrc_env objects. This allows certain functionality, for example dataloading, to be adapted to data source-specific requirements.
Instantiation and set up ofsrc_env objects is possible irrespective ofwhether the underlying data is available. If some (or all) data is missing,the user is asked for permission to download in interactive sessions and anerror is thrown in non-interactive sessions upon first access of asrc_tbl bound as set up bysetup_src_env(). Data availability can bechecked with the following utilities:
is_tbl_avail(): Returns a logical flag indicating whether all requireddata for the table passed astblwhich may be a string or any objectthat has atbl_name()implementation is available from the environmentenv(requires anas_src_env()method).src_tbl_avail(): Returns a named logical vector, indicating which tableshave all required data available. As above, bothtbls(arbitrarylength) andenv(scalar-valued) may be character vectors or objectswith correspondingtbl_name()andas_src_env()methods.src_data_avail(): The most comprehensive data availability report canbe generated by callingsrc_data_avail(), returning adata.framewithcolumnsname(the data source name),available(logical vectorindicating whether all data is available),tables(the number ofavailable tables) andtotal(the total number of tables). As input,srcmay be an arbitrary length character vector, an object for which anas_src_env()method is defined or an arbitrary-length list thereof.is_data_avail(): Returns a named logical vector, indicating for whichdata sources all required data is available. As above,srcmay be anarbitrary length character vector, an object for which anas_src_env()method is defined or an arbitrary-length list thereof.
Value
The constructorsnew_src_env()/new_src_tbl() as well as coercionfunctionsas_src_env()/as_src_tbl() returnsrc_env andsrc_tblobjects respectively, while inheritance testersis_src_env()/is_src_tbl() return logical flags. For data availability utilities, seeDetails section.
Concept callback functions
Description
Owing to increased complexity and more diverse applications, recursiveconcepts (classrec_cncpt) may specify callback functionsto be called on corresponding data objects and perform post-processingsteps.
Usage
pafi( ..., match_win = hours(2L), mode = c("match_vals", "extreme_vals", "fill_gaps"), fix_na_fio2 = TRUE, interval = NULL)safi( ..., match_win = hours(2L), mode = c("match_vals", "extreme_vals", "fill_gaps"), fix_na_fio2 = TRUE, interval = NULL)vent_ind(..., match_win = hours(6L), min_length = mins(30L), interval = NULL)gcs( ..., valid_win = hours(6L), sed_impute = c("max", "prev", "none", "verb"), set_na_max = TRUE, interval = NULL)urine24( ..., min_win = hours(12L), limits = NULL, start_var = "start", end_var = "end", interval = NULL)vaso60(..., max_gap = mins(5L), interval = NULL)vaso_ind(..., interval = NULL)supp_o2(..., interval = NULL)avpu(..., interval = NULL)bmi(..., interval = NULL)norepi_equiv(..., interval = NULL)Arguments
... | Data input used for concept calculation |
match_win | Time-span during which matching of values is allowed |
mode | Method for matching PaO2 and FiO2 values |
fix_na_fio2 | Logical flag indicating whether to impute missingFiO2 values with 21 |
interval | Expected time series step size (determined from data if |
min_length | Minimal time span between a ventilation start and endtime |
valid_win | Maximal time window for which a GCS value is validif no newer measurement is available |
sed_impute | Imputation scheme for values taken when patient wassedated (i.e. unconscious). |
set_na_max | Logical flag controlling imputation of missing GCS valueswith the respective maximum values |
min_win | Minimal time span required for calculation of urine/24h |
limits | Passed to |
start_var,end_var | Passed to |
max_gap | Maximum time gap between administration windows that aremerged (can be negative). |
Details
Several concept callback functions are exported, mainly for documentingtheir arguments, as default values oftentimes represent somewhat arbitrarychoices and passing non-default values might be of interest forinvestigating stability with respect to such choices. Furthermore, defaultvalues might not be ideal for some datasets and/or analysis tasks.
pafi
In order to calculate the PaO2/FiO2 (or Horowitz index), fora given time point, both a PaO2 and an FiO2 measurement isrequired. As the two are often not measured at the same time, some form ofimputation or matching procedure is required. Several options are available:
match_valsallows for a time difference of maximallymatch_winbetween two measurements for calculating their ratioextreme_valsuses the worst PaO2 and FiO2 values withinthe time window spanned bymatch_winfill_gapsrepresents a variation ofextreme_vals, where ratios areevaluated at every time-point as specified byintervalas opposed toonly the time points where either a PaO2 or an FiO2measurement is available
Finally,fix_na_fio2 imputes all remaining missing FiO2 with 21,the percentage (by volume) of oxygen in (tropospheric) air.
vent_ind
Building on the atomic conceptsvent_start andvent_end,vent_inddetermines time windows during which patients are mechanically ventilatedby combining start and end events that are separated by at mostmatch_winand at leastmin_length. Durations are represented by thedur_var columnin the returnedwin_tbl and thedata_var column simply indicates theventilation status withTRUE values. Currently, no clear distinctionbetween invasive an non-invasive ventilation is made.
sed_gcs
In order to construct an indicator for patient sedation (used within thecontext ofgcs), information from the two conceptsett_gcs andrass ispooled: A patient is considered sedated if intubated or has less or equal to-2 on the Richmond Agitation-Sedation Scale.
gcs
Aggregating components of the Glasgow Coma Scale into a total score(whenever the total scoretgcs is not already available) requirescoinciding availability of an eye (egcs), verbal (vgcs) and motor(mgcs) score. In order to match values, a last observation carry forwardimputation scheme over the time span specified byvalid_win is performed.Furthermore passing"max" assed_impute will assume maximal points fortime steps where the patient is sedated (as indicated bysed_gcs), whilepassing"prev", will assign the last observed value previous to thecurrent sedation window and finally passingFALSE will in turn use rawvalues. Finally, passingTRUE asset_na_max will assume maximal pointsfor missing values (after matching and potentially applyingsed_impute).
urine24
Single urine output events are aggregated into a 24 hour moving window sum.At default value oflimits = NULL, moving window evaluation begins withthe first and ends with the last available measurement. This can however beextended by passing anid_tbl object, such as for example returned bystay_windows() to full stay windows. In order to provide data earlierthan 24 hours before the evaluation start point,min_win specifies theminimally required data window and the evaluation scheme is adjusted forshorter than 24 hour windows.
vaso60
Building on concepts for drug administration rate and drug administrationdurations, administration events are filtered if they do not fall intoadministrations windows of at least 1h. Themax_gap argument can be usedto control how far apart windows can be in order to be merged (negativetimes are possible as well, meaning that even overlapping windows can beconsidered as individual windows).
Value
Either anid_tbl orts_tbl depending on the type of concept.
Internal utilities foritem/concept objects
Description
Several internal utilities for modifying, querying ans subsetting item andconcept objects, including getters and setters foritm variables,callback functions,cncpt target classes, as well as utilities for dataloading such asprepare_query() which creates a row-subsettingexpression,do_callback(), which applies a callback function to data ordo_itm_load(), which performs data loading corresponding to anitm
Usage
prepare_query(x)try_add_vars(x, ..., var_lst = NULL, type = c("data_vars", "meta_vars"))get_itm_var(x, var = NULL, type = c("data_vars", "meta_vars"))set_callback(x, fun)do_callback(x, ...)do_itm_load(x, id_type = "icustay", interval = hours(1L))n_tick(x)set_target(x, target)get_target(x)subset_src(x, src)## S3 method for class 'item'subset_src(x, src)## S3 method for class 'cncpt'subset_src(x, src)## S3 method for class 'concept'subset_src(x, src)Arguments
x | Object defining the row-subsetting |
... | Variable specification |
var_lst | List-based variable specification |
type | Variable type (either data or meta) |
var | Variable name ( |
fun | Callback function (passed as string) |
id_type | String specifying the patient id type to return |
interval | The time interval used to discretize time stamps with,specified as |
src | Character vector of data source name(s) |
Value
prepare_query(): an unevaluated expression used for row-subsettingtry_add_vars(): a (potentially) modified item object with addedvariablesget_itm_var(): character vector ofitmvariablesset_callback(): a modified object with added callback functiondo_callback(): result of the callback function applied to data, mostlikely (id_tbl/ts_tbl)do_itm_load(): result of item loading (id_tbl/ts_tbl)n_tick(): Integer valued number of progress bar ticksset_target(): a modified object with newly set target classget_target(): string valued target class of an objectsubset_src(): an object of the same type as the object passed asx
ICU class data utilities
Description
Several utility functions for working withid_tbl andts_tbl objectsare available, including functions for changing column names, removingcolumns, as well as aggregating or removing rows. An important thing tonote is that asid_tbl (and consequentlyts_tbl) inherits fromdata.table, there are several functions provided by thedata.tablepackage that are capable of modifyingid_tbl in a way that results in anobject with inconsistent state. An example for this isdata.table::setnames(): if an ID column or the index column name ismodified without updating the attribute marking the column as such, thisleads to an invalid object. Asdata.table::setnames() is not an S3generic function, the only way to control its behavior with respect toid_tbl objects is masking the function. As such an approach has its owndown-sides, a separate function,rename_cols() is provided, which is ableto handle column renaming correctly.
Usage
rename_cols( x, new, old = colnames(x), skip_absent = FALSE, by_ref = FALSE, ...)rm_cols(x, cols, skip_absent = FALSE, by_ref = FALSE)change_interval(x, new_interval, cols = time_vars(x), by_ref = FALSE)change_dur_unit(x, new_unit, by_ref = FALSE)rm_na(x, cols = data_vars(x), mode = c("all", "any"))## S3 method for class 'id_tbl'sort( x, decreasing = FALSE, by = meta_vars(x), reorder_cols = TRUE, by_ref = FALSE, ...)is_sorted(x)## S3 method for class 'id_tbl'duplicated(x, incomparables = FALSE, by = meta_vars(x), ...)## S3 method for class 'id_tbl'anyDuplicated(x, incomparables = FALSE, by = meta_vars(x), ...)## S3 method for class 'id_tbl'unique(x, incomparables = FALSE, by = meta_vars(x), ...)is_unique(x, ...)## S3 method for class 'id_tbl'aggregate( x, expr = NULL, by = meta_vars(x), vars = data_vars(x), env = NULL, ...)dt_gforce( x, fun = c("mean", "median", "min", "max", "sum", "prod", "var", "sd", "first", "last", "any", "all"), by = meta_vars(x), vars = data_vars(x), na_rm = !fun %in% c("first", "last"))replace_na(x, val, type = "const", ...)Arguments
x | Object to query |
new,old | Replacement names and existing column names for renamingcolumns |
skip_absent | Logical flag for ignoring non-existent column names |
by_ref | Logical flag indicating whether to perform the operation byreference |
... | Ignored |
cols | Column names of columns to consider |
new_interval | Replacement interval length specified as scalar-valued |
new_unit | New |
mode | Switch between |
decreasing | Logical flag indicating the sort order |
by | Character vector indicating which combinations of columns from |
reorder_cols | Logical flag indicating whether to move the |
incomparables | Not used. Here for S3 method consistency |
expr | Expression to apply over groups |
vars | Column names to apply the function to |
env | Environment to look up names in |
fun | Function name (as string) to apply over groups |
na_rm | Logical flag indicating how to treat |
val | Replacement value (if |
type | character, one of"const","locf" or"nocb". Defaults to |
Details
Apart from a function for renaming columns while respecting attributesmarking columns a index or ID columns, several other utility functions areprovided to make handling ofid_tbl andts_tbl objects more convenient.
Sorting
Anid_tbl orts_tbl object is considered sorted when rows are inascending order according to columns as specified bymeta_vars(). Thismeans that for anid_tbl object rows have to be ordered byid_vars()and for ats_tbl object rows have to be ordered first byid_vars(),followed by theindex_var(). Calling the S3 generic functionbase::sort() on an object that inherits formid_tbl using defaultarguments yields an object that is considered sorted. For convenience(mostly in printing), the column by which the table was sorted are moved tothe front (this can be disabled by passingFALSE asreorder_colsargument). Internally, sorting is handled by either setting adata.table::key() in casedecreasing = FALSE or be callingdata.table::setorder() in casedecreasing = TRUE.
Uniqueness
On object inheriting formid_tbl is considered unique if it is unique interms of the columns as specified bymeta_vars(). This means that for anid_tbl object, either zero or a single row is allowed per combination ofvalues in columnsid_vars() and consequently forts_tbl objects amaximum of one row is allowed per combination of time step and ID. In orderto create a uniqueid_tbl object from a non-uniqueid_tbl object,aggregate() will combine observations that represent repeatedmeasurements within a group.
Aggregating
In order to turn a non-uniqueid_tbl orts_tbl object into an objectconsidered unique, the S3 generic functionstats::aggregate() isavailable. This applied the expression (or function specification) passedasexpr to each combination of grouping variables. The columns to beaggregated can be controlled using thevars argument and the groupingvariables can be changed using theby argument. The argumentexpr isfairly flexible: it can take an expression that will be evaluated in thecontext of thedata.table in a clean environment inheriting fromenv,it can be a function, or it can be a string in which casedt_gforce() iscalled. The default valueNULL chooses a string dependent on data types,wherenumeric resolves tomedian,logical tosum andcharacter tofirst.
As aggregation is used in concept loading (seeload_concepts()),performance is important. For this reason,dt_gforce() allows for any ofthe available functions to be applied using theGForce optimization ofdata.table (seedata.table::datatable.optimize).
Value
Most of the utility functions return an object inheriting fromid_tbl, potentially modified by reference, depending on the type of theobject passed asx. The functionsis_sorted(),anyDuplicated() andis_unique() return logical flags, whileduplicated() returns a logicalvector of the lengthnrow(x).
Examples
tbl <- id_tbl(a = rep(1:5, 4), b = rep(1:2, each = 10), c = rnorm(20), id_vars = c("a", "b"))is_unique(tbl)is_sorted(tbl)is_sorted(tbl[order(c)])identical(aggregate(tbl, list(c = sum(c))), aggregate(tbl, "sum"))tbl <- aggregate(tbl, "sum")is_unique(tbl)is_sorted(tbl)Utilities fordifftime
Description
Asbase::difftime() vectors are used throughoutricu, a set of wrapperfunctions are exported for convenience of instantiationbase::difftime()vectors with given time units.
Usage
secs(...)mins(...)hours(...)days(...)weeks(...)Arguments
... | Numeric vector to coerce to |
Value
Vector valued time differences asdifftime object.
Examples
hours(1L)mins(NA_real_)secs(1:10)hours(numeric(0L))Sepsis 3 label
Description
The sepsis 3 label consists of a suspected infection combined with an acuteincrease in SOFA score.
Usage
sep3( ..., si_window = c("first", "last", "any"), delta_fun = delta_cummin, sofa_thresh = 2L, si_lwr = hours(48L), si_upr = hours(24L), keep_components = FALSE, interval = NULL)delta_cummin(x)delta_start(x)delta_min(x, shifts = seq.int(0L, 23L))Arguments
... | Data objects |
si_window | Switch that can be used to filter SI windows |
delta_fun | Function used to determine the SOFA increase during an SIwindow |
sofa_thresh | Required SOFA increase to trigger Sepsis 3 |
si_lwr,si_upr | Lower/upper extent of SI windows |
keep_components | Logical flag indicating whether to return theindividual components alongside the aggregated score |
interval | Time series interval (only used for checking consistencyof input data) |
x | Vector of SOFA scores |
shifts | Vector of time shifts (multiples of the current interval) overwhich |
Details
The Sepsis-3 Consensus (Singer et. al.) defines sepsis as an acuteincrease in the SOFA score (seesofa_score()) of 2 points or more withinthe suspected infection (SI) window (seesusp_inf()):
A patient can potentially have multiple SI windows. The argumentsi_window is used to control which SI window we focus on (options are"first", "last", "any").
Further, although a 2 or more point increase in the SOFA score is defined,it is not perfectly clear to which value the increase refers. For this thedelta_fun argument is used. If the increase is required to happen withrespect to the minimal SOFA value (within the SI window) up to the currenttime, thedelta_cummin function should be used. If, however, we arelooking for an increase with respect to the start of the SI window, thenthedelta_start function should be used. Lastly, the increase might bedefined with respect to values of the previous 24 hours, in which case thedelta_min function is used.
References
Singer M, Deutschman CS, Seymour CW, et al. The Third InternationalConsensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA.2016;315(8):801–810. doi:10.1001/jama.2016.0287
Data setup
Description
Making a dataset available toricu consists of 3 steps: downloading(download_src()), importing (import_src()) and attaching(attach_src()). While downloading and importing are one-time procedures,attaching of the dataset is repeated every time the package is loaded.Briefly, downloading loads the raw dataset from the internet (most likelyin.csv format), importing consists of some preprocessing to make thedata available more efficiently and attaching sets up the data for use bythe package. The download and import steps can be combined usingsetup_src_data().
Usage
setup_src_data(x, ...)Arguments
x | Object specifying the source configuration |
... | Forwarded to |
Details
Ifsetup_src_data() is called on data sources that have all data availablewithforce = FALSE, nothing happens apart of a message being displayed. Ifonly a subset of tables is missing, only these tables are downloaded(whenever possible) and imported. Passingforce = TRUE attempts to re-download and import the entire data set. If the data source is availableas a data package (as is the case for the two demo datasets), data is notdownloaded and imported, but this package is installed.
In most scenarios,setup_src_data() does not need to be called by users,as upon package loading, all configured data sources are set up in a waythat enables download of missing data upon first access (and barring userconsent). However, instead of accessing all data sources where datamissingness should be resolved one by one,setup_src_data() is exportedfor convenience.
Value
Called for side effects and returnsNULL invisibly.
SIRS score label
Description
The SIRS (Systemic Inflammatory Response Syndrome) score is a commonly usedassessment tool used to track a patient's well-being in an ICU.
Usage
sirs_score( ..., win_length = hours(24L), keep_components = FALSE, interval = NULL)qsofa_score( ..., win_length = hours(24L), keep_components = FALSE, interval = NULL)news_score( ..., win_length = hours(24L), keep_components = FALSE, interval = NULL)mews_score( ..., win_length = hours(24L), keep_components = FALSE, interval = NULL)Arguments
... | Data input used for scoreevaluation |
win_length | Window used for carry forward |
keep_components | Logical flag indicating whether to return theindividual components alongside the aggregated score |
interval | Time series interval (only used for checking consistencyof input data) |
SOFA score label
Description
The SOFA (Sequential Organ Failure Assessment) score is a commonly usedassessment tool for tracking a patient's status during a stay at an ICU.Organ function is quantified by aggregating 6 individual scores,representing respiratory, cardiovascular, hepatic, coagulation, renal andneurological systems. The functionsofa_score() is used as callbackfunction to thesofa concept but is exported as there are a few argumentsthat can used to modify some aspects of the presented SOFA implementation.Internally,sofa_score() calls firstsofa_window(), followed bysofa_compute() and arguments passed as... will be forwarded to therespective internally called function.
Usage
sofa_score( ..., worst_val_fun = max_or_na, explicit_wins = FALSE, win_length = hours(24L), keep_components = FALSE, interval = NULL)sofa_resp(..., interval = NULL)sofa_coag(..., interval = NULL)sofa_liver(..., interval = NULL)sofa_cardio(..., interval = NULL)sofa_cns(..., interval = NULL)sofa_renal(..., interval = NULL)Arguments
... | Concept data, either passed as list or individual argument |
worst_val_fun | functions used to calculate worst values over windows |
explicit_wins | The default |
win_length | Time-frame to look back and apply the |
keep_components | Logical flag indicating whether to return theindividual components alongside the aggregated score (with a suffix |
interval | Time series interval (only used for checking consistencyof input data, |
Details
The functionsofa_score() calculates, for each component, the worst valueover a moving window as specified bywin_length, using the functionpassed asworst_val_fun. The default functionsmax_or_na() returnNAinstead of-Inf/Inf in the case where no measurement is available over anentire window. When calculating the overall score by summing up componentsper time-step, aNA value is treated as 0.
Building on separate concepts, measurements for each component areconverted to a component score using the definition by Vincent et. al.:
| SOFA score | 1 | 2 | 3 | 4 |
| Respiration | ||||
| PaO2/FiO2 [mmHg] | < 400 | < 300 | < 200 | < 100 |
| and mechanical ventilation | yes | yes | ||
| Coagulation | ||||
| Platelets [×103/mm3] | < 150 | < 100 | < 50 | < 20 |
| Liver | ||||
| Bilirubin [mg/dl] | 1.2-1.9 | 2.0-5.9 | 6.0-11.9 | > 12.0 |
| Cardiovasculara | ||||
| MAP | < 70 mmHg | |||
| or dopamine | ≤5 | > 5 | > 15 | |
| or dobutamine | any dose | |||
| or epinephrine | ≤0.1 | > 0.1 | ||
| or norepinephrine | ≤0.1 | > 0.1 | ||
| Central nervous system | ||||
| Glasgow Coma Score | 13-14 | 10-12 | 6-9 | < 6 |
| Renal | ||||
| Creatinine [mg/dl] | 1.2-1.9 | 2.0-3.4 | 3.5-4.9 | > 5.0 |
| or urine output [ml/day] | < 500 | < 200 | ||
Adrenergica agents administered for at least 1h(doses given are in [μg/kg · min]
At default, for each patient, a score is calculated for every time step,from the first available measurement to the last. In instead of a regularlyevaluated score, only certain time points are of interest, this can bespecified using theexplicit_wins argument: passing for examplehours(24, 48) will yield for every patient a score at hours 24 and 48relative to the origin of the current ID system (for example ICU stay).
Value
Ats_tbl object.
References
Vincent, J.-L., Moreno, R., Takala, J. et al. The SOFA (Sepsis-related OrganFailure Assessment) score to describe organ dysfunction/failure. IntensiveCare Med 22, 707–710 (1996). https://doi.org/10.1007/BF01709751
Stays
Description
Building on functionality offered by the (internal) functionid_map(),stay windows as well as (in case of differing values being passed asid_type andwin_type) an ID mapping is computed.
Usage
stay_windows(x, ...)## S3 method for class 'src_env'stay_windows( x, id_type = "icustay", win_type = id_type, in_time = "start", out_time = "end", interval = hours(1L), patient_ids = NULL, ...)## S3 method for class 'character'stay_windows(x, ...)## S3 method for class 'list'stay_windows(x, ..., patient_ids = NULL)## Default S3 method:stay_windows(x, ...)Arguments
x | Data source (is coerced to |
... | Generic consistency |
id_type | Type of ID all returned times are relative to |
win_type | Type of ID for which the in/out times is returned |
in_time,out_time | column names of the returned in/out times |
interval | The time interval used to discretize time stamps with,specified as |
patient_ids | Patient IDs used to subset the result |
Value
Anid_tbl containing the selected IDs and depending on valuespassed asin_time andout_time, start and end times of the ID passed aswin_var.
See Also
change_id
Suspicion of infection label
Description
Suspected infection is defined as co-occurrence of of antibiotic treatmentand body-fluid sampling.
Usage
susp_inf( ..., abx_count_win = hours(24L), abx_min_count = 1L, positive_cultures = FALSE, si_mode = c("and", "or", "abx", "samp"), abx_win = hours(24L), samp_win = hours(72L), by_ref = TRUE, keep_components = FALSE, interval = NULL)Arguments
... | Data and further arguments are passed to |
abx_count_win | Time span during which to apply the |
abx_min_count | Minimal number of antibiotic administrations |
positive_cultures | Logical flag indicating whether to requirecultures to be positive |
si_mode | Switch between |
abx_win | Time-span within which sampling has to occur |
samp_win | Time-span within which antibiotic administration has tooccur |
by_ref | Logical flag indicating whether to process data by reference |
keep_components | Logical flag indicating whether to return theindividual components alongside the aggregated score |
interval | Time series interval (only used for checking consistencyof input data) |
Details
Suspected infection can occur in one of the two following ways:
administration of antibiotics followed by a culture sampling within
samp_winhoursabx_win |---------------| ABX sampling (last possible)
culture sampling followed by an antibiotic administration within
abx_winhourssamp_win |---------------------------------------------|sampling ABX (last possible)
The default values ofsamp_win andabx_win are 24 and 72 hoursrespectively, as perSinger et.al..
The earlier of the two times (fluid sampling, antibiotic treatment) is takenas the time of suspected infection (SI time). The suspected infectionwindow (SI window) is defined to startsi_lwr hours before the SI timeand endsi_upr hours after the SI time. The default values of 48 and 24hours (respectively) are chosen as used bySeymour et.al. (seeSupplemental Material).
48h 24h |------------------------------(|)---------------| SI time
For some datasets, however, information on body fluid sampling is notavailable for majority of the patients (eICU data). Therefore, analternative definition of suspected infection is required. For this, we useadministration of multiple antibiotics (argumentabx_min_count determinesthe required number) withinabx_count_win hours. The first time ofantibiotic administration is taken as the SI time in this case.
References
Singer M, Deutschman CS, Seymour CW, et al. The Third InternationalConsensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA.2016;315(8):801–810. doi:10.1001/jama.2016.0287
Seymour CW, Liu VX, Iwashyna TJ, et al. Assessment of Clinical Criteria forSepsis: For the Third International Consensus Definitions for Sepsis andSeptic Shock (Sepsis-3). JAMA. 2016;315(8):762–774.doi:10.1001/jama.2016.0288
Item callback utilities
Description
For concept loading, item callback functions are used in order to handleitem-specific post-processing steps, such as converting measurement units,mapping a set of values to another or for more involved datatransformations, like turning absolute drug administration rates into ratesthat are relative to body weight. Item callback functions are called byload_concepts() with argumentsx (the data), a variable number of name/string pairs specifying roles of columns for the given item, followed byenv, the data source environment assrc_env object.Item callback functions can be specified by their name or using functionfactories such astransform_fun(),apply_map() orconvert_unit().
Usage
transform_fun(fun, ...)binary_op(op, y)comp_na(op, y)set_val(val)apply_map(map, var = "val_var")convert_unit(fun, new, rgx = NULL, ignore_case = TRUE, ...)Arguments
fun | Function(s) used for transforming matching values |
... | Further arguments passed to downstream function |
op | Function taking two arguments, such as |
y | Value passed as second argument to function |
val | Value to replace every element of x with |
map | Named atomic vector used for mapping a set of values (the namesof |
var | Argument which is used to determine the column the mapping isapplied to |
new | Name(s) of transformed units |
rgx | Regular expression(s) used for identifying observations based ontheir current unit of measurement, |
ignore_case | Forwarded to |
Details
The most forward setting is where a function is simply referred to by itsname. For example in eICU, age is available as character vector due toages 90 and above being represented by the string "> 89". A function suchas the following turns this into a numeric vector, replacing occurrences of"> 89" by the number 90.
eicu_age <- function(x, val_var, ...) { data.table::set( data.table::set(x, which(x[[val_var]] == "> 89"), j = val_var, value = 90), j = val_var, value = as.numeric(x[[val_var]]) )}This function then is specified as item callback function for itemscorresponding to eICU data sources of theage concept as
item(src = "eicu_demo", table = "patient", val_var = "age", callback = "eicu_age", class = "col_itm")
The string passed ascallback argument is evaluated, meaning that anexpression can be passed which evaluates to a function that in turn can beused as callback. Several function factories are provided which returnfunctions suitable for use as item callbacks:transform_fun() creates afunction that transforms theval_var column using the function suppliedasfun argument,apply_map() can be used to map one set of values toanother (again using theval_var column) andconvert_unit() is intendedfor converting a subset of rows (identified by matchingrgx against theunit_var column) by applyingfun to theval_var column and settingnew as the transformed unit name (arguments are not limited to scalarvalues). As transformations require unary functions, two utility function,binary_op() andcomp_na() are provided which can be used to fix thesecond argument of binary functions such as* or==. Taking all thistogether, an item callback function for dividing theval_var column by 2could be specified as"transform_fun(binary_op(/, 2))". The suppliedfunction factories create functions that operate on the data usingby-reference semantics. Furthermore, during conceptloading, progress is reported by aprogress::progress_bar. In order tosignal a message without disrupting the current loading status, seemsg_progress().
Value
Callback function factories such astransform_fun(),apply_map()orconvert_unit() return functions suitable as item callback functions,while transform function generators such asbinary_op(),comp_na()return functions that apply a transformation to a vector.
Examples
dat <- ts_tbl(x = rep(1:2, each = 5), y = hours(rep(1:5, 2)), z = 1:10)subtract_3 <- transform_fun(binary_op(`-`, 3))subtract_3(data.table::copy(dat), val_var = "z")gte_4 <- transform_fun(comp_na(`>=`, 4))gte_4(data.table::copy(dat), val_var = "z")map_letters <- apply_map(setNames(letters[1:9], 1:9))res <- map_letters(data.table::copy(dat), val_var = "z")resnot_b <- transform_fun(comp_na(`!=`, "b"))not_b(res, val_var = "z")Internal utilities for ICU data objects
Description
In order to remove allid_tbl/ts_tbl-related attributes, as well asextra class-labels, the exported but marked internal functionunclass_tbl() can be used. This function provides what one might expectfrom anid_tbl/ts_tbl-specific implementation of the S3 genericfunctiondata.table::as.data.table(). The inverse functionality ifprovided byreclass_tbl() which attempts to add attributes as seen intemplate to the object passed asx. The logical flagstop_on_failcontrols how to proceed if the attributes oftemplate are incompatiblewith the objectx. Finally, in order to generate a template,as_ptype()creates an empty object with the appropriate attributes.
Usage
unclass_tbl(x)reclass_tbl(x, template, stop_on_fail = TRUE)as_ptype(x)Arguments
x | Object to modify/query |
template | Object after which to model the object in question |
stop_on_fail | Logical flag indicating whether to consider failedobject validation as error |
Value
unclass_tbl(): adata.tablereclass_tbl(): either anid_tblor ats_tbldepending on the typeof the object passed astemplateas_ptype(): an object of the same type asx, but with on data
Read and write utilities
Description
Support for reading from and writing to pipe separated values (.psv)files as used for the PhysioNet Sepsis Challenge.
Usage
write_psv(x, dir, na_rows = NULL)read_psv(dir, col_spec = NULL, id_var = "stay_id", index_var = NULL)Arguments
x | Object to write to files |
dir | Directory to write the (many) files to or read from |
na_rows | If |
col_spec | A column specification as created by |
id_var | Name of the id column (IDs are generated from file names) |
index_var | Optional name of index column (will be coerced to |
Details
Data for the PhysioNet Sepsis Challenge is distributed as pipe separatedvalues (.psv) files, split into separate files per patient ID, containingtime stamped rows with measured variables as columns. Files are named withpatient IDs and do not contain any patient identifiers as data. Functionsread_psv() andwrite_psv() can be used to read from and write to sucha data format.
Value
Whilewrite_psv() is called for side effects and returnsNULLinvisibly,read_psv() returns an object inheriting fromid_tbl.
References
Reyna, M., Josef, C., Jeter, R., Shashikumar, S., Moody, B., Westover, M.B., Sharma, A., Nemati, S., & Clifford, G. (2019). Early Prediction ofSepsis from Clinical Data – the PhysioNet Computing in CardiologyChallenge 2019 (version 1.0.0). PhysioNet.https://doi.org/10.13026/v64v-d857.