Research organizations generate, manage, and use more and moreknowledge resources which can be highly heterogenous in their origin,their scope, and their structure. Making this knowledge compliant toF.A.I.R. (Findable, Accessible, Interoperable, Reusable) principles iscritical for facilitating the generation of new insights leveraging it.The aim of the TKCat (Tailored Knowledge Catalog) R package is tofacilitate the management of such resources that are frequently usedalone or in combination in research environments.
In TKCat, knowledge resources are manipulated as modeled database(MDB) objects. These objects provide access to the data tables alongwith a general description of the resource and a detail data modelgenerated withReDaMoRdocumenting the tables, their fields and their relationships. These MDBare then gathered in catalogs that can be easily explored an shared.TKCat provides tools to easily subset, filter and combine MDBs andcreate new catalogs suited for specific needs.
Currently, there are 3 different implementations of MDBs which aresupported by TKCat: in R memory (memoMDB), in files (fileMDB) and inClickHouse (chMDB).
This is document is divided in four main sections:
The first one describes how to build an MDB object, starting witha minimal example
The second section shows how to interact with MDB objects toextract and combine information of interest
The third section focuses on the use of theClickHouse implementation of MDB(chMDB)
The fourth section corresponds to appendices providing technicalinformation regardingClickHouserelated admin tasks and the implementation ofcollections whichare used to identify and leverage potential relationships betweendifferent MDBs.
This section shows how to create an MDB object starting from a set oftables in three steps:
This example focuses on theHuman Phenotype Ontology(HPO). The HPO aims to provide a standardized vocabulary ofphenotypic abnormalities encountered in human diseases(Köhler et al.2019).
A subset of the HPO is provided within theReDaMoR package. We canread some of the tables as follow:
library(readr)hpo_data_dir <- system.file("examples/HPO-subset", package="ReDaMoR")TheHPO_hp table gathers human phenotype identifiers,names and descriptions:
HPO_hp <- readr::read_tsv( file.path(hpo_data_dir, "HPO_hp.txt"))HPO_hp## # A tibble: 500 × 4## id name description level## <chr> <chr> <chr> <dbl>## 1 0000002 Abnormality of body height Deviation fr… 3## 2 0000009 Functional abnormality of the bladder Dysfunction … 6## 3 0000014 Abnormality of the bladder An abnormali… 5## 4 0000017 Nocturia Abnormally i… 7## 5 0000019 Urinary hesitancy Difficulty i… 7## 6 0000021 Megacystis Dilatation o… 8## 7 0000022 Abnormality of male internal genitalia An abnormali… 6## 8 0000024 Prostatitis The presence… 8## 9 0000025 Functional abnormality of male internal genitalia <NA> 6## 10 0000030 Testicular gonadoblastoma The presence… 9## # ℹ 490 more rowsTheHPO_diseases table gathers disease identifiers andlabels from different disease database.
HPO_diseases <- readr::read_tsv( file.path(hpo_data_dir, "HPO_diseases.txt"))HPO_diseases## # A tibble: 1,903 × 3## db id label ## <chr> <dbl> <chr> ## 1 DECIPHER 15 NF1-microdeletion syndrome ## 2 DECIPHER 45 Xq28 (MECP2) duplication ## 3 DECIPHER 65 ATR-16 syndrome ## 4 OMIM 100050 AARSKOG SYNDROME, AUTOSOMAL DOMINANT ## 5 OMIM 100650 ALDEHYDE DEHYDROGENASE 2 FAMILY ## 6 OMIM 101800 ACRODYSOSTOSIS 1, WITH OR WITHOUT HORMONE RESISTANCE; ACRDYS1## 7 OMIM 102500 HAJDU-CHENEY SYNDROME; HJCYS ## 8 OMIM 102510 ACROPECTOROVERTEBRAL DYSPLASIA, F-FORM OF ## 9 OMIM 102700 SEVERE COMBINED IMMUNODEFICIENCY, AUTOSOMAL RECESSIVE, T CEL…## 10 OMIM 102800 ADENOSINE TRIPHOSPHATASE DEFICIENCY, ANEMIA DUE TO ## # ℹ 1,893 more rowsTheHPO_diseaseHP table indicates which phenotype istriggered by each disease.
HPO_diseaseHP <- readr::read_tsv( file.path(hpo_data_dir, "HPO_diseaseHP.txt"))HPO_diseaseHP## # A tibble: 2,594 × 3## db id hp ## <chr> <dbl> <chr> ## 1 ORPHA 140976 0000002## 2 ORPHA 432 0000002## 3 DECIPHER 45 0000009## 4 OMIM 300076 0000009## 5 ORPHA 100996 0000009## 6 ORPHA 100997 0000009## 7 ORPHA 2571 0000009## 8 ORPHA 391487 0000009## 9 ORPHA 488594 0000009## 10 ORPHA 71211 0000009## # ℹ 2,584 more rowsTheReDaMoR packagecan be used for drafting a data model from a set of table:
mhpo_dm <- ReDaMoR::df_to_model(HPO_hp, HPO_diseases, HPO_diseaseHP)if(igraph_available){ mhpo_dm %>% ReDaMoR::auto_layout(lengthMultiplier=80) %>% plot()}else{ mhpo_dm %>% plot()}This data model is minimal: only the name of the tables, their fieldsand their types are documented. There is no additional constrainregarding the uniqueness or the completeness of the fields. Also thereis no information regarding the relationships between the differenttables. Themodel_relational_data() can be used to improvethe documentation of the dataset according to what we know about it.This function raises a graphical interface for manipulating andmodifying the data model (seeReDaMoRdocumentation).
mhpo_dm <- ReDaMoR::model_relational_data(mhpo_dm)Below is the model we get after completing it using the functionabove.
plot(mhpo_dm)In this model, we can see that:
Moreover, some comments are added at the table and at the field levelto give a better understanding of the data (shown when putting thecursor over the tables).
The data model can be explicitly bound to the data in an MDB (ModeledDataBase) object as shown below. However, when trying to build theobject with the tables we’ve read and the data model we have edited, weget the following error message.
mhpo_db <- memoMDB( dataTables=list( HPO_hp=HPO_hp, HPO_diseases=HPO_diseases, HPO_diseaseHP=HPO_diseaseHP ), dataModel=mhpo_dm, dbInfo=list(name="miniHPO"))FAILURE
FAILURE
FAILURE
FAILURE
Indeed, according to the edited model (not the very first oneautomatically created by ReDaMoR), theHPO_hp$level fieldshould containinteger values and theHPO_diseases$id andHPO_diseaseHP$id fieldsshould containcharacter values. The type of the data is amongthe data model features that are automatically checked when building anMDB object (along with uniqueness or NA values for example).
To avoid this error, we can either change the type of the columns ofthe data tables:
HPO_hp <- mutate(HPO_hp, level=as.integer(level))HPO_diseases <- mutate(HPO_diseases, id=as.character(id))HPO_diseaseHP <- mutate(HPO_diseaseHP, id=as.character(id))mhpo_db <- memoMDB( dataTables=list( HPO_hp=HPO_hp, HPO_diseases=HPO_diseases, HPO_diseaseHP=HPO_diseaseHP ), dataModel=mhpo_dm, dbInfo=list(name="miniHPO"))Or we can use the data model to read the data in a fileMDBobject:
f_mhpo_db <- read_fileMDB( path=hpo_data_dir, dbInfo=list(name="miniHPO"), dataModel=mhpo_dm)## miniHPO## SUCCESS## ## Check configuration## - Optional checks: ## - Maximum number of records: 10Theread_fileMDB() function identifies the text files toread inpath according to thedataModel. Ituses the types documented in the data model to read the files. Bydefault, the field delimiter is\t, but another can bedefined by writing adelim slot in thedbInfoparameter(e.g. dbInfo=list(name="miniHPO", delim="\t")).
As shown in the message above, by default,read_fileMDB() does not perform optional checks(unique fields,not nullable fields,foreignkeys) and it only checks data on the 10 first records. Also, thefileMDB data are not loaded in memory until requested by the user. Theobject is then smaller than the memoMDB object even if they gather thesame information.
print(object.size(mhpo_db), units="Kb")## 691.9 Kbprint(object.size(f_mhpo_db), units="Kb")## 23.5 Kbcompare_MDB(former=mhpo_db, new=f_mhpo_db) %>% DT::datatable( rownames=FALSE, width="75%", options=list(dom="t", pageLength=nrow(.)) )In the table above we can see that several pieces of information areexpected in an MDB object even if not mandatory (title,description,url,version,maintainer,timestamp). They can be provided in thedbInfo parameter of the MDB creator function(e.g. memoMDB()) or added afterward:
db_info(mhpo_db)$title <- "Very small extract of the human phenotype ontology"db_info(mhpo_db)$description <- "For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted"db_info(mhpo_db)$url <- "https://hpo.jax.org/"db_info(mhpo_db)$version <- "0.1"db_info(mhpo_db)$maintainer <- "Patrice Godard"db_info(mhpo_db)$timestamp <- Sys.time()All this information is displayed when printing the object:
mhpo_db## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology## - 3 tables with 10 fields## ## No collection member## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##In the HPO example, one table regards human phenotypes(HPO_hp) and another human diseases (HPO_diseases).These concepts are general and referenced in many other knowledge ordata resources (e.g. database providing information about diseasegenetics). Therefore, documenting formally such concepts will help toidentify how to connect the HPO example to other resources referencingthe same or related concepts.
In TKCat, these central concepts are referred as members ofcollections.Collections are pre-defined and membersmust be documented according to this definition. There are currently twocollections provided within the TKCat package:
list_local_collections()## # A tibble: 2 × 2## title description ## <chr> <chr> ## 1 BE Collection of biological entity (BE) concepts## 2 Condition Collection of condition conceptsAdditional collections can be defined by users according to theirneeds. Further information about collections implementation is providedin theappendix.
So far, there is no collection member documented in the HPO exampledescribed above, as indicated by the“No collection member”statement displayed when printing the object:
mhpo_db## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology## - 3 tables with 10 fields## ## No collection member## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##However, as just discussed, theHPO_hp table refers to humanphenotypes and theHPO_diseases table to human diseases. Theseconcept corresponds to conditions and those tables can be documented asmember of theCondition collection.
Condition members are documented calling theadd_collection_member() function on the MDB object. The twoother main arguments are the name of thecollection and thename of thetable in the MDB object. The other arguments tobe provided depend on the collection. ForCondition members,three additional arguments must be provided:
condition indicate the type of the condition(“Phenotype” or “Disease”)source a reference source of the conditionidentifieridentifier a condition identifierThe functionsget_local_collection() andshow_collection_def() can be used together to identifyvalid arguments:
get_local_collection("Condition") %>% show_collection_def()## Condition collection: Collection of condition concepts## Arguments (non-mandatory arguments are between parentheses):## - condition:## + static: logical## + value: character## - source:## + static: logical## + value: character## - identifier:## + static: logical## + value: characterWhen callingadd_collection_member(), these argumentsmust be provided as a list with 2 elements named “value” (a character)and “static” (a logical). If “static” is TRUE, “value” corresponds tothe information shared by all the rows of the table. If “static” isFALSE, “value” indicates the name of the column which provides thisinformation for each row.
The example below shows how theHPO_hp table is documentedas a member of theCondition collection.
mhpo_db$HPO_hp## # A tibble: 500 × 4## id name description level## <chr> <chr> <chr> <int>## 1 0000002 Abnormality of body height Deviation fr… 3## 2 0000009 Functional abnormality of the bladder Dysfunction … 6## 3 0000014 Abnormality of the bladder An abnormali… 5## 4 0000017 Nocturia Abnormally i… 7## 5 0000019 Urinary hesitancy Difficulty i… 7## 6 0000021 Megacystis Dilatation o… 8## 7 0000022 Abnormality of male internal genitalia An abnormali… 6## 8 0000024 Prostatitis The presence… 8## 9 0000025 Functional abnormality of male internal genitalia <NA> 6## 10 0000030 Testicular gonadoblastoma The presence… 9## # ℹ 490 more rowsmhpo_db <- add_collection_member( mhpo_db, collection="Condition", table="HPO_hp", condition=list(value="Phenotype", static=TRUE), source=list(value="HP", static=TRUE), identifier=list(value="id", static=FALSE))All rows in this table correspond to a condition of type “Phenotype”(condition=list(value="Phenotype", static=TRUE)). Thephenotype identifiers are all taken from the same source, “HP”(source=list(value="HP", static=TRUE)). The phenotypeidentifiers are provided in the “id” column of the table(identifier=list(value="id", static=FALSE)).
The example below shows how theHPO_disease table isdocumented also as a member of theCondition collection. Inthis case, the source of disease identifier can be different from onerow to the other and is provided in the “db” column(source=list(value="db", static=FALSE)).
mhpo_db <- add_collection_member( mhpo_db, collection="Condition", table="HPO_diseases", condition=list(value="Disease", static=TRUE), source=list(value="db", static=FALSE), identifier=list(value="id", static=FALSE))Now, the existence of collection members is shown when printing theMDB object:
mhpo_db## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology## - 3 tables with 10 fields## ## Collection members: ## - 2 Condition members## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##And the documented collection members of an MDB can be displayed asfollowing:
collection_members(mhpo_db)## # A tibble: 6 × 9## collection cid resource mid table field static value type ## <chr> <chr> <chr> <int> <chr> <chr> <lgl> <chr> <chr>## 1 Condition miniHPO_Condition_1.0 miniHPO 1 HPO_… cond… TRUE Phen… <NA> ## 2 Condition miniHPO_Condition_1.0 miniHPO 1 HPO_… sour… TRUE HP <NA> ## 3 Condition miniHPO_Condition_1.0 miniHPO 1 HPO_… iden… FALSE id <NA> ## 4 Condition miniHPO_Condition_1.0 miniHPO 2 HPO_… cond… TRUE Dise… <NA> ## 5 Condition miniHPO_Condition_1.0 miniHPO 2 HPO_… sour… FALSE db <NA> ## 6 Condition miniHPO_Condition_1.0 miniHPO 2 HPO_… iden… FALSE id <NA>The use of collection members to link or integrate different MDBswill be describedlater in thisdocument
Once an MDB has been created and documented in can be written in adirectory:
tmpDir <- tempdir()as_fileMDB(mhpo_db, path=tmpDir, htmlModel=FALSE)The structure of the created directory is the following:
## miniHPO ## ¦--DESCRIPTION.json ## ¦--data ## ¦ ¦--HPO_diseaseHP.txt.gz ## ¦ ¦--HPO_diseases.txt.gz ## ¦ °--HPO_hp.txt.gz ## °--model ## ¦--Collections ## ¦ °--Condition-miniHPO_Condition_1.0.json## °--miniHPO.jsonAll the data are in thedata folder whereas the data modeland collection members are written in json files in themodelfolder. TheDESCRIPTION.json file gather db information andinformation about how to read the data files (i.e. delim,na).
This folder can be shared and it’s then easy to get all the data andthe corresponding documentation from it back in R:
read_fileMDB(file.path(tmpDir, "miniHPO"))## miniHPO## SUCCESS## ## Check configuration## - Optional checks: ## - Maximum number of records: 10## fileMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology## - 3 tables with 10 fields## ## Collection members: ## - 2 Condition members## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11##Also writing these data and related information in text files makethem convenient to share with people using them in other analyticalenvironments than R.
The former section showed how to create and save an MDB object. Thissection describes how MDBs can be used, filtered and combined toefficiently leverage their content.
As a reminder, a modeled database (MDB) in TKCat gathers thefollowing information:
To illustrate how MDBs can be used, some example data are providedwithin theReDaMoR andthe TKCat package. The following paragraphs show how to load them in theR session.
A subset of theHuman PhenotypeOntology (HPO) is provided within theReDaMoR package. The HPOaims to provide a standardized vocabulary of phenotypic abnormalitiesencountered in human diseases(Köhler et al. 2019). An MDBobject based on files (seeMDBimplementations) can be read as shown below. As explained above, thedata provided by thepath parameter are documented with amodel (dataModel parameter) and general information(dbInfo parameter).
file_hpo <- read_fileMDB( path=system.file("examples/HPO-subset", package="ReDaMoR"), dataModel=system.file("examples/HPO-model.json", package="ReDaMoR"), dbInfo=list( "name"="HPO", "title"="Data extracted from the HPO database", "description"=paste( "This is a very small subset of the HPO!", "Visit the reference URL for more information." ), "url"="http://human-phenotype-ontology.github.io/" ))## HPO## SUCCESS## ## Check configuration## - Optional checks: ## - Maximum number of records: 10The message displayed in the console indicates if the data fit thedata model. It relies on theReDaMoR::confront_data()functions and check by default the first 10 rows of each file.
The data model can then be drawn.
plot(data_model(file_hpo))The data model shows that this MDB contains the 3 tables taken intoaccount in the minimal example. The additional tables provides mainlysupplementary details regarding phenotype and diseases. Still, theHPO_hp and theHPO_disease table are members of theCondition collection and can be documented as such, asexplained above.
file_hpo <- file_hpo %>% add_collection_member( collection="Condition", table="HPO_hp", condition=list(value="Phenotype", static=TRUE), source=list(value="HP", static=TRUE), identifier=list(value="id", static=FALSE) ) %>% add_collection_member( collection="Condition", table="HPO_diseases", condition=list(value="Disease", static=TRUE), source=list(value="db", static=FALSE), identifier=list(value="id", static=FALSE) )A subset of theClinVar database isprovided within this package. ClinVar is a freely accessible, publicarchive of reports of the relationships among human variations andphenotypes, with supporting evidence(Landrum et al. 2018). Thisresource can be read as afileMDB as shown above. However,in this case all the documenting information is included in the resourcedirectory, making it easier to read asexplainedabove.
file_clinvar <- read_fileMDB( path=system.file("examples/ClinVar", package="TKCat"))## ClinVar## SUCCESS## ## Check configuration## - Optional checks: ## - Maximum number of records: 10file_clinvar## fileMDB ClinVar (version 0.9, Patrice Godard <patrice.godard@ucb.com>): Data extracted from the ClinVar database## - 21 tables with 86 fields## ## Collection members: ## - 1 BE member## - 2 Condition members## ## ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information.## (https://www.ncbi.nlm.nih.gov/clinvar/)## ##Similarly, a self-documented subset of theCHEMBL database is alsoprovided in the TKCat package. It can be read the same way.
file_chembl <- read_fileMDB( path=system.file("examples/CHEMBL", package="TKCat"))## CHEMBL## SUCCESS## ## Check configuration## - Optional checks: ## - Maximum number of records: 10CHEMBL is a manually curated chemical database of bioactive moleculeswith drug-like properties(Mendez et al. 2019).
file_chembl## fileMDB CHEMBL (version 0.2, Liesbeth François <liesbeth.francois@ucb.com>): Data extracted from the CHEMBL database## - 10 tables with 61 fields## ## Collection members: ## - 1 BE member## - 1 Condition member## ## CHEMBL is a manually curated chemical database of bioactive molecules with drug-like properties. This is a very small subset of CHEMBL! Visit the reference URL for more information.## (https://www.ebi.ac.uk/chembl/)## ##There are 3 main implementations of MDBs:
fileMDB objects keep the data in files and loadthem only when requested by the user. These implementation is the firstone which is used when reading MDB as demonstrated in the examplesabove.
memoMDB objects have all the data loaded inmemory. These objects are very easy to use but can take time to load andcan use a lot of memory.
chMDB objects get the data from aClickHouse database providing acatalog of MDBs as described in thededicatedsection.
The different implementations can be converted to each others usingas_fileMDB(),as_memoMDB() andas_chMDB() functions.
memo_clinvar <- as_memoMDB(file_clinvar)object.size(file_clinvar) %>% print(units="Kb")## 155.2 Kbobject.size(memo_clinvar) %>% print(units="Kb")## 760.5 KbA fourth implementation ismetaMDB which combinesseveral MDBs glued together with relational tables (see theMerging with collections part).
Most of the functions described below work with any MDBimplementation, and a few functions are specific to eachimplementation.
General information can be retrieved (and potentialy updated) usingthedb_info() function.
db_info(file_clinvar)## $name## [1] "ClinVar"## ## $title## [1] "Data extracted from the ClinVar database"## ## $description## [1] "ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information."## ## $url## [1] "https://www.ncbi.nlm.nih.gov/clinvar/"## ## $version## [1] "0.9"## ## $maintainer## [1] "Patrice Godard <patrice.godard@ucb.com>"## ## $timestamp## [1] NAAs shown above the data model of an MDB can be retrieved and plot thefollowing way.
plot(data_model(file_clinvar))Tables names can be listed with thenames() function andpotentially renamed withnames()<- orrename() functions (the tables have been renamed here toimprove the readability of the following examples).
names(file_clinvar)## [1] "ClinVar_ReferenceClinVarAssertion" "ClinVar_rcvaVariant" ## [3] "ClinVar_ClinVarAssertions" "ClinVar_rcvaInhMode" ## [5] "ClinVar_rcvaObservedIn" "ClinVar_rcvaTraits" ## [7] "ClinVar_clinSigOrder" "ClinVar_revStatOrder" ## [9] "ClinVar_variants" "ClinVar_cvaObservedIn" ## [11] "ClinVar_cvaSubmitters" "ClinVar_traits" ## [13] "ClinVar_varEntrez" "ClinVar_varAttributes" ## [15] "ClinVar_varCytoLoc" "ClinVar_varNames" ## [17] "ClinVar_varSeqLoc" "ClinVar_varXRef" ## [19] "ClinVar_traitCref" "ClinVar_traitNames" ## [21] "ClinVar_entrezNames"file_clinvar <- file_clinvar %>% set_names(sub("ClinVar_", "", names(.))) names(file_clinvar)## [1] "ReferenceClinVarAssertion" "rcvaVariant" ## [3] "ClinVarAssertions" "rcvaInhMode" ## [5] "rcvaObservedIn" "rcvaTraits" ## [7] "clinSigOrder" "revStatOrder" ## [9] "variants" "cvaObservedIn" ## [11] "cvaSubmitters" "traits" ## [13] "varEntrez" "varAttributes" ## [15] "varCytoLoc" "varNames" ## [17] "varSeqLoc" "varXRef" ## [19] "traitCref" "traitNames" ## [21] "entrezNames"The different collection members of an MDBs are listed with thecollection_members() function.
collection_members(file_clinvar)## # A tibble: 10 × 9## collection cid resource mid table field static value type ## <chr> <chr> <chr> <int> <chr> <chr> <lgl> <chr> <chr>## 1 Condition ClinVar_conditions_… ClinVar 2 trai… cond… TRUE Dise… <NA> ## 2 Condition ClinVar_conditions_… ClinVar 2 trai… iden… FALSE id <NA> ## 3 Condition ClinVar_conditions_… ClinVar 2 trai… sour… TRUE Clin… <NA> ## 4 Condition ClinVar_conditions_… ClinVar 1 trai… cond… TRUE Dise… <NA> ## 5 Condition ClinVar_conditions_… ClinVar 1 trai… iden… FALSE id <NA> ## 6 Condition ClinVar_conditions_… ClinVar 1 trai… sour… FALSE db <NA> ## 7 BE ClinVar_BE_1.0 ClinVar 1 entr… be TRUE Gene <NA> ## 8 BE ClinVar_BE_1.0 ClinVar 1 entr… iden… FALSE entr… <NA> ## 9 BE ClinVar_BE_1.0 ClinVar 1 entr… orga… TRUE Homo… Scie…## 10 BE ClinVar_BE_1.0 ClinVar 1 entr… sour… TRUE Entr… <NA>The following functions are use to get the number of tables, thenumber of fields per table and the number of records.
length(file_clinvar) # Number of tables## [1] 21lengths(file_clinvar) # Number of fields per table## ReferenceClinVarAssertion rcvaVariant ClinVarAssertions ## 8 2 4 ## rcvaInhMode rcvaObservedIn rcvaTraits ## 2 6 3 ## clinSigOrder revStatOrder variants ## 2 2 3 ## cvaObservedIn cvaSubmitters traits ## 4 3 2 ## varEntrez varAttributes varCytoLoc ## 3 5 2 ## varNames varSeqLoc varXRef ## 3 18 4 ## traitCref traitNames entrezNames ## 4 3 3count_records(file_clinvar) # Number of records per table## ReferenceClinVarAssertion rcvaVariant ClinVarAssertions ## 166 166 409 ## rcvaInhMode rcvaObservedIn rcvaTraits ## 16 337 166 ## clinSigOrder revStatOrder variants ## 11 2 138 ## cvaObservedIn cvaSubmitters traits ## 412 416 18 ## varEntrez varAttributes varCytoLoc ## 145 2262 138 ## varNames varSeqLoc varXRef ## 188 280 244 ## traitCref traitNames entrezNames ## 50 44 20Thecount_records() function can take a lot of time whendealing withfileMDB objects if the data files are very large.In such case it could be more efficient to list data file sizeinstead.
data_file_size(file_clinvar, hr=TRUE)## # A tibble: 21 × 3## table size compressed## <chr> <chr> <lgl> ## 1 ReferenceClinVarAssertion 4.6 KB TRUE ## 2 rcvaVariant 947 B TRUE ## 3 ClinVarAssertions 4.2 KB TRUE ## 4 rcvaInhMode 152 B TRUE ## 5 rcvaObservedIn 1.4 KB TRUE ## 6 rcvaTraits 788 B TRUE ## 7 clinSigOrder 145 B TRUE ## 8 revStatOrder 101 B TRUE ## 9 variants 2.1 KB TRUE ## 10 cvaObservedIn 1.8 KB TRUE ## # ℹ 11 more rowsThere are several possible ways to pull data tables from MDBs. Thefollowing lines return the same result displayed below (only once).
data_tables(file_clinvar, "traitNames")[[1]]file_clinvar[["traitNames"]]file_clinvar$"traitNames"file_clinvar %>% pull(traitNames)## # A tibble: 44 × 3## t.id name type ## <int> <chr> <chr>## 1 912 Chudley-McCullough syndrome Pref…## 2 912 Deafness, autosomal recessive 82 Alte…## 3 912 Deafness, bilateral sensorineural, and hydrocephalus due to fora… Alte…## 4 912 Deafness, sensorineural, with partial agenesis of the corpus cal… Alte…## 5 1352 CTSD-Related Neuronal Ceroid-Lipofuscinosis Alte…## 6 1352 Ceroid lipofuscinosis neuronal Cathepsin D-deficient Alte…## 7 1352 Neuronal ceroid lipofuscinosis 10 Pref…## 8 1352 Neuronal ceroid lipofuscinosis due to Cathepsin D deficiency Alte…## 9 1481 Diabetes mellitus, neonatal, with congenital hypothyroidism Pref…## 10 1481 NDH SYNDROME Alte…## # ℹ 34 more rowsMDBs can also be subset and combined. The corresponding functionsensure that the data model is fulfilled by the data tables.
file_clinvar[1:3]## fileMDB ClinVar (version 0.9, Patrice Godard <patrice.godard@ucb.com>): Data extracted from the ClinVar database## - 3 tables with 14 fields## ## No collection member## ## ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information.## (https://www.ncbi.nlm.nih.gov/clinvar/)## ##if(igraph_available){ c(file_clinvar[1:3], file_hpo[c(1,5,7)]) %>% data_model() %>% auto_layout(force=TRUE) %>% plot()}else{ c(file_clinvar[1:3], file_hpo[c(1,5,7)]) %>% data_model() %>% plot()}The functionc() concatenates the provided MDB afterchecking that tables names are not duplicated. It does not integrate thedata with any relational table. This can achieved by merging the MDBs asdescribed in theMerging withcollections section.
An MDB can be filtered by filtering one or several tables based onfield values. The filtering is propagated to other tables using theembedded data model.
In the example below, thefile_clinvar object isfiltered in order to focus on a few genes with pathogenic variants. Thetable below compares the number of rows before (“ori”) and after(“filt”) filtering.
filtered_clinvar <- file_clinvar %>% filter( entrezNames = symbol %in% c("PIK3R2", "UGT1A8") ) %>% slice(ReferenceClinVarAssertion=grep( "pathogen", .$ReferenceClinVarAssertion$clinicalSignificance, ignore.case=TRUE ))left_join( dims(file_clinvar) %>% select(name, nrow), dims(filtered_clinvar) %>% select(name, nrow), by="name", suffix=c("_ori", "_filt"))## # A tibble: 21 × 3## name nrow_ori nrow_filt## <chr> <dbl> <int>## 1 ReferenceClinVarAssertion 166 4## 2 rcvaVariant 166 4## 3 ClinVarAssertions 409 15## 4 rcvaInhMode 16 0## 5 rcvaObservedIn 337 10## 6 rcvaTraits 166 4## 7 clinSigOrder 11 3## 8 revStatOrder 2 1## 9 variants 138 3## 10 cvaObservedIn 412 15## # ℹ 11 more rowsThe object returned byfilter() orslice isamemoMDB: all the data are in memory.
Tables can be easily joined to get diseases associated to the genesof interest in a single table as shown below.
gene_traits <- filtered_clinvar %>% join_mdb_tables( "entrezNames", "varEntrez", "variants", "rcvaVariant", "ReferenceClinVarAssertion", "rcvaTraits", "traits" )gene_traits$entrezNames %>% select(symbol, name, variants.type, variants.name, traitType, traits.name)## # A tibble: 4 × 6## symbol name variants.type variants.name traitType traits.name## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease Megalencep…## 2 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease not provid…## 3 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease not provid…## 4 UGT1A8 UDP glucuronosyltran… Microsatelli… UGT1A1*28 Disease Gilbert's …Until now, we have seen how to use individual MDB by exploringgeneral information about it, extracting tables, filtering and joiningdata. This part shows how to usecollections toidentify relationships between MDBs and to leverage these relationshipsto integrate them. Documenting collection members has beendescribed above and further information aboutcollections implementation is provided in theappendix.
As explainedabove, some databases referto the same concepts and could be integrated accordingly. However theyoften use different vocabularies.
For example, bothCHEMBL andClinVar refer to biological entities (BE) fordocumenting drug targets or disease causal genes. CHEMBL refers to drugtarget in theCHEMBL_component_sequence table using mainlyUniprot peptide identifiers from different species.
file_chembl$CHEMBL_component_sequence## # A tibble: 35 × 5## component_id accession organism db_source db_version## <int> <chr> <chr> <chr> <chr> ## 1 259 P15260 Homo sapiens Uniprot 2019_09 ## 2 327 Q99062 Homo sapiens Uniprot 2019_09 ## 3 752 P35563 Rattus norvegicus Uniprot 2019_09 ## 4 917 P07339 Homo sapiens Uniprot 2019_09 ## 5 1807 Q54A96 Plasmodium falciparum Uniprot 2019_09 ## 6 2180 P67774 Bos taurus Uniprot 2019_09 ## 7 2398 P25098 Homo sapiens Uniprot 2019_09 ## 8 2541 Q8II92 Plasmodium falciparum 3D7 Uniprot 2019_09 ## 9 3803 Q64346 Rattus norvegicus Uniprot 2019_09 ## 10 4395 O60502 Homo sapiens Uniprot 2019_09 ## # ℹ 25 more rowsWhereas ClinVar refers to causal genes in theentrezNamestable using human Entrez gene identifiers.
file_clinvar$entrezNames## # A tibble: 20 × 3## entrez name symbol## <int> <chr> <chr> ## 1 1509 cathepsin D CTSD ## 2 1903 sphingosine-1-phosphate receptor 3 S1PR3 ## 3 3300 DnaJ heat shock protein family (Hsp40) member B2 DNAJB2## 4 3423 iduronate 2-sulfatase IDS ## 5 3910 laminin subunit alpha 4 LAMA4 ## 6 5296 phosphoinositide-3-kinase regulatory subunit 2 PIK3R2## 7 6748 signal sequence receptor subunit 4 SSR4 ## 8 7633 zinc finger protein 79 ZNF79 ## 9 22906 trafficking kinesin protein 1 TRAK1 ## 10 23155 chloride channel CLIC like 1 CLCC1 ## 11 26251 potassium voltage-gated channel modifier subfamily G member… KCNG2 ## 12 29851 inducible T cell costimulator ICOS ## 13 54576 UDP glucuronosyltransferase family 1 member A8 UGT1A8## 14 57684 zinc finger and BTB domain containing 26 ZBTB26## 15 115948 outer dynein arm docking complex subunit 3 ODAD3 ## 16 139716 GRB2 associated binding protein 3 GAB3 ## 17 169792 GLIS family zinc finger 3 GLIS3 ## 18 407054 microRNA 98 MIR98 ## 19 441531 phosphoglycerate mutase family member 4 PGAM4 ## 20 105373557 serous ovarian cancer associated RNA SOCARSince peptides are coded by genes, there is a biological relationshipbetween these two types of BE, and several tools exist to convert suchBE identifiers from one scope to the other (e.g. BED(Godard and Eyll 2018),mygene(Wu, MacLeod, and Su2012),biomaRt(Kinsella et al.2011)).
TKCat provides mechanism to document these scopes in order to allowautomatic conversions from and to any of them. Those concepts are calledCollections in TKCat and they should be formallydefined before being able to document any of their members. Twocollection definitions are provided within the TKCat package and othercan be imported with theimport_local_collection()function.
list_local_collections()## # A tibble: 2 × 2## title description ## <chr> <chr> ## 1 BE Collection of biological entity (BE) concepts## 2 Condition Collection of condition conceptsHere are the definition of the BE collection members provided by theCHEMBL_component_sequence and theentrezNamestables.
collection_members(file_chembl, "BE")## # A tibble: 4 × 9## collection cid resource mid table field static value type ## <chr> <chr> <chr> <int> <chr> <chr> <lgl> <chr> <chr>## 1 BE CHEMBL_BE_1.0 CHEMBL 1 CHEMBL_compo… be TRUE Pept… <NA> ## 2 BE CHEMBL_BE_1.0 CHEMBL 1 CHEMBL_compo… iden… FALSE acce… <NA> ## 3 BE CHEMBL_BE_1.0 CHEMBL 1 CHEMBL_compo… sour… FALSE db_s… <NA> ## 4 BE CHEMBL_BE_1.0 CHEMBL 1 CHEMBL_compo… orga… FALSE orga… Scie…collection_members(file_clinvar, "BE")## # A tibble: 4 × 9## collection cid resource mid table field static value type ## <chr> <chr> <chr> <int> <chr> <chr> <lgl> <chr> <chr>## 1 BE ClinVar_BE_1.0 ClinVar 1 entrezNames be TRUE Gene <NA> ## 2 BE ClinVar_BE_1.0 ClinVar 1 entrezNames ident… FALSE entr… <NA> ## 3 BE ClinVar_BE_1.0 ClinVar 1 entrezNames organ… TRUE Homo… Scie…## 4 BE ClinVar_BE_1.0 ClinVar 1 entrezNames source TRUE Entr… <NA>TheCollection column indicates the collection to which thetable refers. Thecid column indicates the version of thecollection definition which should correspond to the$id ofJSON schema. Theresource column indicates the name of theresource and themid column an identifier which is unique foreach member of a collection in each resource. Thefield columnindicates each part of the scope of collection. In the case of BE, 4fields should be documented:
Each of these fields can bestatic or not.TRUEmeans that the value of this field is the same for all the records andis provided in thevalue column. WhereasFALSEmeans that the value can be different for each record and is provided inthe column the name of which is given in thevalue column. Thetype column is only used for the organism field in the case ofthe BE collection and can take 2 values: “Scientific name” or “NCBItaxon identifier”. The definition of the pre-build BE collection membersfollows the terminology used in theBED package(Godard and Eyll2018). But it can be adapted according to the solution chosenfor converting BE identifiers from one scope to another.
Setting up the definition of such scope is done using theadd_collection_member() function as shown above in theminimal example and in theReading HPO example.
The aim of collections is to identify potential bridges between MDBs.Theget_shared_collection() function is used to list allthe collections shared by two MDBs.
get_shared_collections(filtered_clinvar, file_chembl)## # A tibble: 3 × 5## collection table.x mid.x table.y mid.y## <chr> <chr> <int> <chr> <int>## 1 Condition traits 2 CHEMBL_drug_indication 1## 2 Condition traitCref 1 CHEMBL_drug_indication 1## 3 BE entrezNames 1 CHEMBL_component_sequence 1In this example, there are 3 different ways to merge the two MDBsfiltered_clinvar andfile_chembl:
The code below shows how to merge these two resources based on BEinformation. To achieve this task it relies on a function provided withTKCat along with BE collection definition (to get the function:get_collection_mapper("BE")). This function uses theBED package(Godard and Eyll2018) and you need this package to be installed with aconnection to BED database in order to run the code below.
try(BED::connectToBed(a))## Error in eval(expr, envir) : object 'a' not foundbedCheck <- try(BED::checkBedConn())if(!inherits(bedCheck, "try-error") && bedCheck){ sel_coll <- get_shared_collections(file_clinvar, file_chembl) %>% filter(collection=="BE") filtered_cv_chembl <- merge( x=file_clinvar, y=file_chembl, by=sel_coll, dmAutoLayout=igraph_available )}The returned object is ametaMDB gathering theoriginal MDBs and a relational table between members of the samecollection as defined by theby parameter.
Additional information about collection can be found below in theappendix.
If thecollection column of theby parameter isNA, then the relational table is built by merging identicalcolumns in table.x and table.y (No conversion occurs). For example,file_hpo andfile_clinvar MDBs could be mergedaccording to conditions provided in theHPO_diseases and thetraitCref tables respectively.
get_shared_collections(file_hpo, file_clinvar)## # A tibble: 4 × 5## collection table.x mid.x table.y mid.y## <chr> <chr> <int> <chr> <int>## 1 Condition HPO_hp 1 traits 2## 2 Condition HPO_hp 1 traitCref 1## 3 Condition HPO_diseases 2 traits 2## 4 Condition HPO_diseases 2 traitCref 1These conditions could be converted using a function provided withTKCat (get_collection_mapper("Condition")) and which relyon theDODO package(François, Eyll, andGodard 2020). The two tables can also be simply concatenatedwithout applying any conversion (loosing the advantage of suchconversion obviously).
sel_coll <- get_shared_collections(file_hpo, file_clinvar) %>% filter(table.x=="HPO_diseases", table.y=="traitCref") %>% mutate(collection=NA)sel_coll## # A tibble: 1 × 5## collection table.x mid.x table.y mid.y## <lgl> <chr> <int> <chr> <int>## 1 NA HPO_diseases 2 traitCref 1Themerge() function gather the twoMDBs in onemetaMDB and create a association table based on theby argument. This association table(“HPO_diseases_traitCref”) is displayed in yellow in the data model ofthe createdmetaMDB as shown below.
hpo_clinvar <- merge( file_hpo, file_clinvar, by=sel_coll, dmAutoLayout=igraph_available)plot(data_model(hpo_clinvar))hpo_clinvar$HPO_diseases_traitCref## # A tibble: 1,950 × 2## db id ## <chr> <chr> ## 1 DECIPHER 15 ## 2 DECIPHER 45 ## 3 DECIPHER 65 ## 4 OMIM 100050## 5 OMIM 100650## 6 OMIM 101800## 7 OMIM 102500## 8 OMIM 102510## 9 OMIM 102700## 10 OMIM 102800## # ℹ 1,940 more rowsMDB can be gathered in aTKCat (Tailored Knowledge Catalog)object.
k <- TKCat(file_hpo, file_clinvar)Gathering MDBs in such a catalog facilitate their exploration andtheir preparation for potential integration. Several functions areavailable to achieve this goal.
list_MDBs(k) # list all the MDBs in a TKCat object## # A tibble: 2 × 7## name title description url version maintainer timestamp## <chr> <chr> <chr> <chr> <chr> <chr> <dttm> ## 1 HPO Data extract… This is a … http… <NA> <NA> NA ## 2 ClinVar Data extract… ClinVar is… http… 0.9 Patrice G… NAget_MDB(k, "HPO") # get a specific MDBs from the catalog## fileMDB HPO: Data extracted from the HPO database## - 9 tables with 25 fields## ## Collection members: ## - 2 Condition members## ## This is a very small subset of the HPO! Visit the reference URL for more information.## (http://human-phenotype-ontology.github.io/)## ##search_MDB_tables(k, "disease") # Search table about "disease"## # A tibble: 3 × 3## resource name comment ## <chr> <chr> <chr> ## 1 HPO HPO_diseases Diseases ## 2 HPO HPO_diseaseHP HP presented by diseases## 3 HPO HPO_diseaseSynonyms Disease synonymssearch_MDB_fields(k, "disease") # Search a field about "disease"## # A tibble: 8 × 7## resource table name type nullable unique comment ## <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> ## 1 HPO HPO_diseases db character FALSE FALSE Disease databa…## 2 HPO HPO_diseases id character FALSE FALSE Disease ID ## 3 HPO HPO_diseases label character FALSE FALSE Disease lable …## 4 HPO HPO_diseaseHP db character FALSE FALSE Disease databa…## 5 HPO HPO_diseaseHP id character FALSE FALSE Disease ID ## 6 HPO HPO_diseaseSynonyms db character FALSE FALSE Disease databa…## 7 HPO HPO_diseaseSynonyms id character FALSE FALSE Disease ID ## 8 HPO HPO_diseaseSynonyms synonym character FALSE FALSE Disease synonymcollection_members(k) # Get collection members of the different MDBs## # A tibble: 5 × 3## resource collection table ## <chr> <chr> <chr> ## 1 HPO Condition HPO_hp ## 2 HPO Condition HPO_diseases## 3 ClinVar Condition traits ## 4 ClinVar Condition traitCref ## 5 ClinVar BE entrezNamesc(k, TKCat(file_chembl)) # Merge 2 TKCat objects## TKCat gathering 3 MDB objectsThe functionexplore_MDBs() launches a shiny interfaceto explore MDBs in aTKCat object. This exploration interfacecan be easily deployed using anapp.R file with content similarto the one below.
library(TKCat)explore_MDBs(k, download=TRUE)In this interface the users can explore the resources available inthe catalog. They can browse the data model of each of them with somesample data. They can also search for information provided in resources,tables or fields. Finally, if the parameterdownload is settoTRUE, the users will also be able to download the data:either each table individually or an archive of the whole MDB.
AchTKCat object is a catalog of MDB as aTKCatobject described above but relying on aClickHouse database. This partfocuses on using and querying achTKCat object. Theinstallation and the initialization of aClickHouse database ready for TKCatare described below in theappendix.
The connection to the ClickHouse TKCat database is achieved using thechTKCat() function.
k <- chTKCat( host="localhost", # default parameter port=9111L, # default parameter drv=ClickHouseHTTP::ClickHouseHTTP(), # default parameter user="default", # default parameter password="" # if not provided the # password is requested interactively )By default, this function connects anonymously (“default” userwithout password) to the database, using theHTTPinterface of ClickHouse thanks to theClickHouseHTTPdriver. If the database is configured appropriately (seeappendix), connection can be achieved through HTTPSwith or without SSL peer verification (see the manual ofClickHouseHTTP::\ClickHouseHTTPDriver-class`for further information). Also, theRClickhouse::clickhouse()driver from the [RClickhouse][rclickhouse] package can be used (drvparameter of thechTKCat()`function) to leverage the nativeTCP interfaceof ClickHouse which has the strong advantage of having less overhead.But TLS wrapping is not supported yet by the RClickhouse package.
Once connected, thischTKCat object can be used as aTKCat object.
list_MDBs(k) # get a specific MDBs from the catalog## # A tibble: 24 × 12## name title description url version maintainer public populated timestamps## <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> ## 1 ChEMBL ChEM… ChEMBL is … http… 1.0.0 [Patrice … TRUE TRUE TRUE ## 2 Corte… Data… Clarivate … http… 0.0.1 [Patrice … TRUE TRUE TRUE ## 3 DRE-B… Bulk… Re-interpr… http… 0.01 [Patrice … TRUE TRUE TRUE ## 4 FCD-T… Bulk… Re-interpr… http… 0.01 [Patrice … TRUE TRUE TRUE ## 5 GO The … Because of… http… 1.0.0 [Patrice … TRUE TRUE TRUE ## 6 GTEx Geno… The Adult … http… 0.01 [Patrice … TRUE TRUE TRUE ## 7 Galac… Data… Biorelate'… http… 1.1.0 [Patrice … TRUE TRUE TRUE ## 8 Globa… Data… GlobalData… http… 0.0.1 [Patrice … TRUE TRUE TRUE ## 9 HGNC Anno… The HUGO G… http… 0.0.1 [Patrice … TRUE TRUE TRUE ## 10 HPA The … The Human … http… 0.0.1 [Patrice … TRUE TRUE TRUE ## # ℹ 14 more rows## # ℹ 3 more variables: timestamp <dttm>, access <fct>, total_size <dbl>search_MDB_tables(k, "disease") # Search table about "disease"## # A tibble: 47 × 3## resource name comment ## <chr> <chr> <chr> ## 1 ChEMBL assay_classification "Classification scheme for phenotypic ass…## 2 Galactic status "Cause-and-effect interactions can be bet…## 3 HPA Disease_involvement <NA> ## 4 HPO Disease_HP "HP presented by diseases" ## 5 HPO Disease_synonyms "Disease synonyms" ## 6 HPO Diseases "Diseases" ## 7 brainSCOPE CT_group_conditions "Experimental condition (e.g.: to be comp…## 8 brainSCOPE Cell_type_conditions "Experimental condition (e.g.: to be comp…## 9 OpenTargets Associations_by_source "Disease target association by data sourc…## 10 OpenTargets Associations_by_type "Disease target association by data type" ## # ℹ 37 more rowssearch_MDB_fields(k, "disease") # Search a field about "disease"## # A tibble: 124 × 7## resource table name comment type nullable unique## <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> ## 1 FCD-TLE-Bulk-RNA-2019 conditions cond… "Disea… char… TRUE TRUE ## 2 FCD-TLE-Bulk-RNA-2019 samples cond… "Disea… char… TRUE FALSE ## 3 HPA Disease_involvement dise… "" char… FALSE FALSE ## 4 DRE-Bulk-RNA-UMC-2024 conditions cond… "Disea… char… TRUE FALSE ## 5 DRE-Bulk-RNA-UMC-2024 samples cond… "Disea… char… TRUE FALSE ## 6 DRE-Bulk-RNA-UMC-2024 epilepsies_genetics dise… "" char… FALSE FALSE ## 7 DRE-Bulk-RNA-UMC-2024 epilepsies_targets dise… "" char… FALSE FALSE ## 8 DRE-Bulk-RNA-UMC-2024 epilepsies_targets dise… "" nume… FALSE FALSE ## 9 DRE-Bulk-RNA-UMC-2024 genes_epilepsies_a… dise… "" char… FALSE FALSE ## 10 DRE-Bulk-RNA-UMC-2024 epilepsies_genetics dise… "" char… FALSE FALSE ## # ℹ 114 more rowscollection_members(k)## # A tibble: 43 × 3## resource collection table ## <chr> <chr> <chr> ## 1 ChEMBL BE component_sequences## 2 ChEMBL Condition drug_indication ## 3 Cortellis BE target_genes ## 4 DRE-Bulk-RNA-UMC-2024 BE genes ## 5 FCD-TLE-Bulk-RNA-2019 BE genes ## 6 GO BE Unique_BEIDs ## 7 GTEx BE genes ## 8 GTEx BE transcripts ## 9 GlobalData BE target_genes ## 10 HGNC BE Genes ## # ℹ 33 more rowsexplore_MDBs(k)AnyMDB object can be imported in a TKCat ClickHouseinstance as following:
kw <- chTKCat(host="localhost", port=9111L, user="pgodard")create_chMDB(kw, "HPO", public=TRUE)ch_hpo <- as_chMDB(file_hpo, kw)It is then accessible to anyone with relevant permissions on theClickhouse database. Pushing data in a ClickHouse database works only ifthe user is allowed to write in the database.
The functionget_MDB() returns achMDB objectthat can be used as anyMDB object. The data are located in theClickHouse database and pulled on request.
ch_hpo <- get_MDB(k, "HPO")To avoid pulling a whole table from ClickHouse (which can take timeif the table is big), SQL queries can be made on thechMDBobject as shown below.
get_query( ch_hpo, query="SELECT * from HPO_diseases WHERE lower(label) LIKE '%epilep%'")## # A tibble: 292 × 3## db id label ## <chr> <chr> <chr> ## 1 OMIM 117100 Centralopathic epilepsy ## 2 OMIM 121201 Epilepsy, benign neonatal, 2 ## 3 OMIM 132090 Epilepsy, benign occipital ## 4 OMIM 132300 Epilepsy, reading ## 5 OMIM 159600 Myoclonic epilepsy, Hartung type ## 6 OMIM 159950 Spinal muscular atrophy with progressive myoclonic epilepsy ## 7 OMIM 208700 Ataxia with myoclonic epilepsy and presenile dementia ## 8 OMIM 213000 Cerebellar hypoplasia/atrophy, epilepsy, and global development…## 9 OMIM 226800 Epilepsy, photogenic, with spastic diplegia and mental retardat…## 10 OMIM 226810 Celiac disease, epilepsy and cerebral calcification syndrome ## # ℹ 282 more rowsBeside the relational model, no additional constraints are applied toan MDB. This allows for high flexibility in the data that can bemanaged. However, in some cases, it could be useful to add furtherconstraints to ensure that the data is compatible with specific analysisor integration workflows. In TKCat, this feature is supported by KMR(Knowledge Management Requirements). A KMR object is meant to be sharedand centrally managed. MDBs intended to meet these requirements mustcontain technical tables referring to the corresponding KMR. Whengrouped in the same TKCat catalog, KMRs and MDBs form a coherent corpusof knowledge that can be leveraged consistently by KMR-tailoredfunctions.
This set of features is described in the vignetteDefining and using Requirements for KnowledgeManagement (KMR) in TKCat.
The ClickHouse docker container supporting TKCat, its initializationand its configuration procedures are implemented here:docker.
Update theDockerfile to select the version ofClickHouse to use.
Customize and run the following script.
sh launch-tkcat-instance.shSpecific attention should be paid on available ports: TCP native port(but not TLS wrapping yet) is supported by theRClickhouse R packagewhereas HTTP and HTTP ports are supported by theClickHouseHTTP Rpackage.
The data are stored in theTKCAT_HOME folder.
When no longer needed, stooping and removing the docker container canbe achieved as exemplified below
# In shelldocker stop test_tkcatdocker rm test_tkcatdocker volume prune -f# Remove the folder with all the data: `$TKCAT_HOME`.`sudo rm -rf /mnt/data1/pgodard/Services-test/test_tkcat_2025.04.18User management requires admin rights on the database.
k <- chTKCat(user="pgodard")create_chTKCat_user( k, login="lfrancois", contact=NA, admin=FALSE, provider=TRUE)The function will require to setup a password for the new user. Theadmin parameter indicates if the new user have admin right on the wholechTKCat instance (default: FALSE). The provider parameter indicates ifthe new user can create and populate new databases whithin the chTKCatinstance (default: FALSE).
k <- chTKCat(user="pgodard")change_chTKCat_password(k, "lfrancois")update_chTKCat_user(k, contact="email", admin=FALSE)A shiny application can be launched for updating user settings:
manage_chTKCat_users(k)If this application is deployed, it can be made directly accessiblefrom theexplore_MDBs() Shiny application by providing theURL as theuserManager parameter.
drop_chTKCat_user(k, login="lfrancois")Before MDB data can be uploaded, the database should be created. Thisoperation can only be achieved by data providers (seeabove).
create_chMDB(k, "CHEMBL", public=FALSE)By default chMDB are not public. It can be changed through thepublic parameter when creating the chMDB or by using theset_chMDB_access() function afterward.
set_chMDB_access(k, "CHEMBL", public=TRUE)Then, users having access to the chMDB can be identified with orwithout admin rights on the chMDB. Admin rights allow the user to updatethe chMDB data.
add_chMDB_user(k, "CHEMBL", "lfrancois", admin=TRUE)# remove_chMDB_user(k, "CHEMBL", "lfrancois")list_chMDB_users(k, "CHEMBL")Each chMDB can be populated individualy using theas_chMDB() function. The code chunk below shows how to scana directory for allfileMDB it contains. Theas_memoMDB() function load all the data in memory andchecks that all the model constraints are fulfilled (this step isoptional). Whenoverwrite parameter of theas_chMDB() function is set to FALSE (default), thepotential existing version is archived before being updated. Whenoverwrite is set to TRUE, the potential existing version isoverwritten without being archived.
lc <- scan_fileMDBs("fileMDB_directory")## The commented line below allows the exploration of the data models in lc.# explore_MDBs(lc)for(r in toFeed){ message(r) lr <- as_memoMDB(lc[[r]]) cr <- as_chMDB(lr, k, overwrite=FALSE)}Any admin user of a chMDB can delete the corresponding data.
empty_chMDB(k, "CHEMBL")But only a system admin can drop the chMDB from the ClickHousedatabase.
drop_chMDB(k, "CHEMBL")Details about collections are provided in thefollowing appendix.
Collections needs to be added to a chTKCat instance in order tosupport collection members of the different chMDB. They can be takenfrom the TKCat package environment, from a JSON file or directly from aJSON text variable. Additional functions are available to list andremove chTKCat collections.
add_chTKCat_collection(k, "BE")list_chTKCat_collections(k)remove_chTKCat_collection(k, "BE")The default database stores information about chTKCat instance, usersand user access.
Modeled databases (MDB) are stored in dedicated database in chTKCat.Their data model is provided in dedicated tables described below.
Some MDBs refer to the same concepts and can be integratedaccordingly. However they often use different vocabularies or scopes.Collections are used to identify such concepts and to define a way todocument formally the scope used by the different members of thesecollections. Thanks to this formal description, tools can be used toautomatically combine MDBs referring to the same collection but usingdifferent scopes, as shownabove.
This appendix describes how to create TKCat Collections, documentcollection members and create functions to support the merging ofMDBs.
A collection is defined by a JSON document. This document shouldfulfill the requirements defined by theCollection-Schema.json.Two collections are available by default in the TKCat package.
list_local_collections()## # A tibble: 2 × 2## title description ## <chr> <chr> ## 1 BE Collection of biological entity (BE) concepts## 2 Condition Collection of condition conceptsHere is how theBE collection is defined.
get_local_collection("BE"){ "$schema": "https://json-schema.org/draft/2019-09/schema", "$id":"TKCat_BE_collection_1.0", "title": "BE collection", "type": "object", "description": "Collection of biological entity (BE) concepts", "properties": { "$schema": {"enum": ["TKCat_BE_collection_1.0"]}, "$id": {"type": "string"}, "collection": {"enum":["BE"]}, "resource": {"type": "string"}, "tables": { "type": "array", "minItems": 1, "items":{ "type": "object", "properties":{ "name": {"type": "string"}, "fields": { "type": "object", "properties": { "be": { "type": "object", "properties": { "static": {"type": "boolean"}, "value": {"type": "string"} }, "required": ["static", "value"], "additionalProperties": false }, "source": { "type": "object", "properties": { "static": {"type": "boolean"}, "value": {"type": "string"} }, "required": ["static", "value"], "additionalProperties": false }, "organism": { "type": "object", "properties": { "static": {"type": "boolean"}, "value": {"type": "string"}, "type": {"enum": ["Scientific name", "NCBI taxon identifier"]} }, "required": ["static", "value", "type"], "additionalProperties": false }, "identifier": { "type": "object", "properties": { "static": {"type": "boolean"}, "value": {"type": "string"} }, "required": ["static", "value"], "additionalProperties": false } }, "required": ["be", "source", "identifier"], "additionalProperties": false } }, "required": ["name", "fields"], "additionalProperties": false } } }, "required": ["$schema", "$id", "collection", "resource", "tables"], "additionalProperties": false}A collection should refer to the"TKCat_collections_1.0"$schema. It should then have the followingproperties:
$id: the identifier of the collection
title: the title of the collection
type: alwaysobject
description: a short description of thecollection
properties: the properties that should beprovided by collection members. In this case:
$schema: should be the$id of thecollection
$id: the identifier of the collection member: astring
collection: should be “BE”
resource: the name of the resource havingcollection members: a string
tables: an array of tables corresponding tocollection members. Each item being a table with the followingfeatures:
name: the name of the table
fields: the required fields
"Scientific name" or"NCBI taxon identifier".The main specifications defined in a JSON document can be simplydisplayed in R session by calling theshow_collection_def()function.
get_local_collection("BE") %>% show_collection_def()## BE collection: Collection of biological entity (BE) concepts## Arguments (non-mandatory arguments are between parentheses):## - be:## + static: logical## + value: character## - source:## + static: logical## + value: character## - (organism):## + static: logical## + value: character## + type: character in 'Scientific name', 'NCBI taxon identifier'## - identifier:## + static: logical## + value: characterDocumenting collection members of anMDB can be done byusing theadd_collection_member() function (asformerly described), or by writing a JSON filelike the following one which correspond to BE members of the CHEMBLMDB.
system.file( "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json", package="TKCat") %>% readLines() %>% paste(collapse="\n"){ "$schema": "TKCat_BE_collection_1.0", "$id": "CHEMBL_BE_1.0", "collection": "BE", "resource": "CHEMBL", "tables": [ { "name": "CHEMBL_component_sequence", "fields": { "be": { "static": true, "value": "Peptide" }, "identifier": { "static": false, "value": "accession" }, "source": { "static": false, "value": "db_source" }, "organism": { "static": false, "value": "organism", "type": "Scientific name" } } } ]}The identification of collection members should fulfill therequirements defined by the collection JSON document, and therefore passthe following validation.
jsonvalidate::json_validate( json=system.file( "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json", package="TKCat" ), schema=get_local_collection("BE"), engine="ajv")## [1] TRUEThis validation is done automatically when reading afileMDBobject or when setting collection members with theadd_collection_member() function.
Themerge.MDB() and themap_collection_members() functions rely on functions to mapmembers of the same collection. When recorded (using theimport_collection_mapper() function), these functions canbe automatically identified by TKCat, otherwise or according to userneeds, these functions could be provided using thefuns(formerge.MDB()) or thefun (formap_collection_members()) parameters. Two mappers arepre-recorded in TKCat, one for theBE collection and one fortheCondition collection. They can be retrieved with theget_collection_mapper() function.
get_collection_mapper("BE")function (x, y, orthologs = FALSE, restricted = FALSE, ...) { if (!requireNamespace("BED")) { stop("The BED package is required") } if (!BED::checkBedConn()) { stop("You need to connect to a BED database using", " the BED::connectToBed() function") } if (!"organism" %in% colnames(x)) { d <- x scopes <- dplyr::distinct(d, be, source) nd <- c() for (i in 1:nrow(scopes)) { be <- scopes$be[i] source <- scopes$source[i] toadd <- d %>% dplyr::filter(be == be, source == source) organism <- BED::guessIdScope(toadd$identifier, be = be, source = source, tcLim = Inf) %>% attr("details") %>% filter(be == !!be & source == !!source) %>% pull(organism) %>% unique() toadd <- merge(toadd, tibble(organism = organism)) nd <- bind_rows(nd, toadd) } x <- nd %>% mutate(organism_type = "Scientific name") } if (!"organism" %in% colnames(y)) { d <- y scopes <- dplyr::distinct(d, be, source) nd <- c() for (i in 1:nrow(scopes)) { be <- scopes$be[i] source <- scopes$source[i] toadd <- d %>% dplyr::filter(be == be, source == source) organism <- BED::guessIdScope(toadd$identifier, be = be, source = source, tcLim = Inf) %>% attr("details") %>% filter(be == !!be & source == !!source) %>% pull(organism) %>% unique() toadd <- merge(toadd, tibble(organism = organism)) nd <- bind_rows(nd, toadd) } y <- nd %>% mutate(organism_type = "Scientific name") } xscopes <- dplyr::distinct(x, be, source, organism, organism_type) yscopes <- dplyr::distinct(y, be, source, organism, organism_type) toRet <- NULL for (i in 1:nrow(xscopes)) { xscope <- xscopes[i, ] if (any(apply(xscope, 2, is.na))) { (next)() } xi <- dplyr::right_join(x, xscope, by = c("be", "source", "organism", "organism_type")) xorg <- ifelse(xscope$organism_type == "NCBI taxon identifier", BED::getOrgNames(xscope$organism) %>% dplyr::filter(nameClass == "scientific name") %>% dplyr::pull(name), xscope$organism) for (j in 1:nrow(yscopes)) { yscope <- yscopes[j, ] if (any(apply(yscope, 2, is.na))) { (next)() } yi <- dplyr::right_join(y, yscope, by = c("be", "source", "organism", "organism_type")) yorg <- ifelse(yscope$organism_type == "NCBI taxon identifier", BED::getOrgNames(yscope$organism) %>% dplyr::filter(nameClass == "scientific name") %>% dplyr::pull(name), yscope$organism) if (xorg == yorg || orthologs) { xy <- BED::convBeIds(ids = xi$identifier, from = xscope$be, from.source = xscope$source, from.org = xorg, to = yscope$be, to.source = yscope$source, to.org = yorg, restricted = restricted) %>% dplyr::as_tibble() %>% dplyr::select(from, to) if (restricted) { xy <- dplyr::bind_rows(xy, BED::convBeIds(ids = yi$identifier, from = yscope$be, from.source = yscope$source, from.org = yorg, to = xscope$be, to.source = xscope$source, to.org = xorg, restricted = restricted) %>% dplyr::as_tibble() %>% dplyr::select(to = from, from = to)) } xy <- xy %>% dplyr::rename(identifier_x = "from", identifier_y = "to") %>% dplyr::mutate(be_x = xscope$be, source_x = xscope$source, organism_x = xscope$organism, be_y = yscope$be, source_y = yscope$source, organism_y = yscope$organism) toRet <- dplyr::bind_rows(toRet, xy) } } } toRet <- dplyr::distinct(toRet) return(toRet)}A mapper function must have at least an x and a y parameters. Each ofthem should be a data.frame with all the field values corresponding tothe fields defined in the collection. Additional parameters can bedefined and will be forwarded using.... This functionshould return a data frame with all the fields values followed by “_x”and “_y” suffix accordingly.
Most of the data format and data types supported by the ReDaMoR andthe TKCat packages are taken into account in the examples described inthe main sections of this vignette. Nevertheless, one specific dataformat (matrix) and one specific data type (base64) are not exemplified.This appendix provides a short description of these format and type.
ReDaMoR and TKCat support data frame and matrix objectq. Data frameis the most used data format from far. However, matrices of values canbe useful in some use cases. The example below shows how such dataformat are modeled in ReDaMoR as a 3 columns table: one of type “row”corresponding to the row names of the matrix, one of type “column”corresponding to the column names of the matrix, and one of any type(excepted “row”, “column”, or “base64”).
d <- matrix( rnorm(40), nrow=10, dimnames=list( paste0("g", 1:10), paste0("s", 1:4) ))m <- ReDaMoR::df_to_model(d) %>% ReDaMoR::rename_field("d", "row", "gene") %>% update_field("d", "gene", comment="Gene identifier") %>% ReDaMoR::rename_field("d", "column", "sample") %>% update_field("d", "sample", comment="Sample identifier") %>% ReDaMoR::rename_field("d", "value", "expression") %>% update_field( "d", "expression", nullable=FALSE, comment="Gene expression value" )md <- memoMDB(list(d=d), m, list(name="Matrix example"))plot(data_model(md))Whole documents can be stored in MDB as “base64” character values.The example below shows how a document can be put in a table and thecorresponding data model.
ch_config_files <- tibble( name=c("config.xml", "users.xml"), file=c( base64enc::base64encode( system.file("ClickHouse/config.xml", package="TKCat") ), base64enc::base64encode( system.file("ClickHouse/users.xml", package="TKCat") ) ))m <- df_to_model(ch_config_files) %>% update_field( "ch_config_files", "name", type="base64", comment="Name of the config file", nullable=FALSE, unique=TRUE ) %>% update_field( "ch_config_files", "file", type="base64", comment="Config file in base64 format", nullable=FALSE )md <- memoMDB( list(ch_config_files=ch_config_files), m, list(name="base64 example"))plot(data_model(md))