Introduction to TKCat

Patrice Godard

June 05, 2025

1 Introduction

Research organizations generate, manage, and use more and moreknowledge resources which can be highly heterogenous in their origin,their scope, and their structure. Making this knowledge compliant toF.A.I.R. (Findable, Accessible, Interoperable, Reusable) principles iscritical for facilitating the generation of new insights leveraging it.The aim of the TKCat (Tailored Knowledge Catalog) R package is tofacilitate the management of such resources that are frequently usedalone or in combination in research environments.

In TKCat, knowledge resources are manipulated as modeled database(MDB) objects. These objects provide access to the data tables alongwith a general description of the resource and a detail data modelgenerated withReDaMoRdocumenting the tables, their fields and their relationships. These MDBare then gathered in catalogs that can be easily explored an shared.TKCat provides tools to easily subset, filter and combine MDBs andcreate new catalogs suited for specific needs.

Currently, there are 3 different implementations of MDBs which aresupported by TKCat: in R memory (memoMDB), in files (fileMDB) and inClickHouse (chMDB).

This is document is divided in four main sections:

The first one describes how to build an MDB object, starting witha minimal example
The second section shows how to interact with MDB objects toextract and combine information of interest
The third section focuses on the use of theClickHouse implementation of MDB(chMDB)
The fourth section corresponds to appendices providing technicalinformation regardingClickHouserelated admin tasks and the implementation ofcollections whichare used to identify and leverage potential relationships betweendifferent MDBs.

2 Create an MDB: aminimal example

This section shows how to create an MDB object starting from a set oftables in three steps:

Create a data model
Create and validate a modeled database (MDB) by binding the datamodel to the dataset
Document concept collections that can be used to make bridges acrossdifferent MDBs

This example focuses on theHuman Phenotype Ontology(HPO). The HPO aims to provide a standardized vocabulary ofphenotypic abnormalities encountered in human diseases(Köhler et al.2019).

2.1 Loading exampledata

A subset of the HPO is provided within theReDaMoR package. We canread some of the tables as follow:

library(readr)hpo_data_dir <- system.file("examples/HPO-subset", package="ReDaMoR")

TheHPO_hp table gathers human phenotype identifiers,names and descriptions:

HPO_hp <- readr::read_tsv(   file.path(hpo_data_dir, "HPO_hp.txt"))HPO_hp

## # A tibble: 500 × 4##    id      name                                              description   level##    <chr>   <chr>                                             <chr>         <dbl>##  1 0000002 Abnormality of body height                        Deviation fr…     3##  2 0000009 Functional abnormality of the bladder             Dysfunction …     6##  3 0000014 Abnormality of the bladder                        An abnormali…     5##  4 0000017 Nocturia                                          Abnormally i…     7##  5 0000019 Urinary hesitancy                                 Difficulty i…     7##  6 0000021 Megacystis                                        Dilatation o…     8##  7 0000022 Abnormality of male internal genitalia            An abnormali…     6##  8 0000024 Prostatitis                                       The presence…     8##  9 0000025 Functional abnormality of male internal genitalia <NA>              6## 10 0000030 Testicular gonadoblastoma                         The presence…     9## # ℹ 490 more rows

TheHPO_diseases table gathers disease identifiers andlabels from different disease database.

HPO_diseases <- readr::read_tsv(   file.path(hpo_data_dir, "HPO_diseases.txt"))HPO_diseases

## # A tibble: 1,903 × 3##    db           id label                                                        ##    <chr>     <dbl> <chr>                                                        ##  1 DECIPHER     15 NF1-microdeletion syndrome                                   ##  2 DECIPHER     45 Xq28 (MECP2) duplication                                     ##  3 DECIPHER     65 ATR-16 syndrome                                              ##  4 OMIM     100050 AARSKOG SYNDROME, AUTOSOMAL DOMINANT                         ##  5 OMIM     100650 ALDEHYDE DEHYDROGENASE 2 FAMILY                              ##  6 OMIM     101800 ACRODYSOSTOSIS 1, WITH OR WITHOUT HORMONE RESISTANCE; ACRDYS1##  7 OMIM     102500 HAJDU-CHENEY SYNDROME; HJCYS                                 ##  8 OMIM     102510 ACROPECTOROVERTEBRAL DYSPLASIA, F-FORM OF                    ##  9 OMIM     102700 SEVERE COMBINED IMMUNODEFICIENCY, AUTOSOMAL RECESSIVE, T CEL…## 10 OMIM     102800 ADENOSINE TRIPHOSPHATASE DEFICIENCY, ANEMIA DUE TO           ## # ℹ 1,893 more rows

TheHPO_diseaseHP table indicates which phenotype istriggered by each disease.

HPO_diseaseHP <- readr::read_tsv(   file.path(hpo_data_dir, "HPO_diseaseHP.txt"))HPO_diseaseHP

## # A tibble: 2,594 × 3##    db           id hp     ##    <chr>     <dbl> <chr>  ##  1 ORPHA    140976 0000002##  2 ORPHA       432 0000002##  3 DECIPHER     45 0000009##  4 OMIM     300076 0000009##  5 ORPHA    100996 0000009##  6 ORPHA    100997 0000009##  7 ORPHA      2571 0000009##  8 ORPHA    391487 0000009##  9 ORPHA    488594 0000009## 10 ORPHA     71211 0000009## # ℹ 2,584 more rows

2.2 Creating a data modelwith ReDaMoR

TheReDaMoR packagecan be used for drafting a data model from a set of table:

mhpo_dm <- ReDaMoR::df_to_model(HPO_hp, HPO_diseases, HPO_diseaseHP)if(igraph_available){   mhpo_dm %>%      ReDaMoR::auto_layout(lengthMultiplier=80) %>%       plot()}else{   mhpo_dm %>%      plot()}

This data model is minimal: only the name of the tables, their fieldsand their types are documented. There is no additional constrainregarding the uniqueness or the completeness of the fields. Also thereis no information regarding the relationships between the differenttables. Themodel_relational_data() can be used to improvethe documentation of the dataset according to what we know about it.This function raises a graphical interface for manipulating andmodifying the data model (seeReDaMoRdocumentation).

mhpo_dm <- ReDaMoR::model_relational_data(mhpo_dm)

Below is the model we get after completing it using the functionabove.

plot(mhpo_dm)

In this model, we can see that:

id is theprimary key of theHPO_hp table, and therefore this field must beunique;
db/id form theprimary key of theHPO_diseases table and must also beuniquewhen taken together;
all the fields excepteddescription (in theHPO_hptable) are complete (they cannot be NA);
theHPO_diseaseHP table refers to theHPO_hp tableusing itsHPO_hp fields and to theHPO_diseases tableusing itsdb andid fields (such details are shownwhen putting cursor over the edges).

Moreover, some comments are added at the table and at the field levelto give a better understanding of the data (shown when putting thecursor over the tables).

2.3 Binding the model tothe data in an MDB object

The data model can be explicitly bound to the data in an MDB (ModeledDataBase) object as shown below. However, when trying to build theobject with the tables we’ve read and the data model we have edited, weget the following error message.

mhpo_db <- memoMDB(   dataTables=list(      HPO_hp=HPO_hp, HPO_diseases=HPO_diseases, HPO_diseaseHP=HPO_diseaseHP   ),   dataModel=mhpo_dm,   dbInfo=list(name="miniHPO"))

miniHPO

FAILURE

Check configuration

Optional checks: unique, not nullable, foreignkeys
Maximum number of records: Inf

HPO_hp

FAILURE

Field issues or warnings

description:SUCCESSMissingvalues 117/500 = 23%
level:FAILUREUnexpected“numeric”

HPO_diseases

FAILURE

Field issues or warnings

id:FAILUREUnexpected“numeric”

HPO_diseaseHP

FAILURE

Field issues or warnings

id:FAILUREUnexpected“numeric”

Indeed, according to the edited model (not the very first oneautomatically created by ReDaMoR), theHPO_hp$level fieldshould containinteger values and theHPO_diseases$id andHPO_diseaseHP$id fieldsshould containcharacter values. The type of the data is amongthe data model features that are automatically checked when building anMDB object (along with uniqueness or NA values for example).

To avoid this error, we can either change the type of the columns ofthe data tables:

HPO_hp <- mutate(HPO_hp, level=as.integer(level))HPO_diseases <- mutate(HPO_diseases, id=as.character(id))HPO_diseaseHP <- mutate(HPO_diseaseHP, id=as.character(id))mhpo_db <- memoMDB(   dataTables=list(      HPO_hp=HPO_hp, HPO_diseases=HPO_diseases, HPO_diseaseHP=HPO_diseaseHP   ),   dataModel=mhpo_dm,   dbInfo=list(name="miniHPO"))

Or we can use the data model to read the data in a fileMDBobject:

f_mhpo_db <- read_fileMDB(   path=hpo_data_dir,   dbInfo=list(name="miniHPO"),   dataModel=mhpo_dm)

## miniHPO## SUCCESS## ## Check configuration##    - Optional checks: ##    - Maximum number of records: 10

Theread_fileMDB() function identifies the text files toread inpath according to thedataModel. Ituses the types documented in the data model to read the files. Bydefault, the field delimiter is\t, but another can bedefined by writing adelim slot in thedbInfoparameter(e.g. dbInfo=list(name="miniHPO", delim="\t")).

As shown in the message above, by default,read_fileMDB() does not perform optional checks(unique fields,not nullable fields,foreignkeys) and it only checks data on the 10 first records. Also, thefileMDB data are not loaded in memory until requested by the user. Theobject is then smaller than the memoMDB object even if they gather thesame information.

print(object.size(mhpo_db), units="Kb")

## 691.9 Kb

print(object.size(f_mhpo_db), units="Kb")

## 23.5 Kb

compare_MDB(former=mhpo_db, new=f_mhpo_db) %>%    DT::datatable(      rownames=FALSE,      width="75%",      options=list(dom="t", pageLength=nrow(.))   )

2.4 Adding informationabout an MDB

In the table above we can see that several pieces of information areexpected in an MDB object even if not mandatory (title,description,url,version,maintainer,timestamp). They can be provided in thedbInfo parameter of the MDB creator function(e.g. memoMDB()) or added afterward:

title,description andurl are used togive more details about the scope of the data and their origin.

db_info(mhpo_db)$title <- "Very small extract of the human phenotype ontology"db_info(mhpo_db)$description <- "For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted"db_info(mhpo_db)$url <- "https://hpo.jax.org/"

version andmaintainer are related to dbinformation and the data model whereastimestamp should be usedto document the data themselves.

db_info(mhpo_db)$version <- "0.1"db_info(mhpo_db)$maintainer <- "Patrice Godard"db_info(mhpo_db)$timestamp <- Sys.time()

All this information is displayed when printing the object:

mhpo_db

## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology##    - 3 tables with 10 fields## ## No collection member## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##

2.5 Documentingcollection members

In the HPO example, one table regards human phenotypes(HPO_hp) and another human diseases (HPO_diseases).These concepts are general and referenced in many other knowledge ordata resources (e.g. database providing information about diseasegenetics). Therefore, documenting formally such concepts will help toidentify how to connect the HPO example to other resources referencingthe same or related concepts.

In TKCat, these central concepts are referred as members ofcollections.Collections are pre-defined and membersmust be documented according to this definition. There are currently twocollections provided within the TKCat package:

list_local_collections()

## # A tibble: 2 × 2##   title     description                                  ##   <chr>     <chr>                                        ## 1 BE        Collection of biological entity (BE) concepts## 2 Condition Collection of condition concepts

Additional collections can be defined by users according to theirneeds. Further information about collections implementation is providedin theappendix.

So far, there is no collection member documented in the HPO exampledescribed above, as indicated by the“No collection member”statement displayed when printing the object:

mhpo_db

## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology##    - 3 tables with 10 fields## ## No collection member## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##

However, as just discussed, theHPO_hp table refers to humanphenotypes and theHPO_diseases table to human diseases. Theseconcept corresponds to conditions and those tables can be documented asmember of theCondition collection.

Condition members are documented calling theadd_collection_member() function on the MDB object. The twoother main arguments are the name of thecollection and thename of thetable in the MDB object. The other arguments tobe provided depend on the collection. ForCondition members,three additional arguments must be provided:

condition indicate the type of the condition(“Phenotype” or “Disease”)
source a reference source of the conditionidentifier
identifier a condition identifier

The functionsget_local_collection() andshow_collection_def() can be used together to identifyvalid arguments:

get_local_collection("Condition") %>%   show_collection_def()

## Condition collection: Collection of condition concepts## Arguments (non-mandatory arguments are between parentheses):##    - condition:##       + static: logical##       + value: character##    - source:##       + static: logical##       + value: character##    - identifier:##       + static: logical##       + value: character

When callingadd_collection_member(), these argumentsmust be provided as a list with 2 elements named “value” (a character)and “static” (a logical). If “static” is TRUE, “value” corresponds tothe information shared by all the rows of the table. If “static” isFALSE, “value” indicates the name of the column which provides thisinformation for each row.

The example below shows how theHPO_hp table is documentedas a member of theCondition collection.

mhpo_db$HPO_hp

## # A tibble: 500 × 4##    id      name                                              description   level##    <chr>   <chr>                                             <chr>         <int>##  1 0000002 Abnormality of body height                        Deviation fr…     3##  2 0000009 Functional abnormality of the bladder             Dysfunction …     6##  3 0000014 Abnormality of the bladder                        An abnormali…     5##  4 0000017 Nocturia                                          Abnormally i…     7##  5 0000019 Urinary hesitancy                                 Difficulty i…     7##  6 0000021 Megacystis                                        Dilatation o…     8##  7 0000022 Abnormality of male internal genitalia            An abnormali…     6##  8 0000024 Prostatitis                                       The presence…     8##  9 0000025 Functional abnormality of male internal genitalia <NA>              6## 10 0000030 Testicular gonadoblastoma                         The presence…     9## # ℹ 490 more rows

mhpo_db <- add_collection_member(   mhpo_db, collection="Condition", table="HPO_hp",   condition=list(value="Phenotype", static=TRUE),   source=list(value="HP", static=TRUE),   identifier=list(value="id", static=FALSE))

All rows in this table correspond to a condition of type “Phenotype”(condition=list(value="Phenotype", static=TRUE)). Thephenotype identifiers are all taken from the same source, “HP”(source=list(value="HP", static=TRUE)). The phenotypeidentifiers are provided in the “id” column of the table(identifier=list(value="id", static=FALSE)).

The example below shows how theHPO_disease table isdocumented also as a member of theCondition collection. Inthis case, the source of disease identifier can be different from onerow to the other and is provided in the “db” column(source=list(value="db", static=FALSE)).

mhpo_db <- add_collection_member(   mhpo_db, collection="Condition", table="HPO_diseases",   condition=list(value="Disease", static=TRUE),   source=list(value="db", static=FALSE),   identifier=list(value="id", static=FALSE))

Now, the existence of collection members is shown when printing theMDB object:

mhpo_db

## memoMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology##    - 3 tables with 10 fields## ## Collection members: ##    - 2 Condition members## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11.909539##

And the documented collection members of an MDB can be displayed asfollowing:

collection_members(mhpo_db)

## # A tibble: 6 × 9##   collection cid                   resource   mid table field static value type ##   <chr>      <chr>                 <chr>    <int> <chr> <chr> <lgl>  <chr> <chr>## 1 Condition  miniHPO_Condition_1.0 miniHPO      1 HPO_… cond… TRUE   Phen… <NA> ## 2 Condition  miniHPO_Condition_1.0 miniHPO      1 HPO_… sour… TRUE   HP    <NA> ## 3 Condition  miniHPO_Condition_1.0 miniHPO      1 HPO_… iden… FALSE  id    <NA> ## 4 Condition  miniHPO_Condition_1.0 miniHPO      2 HPO_… cond… TRUE   Dise… <NA> ## 5 Condition  miniHPO_Condition_1.0 miniHPO      2 HPO_… sour… FALSE  db    <NA> ## 6 Condition  miniHPO_Condition_1.0 miniHPO      2 HPO_… iden… FALSE  id    <NA>

The use of collection members to link or integrate different MDBswill be describedlater in thisdocument

2.6 Writing an MDB infiles

Once an MDB has been created and documented in can be written in adirectory:

tmpDir <- tempdir()as_fileMDB(mhpo_db, path=tmpDir, htmlModel=FALSE)

The structure of the created directory is the following:

## miniHPO                                         ##  ¦--DESCRIPTION.json                            ##  ¦--data                                        ##  ¦   ¦--HPO_diseaseHP.txt.gz                    ##  ¦   ¦--HPO_diseases.txt.gz                     ##  ¦   °--HPO_hp.txt.gz                           ##  °--model                                       ##      ¦--Collections                             ##      ¦   °--Condition-miniHPO_Condition_1.0.json##      °--miniHPO.json

All the data are in thedata folder whereas the data modeland collection members are written in json files in themodelfolder. TheDESCRIPTION.json file gather db information andinformation about how to read the data files (i.e. delim,na).

This folder can be shared and it’s then easy to get all the data andthe corresponding documentation from it back in R:

read_fileMDB(file.path(tmpDir, "miniHPO"))

## miniHPO## SUCCESS## ## Check configuration##    - Optional checks: ##    - Maximum number of records: 10

## fileMDB miniHPO (version 0.1, Patrice Godard): Very small extract of the human phenotype ontology##    - 3 tables with 10 fields## ## Collection members: ##    - 2 Condition members## ## For demonstrating ReDaMoR and TKCat capabilities, a very few information from the HPO (human phenotype ontology) has been extracted## (https://hpo.jax.org/)## ## Timestamp: 2025-06-05 06:05:11##

Also writing these data and related information in text files makethem convenient to share with people using them in other analyticalenvironments than R.

3 Leveraging MDB

The former section showed how to create and save an MDB object. Thissection describes how MDBs can be used, filtered and combined toefficiently leverage their content.

As a reminder, a modeled database (MDB) in TKCat gathers thefollowing information:

General database information including a mandatoryname andoptionally the following fields:title,description,url,version andmaintainer.
AReDaMoR datamodel.
A list of tables corresponding to reference concepts shared bydifferent MDBs. The way these concepts are identified is defined inspecific documents called collections.
The data themselves organized according to the data model.

3.1 Loading exampledata

To illustrate how MDBs can be used, some example data are providedwithin theReDaMoR andthe TKCat package. The following paragraphs show how to load them in theR session.

3.1.1 HPO

A subset of theHuman PhenotypeOntology (HPO) is provided within theReDaMoR package. The HPOaims to provide a standardized vocabulary of phenotypic abnormalitiesencountered in human diseases(Köhler et al. 2019). An MDBobject based on files (seeMDBimplementations) can be read as shown below. As explained above, thedata provided by thepath parameter are documented with amodel (dataModel parameter) and general information(dbInfo parameter).

file_hpo <- read_fileMDB(   path=system.file("examples/HPO-subset", package="ReDaMoR"),   dataModel=system.file("examples/HPO-model.json", package="ReDaMoR"),   dbInfo=list(      "name"="HPO",      "title"="Data extracted from the HPO database",      "description"=paste(         "This is a very small subset of the HPO!",         "Visit the reference URL for more information."      ),      "url"="http://human-phenotype-ontology.github.io/"   ))

## HPO## SUCCESS## ## Check configuration##    - Optional checks: ##    - Maximum number of records: 10

The message displayed in the console indicates if the data fit thedata model. It relies on theReDaMoR::confront_data()functions and check by default the first 10 rows of each file.

The data model can then be drawn.

plot(data_model(file_hpo))

The data model shows that this MDB contains the 3 tables taken intoaccount in the minimal example. The additional tables provides mainlysupplementary details regarding phenotype and diseases. Still, theHPO_hp and theHPO_disease table are members of theCondition collection and can be documented as such, asexplained above.

file_hpo <- file_hpo %>%    add_collection_member(      collection="Condition", table="HPO_hp",      condition=list(value="Phenotype", static=TRUE),      source=list(value="HP", static=TRUE),      identifier=list(value="id", static=FALSE)   ) %>%    add_collection_member(      collection="Condition", table="HPO_diseases",      condition=list(value="Disease", static=TRUE),      source=list(value="db", static=FALSE),      identifier=list(value="id", static=FALSE)   )

3.1.2 ClinVar

A subset of theClinVar database isprovided within this package. ClinVar is a freely accessible, publicarchive of reports of the relationships among human variations andphenotypes, with supporting evidence(Landrum et al. 2018). Thisresource can be read as afileMDB as shown above. However,in this case all the documenting information is included in the resourcedirectory, making it easier to read asexplainedabove.

file_clinvar <- read_fileMDB(   path=system.file("examples/ClinVar", package="TKCat"))

## ClinVar## SUCCESS## ## Check configuration##    - Optional checks: ##    - Maximum number of records: 10

file_clinvar

## fileMDB ClinVar (version 0.9, Patrice Godard <patrice.godard@ucb.com>): Data extracted from the ClinVar database##    - 21 tables with 86 fields## ## Collection members: ##    - 1 BE member##    - 2 Condition members## ## ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information.## (https://www.ncbi.nlm.nih.gov/clinvar/)## ##

3.1.3 CHEMBL

Similarly, a self-documented subset of theCHEMBL database is alsoprovided in the TKCat package. It can be read the same way.

file_chembl <- read_fileMDB(   path=system.file("examples/CHEMBL", package="TKCat"))

## CHEMBL## SUCCESS## ## Check configuration##    - Optional checks: ##    - Maximum number of records: 10

CHEMBL is a manually curated chemical database of bioactive moleculeswith drug-like properties(Mendez et al. 2019).

file_chembl

## fileMDB CHEMBL (version 0.2, Liesbeth François <liesbeth.francois@ucb.com>): Data extracted from the CHEMBL database##    - 10 tables with 61 fields## ## Collection members: ##    - 1 BE member##    - 1 Condition member## ## CHEMBL is a manually curated chemical database of bioactive molecules with drug-like properties. This is a very small subset of CHEMBL! Visit the reference URL for more information.## (https://www.ebi.ac.uk/chembl/)## ##

3.2 MDBimplementations

There are 3 main implementations of MDBs:

fileMDB objects keep the data in files and loadthem only when requested by the user. These implementation is the firstone which is used when reading MDB as demonstrated in the examplesabove.
memoMDB objects have all the data loaded inmemory. These objects are very easy to use but can take time to load andcan use a lot of memory.
chMDB objects get the data from aClickHouse database providing acatalog of MDBs as described in thededicatedsection.

The different implementations can be converted to each others usingas_fileMDB(),as_memoMDB() andas_chMDB() functions.

memo_clinvar <- as_memoMDB(file_clinvar)object.size(file_clinvar) %>% print(units="Kb")

## 155.2 Kb

object.size(memo_clinvar) %>% print(units="Kb")

## 760.5 Kb

A fourth implementation ismetaMDB which combinesseveral MDBs glued together with relational tables (see theMerging with collections part).

Most of the functions described below work with any MDBimplementation, and a few functions are specific to eachimplementation.

3.3 Exploringinformation

General information can be retrieved (and potentialy updated) usingthedb_info() function.

db_info(file_clinvar)

## $name## [1] "ClinVar"## ## $title## [1] "Data extracted from the ClinVar database"## ## $description## [1] "ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information."## ## $url## [1] "https://www.ncbi.nlm.nih.gov/clinvar/"## ## $version## [1] "0.9"## ## $maintainer## [1] "Patrice Godard <patrice.godard@ucb.com>"## ## $timestamp## [1] NA

As shown above the data model of an MDB can be retrieved and plot thefollowing way.

plot(data_model(file_clinvar))

Tables names can be listed with thenames() function andpotentially renamed withnames()<- orrename() functions (the tables have been renamed here toimprove the readability of the following examples).

names(file_clinvar)

##  [1] "ClinVar_ReferenceClinVarAssertion" "ClinVar_rcvaVariant"              ##  [3] "ClinVar_ClinVarAssertions"         "ClinVar_rcvaInhMode"              ##  [5] "ClinVar_rcvaObservedIn"            "ClinVar_rcvaTraits"               ##  [7] "ClinVar_clinSigOrder"              "ClinVar_revStatOrder"             ##  [9] "ClinVar_variants"                  "ClinVar_cvaObservedIn"            ## [11] "ClinVar_cvaSubmitters"             "ClinVar_traits"                   ## [13] "ClinVar_varEntrez"                 "ClinVar_varAttributes"            ## [15] "ClinVar_varCytoLoc"                "ClinVar_varNames"                 ## [17] "ClinVar_varSeqLoc"                 "ClinVar_varXRef"                  ## [19] "ClinVar_traitCref"                 "ClinVar_traitNames"               ## [21] "ClinVar_entrezNames"

file_clinvar <- file_clinvar %>%    set_names(sub("ClinVar_", "", names(.))) names(file_clinvar)

##  [1] "ReferenceClinVarAssertion" "rcvaVariant"              ##  [3] "ClinVarAssertions"         "rcvaInhMode"              ##  [5] "rcvaObservedIn"            "rcvaTraits"               ##  [7] "clinSigOrder"              "revStatOrder"             ##  [9] "variants"                  "cvaObservedIn"            ## [11] "cvaSubmitters"             "traits"                   ## [13] "varEntrez"                 "varAttributes"            ## [15] "varCytoLoc"                "varNames"                 ## [17] "varSeqLoc"                 "varXRef"                  ## [19] "traitCref"                 "traitNames"               ## [21] "entrezNames"

The different collection members of an MDBs are listed with thecollection_members() function.

collection_members(file_clinvar)

## # A tibble: 10 × 9##    collection cid                  resource   mid table field static value type ##    <chr>      <chr>                <chr>    <int> <chr> <chr> <lgl>  <chr> <chr>##  1 Condition  ClinVar_conditions_… ClinVar      2 trai… cond… TRUE   Dise… <NA> ##  2 Condition  ClinVar_conditions_… ClinVar      2 trai… iden… FALSE  id    <NA> ##  3 Condition  ClinVar_conditions_… ClinVar      2 trai… sour… TRUE   Clin… <NA> ##  4 Condition  ClinVar_conditions_… ClinVar      1 trai… cond… TRUE   Dise… <NA> ##  5 Condition  ClinVar_conditions_… ClinVar      1 trai… iden… FALSE  id    <NA> ##  6 Condition  ClinVar_conditions_… ClinVar      1 trai… sour… FALSE  db    <NA> ##  7 BE         ClinVar_BE_1.0       ClinVar      1 entr… be    TRUE   Gene  <NA> ##  8 BE         ClinVar_BE_1.0       ClinVar      1 entr… iden… FALSE  entr… <NA> ##  9 BE         ClinVar_BE_1.0       ClinVar      1 entr… orga… TRUE   Homo… Scie…## 10 BE         ClinVar_BE_1.0       ClinVar      1 entr… sour… TRUE   Entr… <NA>

The following functions are use to get the number of tables, thenumber of fields per table and the number of records.

length(file_clinvar)        # Number of tables

## [1] 21

lengths(file_clinvar)       # Number of fields per table

## ReferenceClinVarAssertion               rcvaVariant         ClinVarAssertions ##                         8                         2                         4 ##               rcvaInhMode            rcvaObservedIn                rcvaTraits ##                         2                         6                         3 ##              clinSigOrder              revStatOrder                  variants ##                         2                         2                         3 ##             cvaObservedIn             cvaSubmitters                    traits ##                         4                         3                         2 ##                 varEntrez             varAttributes                varCytoLoc ##                         3                         5                         2 ##                  varNames                 varSeqLoc                   varXRef ##                         3                        18                         4 ##                 traitCref                traitNames               entrezNames ##                         4                         3                         3

count_records(file_clinvar) # Number of records per table

## ReferenceClinVarAssertion               rcvaVariant         ClinVarAssertions ##                       166                       166                       409 ##               rcvaInhMode            rcvaObservedIn                rcvaTraits ##                        16                       337                       166 ##              clinSigOrder              revStatOrder                  variants ##                        11                         2                       138 ##             cvaObservedIn             cvaSubmitters                    traits ##                       412                       416                        18 ##                 varEntrez             varAttributes                varCytoLoc ##                       145                      2262                       138 ##                  varNames                 varSeqLoc                   varXRef ##                       188                       280                       244 ##                 traitCref                traitNames               entrezNames ##                        50                        44                        20

Thecount_records() function can take a lot of time whendealing withfileMDB objects if the data files are very large.In such case it could be more efficient to list data file sizeinstead.

data_file_size(file_clinvar, hr=TRUE)

## # A tibble: 21 × 3##    table                     size   compressed##    <chr>                     <chr>  <lgl>     ##  1 ReferenceClinVarAssertion 4.6 KB TRUE      ##  2 rcvaVariant               947 B  TRUE      ##  3 ClinVarAssertions         4.2 KB TRUE      ##  4 rcvaInhMode               152 B  TRUE      ##  5 rcvaObservedIn            1.4 KB TRUE      ##  6 rcvaTraits                788 B  TRUE      ##  7 clinSigOrder              145 B  TRUE      ##  8 revStatOrder              101 B  TRUE      ##  9 variants                  2.1 KB TRUE      ## 10 cvaObservedIn             1.8 KB TRUE      ## # ℹ 11 more rows

3.4 Pulling, subsettingand combining

There are several possible ways to pull data tables from MDBs. Thefollowing lines return the same result displayed below (only once).

data_tables(file_clinvar, "traitNames")[[1]]file_clinvar[["traitNames"]]file_clinvar$"traitNames"file_clinvar %>% pull(traitNames)

## # A tibble: 44 × 3##     t.id name                                                              type ##    <int> <chr>                                                             <chr>##  1   912 Chudley-McCullough syndrome                                       Pref…##  2   912 Deafness, autosomal recessive 82                                  Alte…##  3   912 Deafness, bilateral sensorineural, and hydrocephalus due to fora… Alte…##  4   912 Deafness, sensorineural, with partial agenesis of the corpus cal… Alte…##  5  1352 CTSD-Related Neuronal Ceroid-Lipofuscinosis                       Alte…##  6  1352 Ceroid lipofuscinosis neuronal Cathepsin D-deficient              Alte…##  7  1352 Neuronal ceroid lipofuscinosis 10                                 Pref…##  8  1352 Neuronal ceroid lipofuscinosis due to Cathepsin D deficiency      Alte…##  9  1481 Diabetes mellitus, neonatal, with congenital hypothyroidism       Pref…## 10  1481 NDH SYNDROME                                                      Alte…## # ℹ 34 more rows

MDBs can also be subset and combined. The corresponding functionsensure that the data model is fulfilled by the data tables.

file_clinvar[1:3]

## fileMDB ClinVar (version 0.9, Patrice Godard <patrice.godard@ucb.com>): Data extracted from the ClinVar database##    - 3 tables with 14 fields## ## No collection member## ## ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. This is a very small subset of ClinVar! Visit the reference URL for more information.## (https://www.ncbi.nlm.nih.gov/clinvar/)## ##

if(igraph_available){   c(file_clinvar[1:3], file_hpo[c(1,5,7)]) %>%       data_model() %>% auto_layout(force=TRUE) %>% plot()}else{   c(file_clinvar[1:3], file_hpo[c(1,5,7)]) %>%       data_model() %>% plot()}

The functionc() concatenates the provided MDB afterchecking that tables names are not duplicated. It does not integrate thedata with any relational table. This can achieved by merging the MDBs asdescribed in theMerging withcollections section.

3.5 Filtering andjoining

An MDB can be filtered by filtering one or several tables based onfield values. The filtering is propagated to other tables using theembedded data model.

In the example below, thefile_clinvar object isfiltered in order to focus on a few genes with pathogenic variants. Thetable below compares the number of rows before (“ori”) and after(“filt”) filtering.

filtered_clinvar <- file_clinvar %>%   filter(      entrezNames = symbol %in% c("PIK3R2", "UGT1A8")   ) %>%    slice(ReferenceClinVarAssertion=grep(      "pathogen",      .$ReferenceClinVarAssertion$clinicalSignificance,      ignore.case=TRUE   ))left_join(   dims(file_clinvar) %>% select(name, nrow),   dims(filtered_clinvar) %>% select(name, nrow),   by="name",   suffix=c("_ori", "_filt"))

## # A tibble: 21 × 3##    name                      nrow_ori nrow_filt##    <chr>                        <dbl>     <int>##  1 ReferenceClinVarAssertion      166         4##  2 rcvaVariant                    166         4##  3 ClinVarAssertions              409        15##  4 rcvaInhMode                     16         0##  5 rcvaObservedIn                 337        10##  6 rcvaTraits                     166         4##  7 clinSigOrder                    11         3##  8 revStatOrder                     2         1##  9 variants                       138         3## 10 cvaObservedIn                  412        15## # ℹ 11 more rows

The object returned byfilter() orslice isamemoMDB: all the data are in memory.

Tables can be easily joined to get diseases associated to the genesof interest in a single table as shown below.

gene_traits <- filtered_clinvar %>%    join_mdb_tables(      "entrezNames", "varEntrez", "variants", "rcvaVariant",      "ReferenceClinVarAssertion", "rcvaTraits", "traits"   )gene_traits$entrezNames %>%   select(symbol, name, variants.type, variants.name, traitType, traits.name)

## # A tibble: 4 × 6##   symbol name                  variants.type variants.name traitType traits.name##   <chr>  <chr>                 <chr>         <chr>         <chr>     <chr>      ## 1 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease   Megalencep…## 2 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease   not provid…## 3 PIK3R2 phosphoinositide-3-k… single nucle… NM_005027.4(… Disease   not provid…## 4 UGT1A8 UDP glucuronosyltran… Microsatelli… UGT1A1*28     Disease   Gilbert's …

3.6 Merging MDBs withcollections

Until now, we have seen how to use individual MDB by exploringgeneral information about it, extracting tables, filtering and joiningdata. This part shows how to usecollections toidentify relationships between MDBs and to leverage these relationshipsto integrate them. Documenting collection members has beendescribed above and further information aboutcollections implementation is provided in theappendix.

3.6.1 Collections andcollection members

As explainedabove, some databases referto the same concepts and could be integrated accordingly. However theyoften use different vocabularies.

For example, bothCHEMBL andClinVar refer to biological entities (BE) fordocumenting drug targets or disease causal genes. CHEMBL refers to drugtarget in theCHEMBL_component_sequence table using mainlyUniprot peptide identifiers from different species.

file_chembl$CHEMBL_component_sequence

## # A tibble: 35 × 5##    component_id accession organism                  db_source db_version##           <int> <chr>     <chr>                     <chr>     <chr>     ##  1          259 P15260    Homo sapiens              Uniprot   2019_09   ##  2          327 Q99062    Homo sapiens              Uniprot   2019_09   ##  3          752 P35563    Rattus norvegicus         Uniprot   2019_09   ##  4          917 P07339    Homo sapiens              Uniprot   2019_09   ##  5         1807 Q54A96    Plasmodium falciparum     Uniprot   2019_09   ##  6         2180 P67774    Bos taurus                Uniprot   2019_09   ##  7         2398 P25098    Homo sapiens              Uniprot   2019_09   ##  8         2541 Q8II92    Plasmodium falciparum 3D7 Uniprot   2019_09   ##  9         3803 Q64346    Rattus norvegicus         Uniprot   2019_09   ## 10         4395 O60502    Homo sapiens              Uniprot   2019_09   ## # ℹ 25 more rows

Whereas ClinVar refers to causal genes in theentrezNamestable using human Entrez gene identifiers.

file_clinvar$entrezNames

## # A tibble: 20 × 3##       entrez name                                                         symbol##        <int> <chr>                                                        <chr> ##  1      1509 cathepsin D                                                  CTSD  ##  2      1903 sphingosine-1-phosphate receptor 3                           S1PR3 ##  3      3300 DnaJ heat shock protein family (Hsp40) member B2             DNAJB2##  4      3423 iduronate 2-sulfatase                                        IDS   ##  5      3910 laminin subunit alpha 4                                      LAMA4 ##  6      5296 phosphoinositide-3-kinase regulatory subunit 2               PIK3R2##  7      6748 signal sequence receptor subunit 4                           SSR4  ##  8      7633 zinc finger protein 79                                       ZNF79 ##  9     22906 trafficking kinesin protein 1                                TRAK1 ## 10     23155 chloride channel CLIC like 1                                 CLCC1 ## 11     26251 potassium voltage-gated channel modifier subfamily G member… KCNG2 ## 12     29851 inducible T cell costimulator                                ICOS  ## 13     54576 UDP glucuronosyltransferase family 1 member A8               UGT1A8## 14     57684 zinc finger and BTB domain containing 26                     ZBTB26## 15    115948 outer dynein arm docking complex subunit 3                   ODAD3 ## 16    139716 GRB2 associated binding protein 3                            GAB3  ## 17    169792 GLIS family zinc finger 3                                    GLIS3 ## 18    407054 microRNA 98                                                  MIR98 ## 19    441531 phosphoglycerate mutase family member 4                      PGAM4 ## 20 105373557 serous ovarian cancer associated RNA                         SOCAR

Since peptides are coded by genes, there is a biological relationshipbetween these two types of BE, and several tools exist to convert suchBE identifiers from one scope to the other (e.g. BED(Godard and Eyll 2018),mygene(Wu, MacLeod, and Su2012),biomaRt(Kinsella et al.2011)).

TKCat provides mechanism to document these scopes in order to allowautomatic conversions from and to any of them. Those concepts are calledCollections in TKCat and they should be formallydefined before being able to document any of their members. Twocollection definitions are provided within the TKCat package and othercan be imported with theimport_local_collection()function.

list_local_collections()

## # A tibble: 2 × 2##   title     description                                  ##   <chr>     <chr>                                        ## 1 BE        Collection of biological entity (BE) concepts## 2 Condition Collection of condition concepts

Here are the definition of the BE collection members provided by theCHEMBL_component_sequence and theentrezNamestables.

collection_members(file_chembl, "BE")

## # A tibble: 4 × 9##   collection cid           resource   mid table         field static value type ##   <chr>      <chr>         <chr>    <int> <chr>         <chr> <lgl>  <chr> <chr>## 1 BE         CHEMBL_BE_1.0 CHEMBL       1 CHEMBL_compo… be    TRUE   Pept… <NA> ## 2 BE         CHEMBL_BE_1.0 CHEMBL       1 CHEMBL_compo… iden… FALSE  acce… <NA> ## 3 BE         CHEMBL_BE_1.0 CHEMBL       1 CHEMBL_compo… sour… FALSE  db_s… <NA> ## 4 BE         CHEMBL_BE_1.0 CHEMBL       1 CHEMBL_compo… orga… FALSE  orga… Scie…

collection_members(file_clinvar, "BE")

## # A tibble: 4 × 9##   collection cid            resource   mid table       field  static value type ##   <chr>      <chr>          <chr>    <int> <chr>       <chr>  <lgl>  <chr> <chr>## 1 BE         ClinVar_BE_1.0 ClinVar      1 entrezNames be     TRUE   Gene  <NA> ## 2 BE         ClinVar_BE_1.0 ClinVar      1 entrezNames ident… FALSE  entr… <NA> ## 3 BE         ClinVar_BE_1.0 ClinVar      1 entrezNames organ… TRUE   Homo… Scie…## 4 BE         ClinVar_BE_1.0 ClinVar      1 entrezNames source TRUE   Entr… <NA>

TheCollection column indicates the collection to which thetable refers. Thecid column indicates the version of thecollection definition which should correspond to the$id ofJSON schema. Theresource column indicates the name of theresource and themid column an identifier which is unique foreach member of a collection in each resource. Thefield columnindicates each part of the scope of collection. In the case of BE, 4fields should be documented:

be: the type of BE (e.g. Gene or Peptide)
source: the source of the identifier (e.g. EntrezGene orPeptide)
organism: the organism to which the identifier refers (e.g Homosapiens)
identifier: the identifier itself.

Each of these fields can bestatic or not.TRUEmeans that the value of this field is the same for all the records andis provided in thevalue column. WhereasFALSEmeans that the value can be different for each record and is provided inthe column the name of which is given in thevalue column. Thetype column is only used for the organism field in the case ofthe BE collection and can take 2 values: “Scientific name” or “NCBItaxon identifier”. The definition of the pre-build BE collection membersfollows the terminology used in theBED package(Godard and Eyll2018). But it can be adapted according to the solution chosenfor converting BE identifiers from one scope to another.

Setting up the definition of such scope is done using theadd_collection_member() function as shown above in theminimal example and in theReading HPO example.

3.6.2 Shared collectionsand merging

The aim of collections is to identify potential bridges between MDBs.Theget_shared_collection() function is used to list allthe collections shared by two MDBs.

get_shared_collections(filtered_clinvar, file_chembl)

## # A tibble: 3 × 5##   collection table.x     mid.x table.y                   mid.y##   <chr>      <chr>       <int> <chr>                     <int>## 1 Condition  traits          2 CHEMBL_drug_indication        1## 2 Condition  traitCref       1 CHEMBL_drug_indication        1## 3 BE         entrezNames     1 CHEMBL_component_sequence     1

In this example, there are 3 different ways to merge the two MDBsfiltered_clinvar andfile_chembl:

Based on conditions provided respectively in thetraits andin theCHEMBL_drug_indication tables
Based on conditions provided respectively in thetraitsCrefand in theCHEMBL_drug_indication tables
Based on BE provided respectively in theentrezNames and intheCHEMBL_component_sequence tables

The code below shows how to merge these two resources based on BEinformation. To achieve this task it relies on a function provided withTKCat along with BE collection definition (to get the function:get_collection_mapper("BE")). This function uses theBED package(Godard and Eyll2018) and you need this package to be installed with aconnection to BED database in order to run the code below.

try(BED::connectToBed(a))

## Error in eval(expr, envir) : object 'a' not found

bedCheck <- try(BED::checkBedConn())if(!inherits(bedCheck, "try-error") && bedCheck){   sel_coll <- get_shared_collections(file_clinvar, file_chembl) %>%       filter(collection=="BE")   filtered_cv_chembl <- merge(      x=file_clinvar,      y=file_chembl,      by=sel_coll,      dmAutoLayout=igraph_available   )}

The returned object is ametaMDB gathering theoriginal MDBs and a relational table between members of the samecollection as defined by theby parameter.

Additional information about collection can be found below in theappendix.

3.6.3 Merging withoutcollection

If thecollection column of theby parameter isNA, then the relational table is built by merging identicalcolumns in table.x and table.y (No conversion occurs). For example,file_hpo andfile_clinvar MDBs could be mergedaccording to conditions provided in theHPO_diseases and thetraitCref tables respectively.

get_shared_collections(file_hpo, file_clinvar)

## # A tibble: 4 × 5##   collection table.x      mid.x table.y   mid.y##   <chr>      <chr>        <int> <chr>     <int>## 1 Condition  HPO_hp           1 traits        2## 2 Condition  HPO_hp           1 traitCref     1## 3 Condition  HPO_diseases     2 traits        2## 4 Condition  HPO_diseases     2 traitCref     1

These conditions could be converted using a function provided withTKCat (get_collection_mapper("Condition")) and which relyon theDODO package(François, Eyll, andGodard 2020). The two tables can also be simply concatenatedwithout applying any conversion (loosing the advantage of suchconversion obviously).

sel_coll <- get_shared_collections(file_hpo, file_clinvar) %>%    filter(table.x=="HPO_diseases", table.y=="traitCref") %>%    mutate(collection=NA)sel_coll

## # A tibble: 1 × 5##   collection table.x      mid.x table.y   mid.y##   <lgl>      <chr>        <int> <chr>     <int>## 1 NA         HPO_diseases     2 traitCref     1

Themerge() function gather the twoMDBs in onemetaMDB and create a association table based on theby argument. This association table(“HPO_diseases_traitCref”) is displayed in yellow in the data model ofthe createdmetaMDB as shown below.

hpo_clinvar <- merge(   file_hpo, file_clinvar, by=sel_coll, dmAutoLayout=igraph_available)plot(data_model(hpo_clinvar))

hpo_clinvar$HPO_diseases_traitCref

## # A tibble: 1,950 × 2##    db       id    ##    <chr>    <chr> ##  1 DECIPHER 15    ##  2 DECIPHER 45    ##  3 DECIPHER 65    ##  4 OMIM     100050##  5 OMIM     100650##  6 OMIM     101800##  7 OMIM     102500##  8 OMIM     102510##  9 OMIM     102700## 10 OMIM     102800## # ℹ 1,940 more rows

4 A centralized catalogof MDB in ClickHouse (chTKCat)

4.1 Local TKCat

MDB can be gathered in aTKCat (Tailored Knowledge Catalog)object.

k <- TKCat(file_hpo, file_clinvar)

Gathering MDBs in such a catalog facilitate their exploration andtheir preparation for potential integration. Several functions areavailable to achieve this goal.

list_MDBs(k)                     # list all the MDBs in a TKCat object

## # A tibble: 2 × 7##   name    title         description url   version maintainer timestamp##   <chr>   <chr>         <chr>       <chr> <chr>   <chr>      <dttm>   ## 1 HPO     Data extract… This is a … http… <NA>    <NA>       NA       ## 2 ClinVar Data extract… ClinVar is… http… 0.9     Patrice G… NA

get_MDB(k, "HPO")                # get a specific MDBs from the catalog

## fileMDB HPO: Data extracted from the HPO database##    - 9 tables with 25 fields## ## Collection members: ##    - 2 Condition members## ## This is a very small subset of the HPO! Visit the reference URL for more information.## (http://human-phenotype-ontology.github.io/)## ##

search_MDB_tables(k, "disease")  # Search table about "disease"

## # A tibble: 3 × 3##   resource name                comment                 ##   <chr>    <chr>               <chr>                   ## 1 HPO      HPO_diseases        Diseases                ## 2 HPO      HPO_diseaseHP       HP presented by diseases## 3 HPO      HPO_diseaseSynonyms Disease synonyms

search_MDB_fields(k, "disease")  # Search a field about "disease"

## # A tibble: 8 × 7##   resource table               name    type      nullable unique comment        ##   <chr>    <chr>               <chr>   <chr>     <lgl>    <lgl>  <chr>          ## 1 HPO      HPO_diseases        db      character FALSE    FALSE  Disease databa…## 2 HPO      HPO_diseases        id      character FALSE    FALSE  Disease ID     ## 3 HPO      HPO_diseases        label   character FALSE    FALSE  Disease lable …## 4 HPO      HPO_diseaseHP       db      character FALSE    FALSE  Disease databa…## 5 HPO      HPO_diseaseHP       id      character FALSE    FALSE  Disease ID     ## 6 HPO      HPO_diseaseSynonyms db      character FALSE    FALSE  Disease databa…## 7 HPO      HPO_diseaseSynonyms id      character FALSE    FALSE  Disease ID     ## 8 HPO      HPO_diseaseSynonyms synonym character FALSE    FALSE  Disease synonym

collection_members(k)            # Get collection members of the different MDBs

## # A tibble: 5 × 3##   resource collection table       ##   <chr>    <chr>      <chr>       ## 1 HPO      Condition  HPO_hp      ## 2 HPO      Condition  HPO_diseases## 3 ClinVar  Condition  traits      ## 4 ClinVar  Condition  traitCref   ## 5 ClinVar  BE         entrezNames

c(k, TKCat(file_chembl))         # Merge 2 TKCat objects

## TKCat gathering 3 MDB objects

The functionexplore_MDBs() launches a shiny interfaceto explore MDBs in aTKCat object. This exploration interfacecan be easily deployed using anapp.R file with content similarto the one below.

library(TKCat)explore_MDBs(k, download=TRUE)

In this interface the users can explore the resources available inthe catalog. They can browse the data model of each of them with somesample data. They can also search for information provided in resources,tables or fields. Finally, if the parameterdownload is settoTRUE, the users will also be able to download the data:either each table individually or an archive of the whole MDB.

4.2 chTKCat

AchTKCat object is a catalog of MDB as aTKCatobject described above but relying on aClickHouse database. This partfocuses on using and querying achTKCat object. Theinstallation and the initialization of aClickHouse database ready for TKCatare described below in theappendix.

The connection to the ClickHouse TKCat database is achieved using thechTKCat() function.

k <- chTKCat(   host="localhost",                     # default parameter   port=9111L,                           # default parameter   drv=ClickHouseHTTP::ClickHouseHTTP(), # default parameter   user="default",                       # default parameter   password=""                           # if not provided the                                         # password is requested interactively )

By default, this function connects anonymously (“default” userwithout password) to the database, using theHTTPinterface of ClickHouse thanks to theClickHouseHTTPdriver. If the database is configured appropriately (seeappendix), connection can be achieved through HTTPSwith or without SSL peer verification (see the manual ofClickHouseHTTP::\ClickHouseHTTPDriver-class`for further information). Also, theRClickhouse::clickhouse()driver from the [RClickhouse][rclickhouse] package can be used (drvparameter of thechTKCat()`function) to leverage the nativeTCP interfaceof ClickHouse which has the strong advantage of having less overhead.But TLS wrapping is not supported yet by the RClickhouse package.

Once connected, thischTKCat object can be used as aTKCat object.

list_MDBs(k)             # get a specific MDBs from the catalog

## # A tibble: 24 × 12##    name   title description url   version maintainer public populated timestamps##    <chr>  <chr> <chr>       <chr> <chr>   <chr>      <lgl>  <lgl>     <lgl>     ##  1 ChEMBL ChEM… ChEMBL is … http… 1.0.0   [Patrice … TRUE   TRUE      TRUE      ##  2 Corte… Data… Clarivate … http… 0.0.1   [Patrice … TRUE   TRUE      TRUE      ##  3 DRE-B… Bulk… Re-interpr… http… 0.01    [Patrice … TRUE   TRUE      TRUE      ##  4 FCD-T… Bulk… Re-interpr… http… 0.01    [Patrice … TRUE   TRUE      TRUE      ##  5 GO     The … Because of… http… 1.0.0   [Patrice … TRUE   TRUE      TRUE      ##  6 GTEx   Geno… The Adult … http… 0.01    [Patrice … TRUE   TRUE      TRUE      ##  7 Galac… Data… Biorelate'… http… 1.1.0   [Patrice … TRUE   TRUE      TRUE      ##  8 Globa… Data… GlobalData… http… 0.0.1   [Patrice … TRUE   TRUE      TRUE      ##  9 HGNC   Anno… The HUGO G… http… 0.0.1   [Patrice … TRUE   TRUE      TRUE      ## 10 HPA    The … The Human … http… 0.0.1   [Patrice … TRUE   TRUE      TRUE      ## # ℹ 14 more rows## # ℹ 3 more variables: timestamp <dttm>, access <fct>, total_size <dbl>

search_MDB_tables(k, "disease")  # Search table about "disease"

## # A tibble: 47 × 3##    resource    name                   comment                                   ##    <chr>       <chr>                  <chr>                                     ##  1 ChEMBL      assay_classification   "Classification scheme for phenotypic ass…##  2 Galactic    status                 "Cause-and-effect interactions can be bet…##  3 HPA         Disease_involvement     <NA>                                     ##  4 HPO         Disease_HP             "HP presented by diseases"                ##  5 HPO         Disease_synonyms       "Disease synonyms"                        ##  6 HPO         Diseases               "Diseases"                                ##  7 brainSCOPE  CT_group_conditions    "Experimental condition (e.g.: to be comp…##  8 brainSCOPE  Cell_type_conditions   "Experimental condition (e.g.: to be comp…##  9 OpenTargets Associations_by_source "Disease target association by data sourc…## 10 OpenTargets Associations_by_type   "Disease target association by data type" ## # ℹ 37 more rows

search_MDB_fields(k, "disease")  # Search a field about "disease"

## # A tibble: 124 × 7##    resource              table               name  comment type  nullable unique##    <chr>                 <chr>               <chr> <chr>   <chr> <lgl>    <lgl> ##  1 FCD-TLE-Bulk-RNA-2019 conditions          cond… "Disea… char… TRUE     TRUE  ##  2 FCD-TLE-Bulk-RNA-2019 samples             cond… "Disea… char… TRUE     FALSE ##  3 HPA                   Disease_involvement dise… ""      char… FALSE    FALSE ##  4 DRE-Bulk-RNA-UMC-2024 conditions          cond… "Disea… char… TRUE     FALSE ##  5 DRE-Bulk-RNA-UMC-2024 samples             cond… "Disea… char… TRUE     FALSE ##  6 DRE-Bulk-RNA-UMC-2024 epilepsies_genetics dise… ""      char… FALSE    FALSE ##  7 DRE-Bulk-RNA-UMC-2024 epilepsies_targets  dise… ""      char… FALSE    FALSE ##  8 DRE-Bulk-RNA-UMC-2024 epilepsies_targets  dise… ""      nume… FALSE    FALSE ##  9 DRE-Bulk-RNA-UMC-2024 genes_epilepsies_a… dise… ""      char… FALSE    FALSE ## 10 DRE-Bulk-RNA-UMC-2024 epilepsies_genetics dise… ""      char… FALSE    FALSE ## # ℹ 114 more rows

collection_members(k)

## # A tibble: 43 × 3##    resource              collection table              ##    <chr>                 <chr>      <chr>              ##  1 ChEMBL                BE         component_sequences##  2 ChEMBL                Condition  drug_indication    ##  3 Cortellis             BE         target_genes       ##  4 DRE-Bulk-RNA-UMC-2024 BE         genes              ##  5 FCD-TLE-Bulk-RNA-2019 BE         genes              ##  6 GO                    BE         Unique_BEIDs       ##  7 GTEx                  BE         genes              ##  8 GTEx                  BE         transcripts        ##  9 GlobalData            BE         target_genes       ## 10 HGNC                  BE         Genes              ## # ℹ 33 more rows

explore_MDBs(k)

4.3 Pushing an MDB in achTKCat instance

AnyMDB object can be imported in a TKCat ClickHouseinstance as following:

kw <- chTKCat(host="localhost", port=9111L, user="pgodard")create_chMDB(kw, "HPO", public=TRUE)ch_hpo <- as_chMDB(file_hpo, kw)

It is then accessible to anyone with relevant permissions on theClickhouse database. Pushing data in a ClickHouse database works only ifthe user is allowed to write in the database.

4.4 Specific operationson chMDB objects

The functionget_MDB() returns achMDB objectthat can be used as anyMDB object. The data are located in theClickHouse database and pulled on request.

ch_hpo <- get_MDB(k, "HPO")

To avoid pulling a whole table from ClickHouse (which can take timeif the table is big), SQL queries can be made on thechMDBobject as shown below.

get_query(   ch_hpo,   query="SELECT * from HPO_diseases WHERE lower(label) LIKE '%epilep%'")

## # A tibble: 292 × 3##    db    id     label                                                           ##    <chr> <chr>  <chr>                                                           ##  1 OMIM  117100 Centralopathic epilepsy                                         ##  2 OMIM  121201 Epilepsy, benign neonatal, 2                                    ##  3 OMIM  132090 Epilepsy, benign occipital                                      ##  4 OMIM  132300 Epilepsy, reading                                               ##  5 OMIM  159600 Myoclonic epilepsy, Hartung type                                ##  6 OMIM  159950 Spinal muscular atrophy with progressive myoclonic epilepsy     ##  7 OMIM  208700 Ataxia with myoclonic epilepsy and presenile dementia           ##  8 OMIM  213000 Cerebellar hypoplasia/atrophy, epilepsy, and global development…##  9 OMIM  226800 Epilepsy, photogenic, with spastic diplegia and mental retardat…## 10 OMIM  226810 Celiac disease, epilepsy and cerebral calcification syndrome    ## # ℹ 282 more rows

5 Defining and usingRequirements for Knowledge Management (KMR)

Beside the relational model, no additional constraints are applied toan MDB. This allows for high flexibility in the data that can bemanaged. However, in some cases, it could be useful to add furtherconstraints to ensure that the data is compatible with specific analysisor integration workflows. In TKCat, this feature is supported by KMR(Knowledge Management Requirements). A KMR object is meant to be sharedand centrally managed. MDBs intended to meet these requirements mustcontain technical tables referring to the corresponding KMR. Whengrouped in the same TKCat catalog, KMRs and MDBs form a coherent corpusof knowledge that can be leveraged consistently by KMR-tailoredfunctions.

This set of features is described in the vignetteDefining and using Requirements for KnowledgeManagement (KMR) in TKCat.

6 Appendices

6.1 chTKCatoperations

6.1.1 Instantiating theClickHouse database

6.1.1.1 InstallClickHouse, initialize and configure the TKCat instance

The ClickHouse docker container supporting TKCat, its initializationand its configuration procedures are implemented here:docker.

Update theDockerfile to select the version ofClickHouse to use.
Customize and run the following script.

sh launch-tkcat-instance.sh

Specific attention should be paid on available ports: TCP native port(but not TLS wrapping yet) is supported by theRClickhouse R packagewhereas HTTP and HTTP ports are supported by theClickHouseHTTP Rpackage.

The data are stored in theTKCAT_HOME folder.

6.1.1.2 Cleaning andremoving a TKCat instance

When no longer needed, stooping and removing the docker container canbe achieved as exemplified below

# In shelldocker stop test_tkcatdocker rm test_tkcatdocker volume prune -f# Remove the folder with all the data: `$TKCAT_HOME`.`sudo rm -rf /mnt/data1/pgodard/Services-test/test_tkcat_2025.04.18

6.1.2 Usermanagement

User management requires admin rights on the database.

6.1.2.1 Creation

k <- chTKCat(user="pgodard")create_chTKCat_user(   k, login="lfrancois", contact=NA, admin=FALSE, provider=TRUE)

The function will require to setup a password for the new user. Theadmin parameter indicates if the new user have admin right on the wholechTKCat instance (default: FALSE). The provider parameter indicates ifthe new user can create and populate new databases whithin the chTKCatinstance (default: FALSE).

6.1.2.2 Update

k <- chTKCat(user="pgodard")change_chTKCat_password(k, "lfrancois")update_chTKCat_user(k, contact="email", admin=FALSE)

A shiny application can be launched for updating user settings:

manage_chTKCat_users(k)

If this application is deployed, it can be made directly accessiblefrom theexplore_MDBs() Shiny application by providing theURL as theuserManager parameter.

6.1.2.3 Drop

drop_chTKCat_user(k, login="lfrancois")

6.1.3 chMDBmanagement

6.1.3.1 chMDBCreation

Before MDB data can be uploaded, the database should be created. Thisoperation can only be achieved by data providers (seeabove).

create_chMDB(k, "CHEMBL", public=FALSE)

By default chMDB are not public. It can be changed through thepublic parameter when creating the chMDB or by using theset_chMDB_access() function afterward.

set_chMDB_access(k, "CHEMBL", public=TRUE)

Then, users having access to the chMDB can be identified with orwithout admin rights on the chMDB. Admin rights allow the user to updatethe chMDB data.

add_chMDB_user(k, "CHEMBL", "lfrancois", admin=TRUE)# remove_chMDB_user(k, "CHEMBL", "lfrancois")list_chMDB_users(k, "CHEMBL")

6.1.3.2 PopulatingchMDB

Each chMDB can be populated individualy using theas_chMDB() function. The code chunk below shows how to scana directory for allfileMDB it contains. Theas_memoMDB() function load all the data in memory andchecks that all the model constraints are fulfilled (this step isoptional). Whenoverwrite parameter of theas_chMDB() function is set to FALSE (default), thepotential existing version is archived before being updated. Whenoverwrite is set to TRUE, the potential existing version isoverwritten without being archived.

lc <- scan_fileMDBs("fileMDB_directory")## The commented line below allows the exploration of the data models in lc.# explore_MDBs(lc)for(r in toFeed){   message(r)   lr <- as_memoMDB(lc[[r]])   cr <- as_chMDB(lr, k, overwrite=FALSE)}

6.1.3.3 Deleting achMDB

Any admin user of a chMDB can delete the corresponding data.

empty_chMDB(k, "CHEMBL")

But only a system admin can drop the chMDB from the ClickHousedatabase.

drop_chMDB(k, "CHEMBL")

6.1.4 Collectionmanagement

Details about collections are provided in thefollowing appendix.

Collections needs to be added to a chTKCat instance in order tosupport collection members of the different chMDB. They can be takenfrom the TKCat package environment, from a JSON file or directly from aJSON text variable. Additional functions are available to list andremove chTKCat collections.

add_chTKCat_collection(k, "BE")list_chTKCat_collections(k)remove_chTKCat_collection(k, "BE")

6.1.5 Implementation

6.1.5.1 Data models

6.1.5.1.1 Defaultdatabase

The default database stores information about chTKCat instance, usersand user access.

6.1.5.1.2 Modeleddatabases

Modeled databases (MDB) are stored in dedicated database in chTKCat.Their data model is provided in dedicated tables described below.

6.2 TKCatcollections

Some MDBs refer to the same concepts and can be integratedaccordingly. However they often use different vocabularies or scopes.Collections are used to identify such concepts and to define a way todocument formally the scope used by the different members of thesecollections. Thanks to this formal description, tools can be used toautomatically combine MDBs referring to the same collection but usingdifferent scopes, as shownabove.

This appendix describes how to create TKCat Collections, documentcollection members and create functions to support the merging ofMDBs.

6.2.1 Creating acollection

A collection is defined by a JSON document. This document shouldfulfill the requirements defined by theCollection-Schema.json.Two collections are available by default in the TKCat package.

list_local_collections()

## # A tibble: 2 × 2##   title     description                                  ##   <chr>     <chr>                                        ## 1 BE        Collection of biological entity (BE) concepts## 2 Condition Collection of condition concepts

Here is how theBE collection is defined.

get_local_collection("BE")

{   "$schema": "https://json-schema.org/draft/2019-09/schema",   "$id":"TKCat_BE_collection_1.0",    "title": "BE collection",    "type": "object",    "description": "Collection of biological entity (BE) concepts",    "properties": {      "$schema": {"enum": ["TKCat_BE_collection_1.0"]},      "$id": {"type": "string"},        "collection": {"enum":["BE"]},        "resource": {"type": "string"},        "tables": {            "type": "array",            "minItems": 1,            "items":{                "type": "object",                "properties":{                    "name": {"type": "string"},                    "fields": {                        "type": "object",                        "properties": {                            "be": {                                "type": "object",                                "properties": {                                    "static": {"type": "boolean"},                                    "value": {"type": "string"}                                },                                "required": ["static", "value"],                                "additionalProperties": false                            },                            "source": {                                "type": "object",                                "properties": {                                    "static": {"type": "boolean"},                                    "value": {"type": "string"}                                },                                "required": ["static", "value"],                                "additionalProperties": false                            },                            "organism": {                                "type": "object",                                "properties": {                                    "static": {"type": "boolean"},                                    "value": {"type": "string"},                                    "type": {"enum": ["Scientific name", "NCBI taxon identifier"]}                                },                                "required": ["static", "value", "type"],                                "additionalProperties": false                            },                            "identifier": {                                "type": "object",                                "properties": {                                    "static": {"type": "boolean"},                                    "value": {"type": "string"}                                },                                "required": ["static", "value"],                                "additionalProperties": false                            }                        },                        "required": ["be", "source", "identifier"],                        "additionalProperties": false                    }                },                "required": ["name", "fields"],                "additionalProperties": false            }        }    },    "required": ["$schema", "$id", "collection", "resource", "tables"],    "additionalProperties": false}

A collection should refer to the"TKCat_collections_1.0"$schema. It should then have the followingproperties:

$id: the identifier of the collection
title: the title of the collection
type: alwaysobject
description: a short description of thecollection
properties: the properties that should beprovided by collection members. In this case:
- $schema: should be the$id of thecollection
- $id: the identifier of the collection member: astring
- collection: should be “BE”
- resource: the name of the resource havingcollection members: a string
- tables: an array of tables corresponding tocollection members. Each item being a table with the followingfeatures:
  - name: the name of the table
  - fields: the required fields
    - be: ifstatic is true thenvalue correspond to the be value valid for all therecords. If notvalue correspond to the table columnwith the be value for each record.
    - source: ifstatic is true thenvalue correspond to the source value valid for all therecords. If notvalue correspond to the table columnwith the source value for each record.
    - organism: ifstatic is true thenvalue correspond to the organism value valid for allthe records. If notvalue correspond to the tablecolumn with the organism value for each record.typeindicate how organisms are identified:"Scientific name" or"NCBI taxon identifier".

The main specifications defined in a JSON document can be simplydisplayed in R session by calling theshow_collection_def()function.

get_local_collection("BE") %>%   show_collection_def()

## BE collection: Collection of biological entity (BE) concepts## Arguments (non-mandatory arguments are between parentheses):##    - be:##       + static: logical##       + value: character##    - source:##       + static: logical##       + value: character##    - (organism):##       + static: logical##       + value: character##       + type: character in 'Scientific name', 'NCBI taxon identifier'##    - identifier:##       + static: logical##       + value: character

6.2.2 Documentingcollection members

Documenting collection members of anMDB can be done byusing theadd_collection_member() function (asformerly described), or by writing a JSON filelike the following one which correspond to BE members of the CHEMBLMDB.

system.file(   "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json",   package="TKCat") %>%    readLines() %>% paste(collapse="\n")

{  "$schema": "TKCat_BE_collection_1.0",  "$id": "CHEMBL_BE_1.0",  "collection": "BE",  "resource": "CHEMBL",  "tables": [    {      "name": "CHEMBL_component_sequence",      "fields": {        "be": {          "static": true,          "value": "Peptide"        },        "identifier": {          "static": false,          "value": "accession"        },        "source": {          "static": false,          "value": "db_source"        },        "organism": {          "static": false,          "value": "organism",          "type": "Scientific name"        }      }    }  ]}

The identification of collection members should fulfill therequirements defined by the collection JSON document, and therefore passthe following validation.

jsonvalidate::json_validate(   json=system.file(      "examples/CHEMBL/model/Collections/BE-CHEMBL_BE_1.0.json",      package="TKCat"   ),   schema=get_local_collection("BE"),   engine="ajv")

## [1] TRUE

This validation is done automatically when reading afileMDBobject or when setting collection members with theadd_collection_member() function.

6.2.3 Collection mapperfunctions

Themerge.MDB() and themap_collection_members() functions rely on functions to mapmembers of the same collection. When recorded (using theimport_collection_mapper() function), these functions canbe automatically identified by TKCat, otherwise or according to userneeds, these functions could be provided using thefuns(formerge.MDB()) or thefun (formap_collection_members()) parameters. Two mappers arepre-recorded in TKCat, one for theBE collection and one fortheCondition collection. They can be retrieved with theget_collection_mapper() function.

get_collection_mapper("BE")

function (x, y, orthologs = FALSE, restricted = FALSE, ...) {    if (!requireNamespace("BED")) {        stop("The BED package is required")    }    if (!BED::checkBedConn()) {        stop("You need to connect to a BED database using", " the BED::connectToBed() function")    }    if (!"organism" %in% colnames(x)) {        d <- x        scopes <- dplyr::distinct(d, be, source)        nd <- c()        for (i in 1:nrow(scopes)) {            be <- scopes$be[i]            source <- scopes$source[i]            toadd <- d %>% dplyr::filter(be == be, source ==                 source)            organism <- BED::guessIdScope(toadd$identifier, be = be,                 source = source, tcLim = Inf) %>% attr("details") %>%                 filter(be == !!be & source == !!source) %>% pull(organism) %>%                 unique()            toadd <- merge(toadd, tibble(organism = organism))            nd <- bind_rows(nd, toadd)        }        x <- nd %>% mutate(organism_type = "Scientific name")    }    if (!"organism" %in% colnames(y)) {        d <- y        scopes <- dplyr::distinct(d, be, source)        nd <- c()        for (i in 1:nrow(scopes)) {            be <- scopes$be[i]            source <- scopes$source[i]            toadd <- d %>% dplyr::filter(be == be, source ==                 source)            organism <- BED::guessIdScope(toadd$identifier, be = be,                 source = source, tcLim = Inf) %>% attr("details") %>%                 filter(be == !!be & source == !!source) %>% pull(organism) %>%                 unique()            toadd <- merge(toadd, tibble(organism = organism))            nd <- bind_rows(nd, toadd)        }        y <- nd %>% mutate(organism_type = "Scientific name")    }    xscopes <- dplyr::distinct(x, be, source, organism, organism_type)    yscopes <- dplyr::distinct(y, be, source, organism, organism_type)    toRet <- NULL    for (i in 1:nrow(xscopes)) {        xscope <- xscopes[i, ]        if (any(apply(xscope, 2, is.na))) {            (next)()        }        xi <- dplyr::right_join(x, xscope, by = c("be", "source",             "organism", "organism_type"))        xorg <- ifelse(xscope$organism_type == "NCBI taxon identifier",             BED::getOrgNames(xscope$organism) %>% dplyr::filter(nameClass ==                 "scientific name") %>% dplyr::pull(name), xscope$organism)        for (j in 1:nrow(yscopes)) {            yscope <- yscopes[j, ]            if (any(apply(yscope, 2, is.na))) {                (next)()            }            yi <- dplyr::right_join(y, yscope, by = c("be", "source",                 "organism", "organism_type"))            yorg <- ifelse(yscope$organism_type == "NCBI taxon identifier",                 BED::getOrgNames(yscope$organism) %>% dplyr::filter(nameClass ==                   "scientific name") %>% dplyr::pull(name), yscope$organism)            if (xorg == yorg || orthologs) {                xy <- BED::convBeIds(ids = xi$identifier, from = xscope$be,                   from.source = xscope$source, from.org = xorg,                   to = yscope$be, to.source = yscope$source,                   to.org = yorg, restricted = restricted) %>%                   dplyr::as_tibble() %>% dplyr::select(from,                   to)                if (restricted) {                  xy <- dplyr::bind_rows(xy, BED::convBeIds(ids = yi$identifier,                     from = yscope$be, from.source = yscope$source,                     from.org = yorg, to = xscope$be, to.source = xscope$source,                     to.org = xorg, restricted = restricted) %>%                     dplyr::as_tibble() %>% dplyr::select(to = from,                     from = to))                }                xy <- xy %>% dplyr::rename(identifier_x = "from",                   identifier_y = "to") %>% dplyr::mutate(be_x = xscope$be,                   source_x = xscope$source, organism_x = xscope$organism,                   be_y = yscope$be, source_y = yscope$source,                   organism_y = yscope$organism)                toRet <- dplyr::bind_rows(toRet, xy)            }        }    }    toRet <- dplyr::distinct(toRet)    return(toRet)}

A mapper function must have at least an x and a y parameters. Each ofthem should be a data.frame with all the field values corresponding tothe fields defined in the collection. Additional parameters can bedefined and will be forwarded using.... This functionshould return a data frame with all the fields values followed by “_x”and “_y” suffix accordingly.

6.3 Remarks aboutsupported data format and data types

Most of the data format and data types supported by the ReDaMoR andthe TKCat packages are taken into account in the examples described inthe main sections of this vignette. Nevertheless, one specific dataformat (matrix) and one specific data type (base64) are not exemplified.This appendix provides a short description of these format and type.

6.3.1 Matrices ofvalues

ReDaMoR and TKCat support data frame and matrix objectq. Data frameis the most used data format from far. However, matrices of values canbe useful in some use cases. The example below shows how such dataformat are modeled in ReDaMoR as a 3 columns table: one of type “row”corresponding to the row names of the matrix, one of type “column”corresponding to the column names of the matrix, and one of any type(excepted “row”, “column”, or “base64”).

d <- matrix(   rnorm(40), nrow=10,   dimnames=list(      paste0("g", 1:10),      paste0("s", 1:4)   ))m <- ReDaMoR::df_to_model(d) %>%    ReDaMoR::rename_field("d", "row", "gene") %>%   update_field("d", "gene", comment="Gene identifier") %>%    ReDaMoR::rename_field("d", "column", "sample") %>%    update_field("d", "sample", comment="Sample identifier") %>%    ReDaMoR::rename_field("d", "value", "expression") %>%    update_field(      "d", "expression", nullable=FALSE, comment="Gene expression value"   )md <- memoMDB(list(d=d), m, list(name="Matrix example"))plot(data_model(md))

6.3.2 Documents stored asbase64 values

Whole documents can be stored in MDB as “base64” character values.The example below shows how a document can be put in a table and thecorresponding data model.

ch_config_files <- tibble(   name=c("config.xml", "users.xml"),   file=c(      base64enc::base64encode(         system.file("ClickHouse/config.xml", package="TKCat")      ),      base64enc::base64encode(         system.file("ClickHouse/users.xml", package="TKCat")      )   ))m <- df_to_model(ch_config_files) %>%    update_field(      "ch_config_files", "name",      type="base64", comment="Name of the config file",      nullable=FALSE, unique=TRUE   ) %>%    update_field(      "ch_config_files", "file",      type="base64", comment="Config file in base64 format",      nullable=FALSE   )md <- memoMDB(   list(ch_config_files=ch_config_files), m, list(name="base64 example"))plot(data_model(md))

References

François, Liesbeth, Jonathan van Eyll, and Patrice Godard. 2020.“Dictionary of Disease Ontologies (DODO): A Graph Database toFacilitate Access and Interaction with Disease and PhenotypeOntologies.”F1000Research 9 (August): 942.https://doi.org/10.12688/f1000research.25144.1.

Godard, Patrice, and Jonathan van Eyll. 2018.“BED: ABiologicalEntityDictionaryBased on a Graph Data Model.”F1000Research 7: 195.https://doi.org/10.12688/f1000research.13925.3.

Kinsella, R. J., A. Kahari, S. Haider, J. Zamora, G. Proctor, G.Spudich, J. Almeida-King, et al. 2011.“Ensembl BioMarts: A Hubfor Data Retrieval Across Taxonomic Space.”Database2011 (0): bar030–30.https://doi.org/10.1093/database/bar030.

Köhler, Sebastian, Leigh Carmody, Nicole Vasilevsky, Julius O BJacobsen, Daniel Danis, Jean-Philippe Gourdine, Michael Gargano, et al.2019.“Expansion of theHumanPhenotypeOntology (HPO) Knowledge Base andResources.”Nucleic Acids Research 47 (D1): D1018–27.https://doi.org/10.1093/nar/gky1105.

Landrum, Melissa J., Jennifer M. Lee, Mark Benson, Garth R. Brown, ChenChao, Shanmuga Chitipiralla, Baoshan Gu, et al. 2018.“ClinVar: Improving Access to Variant Interpretationsand Supporting Evidence.”Nucleic Acids Research 46(D1): D1062–67.https://doi.org/10.1093/nar/gkx1153.

Mendez, David, Anna Gaulton, A Patrícia Bento, Jon Chambers, Marleen DeVeij, Eloy Félix, María Paula Magariños, et al. 2019.“ChEMBL: Towards Direct Deposition of BioassayData.”Nucleic Acids Research 47 (D1): D930–40.https://doi.org/10.1093/nar/gky1075.

Wu, Chunlei, Ian MacLeod, and Andrew I. Su. 2012.“BioGPS andMyGene.info: Organizing Online, Gene-Centric Information.”Nucleic Acids Research 41 (D1): D561–65.https://doi.org/10.1093/nar/gks1114.

Movatterモバイル変換

Introduction to TKCat

Patrice Godard

June 05, 2025

1 Introduction

2 Create an MDB: aminimal example

2.1 Loading exampledata

2.2 Creating a data modelwith ReDaMoR

2.3 Binding the model tothe data in an MDB object

miniHPO

Check configuration

HPO_hp

Field issues or warnings

HPO_diseases

Field issues or warnings

HPO_diseaseHP

Field issues or warnings

2.4 Adding informationabout an MDB

2.5 Documentingcollection members

2.6 Writing an MDB infiles

3 Leveraging MDB

3.1 Loading exampledata

3.1.1 HPO

3.1.2 ClinVar

3.1.3 CHEMBL

3.2 MDBimplementations

3.3 Exploringinformation

3.4 Pulling, subsettingand combining

3.5 Filtering andjoining

3.6 Merging MDBs withcollections

3.6.1 Collections andcollection members

3.6.2 Shared collectionsand merging

3.6.3 Merging withoutcollection

4 A centralized catalogof MDB in ClickHouse (chTKCat)

4.1 Local TKCat

4.2 chTKCat

4.3 Pushing an MDB in achTKCat instance

4.4 Specific operationson chMDB objects

5 Defining and usingRequirements for Knowledge Management (KMR)

6 Appendices

6.1 chTKCatoperations

6.1.1 Instantiating theClickHouse database

6.1.1.1 InstallClickHouse, initialize and configure the TKCat instance

6.1.1.2 Cleaning andremoving a TKCat instance

6.1.2 Usermanagement

6.1.2.1 Creation

6.1.2.2 Update

6.1.2.3 Drop

6.1.3 chMDBmanagement

6.1.3.1 chMDBCreation

6.1.3.2 PopulatingchMDB

6.1.3.3 Deleting achMDB

6.1.4 Collectionmanagement

6.1.5 Implementation

6.1.5.1 Data models

6.1.5.1.1 Defaultdatabase

6.1.5.1.2 Modeleddatabases

6.2 TKCatcollections

6.2.1 Creating acollection

6.2.2 Documentingcollection members

6.2.3 Collection mapperfunctions

6.3 Remarks aboutsupported data format and data types

6.3.1 Matrices ofvalues

6.3.2 Documents stored asbase64 values

References