Movatterモバイル変換


[0]ホーム

URL:


Title:Prioritizing Cancer Driver Genes Using Genomics Data
Version:0.4.1
Maintainer:Ege Ulgen <egeulgen@gmail.com>
Description:Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations. 'driveR' is a tool for personalized or batch analysis of genomic data for driver gene prioritization by combining genomic information and prior biological knowledge. As features, 'driveR' uses coding impact metaprediction scores, non-coding impact scores, somatic copy number alteration scores, hotspot gene/double-hit gene condition, 'phenolyzer' gene scores and memberships to cancer-related KEGG pathways. It uses these features to estimate cancer-type-specific probability for each gene of being a cancer driver using the related task of a multi-task learning classification model. The method is described in detail in Ulgen E, Sezerman OU. 2021. driveR: driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics <doi:10.1186/s12859-021-04203-7>.
License:MIT + file LICENSE
Encoding:UTF-8
LazyData:true
RoxygenNote:7.2.3
URL:https://egeulgen.github.io/driveR/,https://github.com/egeulgen/driveR/
BugReports:https://github.com/egeulgen/driveR/issues
Imports:caret, randomForest, GenomicRanges, GenomeInfoDb,GenomicFeatures, TxDb.Hsapiens.UCSC.hg19.knownGene,TxDb.Hsapiens.UCSC.hg38.knownGene, S4Vectors, org.Hs.eg.db,rlang,
Depends:R (≥ 4.0)
Suggests:testthat, covr, knitr, rmarkdown
VignetteBuilder:knitr
NeedsCompilation:no
Packaged:2023-08-19 13:44:30 UTC; egeulgen
Author:Ege UlgenORCID iD [aut, cre, cph]
Repository:CRAN
Date/Publication:2023-08-19 14:02:36 UTC

driveR: An R Package for Prioritizing Cancer Driver Genes Using Genomics Data

Description

Cancer genomes contain large numbers of somatic alterations but few genesdrive tumor development. Identifying cancer driver genes is criticalfor precision oncology. Most of current approaches either identify drivergenes based on mutational recurrence or using estimated scores predictingthe functional consequences of mutations.

Details

driveR is a tool for personalized or batch analysis of genomic data fordriver gene prioritization by combining genomic information and priorbiological knowledge. As features, driveR uses coding impact metapredictionscores, non-coding impact scores, somatic copy number alteration scores,hotspot gene/double-hit gene condition, 'phenolyzer' gene scores andmemberships to cancer-related KEGG pathways. It uses these features toestimate cancer-type-specific probabilities for each gene of being a cancerdriver using the related task of a multi-task learning classification model.

Author(s)

Maintainer: Ege Ulgenegeulgen@gmail.com (ORCID) [copyright holder]

See Also

predict_coding_impact for metaprediction of impact ofcoding variants.create_features_df for creating the features table tobe used to prioritize cancer driver genes.Seeprioritize_driver_genes for prioritizing cancer driver genes


KEGG "Pathways in cancer"-related Pathways - Descriptions

Description

A data frame containing descriptions for KEGG "Pathways in cancer"(hsa05200)-related pathways.Generated on Nov 17, 2020.

Usage

KEGG_cancer_pathways_descriptions

Format

A data frame with 21 rows and 2 variables:

id

KEGG pathway ID

description

KEGG pathway description


MTL Sub-model Descriptions

Description

A data frame containing descriptions for all sub-models of the MTL model.

Usage

MTL_submodel_descriptions

Format

A data frame with 21 rows and 2 variables:

short_name

short name for the cancer type

description

description of the cancer type


Create SCNA Score Data Frame

Description

Create SCNA Score Data Frame

Usage

create_SCNA_score_df(  gene_SCNA_df,  build = "GRCh37",  log2_ratio_threshold = 0.25,  MCR_overlap_threshold = 25)

Arguments

gene_SCNA_df

data frame of gene-level SCNAs (output ofcreate_gene_level_scna_df)

build

genome build for the SCNA segments data frame (default = "GRCh37")

log2_ratio_threshold

the log2ratio threshold for keeping high-confidence SCNA events (default = 0.25)

MCR_overlap_threshold

the percentage threshold for the overlap betweena gene and an MCR region (default = 25). This means that if only a geneoverlaps an MCR region more than this threshold, the gene is assigned theSCNA density of the MCR

Details

The function first aggregates SCNA log2 ratioon gene-level (by keeping the ratio with the maximal |log2|ratio over all the SCNA segments overlapping a gene). Next, it identifies theminimal common regions (MCRs) that the genes overlap and finally assigns theSCNA density (SCNA/Mb) values as proxy SCNA scores.

Value

data frame of SCNA proxy scores containing 2 columns:

gene_symbol

HGNC gene symbol

SCNA_density

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located.


Create Data Frame of Features for Driver Gene Prioritization

Description

Create Data Frame of Features for Driver Gene Prioritization

Usage

create_features_df(  annovar_csv_path,  scna_df,  phenolyzer_annotated_gene_list_path,  batch_analysis = FALSE,  prep_phenolyzer_input = FALSE,  build = "GRCh37",  log2_ratio_threshold = 0.25,  gene_overlap_threshold = 25,  MCR_overlap_threshold = 25,  hotspot_threshold = 5L,  log2_hom_loss_threshold = -1,  verbose = TRUE,  na.string = ".")

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

scna_df

the SCNA segments data frame. Must contain:

chr

chromosome the segment is located in

start

start position of the segment

end

end position of the segment

log2ratio
log2

ratio of the segment

phenolyzer_annotated_gene_list_path

path to 'phenolyzer'"annotated_gene_list" file

batch_analysis

boolean to indicate whether to perform batch analysis(TRUE, default) or personalized analysis (FALSE). IfTRUE,a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNAtable.

prep_phenolyzer_input

boolean to indicate whether or not to createa vector of genes for use as the input of 'phenolyzer' (default =FALSE).IfTRUE, the features data frame is not created and instead the vectorof gene symbols (union of all genes for which scores are available) isreturned.

build

genome build for the SCNA segments data frame (default = "GRCh37")

log2_ratio_threshold

the log2ratio threshold for keeping high-confidence SCNA events (default = 0.25)

gene_overlap_threshold

the percentage threshold for the overlap betweena segment and a transcript (default = 25). This means that if only a segmentoverlaps a transcript more than this threshold, the transcript is assignedthe segment's SCNA event.

MCR_overlap_threshold

the percentage threshold for the overlap betweena gene and an MCR region (default = 25). This means that if only a geneoverlaps an MCR region more than this threshold, the gene is assigned theSCNA density of the MCR

hotspot_threshold

to determine hotspot genes, the (integer) thresholdfor the minimum number of cases with certain mutation in COSMIC (default = 5)

log2_hom_loss_threshold

to determine double-hit events, thelog2 threshold for identifyinghomozygous loss events (default = -1).

verbose

boolean controlling verbosity (default =TRUE)

na.string

string that was used to indicate when a score is not availableduring annotation with ANNOVAR (default = ".")

Value

Ifprep_phenolyzer_input=FALSE (default), a data frame offeatures for prioritizing cancer driver genes (gene_symbol asthe first column and 26 other columns containing features). Ifprep_phenolyzer_input=TRUE, the functions returns a vector gene symbols(union of all gene symbols for which scores are available) to be used as theinput for performing 'phenolyzer' analysis.

The features data frame contains the following columns:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

See Also

prioritize_driver_genes for prioritizing cancer driver genes

Examples

path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",                                package = "driveR")path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",                                   package = "driveR")features_df <- create_features_df(annovar_csv_path = path2annovar_csv,                                  scna_df = example_scna_table,                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out)

Create Gene-level SCNA Data Frame

Description

Create Gene-level SCNA Data Frame

Usage

create_gene_level_scna_df(  scna_df,  build = "GRCh37",  gene_overlap_threshold = 25)

Arguments

scna_df

the SCNA segments data frame. Must contain:

chr

chromosome the segment is located in

start

start position of the segment

end

end position of the segment

log2ratio
log2

ratio of the segment

build

genome build for the SCNA segments data frame (default = "GRCh37")

gene_overlap_threshold

the percentage threshold for the overlap betweena segment and a transcript (default = 25). This means that if only a segmentoverlaps a transcript more than this threshold, the transcript is assignedthe segment's SCNA event.

Value

data frame of gene-level SCNA events, i.e. table of genes overlappedby SCNA segments.


Create Non-coding Impact Score Data Frame

Description

Create Non-coding Impact Score Data Frame

Usage

create_noncoding_impact_score_df(annovar_csv_path, na.string = ".")

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

na.string

string that was used to indicate when a score is not availableduring annotation with ANNOVAR (default = ".")

Value

data frame of meta-prediction scores containing 2 columns:

gene_symbol

HGNC gene symbol

CADD_phred

PHRED-scaled CADD score


Determine Double-Hit Genes

Description

Determine Double-Hit Genes

Usage

determine_double_hit_genes(  annovar_csv_path,  gene_SCNA_df,  log2_hom_loss_threshold = -1,  batch_analysis = FALSE)

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

gene_SCNA_df

data frame of gene-level SCNAs (output ofcreate_gene_level_scna_df)

log2_hom_loss_threshold

to determine double-hit events, thelog2 threshold for identifyinghomozygous loss events (default = -1).

batch_analysis

boolean to indicate whether to perform batch analysis(TRUE, default) or personalized analysis (FALSE). IfTRUE,a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNAtable.

Value

vector of gene symbols that are subject to double-hit event(s), i.e.non-synonymous mutation + homozygous copy-number loss


Determine Hotspot Containing Genes

Description

Determine Hotspot Containing Genes

Usage

determine_hotspot_genes(annovar_csv_path, hotspot_threshold = 5L)

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

hotspot_threshold

to determine hotspot genes, the (integer) thresholdfor the minimum number of cases with certain mutation in COSMIC (default = 5)

Value

vector of gene symbols of genes containing hotspot mutation(s)


Example Cohort-level Features Table for Driver Prioritization

Description

The example dataset containing features for prioritizing cancer driver genes for 10 randomlyselected samples from TCGA's LAML (Acute Myeloid Leukemia) cohort

Usage

example_cohort_features_table

Format

A data frame with 349 rows and 27 variables:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

See Also

KEGG_cancer_pathways_descriptions for descriptions ofKEGG "Pathways in cancer"-related pathways.


Example Cohort-level Somatic Copy Number Alteration Table

Description

A data set containing the somatic copy number alteration data for 10 randomlyselected samples from TCGA's LAML (Acute Myeloid Leukemia) cohort

Usage

example_cohort_scna_table

Format

A data frame with 126147 rows and 5 variables:

chr

chromosome the segment is located in

start

start position of the segment

end

end position of the segment

log2ratio
log2

ratio ofthe segment

tumor_id

ID for the tumor containing the SCNA segment

Source

https://dcc.icgc.org/releases/release_28


Example Features Table for Driver Prioritization

Description

The example dataset containing features for prioritizing cancer driver genes forthe lung adenocarcinoma patient studied in Imielinski M, Greulich H, Kaplan B, et al.Oncogenic and sorafenib-sensitive ARAF mutations in lung adenocarcinoma.J Clin Invest. 2014;124(4):1582-6.

Usage

example_features_table

Format

A data frame with 4901 rows and 27 variables:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

See Also

KEGG_cancer_pathways_descriptions for descriptions ofKEGG "Pathways in cancer"-related pathways.


Example Somatic Copy Number Alteration Table

Description

A data set containing the somatic copy number alteration data for the lungadenocarcinoma patient studied in Imielinski M, Greulich H, Kaplan B, et al.Oncogenic and sorafenib-sensitive ARAF mutations in lung adenocarcinoma.J Clin Invest. 2014;124(4):1582-6.

Usage

example_scna_table

Format

A data frame with 3160 rows and 4 variables:

chr

chromosome the segment is located in

start

start position of the segment

end

end position of the segment

log2ratio
log2

ratio ofthe segment

Source

https://pubmed.ncbi.nlm.nih.gov/24569458/


Create Coding Impact Meta-prediction Score Data Frame

Description

Create Coding Impact Meta-prediction Score Data Frame

Usage

predict_coding_impact(  annovar_csv_path,  keep_highest_score = TRUE,  keep_single_symbol = TRUE,  na.string = ".")

Arguments

annovar_csv_path

path to 'ANNOVAR' csv output file

keep_highest_score

boolean to indicate whether to keep only the maximalimpact score per gene (default =TRUE). IfFALSE, all scoresper each gene are returned

keep_single_symbol

in ANNOVAR outputs, a variant may be annotated asexonic in multiple genes. This boolean argument controls whether or not tokeep only the first encountered symbol for a variant (default =TRUE)

na.string

string that was used to indicate when a score is not availableduring annotation with ANNOVAR (default = ".")

Value

data frame of meta-prediction scores containing 2 columns:

gene_symbol

HGNC gene symbol

metaprediction_score

metapredictor impact score

Examples

path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",                                package = "driveR")metapred_df <- predict_coding_impact(path2annovar_csv)

Prioritize Cancer Driver Genes

Description

Prioritize Cancer Driver Genes

Usage

prioritize_driver_genes(features_df, cancer_type)

Arguments

features_df

the features data frame for all genes, containing the following columns:

gene_symbol

HGNC gene symbol

metaprediction_score

the maximum metapredictor (coding) impact score for the gene

noncoding_score

the maximum non-coding PHRED-scaled CADD score for the gene

scna_score

SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located

hotspot_double_hit

boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)

phenolyzer_score

'phenolyzer' score for the gene

hsa03320

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04010

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04020

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04024

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04060

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04066

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04110

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04115

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04150

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04151

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04210

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04310

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04330

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04340

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04350

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04370

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04510

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04512

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04520

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04630

boolean indicating whether or not the gene takes part in this KEGG pathway

hsa04915

boolean indicating whether or not the gene takes part in this KEGG pathway

cancer_type

short name of the cancer type. All available cancer typesare listed inMTL_submodel_descriptions

Value

data frame with 3 columns:

gene_symbol

HGNC gene symbol

driverness_prob

estimated probability for each gene infeatures_df of being acancer driver. The probabilities are calculated using the selected (viacancer_type) cancer type's sub-model.

prediction

prediction based on the cancer-type-specific threshold (either "driver" or "non-driver")

See Also

create_features_df for creating the features table.

Examples

drivers_df <- prioritize_driver_genes(example_features_table, "LUAD")

Tumor type specific probability thresholds

Description

Driver gene probability thresholds for all 21 cancer types (submodels).

Usage

specific_thresholds

Format

vector with 21 elements


[8]ページ先頭

©2009-2025 Movatter.jp