ELTEbioinformatics/muleaPublic

forked fromBiocodings/MulEA

NotificationsYou must be signed in to change notification settings
Fork1
Star28

mulea - an R package for enrichment analysis using various ontologies and empirical false discovery rate

doi.org/10.1186/s12859-024-05948-7

28 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 399 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
codemeta.json		codemeta.json
cran-comments.md		cran-comments.md
references.bib		references.bib

Repository files navigation

`mulea` - an R Package for Enrichment Analysis Using Multiple Ontologies and Empirical False Discovery Rate

Introduction

Themulea R package (Turek et al. 2024) is a comprehensive tool forfunctional enrichment analysis. It provides two different approaches:

For unranked sets of elements, such as significantly up- ordown-regulated genes,mulea employs the set-basedoverrepresentation analysis (ORA).
Alternatively, if the data consists of ranked elements, forinstance, genes ordered byp-value or log fold-change calculatedby the differential expression analysis,mulea offers thegeneset enrichment (GSEA) approach.

For the overrepresentation analysis,mulea employs a progressiveempirical false discovery rate (eFDR) method, specifically designedfor interconnected biological data, to accurately identify significantterms within diverse ontologies.

mulea expands beyond traditional tools by incorporating awide rangeof ontologies, encompassing Gene Ontology, pathways, regulatoryelements, genomic locations, and protein domains for 27 model organisms,covering 22 ontology types from 16 databases and various identifiersresulting in 879 files available at theELTEbioinformatics/GMT_files_for_muleaGitHub repository and through themuleaDataExperimentData Bioconductor package.

Installation

Install the dependencyfgsea BioConductor package:

# Installing the BiocManager package if neededif (!require("BiocManager",quietly=TRUE))    install.packages("BiocManager")# Installing the fgsea package with the BiocManagerBiocManager::install("fgsea")

To installmulea fromCRAN:

install.packages("mulea")

To install the development version ofmulea from GitHub:

# Installing the devtools package if neededif (!require("devtools",quietly=TRUE))    install.packages("devtools")# Installing the mulea package from GitHubdevtools::install_github("https://github.com/ELTEbioinformatics/mulea")

Usage

First, load themulea anddplyr libraries. Thedplyrlibrary is not essential but is used here to facilitate datamanipulation and inspection.

library(mulea)library(tidyverse)

Importing the Ontology

This section demonstrates how to import the desired ontology, such astranscription factors and their target genes downloaded from the database, into a dataframe suitable for enrichment analysis. We present multiple methods forimporting the ontology. Ensure that the identifier type (e.g.,UniProt protein ID,Entrez ID, Gene Symbol,Ensembl gene ID)matches between the ontology and the elements you wish to investigate.

Alternative 1: Importing the Ontology from a GMT File

TheGMT (Gene MatrixTransposed)format contains collections of genes or proteins associated withspecific ontology terms in a tab-delimited text file. The GMT file canbe read into R as a data frame using theread_gmt function from themulea package. Each term is represented by a single row in both theGMT file and the data frame. Each row includes three types of elements:

Ontology identifier (“ontology_id”): This uniquely identifieseach term within the file or data frame.
Ontology name or description (“ontology_name”): This providesa user-friendly label or textual description for each term.
Associated gene or protein identifiers: These are listed in the“list_of_values” column, with identifiers separated by spaces, andbelong to each term.¹

A)`mulea` GMT File

Alongside with themulea package we provide ontologies collected from16 publicly available databases, in a standardised GMT format for 27model organisms, from Bacteria to human. These files are available attheELTEbioinformatics/GMT_files_for_muleaGitHub repository.

To read a downloaded GMT file locally:

# Reading the mulea GMT file locallytf_ontology<- read_gmt("Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")

Alternatively, one can read it directly from the GitHub repository:

# Reading the GMT file from the GitHub repositorytf_ontology<- read_gmt("https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Escherichia_coli_83333/Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")

B) Enrichr GMT File

mulea is compatible withGMTfiles provided with theEnricher R package (Kuleshov et al. 2016). Download and read such aGMT file (e. g. TRRUST_Transcription_Factors_2019.txt) locally.Note that this ontology is not suitable for analyzing the Escherichiacoli differential expression data described in the sectionTheDifferential Expression Dataset toAnalyse.

# Reading the Enrichr GMT file locallytf_enrichr_ontology<- read_gmt("TRRUST_Transcription_Factors_2019.txt")# The ontology_name is empty, therefore we need to fill it with the ontology_idtf_enrichr_ontology$ontology_name<-tf_enrichr_ontology$ontology_id

C) MsigDB GMT File

mulea is compatible with the MsigDB (Subramanian et al. 2005)GMTfiles. Download and read sucha GMT file (e. g. c3.tft.v2023.2.Hs.symbols.gmt) locally.Notethat this ontology is not suitable for analyzing the Escherichia colidifferential expression data described in the sectionThe DifferentialExpression Dataset toAnalyse.

# Reading the MsigDB GMT file locallytf_msigdb_ontology<- read_gmt("c3.tft.v2023.2.Hs.symbols.gmt")

Alternative 2: Importing the Ontology with the`muleaData` Package

Alternatively, you can retrieve the ontology using themuleaDataExperimentData Bioconductor package:

# Installing the ExperimentHub package from BioconductorBiocManager::install("ExperimentHub")# Calling the ExperimentHub library.library(ExperimentHub)# Downloading the metadata from ExperimentHub.eh<- ExperimentHub()# Creating the muleaData variable.muleaData<- query(eh,"muleaData")# Looking for the ExperimentalHub ID of the ontology.EHID<- mcols(muleaData) %>%   as.data.frame() %>%dplyr::filter(title=="Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.rds") %>%   rownames()# Reading the ontology from the muleaData package.tf_ontology<-muleaData[[EHID]]# Change the headertf_ontology<-tf_ontology %>%   rename(ontology_id="ontologyId",ontology_name="ontologyName",list_of_values="listOfValues")

Filtering the Ontology

Enrichment analysis results can sometimes be skewed by overly specificor broad entries.mulea allows you to customise the size of ontologyentries – the number of genes or proteins belonging to a term – ensuringyour analysis aligns with your desired scope.

Let’s exclude ontology entries with less than 3 or more than 400 genesymbols.

# Filtering the ontologytf_ontology_filtered<- filter_ontology(gmt=tf_ontology,min_nr_of_elements=3,max_nr_of_elements=400)

Saving the Ontology as a GMT file

You can save the ontology as a GMT file using thewrite_gmtfunction.

# Saving the ontology to GMT filewrite_gmt(gmt=tf_ontology_filtered,file="Filtered.gmt")

Converting a List to an Ontology Object

Themulea package provides thelist_to_gmt function to convert alist of gene sets into an ontology data frame. The following exampledemonstrates how to use this function:

# Creating a list of gene setsontology_list<-list(gene_set1= c("gene1","gene2","gene3"),gene_set2= c("gene4","gene5","gene6"))# Converting the list to a ontology (GMT) objectnew_ontology_df<- list_to_gmt(ontology_list)

The Differential Expression Dataset to Analyse

For further steps we will analyse a dataset from a microarray experiment(GSE55662)in the NCBI Gene Expression Omnibus. The study byMéhi et al. (2014) investigated antibiotic resistance evolution inEscherichia coli. Gene expression changes were compared betweenciprofloxacin antibiotic-treatedEscherichia coli bacteria andnon-treated controls.

The expression levels of these groups were compared with theGEO2R tool:

Non-treated control samples (2 replicates):WT_noCPR_1,WT_noCPR_2
Ciprofloxacin-treated samples (2 replicates):WT_CPR_1,WT_CPR_2

To see how the dataset were prepared go to theFormatting the Resultsof a Differential ExpressionAnalysissection.

The Unordered Set-Based Overrepresentation Analysis (ORA)

Themulea package implements a set-based enrichment analysis approachusing thehypergeometric test, which is analogous to the one-tailedFisher’s exact test. This method identifies statistically significantoverrepresentation of elements from a target set (e.g., significantlyup- or downregulated genes) within a background set (e.g., all genesthat were investigated in the experiment). Therefore, a predefinedthreshold value, such as 0.05 for the correctedp-value or 2-foldchange, should be used in the preceding analysis.

The overrepresentation analysis is implemented in theora functionwhich requires three inputs:

Ontologydata frame: Fits the investigated taxa and theapplied gene or protein identifier type, such as GO, pathway,transcription factor regulation, microRNA regulation, geneexpression data, genomic location data, or protein domain content.
Target set: A vector of elements to investigate, containinggenes or proteins of interest, such as significantly overexpressedgenes in the experiment.
Background set: A vector of background elements representing thebroader context, often including all genes investigated in thestudy.

Reading the Target and the Background Sets from Text Files

Let’s read the text files containing the identifiers (gene symbols) ofthe target and the background gene set directly from the GitHub website.To see how these files were prepared, refer to the section onFormatting the Results of a Differential ExpressionAnalysis.

# Taget settarget_set<- readLines("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/target_set.txt")# Background setbackground_set<- readLines("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/background_set.txt")

Performing the OverRepresentation Analysis

To perform the analysis, we will first establish a model using theorafunction. This model defines the parameters for the enrichment analysis.We then execute the test itself using therun_test function. It isimportant to note that for this example, we will employ 10,000permutations for theempirical false discovery rate (eFDR), which isthe recommended minimum, to ensure robust correction for multipletesting.

# Creating the ORA model using the GMT variableora_model<- ora(gmt=tf_ontology_filtered,# Test set variableelement_names=target_set,# Background set variablebackground_element_names=background_set,# p-value adjustment methodp_value_adjustment_method="eFDR",# Number of permutationsnumber_of_permutations=10000,# Number of processor threads to usenthreads=2,# Setting a random seed for reproducibilityrandom_seed=1)# Running the ORAora_results<- run_test(ora_model)

Examining the ORA Result

Theora_results data frame summarises the enrichment analysis, listingenriched ontology entries – in our case transcription factors –alongside their associatedp-values andeFDR values.

We can now determine the number of transcription factors classified as“enriched” based on these statistical measures (eFDR < 0.05).

ora_results %>%# Rows where the eFDR < 0.05  filter(eFDR<0.05) %>%# Number of such rows  nrow()#> [1] 10

Inspect the significant results:

ora_results %>%# Arrange the rows by the eFDR values  arrange(eFDR) %>%# Rows where the eFDR < 0.05  filter(eFDR<0.05)

ontology_id	ontology_name	nr_common_with_tested_elements	nr_common_with_background_elements	p_value	eFDR
FNR	FNR	26	259	0.0000003	0.0000000
LexA	LexA	14	53	0.0000000	0.0000000
SoxS	SoxS	7	37	0.0001615	0.0036667
Rob	Rob	5	21	0.0004717	0.0051200
DnaA	DnaA	4	13	0.0006281	0.0052000
FadR	FadR	5	20	0.0003692	0.0056000
NsrR	NsrR	8	64	0.0010478	0.0073714
ArcA	ArcA	12	148	0.0032001	0.0197500
IHF	IHF	14	205	0.0070758	0.0458600
MarA	MarA	5	37	0.0066068	0.0483111

Visualising the ORA Result

To gain a comprehensive understanding of the enriched transcriptionfactors,mulea offers diverse visualisation tools, includinglollipop charts, bar plots, networks, and heatmaps. These visualisationseffectively reveal patterns and relationships among the enrichedfactors.

Initialising the visualisation with thereshape_results function:

# Reshapeing the ORA results for visualisationora_reshaped_results<- reshape_results(model=ora_model,model_results=ora_results,# Choosing which column to use for the#     indication of significancep_value_type_colname="eFDR")

Visualising the Spread ofeFDR Values: Lollipop Plot

Lollipop charts provide a graphical representation of the distributionof enriched transcription factors. They-axis displays thetranscription factors, while thex-axis represents their correspondingeFDR values. The dots are coloured based on theireFDR values. Thisvisualisation helps us examine the spread ofeFDRs and identifyfactors exceeding the commonly used significance threshold of 0.05.

plot_lollipop(reshaped_results=ora_reshaped_results,# Column containing the names we wish to plotontology_id_colname="ontology_id",# Upper threshold for the value indicating the significancep_value_max_threshold=0.05,# Column that indicates the significance valuesp_value_type_colname="eFDR")

Visualising the Spread ofeFDR Values: Bar Plot

Bar charts offer a graphical representation similar to lollipop plots.They-axis displays the enriched ontology categories (e.g.,transcription factors), while thex-axis represents theircorrespondingeFDR values. The bars are coloured based on theireFDRvalues, aiding in examining the spread ofeFDRs and identifyingfactors exceeding the significance threshold of 0.05.

plot_barplot(reshaped_results=ora_reshaped_results,# Column containing the names we wish to plotontology_id_colname="ontology_id",# Upper threshold for the value indicating the significancep_value_max_threshold=0.05,# Column that indicates the significance valuesp_value_type_colname="eFDR")

Visualising the Associations: Graph Plot

This function generates a network visualisation of the enriched ontologycategories (e.g., transcription factors). Each node represents aneriched ontology category, coloured based on itseFDR value. An edgeis drawn between two nodes if they share at least one common genebelonging to the target set, indicating co-regulation. The thickness ofthe edge reflects the number of shared genes.

plot_graph(reshaped_results=ora_reshaped_results,# Column containing the names we wish to plotontology_id_colname="ontology_id",# Upper threshold for the value indicating the significancep_value_max_threshold=0.05,# Column that indicates the significance valuesp_value_type_colname="eFDR")

Visualising the Associations: Heatmap

The heatmap displays the genes associated with the enriched ontologycategories (e.g., transcription factors). Each row represents acategory, coloured based on itseFDR value. Each column represents agene from the target set belonging to the enriched ontology category,indicating potential regulation by one or more enriched transcriptionfactors.

plot_heatmap(reshaped_results=ora_reshaped_results,# Column containing the names we wish to plotontology_id_colname="ontology_id",# Column that indicates the significance valuesp_value_type_colname="eFDR")

Comparing the significant results when applying the eFDR to the Benjamini-Hochberg and the Bonferroni corrections

Theora function allows you to choose between different methods forcalculating theFDR and adjusting thep-values:eFDR, and allmethod options from thestats::p.adjust documentation (holm,hochberg, hommel, bonferroni, BH, BY, and fdr). The following codesnippet demonstrates how to perform the analysis using theBenjamini-Hochberg andBonferroni corrections:

# Creating the ORA model using the Benjamini-Hochberg p-value correction methodBH_ora_model<- ora(gmt=tf_ontology_filtered,# Test set variableelement_names=target_set,# Background set variablebackground_element_names=background_set,# p-value adjustment methodp_value_adjustment_method="BH")# Running the ORABH_results<- run_test(BH_ora_model)# Creating the ORA model using the Bonferroni p-value correction methodBonferroni_ora_model<- ora(gmt=tf_ontology_filtered,# Test set variableelement_names=target_set,# Background set variablebackground_element_names=background_set,# p-value adjustment methodp_value_adjustment_method="bonferroni")# Running the ORABonferroni_results<- run_test(Bonferroni_ora_model)

To compare the significant results (using the conventional < 0.05threshold) of theeFDR,Benjamini-Hochberg, andBonferronicorrections, we can merge and filter the result tables:

# Merging the Benjamini-Hochberg and eFDR resultsmerged_results<-BH_results %>%# Renaming the column  rename(BH_adjusted_p_value=adjusted_p_value) %>%# Selecting the necessary columns  select(ontology_id,BH_adjusted_p_value) %>%# Joining with the eFDR results  left_join(ora_results,.,by="ontology_id") %>%# Converting the data.frame to a tibble  tibble()# Merging the Bonferroni results with the merged resultsmerged_results<-Bonferroni_results %>%# Renaming the column  rename(Bonferroni_adjusted_p_value=adjusted_p_value) %>%# Selecting the necessary columns  select(ontology_id,Bonferroni_adjusted_p_value) %>%# Joining with the eFDR results  left_join(merged_results,.,by="ontology_id") %>%# Arranging by the p-value  arrange(p_value)# filter the p-value < 0.05 resultsmerged_results_filtered<-merged_results %>%   filter(p_value<0.05) %>%# remove the unnecessary columns  select(-ontology_id,-nr_common_with_tested_elements,-nr_common_with_background_elements)

ontology_name	p_value	eFDR	BH_adjusted_p_value	Bonferroni_adjusted_p_value
LexA	0.0000000	0.0000000	0.0000001	0.0000001
FNR	0.0000003	0.0000000	0.0000208	0.0000416
SoxS	0.0001615	0.0036667	0.0082880	0.0248641
FadR	0.0003692	0.0056000	0.0142127	0.0568507
Rob	0.0004717	0.0051200	0.0145296	0.0726479
DnaA	0.0006281	0.0052000	0.0161218	0.0967306
NsrR	0.0010478	0.0073714	0.0230517	0.1613622
ArcA	0.0032001	0.0197500	0.0616014	0.4928114
MarA	0.0066068	0.0483111	0.1089670	1.0000000
IHF	0.0070758	0.0458600	0.1089670	1.0000000
NarL	0.0096065	0.0534000	0.1276532	1.0000000
NikR	0.0099470	0.0615833	0.1276532	1.0000000
OxyR	0.0174505	0.0786923	0.2067212	1.0000000
ExuR	0.0261046	0.1051867	0.2680073	1.0000000
UxuR	0.0261046	0.1051867	0.2680073	1.0000000
NrdR	0.0328500	0.1232750	0.3161817	1.0000000
IscR	0.0376038	0.1249412	0.3406459	1.0000000
Nac	0.0419701	0.1487556	0.3590774	1.0000000
Fis	0.0457307	0.1433053	0.3706596	1.0000000

A comparison of the significant results revealed that conventionalp-value corrections (Benjamini-Hochberg and Bonferroni) tend to beoverly conservative, leading to a reduction in the number of significanttranscription factors compared to theeFDR. As illustrated in thebelow figure, by applying theeFDR we were able to identify 10significant transcription factors, while with the Benjamini-Hochberg andBonferroni corrections only 7 and 3, respectively. This suggests thattheeFDR may be a more suitable approach for controlling falsepositives in this context.

Gene Set Enrichment Analysis (GSEA)

To perform enrichment analysis using ranked lists, you need to providean ordered list of elements, such as genes or proteins. This ranking istypically based on the results of your prior analysis, using metricslikep-values,z-scores, fold-changes, or others. Crucially, theranked list should include all elements involved in your analysis. Forexample, in a differential expression study, it should encompass allgenes that were measured.

mulea utilises the Kolmogorov-Smirnov approach with a permutation test(developed by Subramanian et al. (2005)) to calculate gene setenrichment analyses. This functionality is implemented through theintegration of thefgseaBioconductor package (created by Korotkevich et al. (2021)).

GSEA requires input data about the genes analysed in our experiment.This data can be formatted in two ways:

Data frame: This format should include all genes investigatedand their respective log fold change values (or other values forordering the genes) obtained from the differential expressionanalysis.
Two vectors: Alternatively, you can provide two separatevectors. One vector should contain the gene symbols (or IDs), andthe other should hold the corresponding log fold change values (orother values for ordering the genes) for each gene.

Reading the Tab Delimited File Containing the Ordered Set

Let’s read the TSV file containing the identifiers (gene symbols) andthe log fold change values of the investigated set directly from theGitHub website. For details on how this file was prepared, please referto theFormatting the Results of a Differential ExpressionAnalysissection.

# Reading the tsv containing the ordered setordered_set<- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/ordered_set.tsv")

Performing the Gene Set Enrichment Analysis

To perform the analysis, we will first establish a model using thegsea function. This model defines the parameters for the enrichmentanalysis. Subsequently, we will execute the test itself using therun_test function. We will employ 10,000 permutations for the falsediscovery rate, to ensure robust correction for multiple testing.

# Creating the GSEA model using the GMT variablegsea_model<- gsea(gmt=tf_ontology_filtered,# Names of elements to testelement_names=ordered_set$Gene.symbol,# LogFC-s of elements to testelement_scores=ordered_set$logFC,# Consider elements having positive logFC values onlyelement_score_type="pos",# Number of permutationsnumber_of_permutations=10000)# Running the GSEAgsea_results<- run_test(gsea_model)

Examining the GSEA Results

Thegsea_results data frame summarises the enrichment analysis,listing enriched ontology entries – in our case transcription factors –alongside their associatedp-values and adjustedp-value values.

We can now determine the number of transcription factors classified as“enriched” based on these statistical measures (adjustedp-value <0.05).

gsea_results %>%# rows where the adjusted_p_value < 0.05  filter(adjusted_p_value<0.05) %>%# the number of such rows  nrow()#> [1] 10

Inspect the significant results:

gsea_results %>%# arrange the rows by the adjusted_p_value values  arrange(adjusted_p_value) %>%# rows where the adjusted_p_value < 0.05  filter(adjusted_p_value<0.05)

ontology_id	ontology_name	nr_common_with_tested_elements	p_value	adjusted_p_value
LexA	LexA	53	0.0000000	0.0000047
FNR	FNR	259	0.0000660	0.0050484
ArcA	ArcA	148	0.0003076	0.0079598
GlaR	GlaR	3	0.0002188	0.0079598
ModE	ModE	45	0.0003122	0.0079598
SoxS	SoxS	37	0.0002848	0.0079598
DnaA	DnaA	13	0.0010217	0.0223306
PaaX	PaaX	14	0.0017028	0.0325652
PspF	PspF	7	0.0023494	0.0399397
FadR	FadR	20	0.0028304	0.0433046

Visualising the GSEA Results

Initializing the visualisation with thereshape_results function:

# Reshaping the GSEA results for visualisationgsea_reshaped_results<- reshape_results(model=gsea_model,model_results=gsea_results,# choosing which column to use for the# indication of significancep_value_type_colname="adjusted_p_value")

Visualising Relationships: Graph Plot

This function generates a network visualisation of the enriched ontologycategories (e.g., transcription factors). Each node represents acategory and is coloured based on its significance level. A connection(edge) is drawn between two nodes if they share at least one common genebelonging to theranked list, meaning that both transcriptionfactors regulate the expression of the same target gene. The thicknessof the edge reflects the number of shared genes.

plot_graph(reshaped_results=gsea_reshaped_results,# the column containing the names we wish to plotontology_id_colname="ontology_id",# upper threshold for the value indicating the significancep_value_max_threshold=0.05,# column that indicates the significance valuesp_value_type_colname="adjusted_p_value")

Other plot types such as lollipop plots, bar plots, and heatmaps canalso be used to investigate the GSEA results.

Formatting the Results of a Differential Expression Analysis

Understanding the Differential Expression Results Table

This section aims to elucidate the structure and essential components ofthe provided DE results table. It offers guidance to users oninterpreting the data effectively for subsequent analysis withmulea.

Let’s read the differential expression result file namedGSE55662.table_wt_non_vs_cipro.tsvlocated in theinst/extdata/folder directly from the GitHub website.

# Importing necessary libraries and reading the DE results tablegeo2r_result_tab<- read_tsv("https://raw.githubusercontent.com/ELTEbioinformatics/mulea/master/inst/extdata/GSE55662.table_wt_non_vs_cipro.tsv")

Let’s delve into thegeo2r_result_tab data frame by examining itsinitial rows:

# Printing the first few rows of the data framegeo2r_result_tab %>%    head(3)

ID	adj.P.Val	P.Value	t	B	logFC	Gene.symbol	Gene.title
1765336_s_at	0.0186	2.4e-06	21.5	4.95769	3.70	gnsB	Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts)
1760422_s_at	0.0186	3.8e-06	19.6	4.68510	3.14	NA	NA
1764904_s_at	0.0186	5.7e-06	18.2	4.43751	2.54	sulA///sulA///sulA///ECs1042	SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor

Data Preparation:

Preparing the data frame appropriately for enrichment analysis iscrucial. This involves specific steps tailored to the microarrayexperiment type. Here, we undertake the following transformations:

Gene Symbol Extraction: We isolate the primary gene symbol fromtheGene.symbol column, eliminating any extraneous information.
Handling Missing Values: Rows with missing gene symbols (NA) areexcluded.
Sorting by Fold Change: The data frame is sorted by log-foldchange (logFC) in descending order, prioritizing genes with the mostsignificant expression alterations.

# Formatting the data framegeo2r_result_tab<-geo2r_result_tab %>%# Extracting the primary gene symbol and removing extraneous information  mutate(Gene.symbol= str_remove(string=Gene.symbol,pattern="\\/.*")) %>%# Filtering out rows with NA gene symbols  filter(!is.na(Gene.symbol)) %>%# Sorting by logFC  arrange(desc(logFC))

Before proceeding with enrichment analysis, let’s examine the initialrows of the formattedgeo2r_result_tab data frame:

# Printing the first few rows of the formatted data framegeo2r_result_tab %>%    head(3)

ID	adj.P.Val	P.Value	t	B	logFC	Gene.symbol	Gene.title
1765336_s_at	0.0186	2.40e-06	21.5	4.95769	3.70	gnsB	Qin prophage; multicopy suppressor of secG(Cs) and fabA6(Ts)
1764904_s_at	0.0186	5.70e-06	18.2	4.43751	2.54	sulA	SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor///SOS cell division inhibitor
1761763_s_at	0.0186	1.54e-05	15.0	3.73568	2.16	recN	recombination and repair protein///recombination and repair protein///recombination and repair protein///recombination and repair protein

Following these formatting steps, the data frame is primed for furtheranalysis.

Preparing Input Data for the ORA

Target Set

A vector containing the gene symbols of significantly overexpressed(adjustedp-value < 0.05) genes with greater than 2 fold-change(logFC > 1).

target_set<-geo2r_result_tab %>%# Filtering for adjusted p-value < 0.05 and logFC > 1  filter(adj.P.Val<0.05&logFC>1) %>%# Selecting the Gene.symbol column  select(Gene.symbol) %>%# Converting the tibble to a vector  pull() %>%# Removing duplicates  unique()

The first 10 elements of the target set:

target_set %>%   head(10)#>  [1] "gnsB"    "sulA"    "recN"    "c4435"   "dinI"    "c2757"   "c1431"#>  [8] "gabP"    "recA"    "ECs5456"

The number of genes in the target set:

target_set %>%   length()#> [1] 241

Background Set

A vector containing the gene symbols of all genes were included in thedifferential expression analysis.

background_set<-geo2r_result_tab %>%# Selecting the Gene.symbol column  select(Gene.symbol) %>%# Converting the tibble to a vector  pull() %>%# Removing duplicates  unique()

The number of genes in the background set:

background_set %>%   length()#> [1] 7381

Save the target and the background set vectors to text file:

# Save taget set to text filetarget_set %>%   writeLines("target_set.txt")# Save background set to text filebackground_set %>%   writeLines("inst/extdata/background_set.txt")

Preparing Input Data for the GSEA

# If there are duplicated Gene.symbols keep the first one onlyordered_set<-geo2r_result_tab %>%# Grouping by Gene.symbol to be able to filter  group_by(Gene.symbol) %>%# Keeping the first row for each Gene.symbol from rows with the same#     Gene.symbol  filter(row_number()==1) %>%# Ungrouping  ungroup() %>%# Arranging by logFC in descending order  arrange(desc(logFC)) %>%  select(Gene.symbol,logFC)

The number of gene symbols in theordered_set vector:

ordered_set %>%   nrow()#> [1] 7381

Save the ordered set data frame to tab delimited file:

# Save ordered set to text fileordered_set %>%   write_tsv("ordered_set.tsv")

Session Info

sessionInfo()#> R version 4.4.2 (2024-10-31)#> Platform: x86_64-pc-linux-gnu#> Running under: Ubuntu 22.04.5 LTS#>#> Matrix products: default#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0#>#> locale:#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C#>  [3] LC_TIME=hu_HU.UTF-8        LC_COLLATE=en_US.UTF-8#>  [5] LC_MONETARY=hu_HU.UTF-8    LC_MESSAGES=en_US.UTF-8#>  [7] LC_PAPER=hu_HU.UTF-8       LC_NAME=C#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C#> [11] LC_MEASUREMENT=hu_HU.UTF-8 LC_IDENTIFICATION=C#>#> time zone: Europe/Budapest#> tzcode source: system (glibc)#>#> attached base packages:#> [1] stats     graphics  grDevices utils     datasets  methods   base#>#> other attached packages:#>  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4#>  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1#>  [9] ggplot2_3.5.1   tidyverse_2.0.0 mulea_1.0.2#>#> loaded via a namespace (and not attached):#>  [1] fastmatch_1.1-4     gtable_0.3.5        xfun_0.47#>  [4] ggrepel_0.9.6       lattice_0.22-6      tzdb_0.4.0#>  [7] vctrs_0.6.5         tools_4.4.2         generics_0.1.3#> [10] curl_5.2.2          parallel_4.4.2      fansi_1.0.6#> [13] highr_0.11          pkgconfig_2.0.3     Matrix_1.7-0#> [16] data.table_1.16.0   lifecycle_1.0.4     compiler_4.4.2#> [19] farver_2.1.2        munsell_0.5.1       ggforce_0.4.2#> [22] fgsea_1.30.0        graphlayouts_1.2.0  codetools_0.2-20#> [25] htmltools_0.5.8.1   yaml_2.3.10         crayon_1.5.3#> [28] pillar_1.9.0        MASS_7.3-61         BiocParallel_1.38.0#> [31] cachem_1.1.0        viridis_0.6.5       tidyselect_1.2.1#> [34] digest_0.6.37       stringi_1.8.4       labeling_0.4.3#> [37] cowplot_1.1.3       polyclip_1.10-7     fastmap_1.2.0#> [40] grid_4.4.2          colorspace_2.1-1    cli_3.6.3#> [43] magrittr_2.0.3      ggraph_2.2.1        tidygraph_1.3.1#> [46] utf8_1.2.4          withr_3.0.1         scales_1.3.0#> [49] bit64_4.0.5         timechange_0.3.0    rmarkdown_2.28#> [52] bit_4.0.5           igraph_2.0.3        gridExtra_2.3#> [55] hms_1.1.3           memoise_2.0.1       evaluate_0.24.0#> [58] knitr_1.48          viridisLite_0.4.2   rlang_1.1.4#> [61] Rcpp_1.0.13         glue_1.7.0          tweenr_2.0.3#> [64] vroom_1.6.5         rstudioapi_0.16.0   R6_2.5.1#> [67] plyr_1.8.9

How to Cite the`mulea` Package?

To cite packagemulea in publications use:

Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, ErvinTasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, andEszter Ari. 2024. “mulea: An R Package for Enrichment Analysis UsingMultiple Ontologies and Empirical False Discovery Rate.”BMCBioinformatics 25 (1): 334.https://doi.org/10.1186/s12859-024-05948-7.

References

Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, MaximN. Artyomov, and Alexey Sergushichev. 2021. “Fast Gene Set EnrichmentAnalysis.”bioRxiv, February.https://doi.org/10.1101/060012.

Kuleshov, Maxim V., Matthew R. Jones, Andrew D. Rouillard, Nicolas F.Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev, et al. 2016.“Enrichr: A Comprehensive Gene Set Enrichment Analysis Web Server 2016Update.”Nucleic Acids Research 44 (W1): W90–97.https://doi.org/10.1093/nar/gkw377.

Méhi, Orsolya, Balázs Bogos, Bálint Csörgő, Ferenc Pál, Ákos Nyerges,Balázs Papp, and Csaba Pál. 2014. “Perturbation of Iron HomeostasisPromotes the Evolution of Antibiotic Resistance.”Molecular Biology andEvolution 31 (10): 2793–2804.https://doi.org/10.1093/molbev/msu223.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee,Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005.“Gene Set Enrichment Analysis: A Knowledge-Based Approach forInterpreting Genome-Wide Expression Profiles.”Proceedings of theNational Academy of Sciences 102 (43): 15545–50.https://doi.org/10.1073/pnas.0506580102.

Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, ErvinTasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, andEszter Ari. 2024. “mulea: An R Package forEnrichment Analysis Using Multiple Ontologies and Empirical FalseDiscovery Rate.”BMC Bioinformatics 25 (1): 334.https://doi.org/10.1186/s12859-024-05948-7.

Footnotes

The format of the actually used ontology slightly deviates fromstandard GMT files. Intf_ontology, both theontology_id andontology_name columns containgene symbols of the transcriptionfactors, unlike other ontologies such as GO, where these columnshold specific identifiers and corresponding names.↩

About

mulea - an R package for enrichment analysis using various ontologies and empirical false discovery rate

doi.org/10.1186/s12859-024-05948-7

Releases4

mulea v1.1.1 Latest

Nov 19, 2024

+ 3 releases

Packages

No packages published

Languages

R89.8%
C++7.6%
TeX2.6%

Movatterモバイル変換

ELTEbioinformatics/mulea