Movatterモバイル変換


[0]ホーム

URL:


netgsa: Network-based Gene SetAnalysis

Jing Ma, Michael Hellstern

2025-07-30

Introduction

In this vignette, we demonstrate the NetGSA workflow using a subsetof the breast cancer data example downloaded from the Cancer GenomeAtlas(TCGA 2012). In particular, weillustrate how to incorporate existing network information (e.g. curatededges from KEGG) to improve the power of pathway enrichment analysiswith NetGSA. Details of the method are avaialble inMa, Shojaie, and Michailidis (2016).

netgsa provides simple functions that automaticallysearch known gene databases usinggraphite for networkinformation and integrate seamlessly withNetGSA pathwayenrichment andigraph and Cytoscape plotting. In thisvignette, we demonstrate the NetGSA workflow explaining proper datatypes and usage of thenetgsa package.

Data: Breast cancer study (2012)

Our example data set comes from a breast cancer study(TCGA 2012), which consists of gene expressiondata from 520 subjects including 117 estrogen-receptor-negative (ER-)and 403 estrogen-receptor-positive (ER+). The goal is to generatediagnostic pathway signatures that could distinguish patients withdifferent ER statuses by comparing gene expression data from the twoclasses. These signatures could provide additional clinical benefit indiagnosing breast cancer.

Step 1: Data set-up

NetGSA works directly with the expression data matrix. When loadingthe data, it is important to check the distribution of raw sequencingreads and perform log transformation if necessary. Data in this examplewere already log transformed. Rows of the data matrix correspond togenes and columns to subjects. Genes in this data matrix were labeledwith Entrez IDs, same as those used in KEGG pathways.

library(netgsa)library(graphite)library(data.table)data("breastcancer2012_subset")ls()
## [1] "edgelist"     "group"        "nonedgelist"  "pathways"     "pathways_mat"## [6] "x"

The objects loaded which are necessary for this vignette are:

Data matrix,x

The data matrix,x must have rownames that are named as"GENE_ID:GENE_VALUE". Sincenetgsa allows theuser to search for edges in multiple databases ingraphite,each of which may use different identifiers (e.g. KEGG, Reactome,Biocarta, etc.),netgsa needs to know the identifier for agiven gene so we can convert it properly. For example, if you have twogenes “ENTREZID: 7186” and “ENTREZID: 329” and want to search forpotential edges between these two within Reactome,netgsamust first convert these genes to their “UNIPROT” IDs and use those tosearch for edges.

netgsa uses theAnnotationDbi package toconvert genes. The valid list of"GENE_ID"’s is:

AnnotationDbi::keytypes(org.Hs.eg.db::org.Hs.eg.db)
##
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    ## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         ## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        ## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      ## [26] "UNIPROT"

An example of what the rownames for the data matrix,x,should look like is:

head(rownames(x))
## [1] "ENTREZID:127550" "ENTREZID:53947"  "ENTREZID:65985"  "ENTREZID:51166" ## [5] "ENTREZID:15"     "ENTREZID:60496"

The columns of the data matrix do not need to be named, but it isuseful for keeping track of groups.

Group vector

Thegroup object is a vector (numeric or character) andmust be the same length asncol(x). Each element of thegroup tells us which group each column ofxcorresponds to. For example, ifgroup[1] = "Positive" thissays that the first column ofx corresponds to the"Positive" condition.

We can find out the ER status by looking at the group labels.

table(group)
## group##   1   2 ## 403 117

Edgelist / Non-edgelist

The edgelist and non-edgelist are strings representing file locationsand are read in usingdata.table’sfread()command. These are where users can specify known edges/non-edges oftheir own. Each observation is assumed to be a directed edge (foredgelist) or a directed non-edge (for non-edgelist). They both must have4 columns in the following order. The columns do not necessarily need tobe named properly, they simply must be in this specific order:

  • 1st column - Source gene (base_gene_src), e.g. “7534””
  • 2nd column - Gene identifier of the source gene (base_id_src),e.g. “ENTREZID”
  • 3rd column - Destination gene (base_gene_dest), e.g. “8607”
  • 4th column - Gene identifier of the destination gene (base_id_dest)e.g. “UNIPROT”

Note it is assumed that each edge/non-edge is directed so if you wantan undirected edge/non-edge you should put in two observations asin:

##    base_gene_src base_id_src base_gene_dest base_id_dest##           <char>      <char>         <char>       <char>## 1:          7534    ENTREZID           8607     ENTREZID## 2:          8607    ENTREZID           7534     ENTREZID

Pathway matrix

Lastly, as documented in?NetGSA we need an indicatormatrix of pathways. The rows of this matrix should correspond topathways and the columns to the genes included in the analysis. Thedimension is thennpath xp. Values in this matrixshould be 1 for genes that are in a given pathway and 0 otherwise.

We consider pathways from KEGG(Kanehisa andGoto 2000). KEGG pathways can be accessed in R using thegraphite package.

paths<- graphite::pathways('hsapiens','kegg')paths[[1]]
## "Glycolysis / Gluconeogenesis" pathway## Native ID       = hsa:00010## Database        = KEGG## Species         = hsapiens## Number of nodes = 93## Number of edges = 1128## Retrieved on    = 22-10-2024## URL             = http://www.kegg.jp/kegg-bin/show_pathway?org_name=hsa&mapno=00010
head(nodes(paths[[1]]))
## [1] "ENTREZID:10327" "ENTREZID:124"   "ENTREZID:125"   "ENTREZID:126"  ## [5] "ENTREZID:127"   "ENTREZID:128"

For this vignette, we’ll usepathways_mat:

pathways_mat[1:5,7, drop=FALSE]
##                                                      ENTREZID:18## Adipocytokine signaling pathway                                0## Adrenergic signaling in cardiomyocytes                         0## AGE-RAGE signaling pathway in diabetic complications           0## Alanine, aspartate and glutamate metabolism                    1## AMPK signaling pathway                                         0

The 1 shown indicates that"ENTREZID: 18" belongs to theAlanine, aspartate and glutamate metabolism pathway.

In summary, the data that we’ll need to run our pathway enrichmentanalysis is:

  • the data matrix (x), with rows for genes and columnsfor samples,
  • the group labels (group) for ER status,
  • the path to the file consisting of known edges(edgelist) to be read in withfread(),
  • the path to the file consisting of (a subset of) the non-edges(nonedgelist) to be read in withfread(),
  • pathway identifiers in matrix form (pathways_mat).

Step 2: Pathway enrichment analysis

Now that we have our data set-up properly we can begin with thepathway enrichment analysis.netgsa has three mainfunctions. The first isprepareAdjMat which searches foredges in public databases and uses this information to estimate theadjacency matrices needed forNetGSA. The second isNetGSA which performs network-based gene set analysis. Wewill go into further details about the usage of these functionsbelow.

prepareAdjMat

After having the data set-up, the first step in pathway enrichmentanalysis withnetgsa is to estimate the adjacency matrices.Remember, the rownames of the data matrixX must be namedas"GENE_ID:GENE_VALUE" as in"ENTREZID:7534".Thegroup vector is defined as above.

Thedatabases argument can be either (1) the result ofobtainEdgeList or (2) a character vector defining thedatabases to search. These are essentially the same thing, the onlydifference is that for the character vector method,obtainEdgeList is called insideprepareAdjMatand cannot be saved. SinceprepareAdjMat queries thegraphite databases when it is called andgraphite databases can change overtime, this may not bedesirable for reproducibility. Using theobtainEdgeListmethod, one can save the edgelist to ensure the same network informationis used across iterations or in the future. In both methods, one mustspecify the databases to search. The options are the databases for homospaiens available ingraphite or NDEx (only for developmentversion on Github). The graphite databases are:

## [1] "kegg"         "panther"      "pathbank"     "pharmgkb"     "reactome"    ## [6] "smpdb"        "wikipathways"

Note thedatabases argument is case sensitive so makesure to pass"reactome" and not"Reactome".

Thecluster argument controls whether or not clusteringis used when estimating the adjacency matrix. The default behavior is touse clustering ifp > 2,500. However, the user canoverride this behavior by setting thecluster argument.prepareAdjMat chooses the best clustering method from 6possible methods in theigraph package. More details areprovided in?prepareAdjMat.

User specified edge and non-edge files are specified with thefile_e andfile_ne arguments respectively. Formore information on the assumptions and how network information isincorporated from the database edgelists, the user edges, and the usernon-edges see the Details section andfile_e andfile_ne parameters in?prepareAdjMat.

It is also important to note thatprepareAdjMat willautomatically choose the correct network estimation technique based onwhether or not the graph is directed so no additional work is needed todetermine undirected vs directed graphs. This is documented in?prepareAdjMat.

Now let’s get to some example code. Suppose we wanted to estimate thenetwork for our example data using our known edges/non-edges andsearching for edges in Reactome, KEGG, and BioCarta. Let’s also useclustering to speed up computation. We could do this simply by:

database_search<-obtainEdgeList(rownames(x),c("kegg","reactome"))network_info<-prepareAdjMat(x, group, database_search,cluster =TRUE,file_e ="edgelist.txt",file_ne ="nonedgelist.txt")

The main value of interest isnetwork_info[["Adj"]], alist of lists. The top level list indicates the condition while the nextlevel is a list of adjacency matrices (one for each cluster). Ifcluster = FALSE there is an element for each connectedcomponent in the network. Also note that in this example, the objectdatabase_search was created. If desired, this can be savedand used later to ensure the same network information was used. However,the results may not be exactly the same. For example, some of theclustering algorithms are not deterministic so while the networkinformation may be the same the clusters may be different resulting indifferent adjacency matrices. The adjacency matrix for the firstcondition and first cluster can be accessed with:

network_info[["Adj"]][[1]][[1]][7:9,7:9]
##             ENTREZID:18 ENTREZID:27 ENTREZID:28## ENTREZID:18           0           0           0## ENTREZID:27           0           0           0## ENTREZID:28           0           0           0

The total number of clusters can be identified with:

length(network_info[["Adj"]][[1]])
## [1] 16

Note we cannot control cluster size which is why we try severalmethods and implement rules on which to choose. For more information see?prepareAdjMat.

NetGSA

Now that the adjacency matrices are assembled we can perform pathwayinference usingNetGSA. The returned value fromprepareAdjMat will always be in the correct formatregardless of whether or not clustering is used, so one should always beable to passnetwork_info[["Adj"]] directly toNetGSA. The default is to use restricted Haseman-Elstonregression (REHE) with sampling to estimate the variance parameters.This allows for significant computational speed up and little to no lossin power. One can also use REML for variance parameter estimation butthis can be quite slow for problems with moderate or large dimension.For explanatory purposes, we explicitly write out the function:

pathway_tests_rehe<-NetGSA(network_info[["Adj"]], x, group, pathways_mat,lklMethod ="REHE",sampling =TRUE,sample_n =0.25,sample_p =0.25)

Thesample_n andsample_p parameterscontrol the ratio for subsampling observations (columns of data matrixx) and variables (rows of data matrixx)respectively. More information on REHE and sampling can be found in the?NetGSA Details section.

Note that inference using REML can take several hours depending onthe complexity of the problem, while inference using REHE can takeminutes. Roughly, REML becomes quite slow for p > 2,000. In thisexample, withsample_n = 0.25,sample_p = 0.25only took about 2 minutes on a 2 CPU personal computer with 16 GB ofRAM.

The main inference results are stored inpathway_tests[["results"]] and contain the pathway names aswell as their significance (p-values and q-values are reported) and teststatistics.

Step 3: Visualization

netgsa provides several options for visualizing theresults fromNetGSA. Cytoscape or igraph is used to displaythe main results depending on whether the user has the Cytoscape appopen.

Cytoscape

If the user has Cytoscape open on their computer, callingplot.NetGSA will create several plots:

  1. Cytoscape plots - the first placeplot.NetGSA generates plots is within Cytoscape. A nestednetwork is created where the main network (Pathway Network) displayspathways as nodes. Edges in this network represent edges between genescontained within those pathways. In this network, theresults object fromNetGSA() is loaded as thenode data and the number of edges between genes of separate pathways areloaded as edge data in a variable called “weight”. Node color is mappedto the test statistic. Negative values are orange, values around 0 arewhite, and positive values are blue. Node border color is mapped toq-values going from red (0) to white (1).

    In addition to the main network, a network is generated for eachpathway. These are the gene networks underlying each pathway and arereferred to as the within pathway networks. The within pathway networksare linked to the nodes in the Pathway Network using Cytoscape’s nestednetwork format. An example will make this more clear. Calling:

    plot.NetGSA(pathway_tests_rehe)

    The Pathway Network might look something like:

    The most interesting part of the Cytoscape nested network frameworkis the ability to view the within pathway networks. Suppose you werelooking at the Pathway Network and noticed theErbB signalingnetwork was highly significant and you wanted to investigate it bylooking at the underlying gene network. To do this, one could simplyright click on theErbB signaling pathway node in the PathwayNetwork, and go to Nested Networks -> Go To Nested Network.Alternatively, the nested networks have the same name as the pathwaythey represent so one could simply navigate to theErbB signalingpathway using theNetwork tab in theControlPanel on the left side of the Cytoscape GUI. To save time, thenested networks are not formatted. However, we can use theformatPathways function to format one or multiple using thedefault style.

    formatPathways(pathway_tests_rehe,"ErbB signaling pathway")

    TheformatPathways function colors gene nodes based onFDR adjusted q-value. For more information see?formatPathways.

    To export images from Cytoscape one can either use the GUI or theRCy3::exportImage() function.

  2. Cytoscape legend generated in R - Legends aredifficult to incorporate on the plot within Cytoscape so we generate alegend in R. Alternatively it is also possible to generate an imagedirectly in Cytoscape. The Cytoscape manual page has more information onthat:http://manual.cytoscape.org/en/stable/Styles.html.

  3. Igraph plot - Lastly, for users that prefer R,plot.NetGSA will also generate a plot usingigraph with the same layout and node color/node bordercolor mapping as in Cytoscape.

    The Cytoscape plotting section is wrapped up with a few notes oncustomization. Whileplot.NetGSA has a default layout, acustom layout can be provided to bothplot.NetGSA andformatPathways using thegraph_layoutparameter. For Cytoscape, this parameter is passed as a string toRCy3::layoutNetwork so it may be helpful to read thatdocumentation. Alternatively, one does not even need to use thisparameter. They can set up a custom layout using the Cytoscape GUI orcan even use theRCy3::layoutNetwork function directly. Forexample, callinglayoutNetwork("degree-circle") will changethe layout of the current network selected in Cytoscape.

    # Format the "Neurotrophin signaling pathway" using the "degree-circle" layoutformatPathways(pathway_tests_rehe,"Neurotrophin signaling pathway",graph_layout ="degree-circle")

    Lastly, for users who want even more customization, theresults object returned fromNetGSA() and theedge weight between pathways are loaded into Cytoscape. So to size edgesbased on edge weight one could:

    RCy3::setCurrentNetwork("Pathway Network")edge_weights<- RCy3::getTableColumns(table ="edge",columns ="weight")RCy3::setEdgeLineWidthMapping("weight",c(min(edge_weights),max(edge_weights)),c(1,5))

Igraph

If the user does not have Cytoscape open, a 3-D igraph plot iscreated usingigraph::rglplot. Due to the nature of theigraph::rglplot function we only map test statistics tonode color. We are not able to map colors to both the node and nodeborder.

plot(pathway_tests_rehe)

Again, a custom layout can be specified with thegraph_layout parameter. As documented in?plot.NetGSA this must be a function that takes oneparameter which is passed to theigraph::rglplot function.For example to layout a graph randomly we can do:

layout_fun<-function(ig) igraph::layout_randomly(ig)plot(pathway_tests_rehe,graph_layout = layout_fun)

While igraph does not provide a similar nested network format asCytoscape, we can use thezoomPathway function to look atthe gene interactions within a pathway:

zoomPathway(pathway_tests_rehe,"ErbB signaling pathway")

References

Kanehisa, Minoru, and Susumu Goto. 2000.“KEGG: Kyoto Encyclopediaof Genes and Genomes.”Nucleic Acids Research 28 (1):27–30.
Ma, Jing, Ali Shojaie, and George Michailidis. 2016.“Network-Based Pathway Enrichment Analysis with Incomplete NetworkInformation.”Bioinformatics 32 (20): 3165–74.
TCGA. 2012.“Comprehensive Molecular Portraits of Human BreastTumours.”Nature 490 (7418): 61.

[8]ページ先頭

©2009-2025 Movatter.jp