Associate Editor: Alfonso Valencia
Motivation: Network reconstruction of biological entities is very important for understanding biological processes and the organizational principles of biological systems. This work focuses on integrating both the literatures and microarray gene-expression data, and a combined literature mining and microarray analysis (LMMA) approach is developed to construct gene networks of a specific biological system.
Results: In the LMMA approach, a global network is first constructed using the literature-based co-occurrence method. It is then refined using microarray data through a multivariate selection procedure. An application of LMMA to the angiogenesis is presented. Our result shows that the LMMA-based network is more reliable than the co-occurrence-based network in dealing with multiple levels of KEGG gene, KEGG Orthology and pathway.
Availability: The LMMA program is available upon request.
Contact: [email protected]
Supplementary Information: Supplementary data are available atBioinformatics online.
Reconstructing networks of biological entities such as genes, transcription factors, proteins, compounds and other regulatory molecules is very important for understanding the biological processes and the organizational principles of the biological systems (Barabasi and Oltvai, 2004). Rapid progress in the biomedical domain has resulted in enormous amount of biomedical literatures. Along with the booming growth of biomedical researches, literature mining (LM) has become a promising direction for knowledge discovery. Various techniques have been developed which make it possible to reveal putative biological networks hidden in the huge collection of individual literatures (Shatkay and Feldman, 2003). Among them the co-occurrence (co-citation) approach (Stapley and Benoit, 2000;Jenssenet al., 2001) is the simplest and most comprehensive in implementation. It can also be easily adopted to find the association between biological entities, such as the genes relations (Jenssenet al., 2001) and the chemical compound–gene relations (Zhuet al., 2005). In our early study (Zhang and Li, 2004), a subject-oriented literature mining technique has been developed to extract subject-specific knowledge by incorporating prior knowledge from biologists. Such approach can retrieve information contained in abundant literatures regardless of individual experimental conditions. However, the literature-derived network is relatively crude and redundant. The co-occurrence approach lacks realistic analysis of various types of relations since the literature reservoir is a collection of results from diverse investigations. Moreover, the networks constructed from literatures are usually not specific with respect to a certain biological process and may unavoidably include overlapped relations, resulting in large and densely connected networks lacking in significant biological meaning.
Network reconstruction from the high-throughput microarray data is another active area in the past decade (van Somerenet al., 2002;de Jong 2002). Microarray technology that documents large-scale gene expression profiles allows characterizing the states of a specific biological system, providing a powerful platform for assessing global gene regulation and gene function. So far, a number of methods are available on reconstructing gene networks using microarray such as deterministic Boolean networks (Lianget al., 1998) and ordinary differential equations (Zaket al., 2003). However, it is difficult to build a reliable network from a small number of array samples owing to the non-uniform distribution of gene expression levels among thousands of genes. Such a technique is also insufficient for detailed biological investigation when prior knowledge is absent (Le Phillipet al., 2004).
Both literature-based and microarray-based approaches share the common goal of identifying the hidden networks of biological entities. Integrating both the experimental data and the literature knowledge in an iterative fashion seems to be an effective way in biological network modeling (Le Phillipet al., 2004). Various approaches have been developed to identify gene clusters and accompanying literature topics (Küffneret al., 2005), and model some biological process such as neuro-endocrine-immune interactions (Wu and Li, 2005). In this article, we propose a novel approach to reconstruct gene networks through combining literature mining and microarray analysis (LMMA), where a global network is first derived using the literature-based co-occurrence method, and then refined using microarray data. The LMMA approach is applied to build an angiogenesis network. The network and its corresponding biological meaning are evaluated in multiple levels of KEGG Gene, KEGG Orthology and pathway. The results show that the LMMA-based network is more reliable and manageable with more significant biological content than the LM-based network.
The first step in the LMMA approach is to derive co-occurrence dataset through literature mining. To find co-citations, a pool of articles and a dictionary containing gene symbols and their synonyms are required. In LMMA approach, the literature information is mainly obtained from the National Library of Medicine's PubMed database (Author Webpage). In general LM approach, specific subject interactions cannot be highlighted since all interactions tend to have similar co-citation number (Zhang and Li, 2004). We therefore prepare candidate articles and/or biological entities dictionary at two stages to incorporate prior knowledge. First, using a keyword referring to a certain subject, we select PubMed literatures that contain only terms of a biological subject. Next, an authoritative, standard or specific glossary is employed to provide a context for building topic-related gene networks. In the present work, we use the HUGO (Human Genome Organisation,Author Webpage) glossary, which contains ∼20000 non-redundant gene symbols, for literature mining.
We perform LM by sharing assumption with many existing LM systems that when two genes are co-cited in the same text unit, there should be a potential biological relationship between them (Stapley and Benoit, 2000;Jenssenet al., 2001). Sentence as a text unit is found to make the good trade-off between precision and recall with high effectiveness (Dinget al., 2002). Accordingly, as our previous work described (Zhang and Li, 2004), we regard two HUGO gene symbols as co-related if they are co-cited in the same sentence. In the HUGO glossary, one gene corresponds to a unique symbol (a one-to-one short mnemonic representation of the gene name) with several aliases. We regard all <alias, symbol> as <key, value> and store them in a hash table, by which many alias co-occurrences are reflected as a corresponding symbol co-occurrence. Capital letters and lowercase are discerned and only the complete words are considered. Next, we count the co-occurrence number of all symbol pairs to form an LM-based network regarded to a special subject.
Microarray datasets related to a biological process are collected from experiments or public repositories such as SMD (Stanford Microarray Database,Author Webpage) that stores a large volume of raw and normalized data from public microarray information (Sherlocket al., 2001). The downloaded microarray data are pre-processed following SMD procedures. In the step of Gene Filtering Options, we selected ‘center data for each array by mean’. Meanwhile, aK-nearest neighbors method (Troyanskayaet al., 2001) is used to evaluate the missing values in the microarray datasets. Briefly, a Pearson correlation analysis is employed to derive otherK genes which have the most similar expression profiles of a missing value of genexi in a observation (i.e. a microarray experiment)j, xij. Then the missing value is retrived from the weighted mean of the correspondingK genes.
A LMMA-based network is constructed by recombining all the refined sub-networks after the multivariate selection. For a specific interaction betweenxi andxj, there are two regression coefficients. One is when we regressxi onxj, the other is when we regressxj onxi, and the one with smallerP-value is used in LMMA. Note that the directionality of the LMMA network is currently not considered.
Third, we employ a leave one out cross validation (LOOCV) (Lachenbruch and Mickey, 1968) approach for evaluating the goodness of fitting in both LM and LMMA networks. According to LOOCV, when the observation (i.e. the microarray experiment)j is omitted for genei and its neighbors, gene 1(i), gene 2(i), … , genel(i), a new linear network can be constructed based on the remaining observationsx-j(i) and. And the omittedxj(i) can be recovered as through the corresponding observations of neighboring genes.
Pathway information is essential for successful quantitative modeling of biological systems (Caryet al., 2005). A well-known pathway database that provides the information of metabolic, regulatory and disease pathways is deposited in KEGG (Kyoto Encyclopedia of Genes and Genomes,Author Webpage) (Kanehisa and Goto, 2000). The relationship recorded in KEGG database is known to be special on the conception KEGG Orthology (KO,Author Webpage), a classification of orthologous genes that links directly to known pathways defined by KEGG. The KO dataset is a single complex flat file containing entries for all of the KO functional terms (the leaf nodes at the fourth level of the KO hierarchy). For more details about KO refer to Maoet al. (2005).
In order to take further insights on the underlying biological meanings of our networks, we map the LM- and LMMA-based networks to KEGG pathway database. First, we extract the KO hierarchy and the known associations between genes and their corresponding KO functional terms from the KO dataset. Second, we extract all the annotated genes from the KEGG Genes (KG) dataset. Both the KO hierarchical and the KG hierarchical relations are employed as benchmarks to validate the interactions in the networks. Here, a true positive (TP) defines an entry that is identified in our networks and is also identified in the dataset, a false positive (FP) refers to an entry that is identified in our networks but it does not occur in the dataset, a true negative (TN) represents an entry that is not identified in our networks and it does not occur in the dataset and a false negative (FN) indicates an entry that is not identified in our network but it occurs in the dataset. Here, we consider only KO/KG connections as entries for the definition of true positives, TP, FP, TN and FN. The precision,p, and the recall,r, of a network are derived respectively using the definition, and. To validate the effect of the LMMA for the precision of the relations predicted, Fisher Exact test with its online software (Author Webpage) is used to calculate the exactP-value of comparing the proportions of TP (FP) between LMMA and LM networks.
Moreover, we group the nodes and connections in the LMMA network according to KEGG pathway definitions. Here, a Fisher's Exact Test for KEGG pathway identification described in DAVID (the Database for Annotation, Visualization and Integrated Discovery,Author Webpage) (Denniset al., 2003) is employed. We perform the KEGG pathway extraction for the LMMA network by statistically evaluating the gene-enrichment in the network, which is compared with the random chance. Fisher's Exact Test is adopted to determine whether the proportion of genes of the LMMA network in a KEGG pathway is significantly higher than that for the human genomic background genes.
Angiogenesis is the process of generating new capillary blood vessels, and a key issue for various disorders especially for a variety of solid tumors, vascular and rheumatoid diseases (Folkman, 1995). Few other processes have such a significant impact as angiogenesis on so many people worldwide. So far, the underlying biological rules of angiogenesis remain unclear. It is therefore critical to understand the molecular basis and biological pathways of angiogenesis (Carmeliet, 2003).
We have successfully reconstructed angiogenesis-oriented networks using both LM and LMMA approaches. First, we collect all the angiogenesis-related PubMed abstracts (till July 24, 2005) using ‘angiogenesis’ as a keyword. A total of 23 497 ‘angiogenesis’ related PubMed abstracts are indexed automatically. By putting HUGO glossary into this abstract pool, we obtained 1929 angiogenesis-related genes. A total of 9514 co-citations among these genes are extracted to construct the co-occurrence based angiogenesis network. We construct a LM-based network with a co-occurrence number of at least 1. This results in the network with the maximum gene interactions.
Next, we select the gene expression profiles of endothelial cells (EC) and solid tumors (ST) from SMD. It is believed that EC is responsible for the generation of blood vessels and ST is the majority of angiogenesis-dependent diseases (Carmeliet, 2003;Folkman, 1995). The EC microarray dataset contains 44 639 genes and 53 experiments, while the ST microarray dataset contains 39 726 genes and 119 experiments. The largest connected gene network in LM with its genes identified in the EC microarray dataset is called LM–EC network (1257 genes and 6761 connections). Similarly, the largest connected gene network in LM with its genes identified in the ST microarray dataset is called LM-ST network (1258 genes and 6884 connections). Accordingly, two LMMA-based angiogenesis networks, LMMA-EC and LMMA-ST are built (Table 1). Using the common genes as the baseline, we compare the LM-EC and the LM-ST networks with their corresponding LMMA-EC and LMMA-ST networks respectively.
LM-EC | LMMA-EC | LM-ST | LMMA-ST | |
---|---|---|---|---|
Common nodesa | 1257 | 1031 | 1258 | 1162 |
Connectionsa | 6761 | 2848 | 6884 | 3935 |
Average path lengtha | 2.9810 | 3.6101 | 2.9741 | 3.3487 |
Average degreeb | 5.3738 | 2.2777 | 5.4722 | 3.1375 |
SSEc | 522.3206 | 380.1941 | 520.2295 | 479.0745 |
SSmsec | 0.0669 | 0.057 | 0.0614 | 0.0589 |
Microarray size | 1257*53 | 1257*53 | 1258*119 | 1258*119 |
LM-EC | LMMA-EC | LM-ST | LMMA-ST | |
---|---|---|---|---|
Common nodesa | 1257 | 1031 | 1258 | 1162 |
Connectionsa | 6761 | 2848 | 6884 | 3935 |
Average path lengtha | 2.9810 | 3.6101 | 2.9741 | 3.3487 |
Average degreeb | 5.3738 | 2.2777 | 5.4722 | 3.1375 |
SSEc | 522.3206 | 380.1941 | 520.2295 | 479.0745 |
SSmsec | 0.0669 | 0.057 | 0.0614 | 0.0589 |
Microarray size | 1257*53 | 1257*53 | 1258*119 | 1258*119 |
aIn the largest connected sub-network.
bIn the whole network.
cAll nodes except for the isolated ones.
LM-EC | LMMA-EC | LM-ST | LMMA-ST | |
---|---|---|---|---|
Common nodesa | 1257 | 1031 | 1258 | 1162 |
Connectionsa | 6761 | 2848 | 6884 | 3935 |
Average path lengtha | 2.9810 | 3.6101 | 2.9741 | 3.3487 |
Average degreeb | 5.3738 | 2.2777 | 5.4722 | 3.1375 |
SSEc | 522.3206 | 380.1941 | 520.2295 | 479.0745 |
SSmsec | 0.0669 | 0.057 | 0.0614 | 0.0589 |
Microarray size | 1257*53 | 1257*53 | 1258*119 | 1258*119 |
LM-EC | LMMA-EC | LM-ST | LMMA-ST | |
---|---|---|---|---|
Common nodesa | 1257 | 1031 | 1258 | 1162 |
Connectionsa | 6761 | 2848 | 6884 | 3935 |
Average path lengtha | 2.9810 | 3.6101 | 2.9741 | 3.3487 |
Average degreeb | 5.3738 | 2.2777 | 5.4722 | 3.1375 |
SSEc | 522.3206 | 380.1941 | 520.2295 | 479.0745 |
SSmsec | 0.0669 | 0.057 | 0.0614 | 0.0589 |
Microarray size | 1257*53 | 1257*53 | 1258*119 | 1258*119 |
aIn the largest connected sub-network.
bIn the whole network.
cAll nodes except for the isolated ones.
Table 1 lists the network parameters for LM- and LMMA-based angiogenesis networks. It shows that redundant connections are eliminated after multivariate selection. The connections for LMMA-EC and LMMA-ST networks are much smaller than those of the predominant sub-networks of LM-EC and LM-ST, respectively. The elimination of connections results in a dramatic decrease of the average degrees of genes and a slightly reduction of node number and average path length. Moreover, as shown inFigure 1a and b, when comparing with the LM-random filtering networks derived from the permutation test, the LMMA network results in not only significantly larger cluster size (P < 0.0001, by Kolmogorov–Smirnov test), but also smaller path length of the largest cluster (P < 0.001 byt-test). The results demonstrate that LMMA is more stable and integrative than that of the LM-random filtering. Similar performance is observed with the LMMA-ST network (Supplementary Fig. S1). Thus, LMMA seems to maintain the backbone of the LM-based angiogenesis network.
(a) Comparison of the cluster sizes between the LMMA-EC network and the LM-EC random filtering networks (P < 0.0001, by Kolmogorov–Smirnov test). Other clusters are with <10 nodes (data not shown).(b) Comparison of the normalized average path length in the largest cluster between the LMMA-EC and the LM-EC random filtering networks (P < 0.001 byt test).(c) Relationship between the number of nodes and the degree of nodes in the whole LM angiogenesis network, LMMA-EC network, and LMMA-ST network (Thp = 0.150). The distribution of degrees in three networks follows a power law, obviously appearing to be scale-free.
Figure 1c shows the relationship between the number of nodes and the degree of nodes in both LM- and LMMA-based angiogenesis networks. Obviously, the profiles follow a power-law distribution, indicating that the topological properties of both networks are scale-free (Jeonget al., 2000;Songet al., 2005). Recent studies (Hanet al., 2004;Ozieret al., 2003) show that centrally located, highly connected hub nodes in a scale-free network dominate network operation.
Top 15 hub genes in both LM-based and LMMA-based angiogenesis networks are listed inTable 2. Vascular endothelial growth factor (VEGF) is identified in both LM and LMMA networks as the hub gene with the highest degree. VEGF is known to be a multi-functional cytokine that plays an important role in vasculogenesis (Mukhopadhyay and Datta, 2004). The activation of endothelial cells by VEGF sets in motion a series of steps towards the creation of new blood vessels (Folkman, 1995).
The top 15 hub genes identified in LM-based and LMMA-based angiogenesis networks (Thp = 0.150)
Gene | Degree (LM-EC; LM-ST)a | Degree (LMMA) | P-valueb | ||
---|---|---|---|---|---|
EC | ST | EC | ST | ||
VEGF | 554 | 51 | 117 | 0 | 0 |
NUDT6 | 211 | 51 | 25 | 0 | 0 |
KDR | 182 | 51 | 117 | 0 | 3.59e−06 |
SIAT7B | 156 | 51 | 44 | 1.19e−07 | 0 |
TNF | 149 | 51 | 46 | 0 | 0 |
IL8 | 148 | 26 | 27 | 0 | 0 |
MVD | 126 | 19 | 28 | 0 | 0 |
CD34 | 111 | 51 | 22 | 1.19e−07 | 0 |
EGF | 104 | 32 | 40 | 1.35e−13 | 0 |
IL6 | 97 | 31 | 24 | 0 | 0 |
CDH17 | 96 | 30 | 27 | 0 | 0 |
HIF1A | 93 | 21 | 38 | 1.65e−12 | 0 |
SOS1 | 87 | 14 | 25 | 1.54e−11 | 0 |
CCM1 | 83 | 51 | 14 | 6.92e−06 | 0 |
PSME3 | 78 | 18 | 34 | 0 | 0 |
Gene | Degree (LM-EC; LM-ST)a | Degree (LMMA) | P-valueb | ||
---|---|---|---|---|---|
EC | ST | EC | ST | ||
VEGF | 554 | 51 | 117 | 0 | 0 |
NUDT6 | 211 | 51 | 25 | 0 | 0 |
KDR | 182 | 51 | 117 | 0 | 3.59e−06 |
SIAT7B | 156 | 51 | 44 | 1.19e−07 | 0 |
TNF | 149 | 51 | 46 | 0 | 0 |
IL8 | 148 | 26 | 27 | 0 | 0 |
MVD | 126 | 19 | 28 | 0 | 0 |
CD34 | 111 | 51 | 22 | 1.19e−07 | 0 |
EGF | 104 | 32 | 40 | 1.35e−13 | 0 |
IL6 | 97 | 31 | 24 | 0 | 0 |
CDH17 | 96 | 30 | 27 | 0 | 0 |
HIF1A | 93 | 21 | 38 | 1.65e−12 | 0 |
SOS1 | 87 | 14 | 25 | 1.54e−11 | 0 |
CCM1 | 83 | 51 | 14 | 6.92e−06 | 0 |
PSME3 | 78 | 18 | 34 | 0 | 0 |
aDegree of these hub genes in both LM-EC and LM-ST networks are the same.
b P-values are calculated fromF-test for the unit-network of each gene (Equation 3).
The top 15 hub genes identified in LM-based and LMMA-based angiogenesis networks (Thp = 0.150)
Gene | Degree (LM-EC; LM-ST)a | Degree (LMMA) | P-valueb | ||
---|---|---|---|---|---|
EC | ST | EC | ST | ||
VEGF | 554 | 51 | 117 | 0 | 0 |
NUDT6 | 211 | 51 | 25 | 0 | 0 |
KDR | 182 | 51 | 117 | 0 | 3.59e−06 |
SIAT7B | 156 | 51 | 44 | 1.19e−07 | 0 |
TNF | 149 | 51 | 46 | 0 | 0 |
IL8 | 148 | 26 | 27 | 0 | 0 |
MVD | 126 | 19 | 28 | 0 | 0 |
CD34 | 111 | 51 | 22 | 1.19e−07 | 0 |
EGF | 104 | 32 | 40 | 1.35e−13 | 0 |
IL6 | 97 | 31 | 24 | 0 | 0 |
CDH17 | 96 | 30 | 27 | 0 | 0 |
HIF1A | 93 | 21 | 38 | 1.65e−12 | 0 |
SOS1 | 87 | 14 | 25 | 1.54e−11 | 0 |
CCM1 | 83 | 51 | 14 | 6.92e−06 | 0 |
PSME3 | 78 | 18 | 34 | 0 | 0 |
Gene | Degree (LM-EC; LM-ST)a | Degree (LMMA) | P-valueb | ||
---|---|---|---|---|---|
EC | ST | EC | ST | ||
VEGF | 554 | 51 | 117 | 0 | 0 |
NUDT6 | 211 | 51 | 25 | 0 | 0 |
KDR | 182 | 51 | 117 | 0 | 3.59e−06 |
SIAT7B | 156 | 51 | 44 | 1.19e−07 | 0 |
TNF | 149 | 51 | 46 | 0 | 0 |
IL8 | 148 | 26 | 27 | 0 | 0 |
MVD | 126 | 19 | 28 | 0 | 0 |
CD34 | 111 | 51 | 22 | 1.19e−07 | 0 |
EGF | 104 | 32 | 40 | 1.35e−13 | 0 |
IL6 | 97 | 31 | 24 | 0 | 0 |
CDH17 | 96 | 30 | 27 | 0 | 0 |
HIF1A | 93 | 21 | 38 | 1.65e−12 | 0 |
SOS1 | 87 | 14 | 25 | 1.54e−11 | 0 |
CCM1 | 83 | 51 | 14 | 6.92e−06 | 0 |
PSME3 | 78 | 18 | 34 | 0 | 0 |
aDegree of these hub genes in both LM-EC and LM-ST networks are the same.
b P-values are calculated fromF-test for the unit-network of each gene (Equation 3).
Table 2 lists theP-values for the unit-networks of 15 hub genes derived from theF-test. We calculate theP-values for different networks. The results show that the LMMA-based angiogenesis network is more reliable than the LM-based one.Figure 2 illustrates LOOCV gene expression values for the unit-networks of VEGF, EGF, TNF and IL6, respectively. The MSE values of the LMMA unit-networks are smaller than those of the LM, indicating that the LMMA network fits better to the microarray data of angiogenesis. Meanwhile,Table 1 lists the SSE and the SSmse scores resulted from LM- and LMMA-based networks. The reduced errors in LMMA again suggest the improvement of the LMMA-based networks.
Gene expression values derived from the leave one out cross validation approach for four hub genes VEGF, EGF, TNF and IL6 in both LM-EC and LMMA-EC networks. A total of 53 experiments in EC microarray dataset are tested.
Figure 3a and b shows the precision and the recall rates of both the LM- and LMMA-based angiogenesis networks at different threshold Thp. The LMMA-based network exhibits higher precisions and lower recalls than the LM-based one. On the other hand, the recall of LMMA-based network increases gradually with the increasing thresholds. We select a suitable threshold, Thp = 0.150, in the LMMA-based EC and ST networks.
Comparison of (a) precision and (b) recall in LM, LMMA-EC and LMMA-ST angiogenesis networks at different thresholds. Here LM represents LM-EC and LM-ST since genes in LM-EC and LM-ST are identical when mapping to KEGG. TheX axis denotes theP-value thresholds calculated fromF-test in the step of statistical multivariate selection. Both the precision and the recall rates are calculated against KEGG.
Both LM-EC and LM-ST networks have the same 474 genes corresponding to 355 KO entities covered by KEGG database. When the LM-based network is refined by LMMA, the proportion of the TP rates increases significantly, while the proportion of FP rates decreases evidently.Table 3 shows the statistical results between LM- and LMMA-based angiogenesis networks, which demonstrate that the LMMA approach significantly eliminates the false positive relations.
The true positive (TP), false positive (FP) and the statisticalP-values of TP/FP ratio (by Fisher Exact Test) between LM and LMMA networksa
KEGG | Network | Thp | 0.025 | 0.050 | 0.075 | 0.100 | 0.125 | 0.150 | 0.175 | 0.200 |
---|---|---|---|---|---|---|---|---|---|---|
KG | LM | TP | 237 | 237 | 237 | 237 | 237 | 237 | 237 | 237 |
LM | FP | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | |
LMMA-EC | TP | 39 | 49 | 56 | 76 | 83 | 98 | 108 | 111 | |
LMMA-EC | FP | 121 | 175 | 241 | 267 | 303 | 349 | 417 | 458 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017004 | 0.034679 | 0.064785 | 0.01837 | 0.02372 | 0.01526 | 0.03012 | 0.04408 | |
LMMA-ST | TP | 71 | 83 | 93 | 101 | 111 | 130 | 137 | 135 | |
LMMA-ST | FP | 223 | 300 | 350 | 392 | 436 | 471 | 499 | 513 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.0056928 | 0.021672 | 0.02762 | 0.032833 | 0.033552 | 0.013196 | 0.013274 | 0.021953 | |
KO | LM | TP | 139 | 139 | 139 | 139 | 139 | 139 | 139 | 139 |
LM | FP | 170 | 170 | 170 | 170 | 170 | 170 | 170 | 170 | |
LMMA-EC | TP | 29 | 33 | 37 | 52 | 57 | 70 | 74 | 77 | |
LMMA-EC | FP | 19 | 34 | 40 | 44 | 46 | 55 | 60 | 63 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017279 | 0.087676 | 0.090294 | 0.027146 | 0.017372 | 0.0097982 | 0.011667 | 0.011803 | |
LMMA-ST | TP | 43 | 54 | 62 | 67 | 73 | 86 | 90 | 87 | |
LMMA-ST | FP | 38 | 46 | 52 | 64 | 69 | 80 | 82 | 87 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.042857 | 0.026912 | 0.020102 | 0.041336 | 0.036214 | 0.028095 | 0.023083 | 0.04316 |
KEGG | Network | Thp | 0.025 | 0.050 | 0.075 | 0.100 | 0.125 | 0.150 | 0.175 | 0.200 |
---|---|---|---|---|---|---|---|---|---|---|
KG | LM | TP | 237 | 237 | 237 | 237 | 237 | 237 | 237 | 237 |
LM | FP | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | |
LMMA-EC | TP | 39 | 49 | 56 | 76 | 83 | 98 | 108 | 111 | |
LMMA-EC | FP | 121 | 175 | 241 | 267 | 303 | 349 | 417 | 458 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017004 | 0.034679 | 0.064785 | 0.01837 | 0.02372 | 0.01526 | 0.03012 | 0.04408 | |
LMMA-ST | TP | 71 | 83 | 93 | 101 | 111 | 130 | 137 | 135 | |
LMMA-ST | FP | 223 | 300 | 350 | 392 | 436 | 471 | 499 | 513 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.0056928 | 0.021672 | 0.02762 | 0.032833 | 0.033552 | 0.013196 | 0.013274 | 0.021953 | |
KO | LM | TP | 139 | 139 | 139 | 139 | 139 | 139 | 139 | 139 |
LM | FP | 170 | 170 | 170 | 170 | 170 | 170 | 170 | 170 | |
LMMA-EC | TP | 29 | 33 | 37 | 52 | 57 | 70 | 74 | 77 | |
LMMA-EC | FP | 19 | 34 | 40 | 44 | 46 | 55 | 60 | 63 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017279 | 0.087676 | 0.090294 | 0.027146 | 0.017372 | 0.0097982 | 0.011667 | 0.011803 | |
LMMA-ST | TP | 43 | 54 | 62 | 67 | 73 | 86 | 90 | 87 | |
LMMA-ST | FP | 38 | 46 | 52 | 64 | 69 | 80 | 82 | 87 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.042857 | 0.026912 | 0.020102 | 0.041336 | 0.036214 | 0.028095 | 0.023083 | 0.04316 |
aHere LM represents LM-EC and LM-ST since genes in LM-EC and LM-ST are identical when mapping to KEGG database. KG = KEGG Gene; KO = KEGG Orthology.
The true positive (TP), false positive (FP) and the statisticalP-values of TP/FP ratio (by Fisher Exact Test) between LM and LMMA networksa
KEGG | Network | Thp | 0.025 | 0.050 | 0.075 | 0.100 | 0.125 | 0.150 | 0.175 | 0.200 |
---|---|---|---|---|---|---|---|---|---|---|
KG | LM | TP | 237 | 237 | 237 | 237 | 237 | 237 | 237 | 237 |
LM | FP | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | |
LMMA-EC | TP | 39 | 49 | 56 | 76 | 83 | 98 | 108 | 111 | |
LMMA-EC | FP | 121 | 175 | 241 | 267 | 303 | 349 | 417 | 458 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017004 | 0.034679 | 0.064785 | 0.01837 | 0.02372 | 0.01526 | 0.03012 | 0.04408 | |
LMMA-ST | TP | 71 | 83 | 93 | 101 | 111 | 130 | 137 | 135 | |
LMMA-ST | FP | 223 | 300 | 350 | 392 | 436 | 471 | 499 | 513 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.0056928 | 0.021672 | 0.02762 | 0.032833 | 0.033552 | 0.013196 | 0.013274 | 0.021953 | |
KO | LM | TP | 139 | 139 | 139 | 139 | 139 | 139 | 139 | 139 |
LM | FP | 170 | 170 | 170 | 170 | 170 | 170 | 170 | 170 | |
LMMA-EC | TP | 29 | 33 | 37 | 52 | 57 | 70 | 74 | 77 | |
LMMA-EC | FP | 19 | 34 | 40 | 44 | 46 | 55 | 60 | 63 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017279 | 0.087676 | 0.090294 | 0.027146 | 0.017372 | 0.0097982 | 0.011667 | 0.011803 | |
LMMA-ST | TP | 43 | 54 | 62 | 67 | 73 | 86 | 90 | 87 | |
LMMA-ST | FP | 38 | 46 | 52 | 64 | 69 | 80 | 82 | 87 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.042857 | 0.026912 | 0.020102 | 0.041336 | 0.036214 | 0.028095 | 0.023083 | 0.04316 |
KEGG | Network | Thp | 0.025 | 0.050 | 0.075 | 0.100 | 0.125 | 0.150 | 0.175 | 0.200 |
---|---|---|---|---|---|---|---|---|---|---|
KG | LM | TP | 237 | 237 | 237 | 237 | 237 | 237 | 237 | 237 |
LM | FP | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | 1048 | |
LMMA-EC | TP | 39 | 49 | 56 | 76 | 83 | 98 | 108 | 111 | |
LMMA-EC | FP | 121 | 175 | 241 | 267 | 303 | 349 | 417 | 458 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017004 | 0.034679 | 0.064785 | 0.01837 | 0.02372 | 0.01526 | 0.03012 | 0.04408 | |
LMMA-ST | TP | 71 | 83 | 93 | 101 | 111 | 130 | 137 | 135 | |
LMMA-ST | FP | 223 | 300 | 350 | 392 | 436 | 471 | 499 | 513 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.0056928 | 0.021672 | 0.02762 | 0.032833 | 0.033552 | 0.013196 | 0.013274 | 0.021953 | |
KO | LM | TP | 139 | 139 | 139 | 139 | 139 | 139 | 139 | 139 |
LM | FP | 170 | 170 | 170 | 170 | 170 | 170 | 170 | 170 | |
LMMA-EC | TP | 29 | 33 | 37 | 52 | 57 | 70 | 74 | 77 | |
LMMA-EC | FP | 19 | 34 | 40 | 44 | 46 | 55 | 60 | 63 | |
TP/FP (LMMA-EC versus LM) | P-value | 0.017279 | 0.087676 | 0.090294 | 0.027146 | 0.017372 | 0.0097982 | 0.011667 | 0.011803 | |
LMMA-ST | TP | 43 | 54 | 62 | 67 | 73 | 86 | 90 | 87 | |
LMMA-ST | FP | 38 | 46 | 52 | 64 | 69 | 80 | 82 | 87 | |
TP/FP (LMMA-ST versus LM) | P-value | 0.042857 | 0.026912 | 0.020102 | 0.041336 | 0.036214 | 0.028095 | 0.023083 | 0.04316 |
aHere LM represents LM-EC and LM-ST since genes in LM-EC and LM-ST are identical when mapping to KEGG database. KG = KEGG Gene; KO = KEGG Orthology.
The statistical significance of pathways in LMMA-based angiogenesis networks is derived from Fisher Exact Test. The results are shown inTable 4 and graphically represented by an example, the EGF (epidermal growth factor) unit-network, inFigure 4. See more in Discussion below. Although many co-occurrence relations are eliminated from the LM-based network, main pathway information, such as the focal adhesion pathway, signaling pathways of TGF-beta, MAPK, Calcium and Wnt, is observed in the LMMA-based network with significantP-values. Thus, pathways in LMMA-based network are significantly enriched.
An EGF (epidermal growth factor) unit-network derived respectively from the co-occurrence literature mining and the LMMA approaches. A total of 21 genes co-cited with EGF in LM are removed by LMMA. By manually revisiting the PubMed records, these 21 genes are found in false relations with EGF resulted from homonymic mis-matches and confused lexical orders (in the blue pane), unknown relations (in the purple pane) and isolated relations (in the yellow pane). A Neato program in the Graphviz software (AT&T;Author Webpage) is adopted to visualize the constructed network.
KEGG pathways with significantP-values in LMMA-based angiogenesis networks (Thp = 0.150)a
LMMA-EC (KG) | LMMA-EC (KO) | LMMA-ST (KG) | LMMA-ST (KO) | |
---|---|---|---|---|
Focal adhesion pathway | 0.00087 | 1.09e − 07 | 0.00084 | 2.84e − 08 |
MAPK signaling pathway | 0.02825 | 0.013338 | 0.015779 | 0.00910 |
Adherens junction | 2.14e − 22 | 1.31e − 13 | 3.08e − 24 | 2.19e − 14 |
TGF-beta signaling pathway | 0.00010 | 0.00540 | 8.76e − 06 | 0.00585 |
Insulin signaling pathway | 1.27e − 06 | 0.00264 | 1.33e − 07 | 0.00225 |
Calcium signaling pathway | 0.00011 | 0.00373 | 1.86e − 08 | 6.66e − 05 |
Wnt signaling pathway | — | 0.03010 | — | 0.00548 |
Regulation of actin cytoskeleton | 0.01100 | — | 0.00020 | — |
Cytokine-cytokine receptor interaction | 5.13E-09 | — | 9.52E-16 | — |
Apoptosis | 0.00127 | — | 0.03001 | — |
Cell cycle | — | 0.04594 | — | 0.02220 |
LMMA-EC (KG) | LMMA-EC (KO) | LMMA-ST (KG) | LMMA-ST (KO) | |
---|---|---|---|---|
Focal adhesion pathway | 0.00087 | 1.09e − 07 | 0.00084 | 2.84e − 08 |
MAPK signaling pathway | 0.02825 | 0.013338 | 0.015779 | 0.00910 |
Adherens junction | 2.14e − 22 | 1.31e − 13 | 3.08e − 24 | 2.19e − 14 |
TGF-beta signaling pathway | 0.00010 | 0.00540 | 8.76e − 06 | 0.00585 |
Insulin signaling pathway | 1.27e − 06 | 0.00264 | 1.33e − 07 | 0.00225 |
Calcium signaling pathway | 0.00011 | 0.00373 | 1.86e − 08 | 6.66e − 05 |
Wnt signaling pathway | — | 0.03010 | — | 0.00548 |
Regulation of actin cytoskeleton | 0.01100 | — | 0.00020 | — |
Cytokine-cytokine receptor interaction | 5.13E-09 | — | 9.52E-16 | — |
Apoptosis | 0.00127 | — | 0.03001 | — |
Cell cycle | — | 0.04594 | — | 0.02220 |
a P-values are calculated from Fisher Exact Test. KG = KEGG Gene. KO = KEGG Orthology.
KEGG pathways with significantP-values in LMMA-based angiogenesis networks (Thp = 0.150)a
LMMA-EC (KG) | LMMA-EC (KO) | LMMA-ST (KG) | LMMA-ST (KO) | |
---|---|---|---|---|
Focal adhesion pathway | 0.00087 | 1.09e − 07 | 0.00084 | 2.84e − 08 |
MAPK signaling pathway | 0.02825 | 0.013338 | 0.015779 | 0.00910 |
Adherens junction | 2.14e − 22 | 1.31e − 13 | 3.08e − 24 | 2.19e − 14 |
TGF-beta signaling pathway | 0.00010 | 0.00540 | 8.76e − 06 | 0.00585 |
Insulin signaling pathway | 1.27e − 06 | 0.00264 | 1.33e − 07 | 0.00225 |
Calcium signaling pathway | 0.00011 | 0.00373 | 1.86e − 08 | 6.66e − 05 |
Wnt signaling pathway | — | 0.03010 | — | 0.00548 |
Regulation of actin cytoskeleton | 0.01100 | — | 0.00020 | — |
Cytokine-cytokine receptor interaction | 5.13E-09 | — | 9.52E-16 | — |
Apoptosis | 0.00127 | — | 0.03001 | — |
Cell cycle | — | 0.04594 | — | 0.02220 |
LMMA-EC (KG) | LMMA-EC (KO) | LMMA-ST (KG) | LMMA-ST (KO) | |
---|---|---|---|---|
Focal adhesion pathway | 0.00087 | 1.09e − 07 | 0.00084 | 2.84e − 08 |
MAPK signaling pathway | 0.02825 | 0.013338 | 0.015779 | 0.00910 |
Adherens junction | 2.14e − 22 | 1.31e − 13 | 3.08e − 24 | 2.19e − 14 |
TGF-beta signaling pathway | 0.00010 | 0.00540 | 8.76e − 06 | 0.00585 |
Insulin signaling pathway | 1.27e − 06 | 0.00264 | 1.33e − 07 | 0.00225 |
Calcium signaling pathway | 0.00011 | 0.00373 | 1.86e − 08 | 6.66e − 05 |
Wnt signaling pathway | — | 0.03010 | — | 0.00548 |
Regulation of actin cytoskeleton | 0.01100 | — | 0.00020 | — |
Cytokine-cytokine receptor interaction | 5.13E-09 | — | 9.52E-16 | — |
Apoptosis | 0.00127 | — | 0.03001 | — |
Cell cycle | — | 0.04594 | — | 0.02220 |
a P-values are calculated from Fisher Exact Test. KG = KEGG Gene. KO = KEGG Orthology.
High false positive rate is a well-known problem in most high-throughput methods for detecting molecular interactions (von Meringet al., 2002). In this work, we developed a LMMA approach to construct networks based on both existing knowledge (literature) and experimental information (microarray). Such approach performs multivariate analysis to modify the literature-derived holistic network using subject-oriented gene expression profiles. To analyze the hidden network buried in microarray datasets, two aspects make it necessary to construct the LM-based network beforehand. First, it is not advisable to construct the network directly from thousands of candidate variables if prior knowledge about the network is not available. Second, the number of variables should not exceed the number of observations (i.e. microarray experiments); otherwise the results will be falsely optimized. Thus, a certain number of arrays are required in LMMA for multivariate selection.
As an application, PubMed literatures and microarray datasets from both the EC and the ST are selected respectively to reconstruct the LMMA network for angiogenesis. The LMMA approach results in a larger cluster size, and a smaller average path length when comparing with a LM-random filtering, while preserves similar topological properties comparing with the LM-based network. Therefore, it indicates that LMMA can eliminate redundant relations while maintain the backbone of the LM-based network.
Angiogenesis networks constructed by LM and LMMA are tested for accuracy on confident sets of interactions. Both precision and recall rates are calculated against KEGG, one commonly used benchmark. We show that LMMA significantly improves the precision rate when comparing with LM alone. On the other hand, asBorket al. (2004) reported, the choice of benchmark set is still a knotty problem because the agreement among different benchmark sets is surprisingly poor. For example, less than half of all pairs in the KEGG benchmark set are present in the Gene Ontology biological process benchmark set (Borket al., 2004). Moreover, it is commonly known that co-occurrence in literature often describes or reflects more general relationships between genes. Some of these may be implicit and/or so novel that they have not yet reached the status of common knowledge or accepted fact often required for inclusion in databases such as KEGG. Two aspects mentioned above may be the reason why both the LM and LMMA approaches resulted in a low recall rate (Fig. 3) when calculated against KEGG. Even so, we still show that the integration with microarray data can significantly increase the reliability of gene co-occurrence networks extracted from the literature.
To demonstrate how LMMA reduces the false positive rate and improves the precision, we select, EGF (epidermal growth factor), a key player in angiogenesis as an example. As shown inFigure 4, LMMA totally removes 21 EGF false related genes from LM-based EGF unit-network. First, LMMA deletes five mis-matched genes in LM: SC, SF, AA and PC are abbreviations of stem cells, scatter factor, arachidonic acid or anaplastic astrocytoma, and prostate cancer respectively; IL8RA (interleukin 8 receptor, alpha) is misinterpretated by IL8 and EGF receptor in the lexical order. Second, LMMA cancels eight genes with unknown relations (few co-citation) to EGF in LM: CCR6, FGF16, MAP3K8 and EGF are co-cited in only one PubMed sentence recorded in a gene expression experiment (Gerritsenet al., 2003); the same as IL11, IL10, IL3, IL4 and CCR2. Third, LMMA removes eight genes that seldom have co-occurrences with EGF even by using their alias: NRG2, Scube1, NPY6R, ZNF78L2, IFI44, RNU106, AXPC1 and ANGPTL6. Thus, our results indicate that common errors, which lead to the false relations in LM, can be effectively removed by the LMMA approach.
Moreover, there are 11 most statistically significant KEGG pathways in the LMMA-based angiogenesis networks. SeeTable 4 for the detailedP-value of each pathway calculated by Fisher Exact Test. Among them the focal adhesion pathway, the adherens junction pathway and the regulation of actin cytoskeleton pathway contribute to the complex processes such as endothelial cell migration, morphogenesis and angiogenesis (Bixet al., 2004). TGF-beta regulates angiogenesis by affecting proliferation, differentiation and migration of endothelial cells (Lomnytskaet al., 2004). Insulin signaling pathway is implicated in cellular mitogenesis, angiogenesis, tumor cell survival and tumorigenesis (Cohenet al., 2005). Many Wnt proteins act through a canonical, beta-catenin signaling pathway (Masckauchanet al., 2005) and are able to control diverse biological processes, such as cell differentiation, proliferation (Masckauchanet al., 2005) and vasculature (Goodwin and D'Amore, 2002). Among the intracellular kinases implicated in angiogenesis, p38 MAPK has been shown to transduce signals critical for vascular remodeling and maturation (Zhuet al., 2003). Ca(2+) signaling is involved in virtually all cellular processes (Munaronet al., 2004). In addition, a variety of stimulatory cytokines, such as tumor necrosis factor (TNF)-alpha, interleukin (IL)-1, -6 and interferon (IFN)-gamma, and growth factors can promote the development of functional and structural vascular changes (Kofleret al., 2005). Therefore, pathway information in the LMMA-based angiogenesis network suggests that multiple pathway interactions boost the activity of either EC or ST, which are in accordance with recent reports (Mukhopadhyay and Datta, 2004;McCarty, 2004). Since multiple pathways are dysfunctional in angiogenesis related disorders such as cancers, a multifocal signal modulation therapy is proposed recently (McCarty, 2004). And LMMA network will be helpful for analyzing the interactions of multiple pathways in such complex biological processes.
As for the usability of LMMA, this system is flexible in application to any biological topic if the related literature and microarray data are available. Note that to construct a LMMA network, the number of all candidate variables (genes) should be controlled in a proper size, and the accuracy of the LMMA approach increases with the increasing number of candidate variables in a certain scope. For the LMMA-based angiogenesis network, it summarizes large amounts of angiogenesis related literatures and high-throughput microarray data. The LMMA approach enables researchers not only to keep up-to-date with all the relevant literature on specialized biological topics, but also to make sense of the relevant large-scale microarray dataset. Meanwhile, the LMMA approach serves as a useful tool for constructing specific biological network and experimental design. Thus, LMMA acts as a valuable computer representation of the known angiogenesis-related pathways, as well as the interactions among multiple pathways. Such representation will enable a systemic recognition for angiogenesis in the context of complex gene interactions, which is also helpful for studying the regulation of various complex biological, physiological and pathological systems. In the ‘omics’ field, the LMMA approach can be further explored to study protein–protein and other interactions.
The authors would like to express their great appreciation to B. Li (Boston University, USA), X. G. Zhang and C. Zhang in their lab for helpful discussions and comments. The authors would like to acknowledge the financial support from FANEDD (No. 200366), the Key Project of Chinese MOE (No. 104009) and the Basic Research Foundation of TNList.
Conflict of Interest: none declared.
Associate Editor: Alfonso Valencia
Month: | Total Views: |
---|---|
December 2016 | 3 |
January 2017 | 2 |
February 2017 | 5 |
March 2017 | 14 |
April 2017 | 2 |
May 2017 | 8 |
June 2017 | 9 |
July 2017 | 13 |
August 2017 | 8 |
September 2017 | 9 |
October 2017 | 1 |
November 2017 | 9 |
December 2017 | 20 |
January 2018 | 21 |
February 2018 | 18 |
March 2018 | 34 |
April 2018 | 34 |
May 2018 | 34 |
June 2018 | 35 |
July 2018 | 17 |
August 2018 | 32 |
September 2018 | 23 |
October 2018 | 9 |
November 2018 | 36 |
December 2018 | 26 |
January 2019 | 21 |
February 2019 | 26 |
March 2019 | 29 |
April 2019 | 32 |
May 2019 | 36 |
June 2019 | 23 |
July 2019 | 25 |
August 2019 | 37 |
September 2019 | 23 |
October 2019 | 15 |
November 2019 | 30 |
December 2019 | 24 |
January 2020 | 24 |
February 2020 | 29 |
March 2020 | 6 |
April 2020 | 35 |
May 2020 | 9 |
June 2020 | 21 |
July 2020 | 14 |
August 2020 | 18 |
September 2020 | 29 |
October 2020 | 18 |
November 2020 | 19 |
December 2020 | 17 |
January 2021 | 31 |
February 2021 | 16 |
March 2021 | 52 |
April 2021 | 20 |
May 2021 | 16 |
June 2021 | 25 |
July 2021 | 24 |
August 2021 | 24 |
September 2021 | 17 |
October 2021 | 24 |
November 2021 | 21 |
December 2021 | 16 |
January 2022 | 14 |
February 2022 | 32 |
March 2022 | 19 |
April 2022 | 21 |
May 2022 | 28 |
June 2022 | 38 |
July 2022 | 25 |
August 2022 | 50 |
September 2022 | 42 |
October 2022 | 36 |
November 2022 | 28 |
December 2022 | 15 |
January 2023 | 27 |
February 2023 | 24 |
March 2023 | 28 |
April 2023 | 14 |
May 2023 | 5 |
June 2023 | 5 |
July 2023 | 16 |
August 2023 | 26 |
September 2023 | 14 |
October 2023 | 17 |
November 2023 | 27 |
December 2023 | 34 |
January 2024 | 13 |
February 2024 | 42 |
March 2024 | 24 |
April 2024 | 19 |
May 2024 | 27 |
June 2024 | 30 |
July 2024 | 25 |
August 2024 | 20 |
September 2024 | 29 |
October 2024 | 19 |
November 2024 | 15 |
December 2024 | 28 |
January 2025 | 9 |
February 2025 | 15 |
March 2025 | 21 |
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.