Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.
Results: We show that PhyloCSF's classification performance in 12-speciesDrosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available athttp://compbio.mit.edu/PhyloCSF
Contact: [email protected];[email protected]
High-throughput transcriptome sequencing (mRNA-Seq) is yielding precise structures for novel transcripts in many species, including mammals (Guttmanet al., 2010). Accurate computational methods are needed to classify these transcripts and the corresponding genomic exons as protein coding or non-coding, even if the transcript models are incomplete or if they only reveal novel exons of already-known genes. In addition to classifying novel transcript models, such methods also have applications in evaluating and revising existing gene annotations (Butleret al., 2009;Clampet al., 2007;Kelliset al., 2003;Linet al., 2007;Pruittet al., 2009), and as input features forde novo gene structure predictors (Alioto and Guigó, 2009;Brent, 2008). We have previously (Linet al., 2008) compared numerous methods for determining whether an exon-length nucleotide sequence is likely to be protein coding or non-coding, including single-sequence metrics that analyze the genome of interest only and comparative genomics metrics that use alignments of orthologous regions in the genomes of related species.
Among the comparative methods benchmarked in our previous study, one of our original contributions was the Codon Substitution Frequencies (CSF) metric, which assigns a score to each codon substitution observed in the input alignment based on the relative frequency of that substitution in known coding and non-coding regions. We showed that CSF is highly effective, performing competitively with a phylogenetic modeling approach with much less computational expense, and indeed we have applied it successfully in flies (Linet al., 2007;Starket al., 2007), fungi (Butleret al., 2009) and mammals (Clampet al., 2007;Guttmanet al., 2009;2010). However, as discussed in our previous work, CSF has certain drawbacks arising from itsad hoc scheme for combining evidence from multiple species. For example, it makes only partial use of the evidence available in a multispecies alignment, and it produces a score lacking a precise theoretical interpretation, meaningful only relative to its empirical distributions in known coding and non-coding regions.
This article introduces a rigorous reformulation of CSF, which frames the evaluation of a given alignment as a statistical model comparison problem, choosing between phylogenetic models estimated from known coding and non-coding regions as the best explanation for the alignment. This new ‘PhyloCSF’ method fully leverages multiple alignments in a phylogenetic framework, produces meaningful likelihood ratios as its output and rests upon the sweeping theoretical foundation for statistical model comparison. Benchmarking on the classification datasets from our original study, we show that PhyloCSF outperforms all the other methods we had previously considered.
PhyloCSF is applicable for assessing the coding potential of transcript models or individual exons in an assembled genome that can be aligned to one or more informant genomes at appropriate phylogenetic distances. To estimate parameters in the underlying statistical models, the approach also requires that the genome of interest, or one of the informant genomes, have existing coding gene annotations of reasonably good quality. We describe several initial applications of the method in such settings, which illustrate how it can contribute to new genome annotation strategies based on mRNA-Seq.
PhyloCSF is based on the well-established theoretical framework for statistical phylogenetic model comparison. In this context, phylogenetic models are generative probabilistic models that produce alignments of molecular sequences, based on a prior distribution over a common ancestral sequence, the topology and branch lengths of a phylogenetic tree relating the descendants, and a substitution process along each branch giving the rates (per unit branch length) at which each character changes to any other. In phylogenetic model comparison, we wish to choose between two competing models as the better explanation for a given alignment. A standard approach is to decide based on thelikelihood ratio between the two models, which quantifies how much more probable the alignment is under one model than under the other. This general approach has been used to explore many different aspects of the evolution of protein-coding genes, as recently reviewed inAnisimova and Kosiol (2008) andDelportet al. (2008).
To distinguish coding and non-coding regions, we design one phylogenetic model to represent the evolution of codons in protein-coding genes, and another to represent the evolution of nucleotide triplet sites in non-coding regions. These models may have one or more parameters θ that adjust them to the genomic region of interest, e.g. the neutral substitution rate or G+C content. To analyze a given alignmentA of extant sequences, we first determine the probability of the alignment under the maximum likelihood estimate (MLE) of the parameters for the coding model,pC=maxθC Pr(A|Coding,θC). We similarly estimate the alignment's probability under the non-coding model,pN=maxθN Pr(A|Noncoding,θN). Finally, we decide if the alignment is more likely to represent a protein-coding region or a non-coding region based on the log-likelihood ratio. The cutoff can be chosen to achieve a certain level of statistical significance, based on known asymptotic convergence properties of the log-likelihood ratio statistic (Otaet al., 2000;Vuong, 1989;Whelan and Goldman, 1999), or it can be chosen empirically based on classification performance in a test set; we use the latter strategy in this work.
A standard method for detecting purifying selection on protein-coding sequences is to test for evidence that non-synonymous substitutions occur at a slower rate than synonymous substitutions. In the widely used PAML implementation of this test (Yang, 2007;Yang and Nielsen, 1998), the vector of codon frequencies π and the ratio of transition to transversion rates κ are the only parameters used to determine all triplet substitution rates in the background/non-coding model, while the coding model additionally supposes that non-synonymous codon substitution rates are reduced relative to synonymous rates by a scale factor ω (also calleddN/dS). PAML takes the phylogenetic tree topology as input, and estimates the branch lengths, π, κ and ω for each alignment. The log-likelihood ratio between the coding and non-coding models can then be obtained from PAML's output. (For detecting purifying selection, the log-likelihood ratio is set to zero if the estimated ω≥1.)
Our previous work (Linet al., 2008) showed this to be one of the best comparative methods for distinguishing coding and non-coding regions, outperforming our CSF metric according to standard classification error measures. Notably however, thedN/dS test performed worse than CSF for short regions (≤180 nt). This is not surprising since PAML was designed for evolutionary analysis of complete open-reading frames (ORFs), not short exon-length regions, which probably provide too little information to reliably estimate both the branch lengths and codon frequencies in addition to the two rate parameters.
PhyloCSF differs from the standarddN/dS test in two main ways. First, it takes advantage of recent advances in phylogenetic codon models that enable much more detailed representations of coding and non-coding sequence evolution. Specifically, while thedN/dS test uses only a few parameters to model the rates of all possible codon substitutions (Yang and Nielsen, 1998), PhyloCSF usesempirical codon models (ECMs) based on several thousand parameters modeling these rates (Kosiolet al., 2007)—one ECM estimated from alignments of many known coding regions and another ECM from non-coding regions. By comparing these two rich evolutionary models, PhyloCSF can observe many additional informative features of a given alignment compared with thedN/dS test. For example, the coding ECM captures not only the decreased overall rate of non-synonymous substitutions, but also the different rates of specific non-synonymous substitutions reflecting the chemical properties of the amino acids. [Earlier codon modeling approaches also incorporate amino acid distances, e.g.Goldman and Yang (1994), but to our knowledge, these are not widely used for distinguishing coding and non-coding regions.] Also, our ECMs explicitly model the extreme difference in the rates of nonsense substitutions (giving rise to stop codons) in coding and non-coding regions.
Second, PhyloCSF also takes advantage of genome-wide training data to provide prior information about the branch lengths in the phylogenetic tree and the codon frequencies, rather than attempting to reestimate thesea priori in each individual alignment. PhyloCSF assumes a fixed tree ‘shape’ based on the genome-wide MLEs of the branch lengths, and estimates only two scale factors ρC and ρN, applied uniformly to all the branch lengths in the coding and non-coding models, respectively, for each individual region analyzed. This allows PhyloCSF to accomodate some region-specific rate variation and reduces its sensitivity to the absolute degree of conservation—without needing to estimate many parameters for each alignment, which may be difficult for short regions.
In summary (Fig. 1), PhyloCSF relies on two ECMs fit to genome-wide training data, which include estimates for the branch lengths, codon frequencies and codon substitution rates for known coding and non-coding regions. To evaluate a given nucleotide sequence alignment, PhyloCSF (i) determines the MLE of the scale factor ρ on the branch lengths for each of these models, (ii) computes the likelihood of each model (the probability of the alignment under the model) using the MLE of the scale factor, and (iii) reports the log-likelihood ratio between the coding and non-coding models.
PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matricesQC andQN for coding and non-coding regions, respectively, shown here for 12Drosophila species.QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, whileQN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of theD.melanogaster homeobox geneDfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within aDfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.
PhyloCSF's trained parameters include a phylogenetic tree (with branch lengths) and two 64×64 codon rate matricesQC andQN representing coding and non-coding sequence evolution, respectively, as reversible, homogeneous, continuous-time Markov processes. To evaluate a given alignment, we first evaluate the likelihood of the coding model as follows. We define an alignment-specific parameter ρC that operates as a scale factor applied to all of the branch lengths in the predefined tree. Given a setting of ρC, the substitution probability matrix along any branch with lengtht is given byP=exp(tρCQC). We can then compute the probability of the full alignment usingFelsenstein (2004)'s algorithm—assuming independence of the codon sites, using the equilibrium frequencies implicit inQC as the prior distribution over the common ancestral sequence, and marginalizing out any gapped or ambiguous codons. We numerically maximize this probability over ρC to obtain the likelihood of the coding modelpC. We then evaluate the likelihood of the non-coding modelpN in the same way, usingQN and an independent scale factor ρN, and report the log-likelihood ratio as the result.
To estimate the phylogenetic tree and the empirical rate matricesQC andQN for the species of interest, we rely on sequence alignments of many known coding and random non-coding regions. Given this genome-wide training data, we estimate the parameters for the coding and non-coding models by maximum likelihood, using an expectation–maximization approach. The E-step is carried out as previously described (Holmes and Rubin, 2002;Siepel and Haussler, 2004). In each M-step, we update the ECM exchangeability parameters using a spectral approximation method (Arvestad and Bruno, 1997) and the branch lengths by numerical optimization of the expected log-likelihood function (Siepel and Haussler, 2004). Meanwhile, the codon/triplet frequencies are fixed to their empirical averages in the training examples, and we assume a fixed species tree topology.
We used the datasets and benchmarks from our previous study (Linet al., 2008) to evaluate our new method. Briefly, the datasets consist of known protein-coding regions and randomly selected non-coding regions (~50 000 total regions) in the genome of the fruitflyDrosophila melanogaster, aligned with 11 otherDrosophila species using MULTIZ (Blanchetteet al., 2004;Drosophila 12 Genomes Consortium, 2007;Starket al., 2007). The lengths of the regions in both the coding and non-coding sets match the length distribution of fly coding exons. Consistent with our previous work, we trained and applied PhyloCSF on this dataset using 4-fold cross-validation to ensure that any observed performance differences are not due to overfitting. We assessed the results by examining ROC curves and computing the minimum average error (MAE), the average false positive and false negative rates at the cutoff that minimizes this average. To compare the power of the methods for short exons specifically, we additionally computed these benchmarks only for the 37% of examples from 30 to 180 nt in length (a range including three-quarters of mammalian coding exons).
These benchmarks show that PhyloCSF outperforms the other comparative methods we previously benchmarked (Fig. 2, left column), effectively dominating them all at good sensitivity/specificity tradeoffs. PhyloCSF's overall MAE is 19% lower than that of the Reading Frame Conservation metric, 15% lower than that of our older CSF method and 8% lower than thedN/dS test's. PhyloCSF also clearly outperforms the other methods for short exons (Fig. 2, bottom row), with an MAE 11% lower than thedN/dS test's. Our previous study (Linet al., 2008) showed that these comparative methods in turn outperform single-species metrics based on sequence composition [e.g. interpolated Markov models (Delcheret al., 1999) and theZ curve discriminant (Gao and Zhang, 2004)].
PhyloCSF performance benchmarks. ROC plots and error measures for several methods to distinguish known protein-coding and randomly selected non-coding regions inD.melanogaster. The top row of plots shows the results for our full dataset of ~50 000 regions matching the fly exon length distribution, while the bottom row of plots is based on the 37% of these regions between 30 and 180 nt in length. The left-hand plots show the performance of the methods applied to multiple alignments of 12 fly genomes, while the right-hand plots use pairwise alignments betweenD.melanogaster andD.ananassae. PhyloCSF effectively dominates the other methods.
We also compared PhyloCSF to the other methods using only pairwise alignments betweenD.melanogaster andD.ananassae, which we previously showed to be the best single informant for this purpose. PhyloCSF also dominates other methods in these pairwise alignments (Fig. 2, right column), with MAE 24% lower than the next-best method's for all aligned regions (dN/dS test) and 21% lower for the short regions (CSF). Some of PhyloCSF's greater relative advantage in the pairwise case arises from its ability to produce an informative score even for regions lacking alignment withD.ananassae, based on the composition of theD.melanogaster region and the codon frequencies included in PhyloCSF's generative ECMs.
Overall, these benchmarks show that PhyloCSF provides superior power to distinguish fly coding and non-coding regions based on either multispecies or pairwise genome alignments.
Like most codon modeling approaches, PhyloCSF assumes statistical independence of each codon site from all the others in the alignment (conditional on the model parameters). We next studied the extent to which this assumption is violated in the real alignments in our dataset, and sought to use this to further improve our ability to distinguish the coding and non-coding regions.
One simple way of assessing the dependence of codon sites in a set of alignments is to measure the relationship between the PhyloCSF log-likelihood ratio, Λ, and the number of sites in each alignment,n. For alignments consisting ofn truly i.i.d. sites generated from either the coding or non-coding ECM, it can be shown by central limit arguments that the (Cox, 1961;1962;Vuong, 1989;White, 1982). On the other hand, SD(Λ)∝n under complete dependence of the sites in each alignment—that is, where each alignment consists of a single randomly sampled site, repeatedn times. We applied a numerical procedure to find MLEs of coefficientsA andB in SD(Λ)~AnB based on the real alignments in our dataset. For coding regions, we foundBC=0.84 and for non-coding regions, we foundBN=0.76. Compared with the expected values ofB=0.5 under complete independence andB=1 under complete dependence, these results suggest considerable site-to-site dependencies in the real alignments—stronger in coding regions than non-coding regions.
The differing length-dependent dispersions of Λ for coding and non-coding regions suggest that a better classification rule could be obtained by interpreting the log-likelihood ratio for each example in the context of the alignment's length. This contrasts with the suggestion of the ‘law of likelihood’ that, under the model's independence assumptions, any information pertinent to the decision would already be captured in the likelihood ratio (Hacking, 1974).
We calculated Ψ for each of the examples in ourDrosophila dataset (with 2-fold cross-validation) and found that it indeed provides a superior overall discriminant between the coding and non-coding regions (Fig. 3), with an MAE 17% lower than that of Λ and 24% lower than thedN/dS test's. While Ψ somewhat distorts the formal inferential meaning captured by Λ, this approach exploits information from site-to-site dependencies in a principled way, and at negligible additional computational cost—in contrast to attempts to explicitly capture such dependencies in the generative codon models (Anisimova and Kosiol, 2008;Delportet al., 2008).
Exploiting non-independence of codon sites. The PhyloCSF log-likelihood ratio Λ is transformed based on alignment length into a new log-likelihood ratio score Ψ (see main text). Ψ provides a superior discriminant in the full 12 fly dataset.
To facilitate the use of PhyloCSF by the community, we provide an implementation that evaluates input sequence alignments in Multi-FASTA format and reports the resulting log-likelihood ratios in units of decibans. We also provide the ECMs and other parameter settings for several phylogenies.
The software provides two additional noteworthy features. First, it can evaluate all ORFs above a given length, in three or six reading frames, within each alignment. Thus, it can delimit likely protein-coding ORFs within transcript models that include untranslated regions. Second, the software also provides a simplified method, similar to the aforementioneddN/dS test, that does not require extensive training data from known coding and non-coding regions. While considerably less accurate than the full ECM comparison presented here, this method can be used in settings where genome alignments are available but high-quality existing gene annotations are lacking, as may increasingly be the case outside of well-studied phylogenies such as mammals and flies.
The Objective Caml source code and executables for GNU/Linux and Mac OS X are available at:http://compbio.mit.edu/PhyloCSF
We now briefly discuss several initial applications of PhyloCSF in three different phylogenies. While biological results of each of these applications are presented elsewhere in greater detail, together they illustrate PhyloCSF's usefulness in practice.
Recent efforts to reconstruct the transcriptome of the fission yeastSchizosaccharomyces pombe based on mRNA-Seq experiments suggested numerous novel transcript models. We generated whole-genome alignments ofS.pombe withS.octosporus,S.japonicus andS.cryophilus, and identified 89 novel protein-coding genes that are highly conserved among these species, even though they do not show primary sequence similarity to known proteins. The mRNA-Seq reconstruction also suggested several hundred revisions to the exon–intron structures of existing gene annotations. PhyloCSF showed that the revised transcript models strongly tend to have higher coding potential than the originals, confirming the quality of the proposed revisions (Rhindet al., in press).
The modENCODE project undertook a comprehensive effort to map the transcriptome ofD.melanogaster, identifying 1938 novel transcribed loci in this well-studied genome. Using PhyloCSF with the 12-species whole-genome alignments, we identified 138 of these likely to represent novel coding genes (The modENCODE Consortium, 2010). We also used PhyloCSF in concert with the modENCODE data to identify and characterize 283 knownD.melanogaster genes that show high coding potential immediately downstream of their stop codon, suggesting a surprisingly widespread mechanism of stop codon readthrough (Jungreis,J.et al., submitted for publication). PhyloCSF's ability to systematically resolve small regions was especially important in this application, as many of these putative readthrough regions are quite short (mean 67 codons).
Lastly, we have also used PhyloCSF in the human and mouse genomes, using alignments of 29 placental mammals (Linblad-Toh,K.et al., submitted for publication). We applied PhyloCSF to transcript models reconstructed from mRNA-Seq on 16 human tissues to identify several candidate novel coding genes (Fig. 4), despite the extensive decade-long efforts to annotate the human genome. We also used CSF and PhyloCSF to evaluate coding potential in novel mouse transcripts, helping to identify thousands of longnon-coding RNAs with diverse functional roles (Guttmanet al., 2010) (Hunget al., in press).
A novel human coding gene found using mRNA-Seq and PhyloCSF. Transcriptome reconstruction by Scripture (Guttmanet al., 2010) based on brain mRNA-Seq data provided by Illumina, Inc. produced two alternative transcript models lying antisense to an intron ofGTF2E2, a known protein-coding gene. PhyloCSF identified a 95-codon ORF in the third exon of this transcript, highly conserved across placental mammals. The color schematic illustrates the genome alignment of 29 placental mammals for this ORF, indicating conservation (white), synonymous and conservative codon substitutions (green), other non-synonymous codon substitutions (red), stop codons (blue/magenta/yellow) and frame-shifted regions (orange). Despite its unmistakable protein-coding evolutionary signatures, the ORF's translation shows no sequence similarity to known proteins.
We have introduced PhyloCSF, a comparative genomics method for distinguishing protein-coding and non-coding regions, and shown that it outperforms previous methods. In addition to its superior discriminatory power, PhyloCSF is far more theoretically attractive than our older CSF and otherad hoc metrics, relying on a formal statistical comparison of phylogenetic codon models. However, we note that PhyloCSF and CSF produce highly correlated scores (Pearson coefficient 0.95 in our dataset), and the new method is much more computationally demanding.
As our initial applications illustrate, PhyloCSF can provide an important building block in future computational strategies for genome annotation based on mRNA-Seq. Other pieces of the puzzle include methods to reconstruct transcript models based on short reads, complementary metrics of coding potential based on primary sequence composition or indel patterns, database search tools to identify similarity to known proteins and non-coding RNAs andde novo gene structure predictors that may be able to identify lowly or rarely expressed genes. Major challenges remain in integrating these methods into coherent pipelines, harmonizing the results with existing genome annotation databases, and coping with the uneven coverage and relatively high error rate of current high-throughput sequencing technologies (Ozsolak and Milos, 2011).
The authors thank Matthew D. Rasmussen and Manuel Garber for helpful comments and discussions.
Funding: National Institutes of Health (U54 HG004555-01); National Science Foundation (DBI 0644282).
Conflict of Interest: none declared.
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.