Description

These tracks show evolutionary protein-coding potential as determined by PhyloCSF(1) to help identify conserved, functional, protein-coding regions of genomes. PhyloCSF examines evolutionary signatures characteristic of alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and nonsense substitutions (CSF = Codon Substitution Frequencies). PhyloCSF provides more information than conservation of the amino acid sequence, because it distinguishes the different codons that code for the same amino acid. One of PhyloCSF's main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing. More information on PhyloCSF can be found on the PhyloCSF wiki.

The Raw PhyloCSF tracks show the PhyloCSF score for each codon in each of 6 frames. Regions in which most codons have score greater than 0 are likely to be protein-coding in that frame. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).

The Smoothed PhyloCSF tracks show the scores smoothed using an HMM. No score is shown when the relative branch length is less than 0.1 (see PhyloCSF Power).

The PhyloCSF Regions tracks show the regions that are protein-coding in the most-likely-path through the HMM. They are something like predicted exons, except that splice sites, start codons, and stop codons are not considered so the boundaries are approximate. The gray scale is an indication of the maximum log-odds that any codon in the region is coding according to the HMM.

The PhyloCSF Power track shows the branch length score at each codon, i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of all species in the full genome alignment. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain.

The Splice Prediction tracks show canonical splice predictions for each strand. This can be useful for predicting where novel exons start and end. The bars are on the first or last base of the hypothetical intron, i.e., the G of GT or AG. Green bars indicate splice donors and red bars indicate acceptors, so an intron would extend from a green bar to a red bar. Taller bars are more likely to be true splice sites.

The PhyloCSF Novel tracks show regions that could be protein-coding but are not currently annotated as protein-coding exons or pseudogenes in a specified gene set. Green regions are on the plus strand and red regions are on the minus strand. These regions have been ranked using an SVM with ones most likely to be real novel coding or pseudogene regions having low rank (darker shading), and ones that are more likely to be non-coding false positives having high rank (lighter shading). The rank is listed next to the region.

The color-coded alignment of a novel region can be examined using CodAlignView (3) as follows: select the region in the browser to view its properties; copy the Position string and make note of the Strand; open this link: CodAlignView; paste the position string into the Intervals control; set the strand; choose an appropriate Alignment Set; and pick Redisplay Alignment.

Caveats

Around 10% of annotated protein-coding regions get score less than 0. This can happen for various reasons. For example, the region could be coding in the reference species but not in other species, or the alignment does not represent a true orthology relationship between the species.
Protein-coding regions will often have positive score on the reverse strand in the frame in which the third codon positions match up (the "antisense" frame), though the score will usually be higher on the correct strand.
Pseudogenes will often get positive scores even though they are not protein-coding. This can happen for a variety of reasons. For example, the aligner might align a duplicated pseudogene to the orthologs of its parent. Also, since PhyloCSF is measuring coding potential on the whole species tree, if a unitary pseudogene was coding in the common ancestor but is no longer protein-coding in the reference species it will often still have positive score.

Methods

PhyloCSF was run with the "fixed" strategy on every codon in every frame on each strand in the assembly. The frame of a codon is defined as the remainder mod 3 of the lowest nominal base of the codon, counting the first base of a chromosome as 1.

The alignments and PhyloCSF parameters for the various assemblies are as follows:

Assembly	PhyloCSF Parameters	Alignments
hg19	29 mammals	29-mammals subset of the 46-vertebrates hg19 alignment
mm10	29 mammals	29-mammals subset of the 60-vertebrates mm10 alignment
galGal4	29 mammals	49 sauropsids alignment (unpublished)

The hg38 scores for most codons were obtained by finding the corresponding codon in hg19 using liftover and using the score of that codon. Codons that did not liftover from hg19 were scored using a 29 mammal subset of the 100-vertebrates hg38 alignment, consisting of the species Human, Chimp, Rhesus, Bushbaby, Mouse, Rat, Squirrel, Rabbit, Pika, Alpaca, Dolphin, Cow, Horse, Cat, Dog, Microbat, Megabat, Hedgehog, Shrew, Elephant, Tenrec, Armadillo, Chinese_tree_shrew, Guinea_pig, Marmoset, Star_nosed_mole, Manatee, Brush_tailed_rat, and Chinese_hamster, and then dividing the score by 0.97 to make it as comparable to the lifted-over hg19 score as possible.

The scores were smoothed using a Hidden Markov Model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization.

The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+0 shows the log-odds that codons in frame 0 on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 1 and 2.

Splice sites were predicted using the maximum entropy method (2), trained on human splice sites.

The PhyloCSF Novel tracks were created as follows. All regions from the PhyloCSF Regions tracks were compared to protein-coding and pseudogene annotations from the specified gene set. Regions contained in annotated pseudogene regions, or in protein-coding regions in the same frame or the antisense frame were eliminated. If part of a region was contained in the annotated region, the region was trimmed to the unannotated portion. Regions less than eight codons long were eliminated. Regions that were more likely to be antisense to a novel region than to be novel themselves were distinguished using a four-feature SVM and excluded, the features being the PhyloCSF score, the difference between the PhyloCSF scores on the two strands, the length of the region, and the relative branch length of the species in the local alignment of the region. Ranks were assigned using an SVM using the same four features but trained to distinguish regions that are contained in coding annotations from ones that are not.

Credits

Questions should be directed to Irwin Jungreis.

References

(1) Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).

(2) Yeo G, Burge CB (2004). Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology : a Journal of Computational Molecular Cell Biology 11(2-3), 377–394.

(3) Jungreis I, Lin MF, Chan CS, Kellis M (2016). CodAlignView: The Codon Alignment Viewer [Internet]. Available from: http://data.broadinstitute.org/compbio1/cav.php.