Notes on creation of the tracks in http://www.broadinstitute.org/compbio1/PhyloCSFtracks/galGal4/20160701: PhyloCSF was run with the "fixed" strategy using the 49birds parameters on every codon in every frame on each strand in the galGal4 chicken assembly. The frame of a codon is defined as the remainder mod 3 of the lowest nominal base of the codon, counting the first base of a chromosome as 1. Alignments were extracted using a pre-publication version of the 49 sauropsids alignment. The scores were smoothed using a hidden markov model (HMM) with 4 states, one representing coding regions and three representing non-coding regions. The emission of each codon is its PhyloCSF score. The ratio of the emissions probabilities for the coding and non-coding models are computed from the PhyloCSF score, since it represents the log-likelihood ratio of the alignment under the coding and non-coding models. The three non-coding states have the same emissions probabilities but different transition probabilities (they can only transition to coding) to better capture the multimodal distribution of gaps between same-frame coding exons. These transition probabilities represent the best approximation of this gap distribution as a mixture model of three exponential distributions, computed using Expectation Maximization. The HMM defines a probability that each codon is coding, based on the PhyloCSF scores of that codon and nearby codons on the same strand in the same frame, without taking into account start codons, stop codons, or potential splice sites. PhyloCSF+0.bw shows the log-odds that codons in frame 0 on the '+' strand are in the coding state according to the HMM, and similarly for strand '-' and frames 1 and 2. PhyloCSF+0Regions.bb, etc., show the regions that are coding in the most-likely-path through the HMM. They are something like predicted exons, except that they don't look at splice sites, start codons, or stop codons so the boundaries are approximate. The gray scale is an indication of the maximum log-odds that any codon in the region is coding according to the HMM. PhyloCSFpower.bw shows the branch length score at each codon, .i.e., the ratio of the branch length of the species present in the local alignment to the total branch length of the 29-mammals tree. It is an indication of the statistical power available to PhyloCSF. Codons with branch length score less than 0.1 have been excluded altogether (from all tracks) because PhyloCSF does not have sufficient power to get a meaningful score at these codons. Codons with branch length score greater than 0.1 but much less than 1 should be considered less certain. The PhyloCSF Novel Regions track shows SVM-prioritized PhyloCSF regions excluding those already annotated. This track is similar to the six PhyloCSF regions tracks that have the coding-state intervals in the most-likely path through the PhyloCSF HMM, except: - Portions of PhyloCSF regions that overlap annotated coding regions (in ensembl Galgal4.82) in the same frame or antisense frame, or that overlap annotated pseudogenes in any frame have been excluded. - Short regions have been excluded (<= 8 codons). - Regions are colored according to strand: + is green, - is red. - All six frames are combined into a single track. - Regions are shaded according to the SVM-ranking -- regions with low ranks, i.e., most likely to be real novel coding regions, are darker. (The old PhyloCSF regions are shaded according to PhyloCSF score alone, whereas the SVM takes into account other things like branch length are region length.) - The rank is displayed next to the region. The splice prediction tracks show the maximum-entropy splice predictions (Yeo and Burge, 2004) for the entire genome (only canonical). There's one track for each strand. Green bars indicate splice donors and red bars indicate acceptors. The bars are on the first or last base of the imagined intron (i.e., the G of GT or AG). Green bars indicate splice donors and red bars indicate acceptors, so an intron would extend from a green bar to a red bar, just as an ORF extends from the green ATG to a red stop codon in the sequence track. The height of the bar indicates the splice prediction score, ranging from 0 to 15. AGs and GTs with negative splice prediction score are not shown.