The tab-delimited spreadsheets in this folder contain the lists of "PhyloCSF Candidate Coding Regions" (PCCRs) for various species and annotation sets, with additional information about each PCCR to facilitate investigating which of them constitute novel coding regions. PCCRs are genomic intervals that show evolutionary evidence of protein-coding potential but are not already annotated as protein-coding or pseudogenic, ranked according to how likely they are to be true protein-coding regions. Manual curation of the 1000 top-ranked PCCRs plus a targeted search of a few hundred lower-ranked PCCRs resulted in the addition of 144 protein-coding genes and hundreds of pseudogenes and coding exons in previously annotated genes to human GENCODE versions 24-28, more than half of which had not been previously discovered [1]; analyses of PCCR lists in other species are likely to find many novel coding regions in those species as well.
PCCRs are created by using whole-genome multispecies alignments to compute PhyloCSF scores [2] for every codon in the whole genome in each of the six reading frames; using a Hidden Markov Model to find candidate coding intervals, "PhyloCSF Regions"; and excluding intervals that are already annotated as coding or pseudogenic, or antisense to such annotations, and very small intervals. They are then sorted using a Support Vector Machine (SVM) that uses as features the PhyloCSF score, the difference between PhyloCSF scores on the two strands, the length of the interval, and the phylogenetic branch length of the local alignment of the interval.
The following is a description of the spreadsheet fields. Some of the spreadsheets have slightly different fields, mostly for historical reasons.
Field | Description |
Rank | SVM rank. Lower ranks are more likely to be true protein-coding regions. |
ID | Unique integer identifier for each PCCR. |
Name | PCCR name of the form: AnnoChrom:Start-End+ or AnnoChrom:Start-End- depending on strand. |
RegType | "Extension" if PCCR is the non-overlapping part of a PhyloCSF Region that overlaps an annotation, otherwise "NoOverlap". |
AnnoChrom | Chromosome as it appears in the annotation GFF/GTF file, which can differ from the name in the alignment MAF file (e.g., 2L versus chr2L). |
Start | Nominally lowest 1-based chromosomal coordinate of nucleotides in interval. |
End | Nominally highest 1-based chromosomal coordinate of nucleotides in interval. |
Strand | "+" or "-". |
ScorePerCodon | PhyloCSF score divided by number of codons in the interval. Scores were computed using PhyloCSF's "mle" option, except that the oldest files, GENCODE23 and GENCODEm5, used the "fixed" option. |
PhyloCSFPsi | Length-adjusted PhyloCSF score (see [2]). |
NumCodons | Length of interval, in codons. |
Bls | Phylogenetic branch length of species present in the local alignment of the interval, divided by branch length of the whole species tree. |
AntiScorePerCodon | PhyloCSF score of the interval in the antisense frame (frame on opposite strand that shares the 3rd codon position) divided by number of codons. |
ScoreDiff | ScorePerCodon - AntiScorePerCodon. |
OverlapTrs | List of protein-coding or pseudogene transcript annotations that overlap the interval in any of the 6 reading frames. For protein-coding, only includes the CDS portion. |
Parent | For "Extension" PCCRs, the PhyloCSF Region that was trimmed to form this one, otherwise "NA". |
CodAlignView | Excel hyperlink to show the color-coded alignment of the interval in CodAlignView [3]. |
UCSCview | Excel hyperlink to show the interval in the UCSC genome browser. |
SvmScore | SVM score used to sort PCCRs. PCCRs with higher scores are more likely to be true protein-coding regions. |
SvmAntisenseScore | SVM score used to distinguish PhyloCSF Regions more likely to be novel coding regions on their strand than on the opposite strand. |
TrypticAAs | Theoretical translation of the PCCR, trimmed to tryptic boundaries. Useful when validating PCCRs using mass spectrometry. |
OutOfOrfCodons | The number of codons after or before a stop codon, whichever is less. If this number is large it indicates the presence of a stop codon in the midst of an ancestral coding region, which typically means this is a pseudogene, though it could also be stop codon readthrough. |
Cluster | Nearby PCCRs were grouped into clusters. This field is an index into the list of clusters. |
ClusterStart | Lowest Start of any PCCR in this cluster. |
ClusterEnd | Highest End of any PCCR in this cluster. |
BestInCluster | "Best" for the lowest ranking PCCR in the cluster, otherwise blank. |
ClusterBestRank | Lowest rank of any PCCR in the cluster. |
ClusterRanks | Ranks of all PCCRs in the cluster. |
ClusterUCSCview | Excel hyperlink to show the cluster in the UCSC genome browser. |
RToverlap | "RToverlap" if PCCR overlaps a previously-predicted stop codon readthrough region, in the same frame, otherwise blank. |
GC | GC content of the nucleotides in the PCCR. |
CpGratio | Number of CpGs in the PCCR divided by number expected from the numbers of Gs and Cs. |
CodingCollisionFrac_{AlignmentSet} | Coding Collision Fraction is the fraction of the alignment of the interval that is shared with the alignment of some annotated coding region. Nonzero values indicate paralogy. Values much larger than zero imply that the PhyloCSF signal could be due to the alignment of the annotated coding region rather than of this interval, which suggests this is a pseudogene. The AlignmentSet indicates which alignment has the collision. For example, a human region might not have a collision in the 58-mammals alignment, but have one in the 100-vertebrates alignment. |
CodingCollisionTr_{AlignmentSet} | The annotated transcript whose alignment collides with that of the PCCR. |
*Mle_{AlignmentSet} | Fields like this indicate scores etc. recomputed using PhyloCSF's "mle" option using a specified alignment set, which can be different from the alignment set used to find the PCCRs in the first place. |
AllOverlapTrs | List of transcripts overlapping or adjacent to the PCCR. Only includes one transcript per type. |
OverlapTypes | The type of overlap for each transcript in AllOverlapTrs. For each transcript, the reported type is the first one in the following list that it satisfies: Extension5p - 5' extension of a coding region in the same frame; Extension3p - 3' extension of a coding region in the same frame; ExtensionMid - extension of a coding region at end of an interior exon; CDS - overlaps CDS (in alternate frame); UTR5 - overlaps 5'-UTR; UTR3 - overlaps 3'-UTR; NC - overlaps exon of non-coding transcript; CDSintron - overlaps an intron within a CDS; UTR5intron - overlaps an intron within 5'-UTR; UTR3intron - overlaps an intron within 3'-UTR; NCintron - overlaps an intron of a non-coding transcript; AntiCDS, etc. - overlaps CDS, etc., on other strand (none for Extension types). |
[1] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118.
[2] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).
[3] I Jungreis, MF Lin, CS Chan, M Kellis. CodAlignView: The Codon Alignment Viewer. CodAlignView