Index of /compbio1/PhyloCSF_Candidate_Coding_Regions

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[   ]PCCRs.H_sapiens.hg38.GENCODE23.txt.gz2019-09-05 15:42 28M 
[   ]PCCRs.M_musculus.mm10.GENCODEm20.txt.gz2019-12-06 10:26 26M 
[   ]PCCRs.M_musculus.mm10.GENCODEm25.txt.gz2020-06-07 22:40 24M 
[   ]PCCRs.M_musculus.mm10.GENCODEm24.txt.gz2020-01-30 20:05 23M 
[   ]PCCRs.M_musculus.mm10.GENCODEm23.txt.gz2019-12-06 10:26 22M 
[   ]PCCRs.M_musculus.mm10.GENCODEm32.txt.gz2023-03-03 14:26 22M 
[   ]PCCRs.M_musculus.mm10.GENCODEm19.txt.gz2019-03-22 22:14 21M 
[   ]PCCRs.G_gallus.galGal6.Ensembl97.txt.gz2019-09-13 15:12 19M 
[   ]PCCRs.M_musculus.mm39.GENCODEm30.txt.gz2022-08-06 06:11 18M 
[   ]PCCRs.M_musculus.mm39.GENCODEm28.txt.gz2021-12-10 19:27 18M 
[   ]PCCRs.M_musculus.mm39.GENCODEm29.txt.gz2022-04-30 05:00 18M 
[   ]PCCRs.M_musculus.mm39.GENCODEm32.txt.gz2023-03-27 23:23 18M 
[   ]PCCRs.H_sapiens.hg38.GENCODE27.txt.gz2019-03-22 22:14 17M 
[   ]PCCRs.M_musculus.mm39.GENCODEm31.txt.gz2022-10-28 18:10 17M 
[   ]PCCRs.M_musculus.mm39.GENCODEm27.txt.gz2021-05-07 20:12 17M 
[   ]PCCRs.M_musculus.mm39.GENCODEm26.txt.gz2021-03-04 12:28 17M 
[   ]PCCRs.H_sapiens.hg38.GENCODE32.txt.gz2019-12-05 11:20 17M 
[   ]PCCRs.H_sapiens.hg38.GENCODE35.txt.gz2020-08-26 11:11 17M 
[   ]PCCRs.H_sapiens.hg38.GENCODE30.txt.gz2019-04-09 17:27 17M 
[   ]PCCRs.H_sapiens.hg38.GENCODE37.txt.gz2021-02-17 06:15 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE34.txt.gz2020-06-08 10:13 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE41.txt.gz2022-08-06 03:42 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE36.txt.gz2021-02-16 15:08 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE40.txt.gz2022-04-30 02:22 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE31.txt.gz2019-07-09 23:13 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE38.txt.gz2021-05-07 14:16 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE33.txt.gz2020-01-30 12:51 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE43.txt.gz2023-02-22 17:07 16M 
[   ]PCCRs.M_musculus.mm39.GENCODEm33.txt.gz2023-07-28 14:47 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE39.txt.gz2021-12-10 10:13 16M 
[   ]PCCRs.H_sapiens.hg38.GENCODE46.txt.gz2024-05-28 14:49 15M 
[   ]PCCRs.H_sapiens.hg38.GENCODE42.txt.gz2022-10-29 10:10 15M 
[   ]PCCRs.M_musculus.mm39.GENCODEm35.txt.gz2024-05-28 17:46 15M 
[   ]PCCRs.M_musculus.mm39.GENCODEm34.txt.gz2024-01-28 03:28 15M 
[   ]PCCRs.H_sapiens.hg38.GENCODE28.txt.gz2019-03-22 22:14 15M 
[   ]PCCRs.G_gallus.galGal4.Ensembl82.txt.gz2019-09-05 15:36 15M 
[   ]PCCRs.H_sapiens.hg38.GENCODE29.txt.gz2019-03-22 22:14 15M 
[   ]PCCRs.H_sapiens.hg38.GENCODE45.txt.gz2024-01-26 13:50 14M 
[   ]PCCRs.H_sapiens.hg38.GENCODE44.txt.gz2023-07-28 12:12 14M 
[   ]PCCRs.M_musculus.mm10.GENCODEm5.txt.gz2019-09-05 15:45 11M 
[   ]PCCRs.A_gambiae.AgamP4.Vectorbase4.8.txt.gz2019-03-22 22:14 5.9M 
[   ]PCCRs.D_melanogaster.dm6.Flybase6.15.txt.gz2019-03-22 22:14 2.7M 
[   ]PCCRs.C_elegans.ce11.WS259.txt.gz2019-03-22 22:14 1.9M 

PhyloCSF Candidate Coding Regions

PhyloCSF Candidate Coding Regions

The tab-delimited spreadsheets in this folder contain the lists of "PhyloCSF Candidate Coding Regions" (PCCRs) for various species and annotation sets, with additional information about each PCCR to facilitate investigating which of them constitute novel coding regions. PCCRs are genomic intervals that show evolutionary evidence of protein-coding potential but are not already annotated as protein-coding or pseudogenic, ranked according to how likely they are to be true protein-coding regions. Manual curation of the 1000 top-ranked PCCRs plus a targeted search of a few hundred lower-ranked PCCRs resulted in the addition of 144 protein-coding genes and hundreds of pseudogenes and coding exons in previously annotated genes to human GENCODE versions 24-28, more than half of which had not been previously discovered. [1]; analyses of PCCR lists in other species are likely to find many novel coding regions in those species as well.

PCCRs are created by using whole-genome multispecies alignments to compute PhyloCSF scores [2] for every codon in the whole genome in each of the six reading frames; using a Hidden Markov Model to find candidate coding intervals, "PhyloCSF Regions"; and excluding intervals that are already annotated as coding or pseudogenic, or antisense to such annotations, and very small intervals. They are then sorted using a Support Vector Machine (SVM) that uses as features the PhyloCSF score, the difference between PhyloCSF scores on the two strands, the length of the interval, and the phylogenetic branch length of the local alignment of the interval.

The following is a description of the spreadsheet fields. Some of the spreadsheets have slightly different fields, mostly for historical reasons.

FieldDescription
RankSVM rank. Lower ranks are more likely to be true protein-coding regions.
IDUnique integer identifier for each PCCR.
NamePCCR name of the form: AnnoChrom:Start-End+ or AnnoChrom:Start-End- depending on strand.
RegType"Extension" if PCCR is the non-overlapping part of a PhyloCSF Region that overlaps an annotation, otherwise "NoOverlap".
AnnoChromChromosome as it appears in the annotation GFF/GTF file, which can differ from the name in the alignment MAF file (e.g., 2L versus chr2L).
StartNominally lowest 1-based chromosomal coordinate of nucleotides in interval.
EndNominally highest 1-based chromosomal coordinate of nucleotides in interval.
Strand"+" or "-".
ScorePerCodonPhyloCSF score divided by number of codons in the interval. Scores were computed using PhyloCSF's "mle" option, except that the oldest files, GENCODE23 and GENCODEm5, used the "fixed" option.
PhyloCSFPsiLength-adjusted PhyloCSF score (see [2]).
NumCodonsLength of interval, in codons.
BlsPhylogenetic branch length of species present in the local alignment of the interval, divided by branch length of the whole species tree.
AntiScorePerCodonPhyloCSF score of the interval in the antisense frame (frame on opposite strand that shares the 3rd codon position) divided by number of codons.
ScoreDiffScorePerCodon - AntiScorePerCodon.
OverlapTrsList of protein-coding or pseudogene transcript annotations that overlap the interval in any of the 6 reading frames. For protein-coding, only includes the CDS portion.
ParentFor "Extension" PCCRs, the PhyloCSF Region that was trimmed to form this one, otherwise "NA".
CodAlignViewExcel hyperlink to show the color-coded alignment of the interval in CodAlignView [3].
UCSCviewExcel hyperlink to show the interval in the UCSC genome browser.
SvmScoreSVM score used to sort PCCRs. PCCRs with higher scores are more likely to be true protein-coding regions.
SvmAntisenseScoreSVM score used to distinguish PhyloCSF Regions more likely to be novel coding regions on their strand than on the opposite strand.
TrypticAAsTheoretical translation of the PCCR, trimmed to tryptic boundaries. Useful when validating PCCRs using mass spectrometry.
OutOfOrfCodonsThe number of codons after or before a stop codon, whichever is less. If this number is large it indicates the presence of a stop codon in the midst of an ancestral coding region, which typically means this is a pseudogene, though it could also be stop codon readthrough.
ClusterNearby PCCRs were grouped into clusters. This field is an index into the list of clusters.
ClusterStartLowest Start of any PCCR in this cluster.
ClusterEndHighest End of any PCCR in this cluster.
BestInCluster"Best" for the lowest ranking PCCR in the cluster, otherwise blank.
ClusterBestRankLowest rank of any PCCR in the cluster.
ClusterRanksRanks of all PCCRs in the cluster.
ClusterUCSCviewExcel hyperlink to show the cluster in the UCSC genome browser.
RToverlap"RToverlap" if PCCR overlaps a previously-predicted stop codon readthrough region, in the same frame, otherwise blank.
GCGC content of the nucleotides in the PCCR.
CpGratioNumber of CpGs in the PCCR divided by number expected from the numbers of Gs and Cs.
CodingCollisionFrac_{AlignmentSet}Coding Collision Fraction is the fraction of the alignment of the interval that is shared with the alignment of some annotated coding region. Nonzero values indicate paralogy. Values much larger than zero imply that the PhyloCSF signal could be due to the alignment of the annotated coding region rather than of this interval, which suggests this is a pseudogene. The AlignmentSet indicates which alignment has the collision. For example, a human region might not have a collision in the 58-mammals alignment, but have one in the 100-vertebrates alignment.
CodingCollisionTr_{AlignmentSet}The annotated transcript whose alignment collides with that of the PCCR.
*Mle_{AlignmentSet}Fields like this indicate scores etc. recomputed using PhyloCSF's "mle" option using a specified alignment set, which can be different from the alignment set used to find the PCCRs in the first place.
AllOverlapTrsList of transcripts overlapping or adjacent to the PCCR. Only includes one transcript per type.
OverlapTypesThe type of overlap for each transcript in AllOverlapTrs. For each transcript, the reported type is the first one in the following list that it satisfies: Extension5p - 5' extension of a coding region in the same frame; Extension3p - 3' extension of a coding region in the same frame; ExtensionMid - extension of a coding region at end of an interior exon; CDS - overlaps CDS (in alternate frame); UTR5 - overlaps 5'-UTR; UTR3 - overlaps 3'-UTR; NC - overlaps exon of non-coding transcript; CDSintron - overlaps an intron within a CDS; UTR5intron - overlaps an intron within 5'-UTR; UTR3intron - overlaps an intron within 3'-UTR; NCintron - overlaps an intron of a non-coding transcript; AntiCDS, etc. - overlaps CDS, etc., on other strand (none for Extension types).

References

[1] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118.

[2] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).

[3] I Jungreis, MF Lin, CS Chan, M Kellis. CodAlignView: The Codon Alignment Viewer. CodAlignView