Index of /compbio1/Novel_PhyloCSF_Regions

Name	Last modified	Size

Parent Directory		-
PCCRs.A_gambiae.AgamP4.Vectorbase4.8.txt.gz	2019-03-22 22:14	5.9M
PCCRs.C_elegans.ce11.WS259.txt.gz	2019-03-22 22:14	1.9M
PCCRs.D_melanogaster.dm6.Flybase6.15.txt.gz	2019-03-22 22:14	2.7M
PCCRs.H_sapiens.hg38.GENCODE27.txt.gz	2019-03-22 22:14	17M
PCCRs.H_sapiens.hg38.GENCODE28.txt.gz	2019-03-22 22:14	15M
PCCRs.H_sapiens.hg38.GENCODE29.txt.gz	2019-03-22 22:14	15M
PCCRs.M_musculus.mm10.GENCODEm19.txt.gz	2019-03-22 22:14	21M
PCCRs.H_sapiens.hg38.GENCODE30.txt.gz	2019-04-09 17:27	17M
PCCRs.H_sapiens.hg38.GENCODE31.txt.gz	2019-07-09 23:13	16M
PCCRs.G_gallus.galGal4.Ensembl82.txt.gz	2019-09-05 15:36	15M
PCCRs.H_sapiens.hg38.GENCODE23.txt.gz	2019-09-05 15:42	28M
PCCRs.M_musculus.mm10.GENCODEm5.txt.gz	2019-09-05 15:45	11M
PCCRs.G_gallus.galGal6.Ensembl97.txt.gz	2019-09-13 15:12	19M
PCCRs.H_sapiens.hg38.GENCODE32.txt.gz	2019-12-05 11:20	17M
PCCRs.M_musculus.mm10.GENCODEm20.txt.gz	2019-12-06 10:26	26M
PCCRs.M_musculus.mm10.GENCODEm23.txt.gz	2019-12-06 10:26	22M
PCCRs.H_sapiens.hg38.GENCODE33.txt.gz	2020-01-30 12:51	16M
PCCRs.M_musculus.mm10.GENCODEm24.txt.gz	2020-01-30 20:05	23M
PCCRs.M_musculus.mm10.GENCODEm25.txt.gz	2020-06-07 22:40	24M
PCCRs.H_sapiens.hg38.GENCODE34.txt.gz	2020-06-08 10:13	16M
PCCRs.H_sapiens.hg38.GENCODE35.txt.gz	2020-08-26 11:11	17M
PCCRs.H_sapiens.hg38.GENCODE36.txt.gz	2021-02-16 15:08	16M
PCCRs.H_sapiens.hg38.GENCODE37.txt.gz	2021-02-17 06:15	16M
PCCRs.M_musculus.mm39.GENCODEm26.txt.gz	2021-03-04 12:28	17M
PCCRs.H_sapiens.hg38.GENCODE38.txt.gz	2021-05-07 14:16	16M
PCCRs.M_musculus.mm39.GENCODEm27.txt.gz	2021-05-07 20:12	17M
PCCRs.H_sapiens.hg38.GENCODE39.txt.gz	2021-12-10 10:13	16M
PCCRs.M_musculus.mm39.GENCODEm28.txt.gz	2021-12-10 19:27	18M
PCCRs.H_sapiens.hg38.GENCODE40.txt.gz	2022-04-30 02:22	16M
PCCRs.M_musculus.mm39.GENCODEm29.txt.gz	2022-04-30 05:00	18M
PCCRs.H_sapiens.hg38.GENCODE41.txt.gz	2022-08-06 03:42	16M
PCCRs.M_musculus.mm39.GENCODEm30.txt.gz	2022-08-06 06:11	18M
PCCRs.M_musculus.mm39.GENCODEm31.txt.gz	2022-10-28 18:10	17M
PCCRs.H_sapiens.hg38.GENCODE42.txt.gz	2022-10-29 10:10	15M
PCCRs.H_sapiens.hg38.GENCODE43.txt.gz	2023-02-22 17:07	16M
PCCRs.M_musculus.mm10.GENCODEm32.txt.gz	2023-03-03 14:26	22M
PCCRs.M_musculus.mm39.GENCODEm32.txt.gz	2023-03-27 23:23	18M
PCCRs.H_sapiens.hg38.GENCODE44.txt.gz	2023-07-28 12:12	14M
PCCRs.M_musculus.mm39.GENCODEm33.txt.gz	2023-07-28 14:47	16M
PCCRs.H_sapiens.hg38.GENCODE45.txt.gz	2024-01-26 13:50	14M
PCCRs.M_musculus.mm39.GENCODEm34.txt.gz	2024-01-28 03:28	15M
PCCRs.H_sapiens.hg38.GENCODE46.txt.gz	2024-05-28 14:49	15M
PCCRs.M_musculus.mm39.GENCODEm35.txt.gz	2024-05-28 17:46	15M
PCCRs.H_sapiens.hg38.GENCODE47.txt.gz	2025-02-10 16:24	16M
PCCRs.M_musculus.mm39.GENCODEm36.txt.gz	2025-02-10 16:45	16M

PhyloCSF Candidate Coding Regions

The tab-delimited spreadsheets in this folder contain the lists of "PhyloCSF Candidate Coding Regions" (PCCRs) for various species and annotation sets, with additional information about each PCCR to facilitate investigating which of them constitute novel coding regions. PCCRs are genomic intervals that show evolutionary evidence of protein-coding potential but are not already annotated as protein-coding or pseudogenic, ranked according to how likely they are to be true protein-coding regions. Manual curation of the 1000 top-ranked PCCRs plus a targeted search of a few hundred lower-ranked PCCRs resulted in the addition of 144 protein-coding genes and hundreds of pseudogenes and coding exons in previously annotated genes to human GENCODE versions 24-28, more than half of which had not been previously discovered [1]; analyses of PCCR lists in other species are likely to find many novel coding regions in those species as well.

PCCRs are created by using whole-genome multispecies alignments to compute PhyloCSF scores [2] for every codon in the whole genome in each of the six reading frames; using a Hidden Markov Model to find candidate coding intervals, "PhyloCSF Regions"; and excluding intervals that are already annotated as coding or pseudogenic, or antisense to such annotations, and very small intervals. They are then sorted using a Support Vector Machine (SVM) that uses as features the PhyloCSF score, the difference between PhyloCSF scores on the two strands, the length of the interval, and the phylogenetic branch length of the local alignment of the interval.

The following is a description of the spreadsheet fields. Some of the spreadsheets have slightly different fields, mostly for historical reasons.

Field	Description
Rank	SVM rank. Lower ranks are more likely to be true protein-coding regions.
ID	Unique integer identifier for each PCCR.
Name	PCCR name of the form: AnnoChrom:Start-End+ or AnnoChrom:Start-End- depending on strand.
RegType	"Extension" if PCCR is the non-overlapping part of a PhyloCSF Region that overlaps an annotation, otherwise "NoOverlap".
AnnoChrom	Chromosome as it appears in the annotation GFF/GTF file, which can differ from the name in the alignment MAF file (e.g., 2L versus chr2L).
Start	Nominally lowest 1-based chromosomal coordinate of nucleotides in interval.
End	Nominally highest 1-based chromosomal coordinate of nucleotides in interval.
Strand	"+" or "-".
ScorePerCodon	PhyloCSF score divided by number of codons in the interval. Scores were computed using PhyloCSF's "mle" option, except that the oldest files, GENCODE23 and GENCODEm5, used the "fixed" option.
PhyloCSFPsi	Length-adjusted PhyloCSF score (see [2]).
NumCodons	Length of interval, in codons.
Bls	Phylogenetic branch length of species present in the local alignment of the interval, divided by branch length of the whole species tree.
AntiScorePerCodon	PhyloCSF score of the interval in the antisense frame (frame on opposite strand that shares the 3rd codon position) divided by number of codons.
ScoreDiff	ScorePerCodon - AntiScorePerCodon.
OverlapTrs	List of protein-coding or pseudogene transcript annotations that overlap the interval in any of the 6 reading frames. For protein-coding, only includes the CDS portion.
Parent	For "Extension" PCCRs, the PhyloCSF Region that was trimmed to form this one, otherwise "NA".
CodAlignView	Excel hyperlink to show the color-coded alignment of the interval in CodAlignView [3].
UCSCview	Excel hyperlink to show the interval in the UCSC genome browser.
SvmScore	SVM score used to sort PCCRs. PCCRs with higher scores are more likely to be true protein-coding regions.
SvmAntisenseScore	SVM score used to distinguish PhyloCSF Regions more likely to be novel coding regions on their strand than on the opposite strand.
TrypticAAs	Theoretical translation of the PCCR, trimmed to tryptic boundaries. Useful when validating PCCRs using mass spectrometry.
OutOfOrfCodons	The number of codons after or before a stop codon, whichever is less. If this number is large it indicates the presence of a stop codon in the midst of an ancestral coding region, which typically means this is a pseudogene, though it could also be stop codon readthrough.
Cluster	Nearby PCCRs were grouped into clusters. This field is an index into the list of clusters.
ClusterStart	Lowest Start of any PCCR in this cluster.
ClusterEnd	Highest End of any PCCR in this cluster.
BestInCluster	"Best" for the lowest ranking PCCR in the cluster, otherwise blank.
ClusterBestRank	Lowest rank of any PCCR in the cluster.
ClusterRanks	Ranks of all PCCRs in the cluster.
ClusterUCSCview	Excel hyperlink to show the cluster in the UCSC genome browser.
RToverlap	"RToverlap" if PCCR overlaps a previously-predicted stop codon readthrough region, in the same frame, otherwise blank.
GC	GC content of the nucleotides in the PCCR.
CpGratio	Number of CpGs in the PCCR divided by number expected from the numbers of Gs and Cs.
CodingCollisionFrac_{AlignmentSet}	Coding Collision Fraction is the fraction of the alignment of the interval that is shared with the alignment of some annotated coding region. Nonzero values indicate paralogy. Values much larger than zero imply that the PhyloCSF signal could be due to the alignment of the annotated coding region rather than of this interval, which suggests this is a pseudogene. The AlignmentSet indicates which alignment has the collision. For example, a human region might not have a collision in the 58-mammals alignment, but have one in the 100-vertebrates alignment.
CodingCollisionTr_{AlignmentSet}	The annotated transcript whose alignment collides with that of the PCCR.
*Mle_{AlignmentSet}	Fields like this indicate scores etc. recomputed using PhyloCSF's "mle" option using a specified alignment set, which can be different from the alignment set used to find the PCCRs in the first place.
AllOverlapTrs	List of transcripts overlapping or adjacent to the PCCR. Only includes one transcript per type.
OverlapTypes	The type of overlap for each transcript in AllOverlapTrs. For each transcript, the reported type is the first one in the following list that it satisfies: Extension5p - 5' extension of a coding region in the same frame; Extension3p - 3' extension of a coding region in the same frame; ExtensionMid - extension of a coding region at end of an interior exon; CDS - overlaps CDS (in alternate frame); UTR5 - overlaps 5'-UTR; UTR3 - overlaps 3'-UTR; NC - overlaps exon of non-coding transcript; CDSintron - overlaps an intron within a CDS; UTR5intron - overlaps an intron within 5'-UTR; UTR3intron - overlaps an intron within 3'-UTR; NCintron - overlaps an intron of a non-coding transcript; AntiCDS, etc. - overlaps CDS, etc., on other strand (none for Extension types).

References

[1] Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright J, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary J, Frankish A, Kellis M (2019). Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Research gr-246462. doi: 10.1101/gr.246462.118.

[2] Lin MF, Jungreis I, and Kellis M (2011). PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions. Bioinformatics 27:i275-i282 (ISMB/ECCB 2011).

[3] I Jungreis, MF Lin, CS Chan, M Kellis. CodAlignView: The Codon Alignment Viewer. CodAlignView