Gene Finding Methods


Overview of RNA-seq Based Annotation and Gene Finding

This document provides a general description of the RNAseq-based automated genome annotation for eukaryotic genomes. Gene annotation at Broad Institute is a multi-step process, as described below.

Evidence Collection

  • RNAseq data generation and processing:

    We generate strand-specific, paired-end RNAseq Illumina reads from several different growth conditions. We then use Blat [Kent, 2002] to align the short reads to the target genome in strand-specific manner to partition the reads by genomic loci. For each genomic locus, we use Inchworm package [Haas 2011] to assemble the reads into EST-like transcripts. These transcripts are then used as one of the input for the EVM gene finder [Haas, 2008] and the PASA annotation pipeline at the Broad Institute [Haas et. al, Nucleic Acids Res, 2003, 31, 5654-5666]. The RNAseq/PASA pipeline generates Inchworm alignment and assembly and high-quality PASA reference ORFs, which are then used for gene prediction.

  • Blast evidence:

    We run tblastn sequence homology search against the UniRef90 protein database. Top 25 tblastn hits with e-values better than 1e-10 are collected.

  • PFAM/TIGRFAM domains:

    We run transcript Hmmer searches using PFAM/TIGRFAM library to find PFAM/TIGRFAM domains on six-frame translations of transcripts.

Computational Prediction of Non-Coding RNA Features

Ribosomal RNAs (rRNAs) are identified with RNAmmer (Lagesen et al., Nucleic Acids Res., 2007). The tRNA features are identified using tRNAScan (Lowe&Eddy, Nucleic Acids Res, 1997, 25, 955-964). Other common RNA features are identified with RFAM (Griffiths-Jones et al. Nucleic Acids Res. 2005, 33, D121-D124).

Prediction of Protein-Coding Genes

We first run Genemark-ES on the genome assembly to generate GeneMark gene models and then use RNAseq/PASA data to select a set of high-confidence GeneMark gene models as the training set for other ab initio gene predictors, including Augustus [Stanke, 2003], GeneId [Guigo, 1998], GlimmerHMM [Majoros, 2004] and SNAP [Korf, 2004]. We also use tblastn to search the genome assembly against the UniRef90 protein dataset and use the blast hits to construct GeneWise [Birney, 2004] gene models. Our initial gene set is generated by running EVM with ab initio gene models (from Augustus, Genemark-ES, GeneId, GlimmerHMM, SNAP), Genewise, PASA assembly, and PASA reference ORFs as input. We then use PASA to update the initial EVM gene set to adjust splice junctions. We check the updated gene set for symptomatic issues, overlap with tRNA and rRNA genes and missing genes with respect to PASA reference ORFs, and make manual adjustment when needed. We then filter the gene set to remove spurious gene models from repeat and low-complexity sequences. This includes genes with repeat-like names, genes with > 40% CDS overlapping TransposonPSI hits [Haas, URL], genes with overlap to RepeatScout [Price, 2005 ] or known repeat PFAM/TIGRFAM hits and with no non-repeat PFAM/TIGRFAM domains, genes overlapping blast hits to an in-house database of known repeat proteins (but not non-repeat PFAM hits), short hypothetical genes (<300bp) whose top two most abundant residues account for >50% of all the amino acid residues in the protein, and genes with CDS alignment to >= 10 different genomic loci at >= 90% identity. After repeat-filtering, we use PASA pipeline to finalize the set by updating the current gene models to adjust splice junctions, add UTRs, and include alternatively spliced forms.

Assigning Gene Product Names

We first assign gene product names via blast hits to SwissProt database with manually curated gene product names, using a set of strigent criteria (>=70% protein sequence identity, >=70% coverage of both the query and the database hit sequence, and length difference <=10%). For the remaining genes with unassigned product names, we then use HMMER equivalogs from TIGRFAM and PFAM hits to assign the name of the HMMER hits, if the hit score is above the trusted cutoff value. The rest of the genes are assigned the name ?hypothetical protein? according to the current GenBank guidelines.

Gene Numbering

Every annotated gene is given a Locus Number of the form xxxG_#####, which is guaranteed to identify a gene uniquely, both at the Broad Institute web site and in GenBank.