User’s Guide for

 CodAlignView: a tool for exploring signatures of protein-coding evolution in an alignment

https://data.broadinstitute.org/compbio1/cav.php


Introduction

 

CodAlignView displays genomic alignments with codonization and color coding to help distinguish conserved protein-coding regions from non-coding regions. The nucleotide alignments are divided into codons in a reference species and substitutions are colored based on whether the substitution is synonymous or non-synonymous, with the latter divided into conservative and radical amino acid substitutions. Colors also indicate frame shifts, insertions or deletions, and stop codons.  Generally, substitutions in coding regions will mostly be green, indicating a prevalence of synonymous and conservative substitutions, whereas non-coding regions have more red and orange, indicating radical substitutions and frame shifts, as well as colors indicating in-frame stop codons.

 

Splice-site predictions may be displayed to help detect cryptic splice sites at the ends of a novel coding exon.

 

If you use CodAlignView, please cite "CodAlignView: a tool for exploring signatures of protein-coding evolution in an alignment", I Jungreis, M Lin, M Kellis, in preparation. 

 

Contact iljungr@csail.mit.edu.

Alignment Display and Coloring

 

The following legend shows the color scheme.

 

GAC No Change
GAT Synonymous
GAA Conservative
GGG Radical
TAA Ochre Stop Codon
TAG Amber Stop Codon
TGA Opal Stop Codon
ATG In-frame ATG
GA- Indel
GAC Frame-shifted
<6 Splice Prediction
() Intron
... No alignment
XXX Inferred bases

 

Notes:

      An amino acid substitution is considered conservative if it has a BLOSUM62 score greater than 0.

      A codon is colored with a stop codon or ATG color if it contains all or part of a stop codon or ATG in the predicted reading frame of that species, which might be different from the frame of the reference species. If a stop codon or ATG spans two codon columns, both will be colored. If it contains both a stop and an ATG, it gets the stop codon color.

      A codon or partial codon is colored with the indel color if it has a different number of bases from the reference species and does not contain an in-frame stop codon or ATG.

      A codon is considered frame-shifted if it has exactly three bases and its first base is out of frame.

      Periods in a sequence indicate regions for which there is no alignment to that species, whereas dashes indicate deletions relative to the reference species.

      Xs indicate unknown bases inferred from coordinates.

      Vertical bars ("|") indicate "jumps". See defintion below under "Hide Jumps".

URL arguments

 

CodAlignView may be accessed at https://data.broadinstitute.org/compbio1/cav.php. Parameters may be specified through arguments in the URL string or by using a Graphical User Interface (GUI). The former allows a convenient workflow for viewing many regions by programmatically including hyperlinks in a spreadsheet, while the latter provides a more convenient way to edit parameters once the initial region has been specified.

 

To control CodAlignView using the URL string, append a question mark after cav.php followed by an ampersand-separated list of parameters.

Example:
https://data.broadinstitute.org/compbio1/cav.php?alnset=mm9&intervals=chr8:27251550-27251561&prologue=90&hideInserts
or
https://data.broadinstitute.org/compbio1/cav.php?a=mm9&i=chr8:27251550-27251561&p=90&h

 

Most parameters require a value after an equal sign. Parameters are not case sensitive. Parameter names in both long and abbreviated form are listed below in square brackets after the corresponding GUI control.

Graphical User Interface

 

Header

 

+: Expand the control palette and show the legend.

 

-: Collapse the control palette and hide the legend.

 

Region

 

Intervals: [&intervals=REGION_STRING,&i=REGION_STRING] Specify the coordinates of one or more genomic intervals in the reference species representing the region to be displayed, typically exons or portions of exons in a transcript. This parameter is mandatory. The region string can be in either of two formats:

The UCSC-like format consists of one or more intervals separated by a plus sign: chrom:start1-end1+chrom:start2-end2+... with start1 ≤ end1 < start2 ≤ end2... (even if the region is on the minus strand). All segments must be on the same chromosome. Coordinates follow the convention of GFF/GTF files and the UCSC genome browser, namely, intervals go from the first base of the region to the last base, inclusive, and coordinates are 1-based, i.e., the 1st base of the chromosome is position 1. Coordinates may include commas, which are ignored.

Example: chrX:87343-87372+chrX:87,390-87,395

The BED-like format consists of 4 fields separated by semicolons: chrom;chromStart;blockSizes;blockStarts where each field is as defined in the BED format and coordinates are 0-based.

Example: chrX;87342;30,6;0,47

 

+/-: [&strand=STRAND,&s=STRAND] Strand in the reference species.

 

Trim Interval: [&trimInterval=TRIM_INTERVAL,&ti=TRIM_INTERVAL] Trim the region defined by Intervals to the specified subset. This allows defining the exon structure of a whole transcript in Intervals and then easily looking at different portions of the transcript while maintaining the exon structure. The interval is of the form LOW-HIGH where LOW and HIGH are chromosome coordinates with LOW <= HIGH (even on the minus strand). Either one may be empty, in which case the interval will extend to that end of the region defined by Intervals.

 

Trim Transcript Interval: [&trimTrInterval=TRIM_TR_INTERVAL,&v=TRIM_TR_INTERVAL] Like Trim Interval except the interval is of the form START-END, with START <= END. START and END are relative positions within the region string with 1 being the first base of the region relative to its strand.

 

Prologue: [&prologue=NUM_BASES,&p=NUM_BASES] Show specified number of bases before the start of the region in gray italics. Prologue will be excluded when determining the coding frame.

 

Epilogue: [&epilogue=NUM_BASES,&e=NUM_BASES] Show specified number of bases after the end of the region in gray italics. Epilogue will be excluded when determining the coding frame.

 

Max Codons: [&maxCodons=NUM_CODONS,&m=NUM_CODONS] Limit length of region to prevent accidentally putting in huge region that wastes resources.

 

Species

 

Alignment Set: [&alnset=ALNSET,&a=ALNSET] Specify which alignments to use. The dropdown shows a list of tags, each specifying one of the available multispecies alignments.

The tags generally begin with the name of the reference assembly, usually followed by the number of species in the original alignment, sometimes followed by a word specifying a subset of those species. For example,  hg19_46_primate is the primate subset of the 46-way vertebrate alignment using the hg19 human genome assembly.

Sometimes, an alignment available in CodAlignView is not shown in the list, for example, if it is awaiting publication. In this case, select Other from the drop down list and type the alignment set tag in the text box; alternatively, set alnset in the URL string.

The Alignment Sets button provides more details on each alignment.

 

Ancestor: [&ancestor=OPTION,&n=OPTION] Strategy to infer ancestral sequence. Codon coloring is relative to the ancestral sequence if present.

Options:

      None: No ancestral sequence is shown. Codon coloring is relative to the reference species.

      Infer/InferWithIndels find the most likely base at each position using an evolutionary model.

      Plurality/PluralityWithIndels take the most frequent base in each column.

The Infer and Plurality methods force the ancestor to have gaps exactly where the reference species does. The WithIndels strategies do not make this restriction; while that is theoretically more satisfying, imperfections in the ancestor inference method can lead to frame-shift artifacts. The Infer options use a species tree, which is not available for some alignments.

Ancestor inference is the slowest step in computing the alignment. For long regions, consider using ancestor strategy None or Plurality instead of Infer.

 

Species: [&speciesOpt=OPTION,&so=OPTION and &speciesList=LIST,&sl=LIST] Restrict the set of species displayed. Where used, LIST is a comma-separated list of species, named as displayed in the alignment, e.g., Mouse,Rat.

Options:

      Aligned: show all aligned species.

      All: show all species in the alignment set, even ones with no aligned bases in the region.

      Only: show only the species listed in LIST.

      Exclude: show only the species that are not in LIST.

 

Species Limit: [&speciesLimit=NUM_SPECIES,&sm=NUM_SPECIES] Limit the number of species displayed to NUM_SPECIES by removing species (other than the first) whose sequences are similar to that of another (kept) species.

 

One Line Per Species: [&oneSeqPerSpec,&os] If alignment includes more than one sequence for a species, show only one of them, chosen by a heuristic algorithm such as the one most similar to the reference sequence.

 

Frame

 

Codon Position: [&codonPos=POS,&cp=POS] Codonize the alignment assuming that the first base of the region (not including prologue) in reference species is in specified codon position (0, 1, or 2).

 

Justify Frame: [&justifyFrameMethod=OPTION,&j=OPTION] Use the specified method to determine the codon positions in species other than the reference by justifying reading frames.

Options:

      none: treat each codon by itself, ignoring earlier frame shifts.

      start: line up all species at the start of the region (not including prologue).

      end: line up all species at the end of the region (not including epilogue).

      best: use the frame that best matches reference species.

      exonBest: use the frame for each exon that best matches the reference species within that exon.

 

Species Codon Position: [&speciesCodonPos=STRING,&scp=STRING] Override the codon position determined by Justify Frame for individual species. Ignored if Justify Frame is none. The format is speciesName:FRAME,speciesName:FRAME... where each FRAME is 0, 1, or 2.

For example, Mouse:0,Rat:2.

 

Display

 

Wrap: [&wrap=NUM_CODONS,&w=NUM_CODONS] Wrap the alignment onto a new block of lines after specified number of codons. If empty or 0, wrapping is turned off.

 

Footer: [&footer=STR,&f=STR] Show specified string at the bottom of the page.

 

Title: [&title=STR,&tl=STR] Show specified string in window title. Default is "CodAlignView" plus footer, if present.

 

First AA Num: [&firstAAnum=AA_NUMBER,&fa=STR] Number amino acids starting with this number. Amino acids whose number is negative or zero will not be displayed.

 

Max AA Num: [&maxAAnum=AA_NUMBER,&ma=STR] Don't display amino acids with numbers higher than this (only meaningful if First AA Num is specified).

 

Splice Site Predictions: [&spliceSites=NAME,&c=NAME] Show splice site predictions based on a model trained on splice sites in the specified species (Yeo G, Burge CB.  J Comput Biol 11:377-394 2004). Predicted splice sites are shown as <S for splice donors and S> for splice acceptors, under the first two or last two bases of the implied intron. S is a single hexadecimal digit score indicating level of confidence. Although the < and > suggest bracketing a hypothetical intron, they are computed independently with no assurance that they pair up.

The Splice Models button lists available training species.

Notes:

      Only canonical GT/AG splice sites are predicted.

      Splice site predictions do not use comparative information, and only use the sequence of the reference species.

      No predictions are made for splice sites near the ends of the sequence displayed. In particular, no predictions are made for donor sites within 3 bases of start or 4 bases of end or for acceptor sites within 18 bases of start and 3 bases of end.

 

Skip Codons: [&skipCodons=RANGES,&sco=RANGES] Don't display codons whose indices are in specified ranges, and put three dots instead. Ranges are specified as a comma-separated list of intervals, each of which is of the form START-END, where START and END are codon indices. If First AA Num is specified, codon indices refer to amino acid numbers, otherwise they are indices with the first complete codon after the prologue having index 1, so codons in the prologue have negative or zero indices.

Example: 5-10,-10--5

 

Hide Inserts: [&hideInserts,&h] Remove columns containing insertions relative to the reference species. This makes the display cleaner, but use with caution because it can hide relevant information.

 

Hide Jumps: [&hideJumps,&u] Neighboring bases in the reference assembly are sometimes aligned to non-neighboring bases in another species (a "jump"), without a corresponding reference assembly gap in the multiple alignment file. This could hide a frameshift. If we can infer from the coordinates in the multiple alignment file that the two bases are on the same chromosome and strand, in the correct order, and no more than 50 nt apart, we add Xs for the inferred intervening bases, and adjust the reading frame accordingly. Otherwise, we add a vertical bar ("|") in the species with a jump. Any columns added in these two cases can be hidden using Hide Inserts and Hide Jumps, respectively.

 

Show Coordinates: [&showCoords,&sc] Show the coordinates of the bases of each species at the end of each line. Also add a line with the ones digit of the coordinate of each base of the reference species.

 

Highlight AAs: [&highlightAAs=POSITIONS,&ha=POSITIONS] Highlight specified sequences of amino acids. Positions are specified as a comma separated list of individual positions, hyphen separated intervals, or amino acid sequences. If First AA Num is specified, positions refer to amino acid numbers, otherwise they are indices with the first complete codon after the prologue having index 1, so amino acids in the prologue have negative or zero index. If an amino acid sequence is specified it refers to the first occurrence excluding the prologue, unless it is preceded by a minus sign, in which case it is the first occurrence including the prologue.

Example: 3,7-9,DELSNL,-DELSNL

 

Mark Positions: [&markPos=SEQNAME:"STRING"@POSITION,... or &mp=...] Display a line of text between the amino acids and the reference sequence, labeled with the specified sequence name, showing the specified strings starting at the specified positions.

Position can either be an integer (possibly negative), which refers to the specified nucleotide position among the non-gap bases of the reference species, with the first nucleotide after the prologue being 0, or a nucleotide sequence, which refers to the first occurrence of that sequence in the reference species, excluding the prologue. Position can also be other in which case STRING should be a single character that will be put at all unmarked positions.

Example: BasePairing:"((.((((("@23,").)))).))"@AGGCG,"."@other

 

Info

 

User's Guide: [&help] Display this user’s guide.

 

Alignment Sets: [&alnsets] Display information about available alignment sets.

 

Splice Models: [&spliceSiteSets] Display information about species that have been used to train splice predictions.

 

Legend: [&legend] Display the codon coloring legend as a standalone page, for possible inclusion in papers or figures that include CodAlignView images.

 

Misc.

 

Fasta Out: [&fastaOut,&fo] Show the alignment as text in FASTA format without any codonization or color coding, so it can be copied and pasted or saved as a FASTA file. With the showCoords option, include the coordinates in the FASTA header for each species. With the hideInserts option, bases inserted relative to the reference species will not be included, but their coordinates will still be included. Use the browser Back button to get back to the controls.

 

No Palette: [&noControls,&nc] Show the alignment without the control palette (not even the collapsed palette). This can be useful for creating graphics to include in a paper or presentation. Use the browser Back button to get back to the control palette.

 

➡Antisense: [&antisense,&as] Show the same region on the opposite strand, adjusting the codon positions so that the wobble position is preserved. In addition to the strand, Trim Transcript Interval, Prologue, Epilogue, Codon Position, Justify Frame, and Species Codon Position are updated as appropriate. Since any Highlight AAs and Mark Positions set for the original strand will no longer make sense on the opposite strand, but will still be useful upon returning to the original strand, they are marked with the string ":anti:" and ignored.

The alignment of a non-coding region can sometimes have the characteristic appearance of a coding region if it is antisense to a true coding region. This is because synonymous substitutions are highly correlated with substitutions in the wobble position and because indels that preserve the frame on one strand also preserve it on the other. This can often be detected by viewing both the region and the antisense region; the one that is non-coding will often have embedded stop codons in one or more species and will tend to have more non-synonymous substitutions.

 

Redisplay Alignment: Redisplay the alignment with new parameter values. Equivalent to pressing Enter in one of the text entry fields.