Eagle v2.4.1 User Manual

Po-Ru Loh

November 18, 2018

Contents

1 Overview

The Eagle software estimates haplotype phase [1] either within a genotyped cohort or using a phased reference panel. Eagle2 is now the default phasing method used by the Sanger and Michigan imputation servers and uses a new, very fast HMM-based algorithm that improves speed and accuracy over existing methods via two key ideas: a new data structure based on the positional Burrows-Wheeler transform [2] and a rapid search algorithm that explores only the most relevant paths through the HMM. Compared to the Eagle1 algorithm, Eagle2 has similar speed but much greater accuracy at sample sizes <50,000; as such, we have made the Eagle2 algorithm the default option. (The Eagle1 algorithm can be accessed via the --v1 flag.) Eagle v2.3+ supports phasing sequence data with or without a reference and also supports phasing chrX.

The Eagle2 algorithm is described in ref. [3]:

Loh P-R, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, Schoenherr S, Forer L, McCarthy S, Abecasis GR, Durbin R, and Price AL. Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics (2016).

The original Eagle1 algorithm is described in ref. [4]:

Loh P-R, Palamara PF, and Price AL. Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics (2016).

2 Download and installation

You can download the latest version of the Eagle software at:

http://data.broadinstitute.org/alkesgroup/Eagle/downloads/

Previous versions are also available in the old/ subdirectory.

For developers, we are maintaining the Eagle source on github:

http://github.com/poruloh/Eagle

2.1 Change log

2.2 Installation

The Eagle_vX.X.tar.gz download package contains a standalone (i.e., statically linked) 64-bit Linux executable, eagle, which we have tested on several Linux systems. We recommend using this static executable because it is well-optimized and no further installation is required.

If you wish to compile your own version of the Eagle software from the source code (in the src/ subdirectory), you will need to ensure that compiler requirements and library dependencies are fulfilled, and you will need to make appropriate modifications to the Makefile:

For reference, the provided eagle executable was created on the Harvard Medical School “Orchestra 2” research computing cluster using GCC 4.8.5 and was linked against the HTSlib 1.9 library (distributed under the MIT/Expat license), the Boost 1.58.0 library (distributed under the Boost license and patched to support bgzip), and the OpenBLAS 0.2.19 library (distributed under the BSD license).

2.3 Running Eagle

To run the eagle executable, simply invoke ./eagle on the Linux command line (within the Eagle install directory) with parameters in the format --optionName=optionValue. The example/ subdirectory contains example phasing runs with and without a reference panel.

A minimal Eagle invocation for phasing a single chromosome looks like:

./eagle --bfile=plink --geneticMapFile=USE_BIM --outPrefix=phased

assuming the PLINK bim file contains genetic map coordinates. (To interpolate genetic coordinates from a reference genetic map, see detailed usage info below.)

2.4 Help

To get a list of options, run:

./eagle -h

3 Computing requirements

3.1 Operating system

We have only compiled and tested Eagle on Linux computing environments; however, the source code is available if you wish to try compiling Eagle for a different OS.

3.2 Memory

3.3 Running time

3.3.1 Multi-threading

On multi-core machines, running time can be reduced (by a factor close to the number of available cores) using the --numThreads option.

4 Phasing without a reference panel

4.1 Input

4.1.1 Genotypes

When phasing without a reference panel, Eagle takes biallelic genotype input either in PLINK [5] binary format (bed/bim/fam) or in VCF/BCF format. For conversion and manipulation of PLINK files, we highly recommend the PLINK2 software [6]. If all genotypes are contained in a single bed/bim/fam file triple with the same file prefix, you may simply use the command line option --bfile=prefix. Eagle phases only one chromosome at a time; if the PLINK files or VCF/BCF file contain data from multiple chromosomes, the chromosome to phase must be specified using the --chrom option.

X chomosome genotypes have the feature that males are hemizygous (i.e., haploid) except in the pseudo-autosomal regions PAR1 and PAR2 at the tips of chrX.

If you are working with an organism that does not have 23 chromosomes, you will need to set --chromX to the number of chromosomes.

4.1.2 Reference genetic maps

In the event that your PLINK bim file does not contain genetic coordinates (in units of Morgans), we have included reference maps (downloaded from HapMap and reformatted) that you can use to interpolate genetic map coordinates from SNP base pair positions. To use a reference map, use the option

--geneticMapFile=tables/genetic_map_hg##_withX.txt.gz

selecting the build (hg17, hg18, or hg19) corresponding to the base pair coordinates of your bim file. (Note: As of v2.3, we have augmented these maps with chrX, and we have also added a linear map genetic_map_1cMperMb.txt for use with non-human data.) You may use the --geneticMapFile option even if your PLINK bim file does contain genetic coordinates; in this case, the genetic coordinates in the bim file will be ignored, and interpolated coordinates will be used instead.

VCF/BCF format does not include genetic coordinates, so the --geneticMapFile option is mandatory with VCF/BCF input.

4.1.3 Missing data treatment

Missing genotypes are automatically imputed during the phasing process; the phased haplotypes that Eagle outputs contain best-guess imputed (haploid) genotypes. If you wish to leave missing genotypes unimputed, you may do so using the --noImpMissing flag if the input is in VCF/BCF format.

4.1.4 Genotype QC

When phasing PLINK data, Eagle automatically filters SNPs and individuals with missing rates exceeding thresholds of 0.1; filtered SNPs and individuals are not included in the output. These thresholds may be modified using --maxMissingPerSnp and --maxMissingPerIndiv. Eagle does not perform filtering based on minor allele frequency or deviation from Hardy-Weinberg equilibrium; we recommend using PLINK2 to perform such filtering if desired. Eagle also does not perform any filtering when phasing VCF/BCF data.

4.1.5 Individual and SNP filters

When phasing PLINK data, individuals to remove from the analysis may be specified in one or more --remove files listing FID and IIDs (one individual per line). Similarly, SNPs to exclude from the analysis may be specified in one ore more --exclude files listing SNP IDs (typically rs numbers). These options are not supported for VCF/BCF input.

For either PLINK or VCF/BCF input data, analysis may be restricted to a sub-chromosomal region by using the --bpStart and --bpEnd options. (For example, setting --bpStart=50e6 and --bpEnd=100e6 restricts analysis to positions between 50Mb and 100Mb.)

4.1.6 Optional algorithmic parameters

The main parameter that adjusts speed and accuracy of the Eagle2 algorithm is --Kpbwt, which determines the number of conditioning haplotypes Eagle2 uses when phasing each sample. By default, --Kpbwt=10000, which we believe provides a good balance of speed and accuracy for most applications. However, increasing --Kpbwt achieves a small increase in accuracy at the expense of additional computation (and analogously for decreasing --Kpbwt). Additional algorithmic options (e.g., --pbwtIters) can be listed by running ./eagle -h.

4.2 Output

When phasing PLINK data without a reference panel, Eagle outputs phased haplotypes in gzipped Oxford HAPS/SAMPLE format (used by SHAPEIT2 [7]; see the SHAPEIT2 documentation for more information). Output file paths are obtained by appending .haps.gz and .sample to the path prefix specified with the --outPrefix option.

When phasing VCF/BCF data without a reference panel, Eagle outputs phased haplotypes in VCF/BCF format (compressed or uncompressed); the output format can be specified using the --vcfOutFormat flag (e.g., --vcfOutFormat=z for .vcf.gz output as in bcftools). Output file paths are obtained by appending .vcf, .vcf.gz, or .bcf to the path prefix specified with the --outPrefix option.

Eagle writes logging output to stdout and stderr as analysis proceeds; we recommend saving this output. If you wish to save this output while simultaneously viewing it on the command line, you may do so using

./eagle [... list of options ...] 2>&1 | tee output.log

5 Phasing with a reference panel

NOTE: If the data set you wish to phase contains more than twice as many samples as the largest reference panel available to you, then using a reference panel is unlikely to give much of a boost in phasing accuracy.

5.1 Input

5.1.1 Genotypes

When phasing with a reference panel, Eagle takes input for both the target and reference genotypes in tabix-indexed VCF/BCF format (compressed or uncompressed) using the --vcfTarget and --vcfRef options; the file format is automatically detected. We recommmend using BCF format, as VCF parsing time can be substantial for large data sets. For conversion and manipulation of VCF/BCF files, we highly recommend the bcftools software.

If your target VCF/BCF file contains data from multiple chromosomes, you must specify one chromosome to phase using the --chrom option.

Chromosome names must match between the target and reference VCF/BCF files.

Eagle restricts analysis to sites that are contained in both the target and reference (with matching CHR, POS, and ALT fields) and are biallelic in the target VCF/BCF. Sites for which the REF and ALT alleles are swapped in the target VCF/BCF relative to the reference are dropped by default. This requirement can be relaxed via the --allowRefAltSwap flag, which causes REF/ALT swaps to be tolerated (and automatically flipped) for true SNPs. For indels, REF/ALT swaps are always dropped due to the possibility of different indels appearing to be the same. (Thanks to Giulio Genovese for pointing out this subtlety.)

Eagle drops multi-allelic sites from the phased output. If your data contains multi-allelic sites that you wish to phase, you can do so by splitting multi-allelic sites into biallelics using bcftools (in both the target and reference data). However, note that encoding a multi-allelic site (e.g., REF=A, ALT=C,G) using a pair of biallelic markers (e.g., REF=A, ALT=C and REF=A, ALT=G) can occasionally result in nonsensical phase (e.g., if a CG het is phased with both ALT alleles assigned to the same haplotype). This issue may eventually be solved by incorporating a phase constraint between the biallelic markers, but for now, you will need to check your output for this type of error (if relevant to downstream analyses).

5.1.2 Reference genetic maps

Since VCF/BCF files do not contain genetic coordinates, you must specify the path to the reference map provided with Eagle:

--geneticMapFile=tables/genetic_map_hg##.txt.gz

selecting the build (hg17, hg18, hg19, or hg38) corresponding to the base pair coordinates in your input data.

5.1.3 Missing data treatment

By default, missing genotypes in the target VCF/BCF are automatically imputed during the phasing process; the phased haplotypes that Eagle outputs contain best-guess imputed (haploid) genotypes. Missing imputation can be turned off using the --noImpMissing option, in which case these genotypes will be left as ./. (or . for haploid chrX genotypes) in the output VCF/BCF.

Because missing genotypes sometimes have incorrectly-coded ploidy, Eagle tries to guess the right ploidy for missing chrX genotypes by default. This behavior can be changed with the --keepMissingPloidyX option, which forces Eagle to produce output with exactly the same ploidy as the input.

5.1.4 Individual and SNP filters

When phasing using a reference, the --vcfExclude flag can be used to specify a list of variants (in VCF/BCF format) to exclude. The --maxMissingPerSnp, --maxMissingPerIndiv, and --remove flags are not supported when phasing using a reference; however, analysis may still be restricted to a sub-chromosomal region by using the --bpStart and --bpEnd options. Additionally, a flanking region can be specified using the --bpFlanking option; this region will be used during phasing but discarded in the output. (For example, setting --bpStart=50e6, --bpEnd=100e6, and --bpFlanking=1e6 analyzes positions between 49Mb and 101Mb and then outputs positions between 50Mb and 100Mb.)

5.1.5 Optional algorithmic parameters

As above, the main parameter that adjusts speed and accuracy of the Eagle2 algorithm is --Kpbwt; see above for details.

5.2 Output

When phasing with a reference panel, Eagle outputs phased haplotypes in VCF/BCF format (compressed or uncompressed); the output format can be specified using the --vcfOutFormat flag. Output file paths are obtained by appending .vcf, .vcf.gz, or .bcf to the path prefix specified with the --outPrefix option.

As above, we recommend saving Eagle’s logging output (written to stdout and stderr); you may save this output while simultaneously viewing it on the command line using

./eagle [... list of options ...] 2>&1 | tee output.log

At the end of reference-based phasing analysis, Eagle writes a list of estimated average phase confidences (one estimate per sample) to stdout; these confidences can be used for downstream QC. Phase confidences of -nan indicate samples with nothing to phase (i.e., 0 hets or 1 het, which tend to arise when phasing male X chromosomes or in runs of homozygosity).

5.3 Reference panels from the 1000 Genomes Project

If you wish to perform phasing using the publicly available 1000 Genomes Phase 3 haplotypes as a reference panel, please see below for how to download and prepare the data.

5.3.1 GRCh37/hg19

The 1000 Genomes Phase 3 data can be downloaded in hg19 coordinates from the FTP site below:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

The data is divided into one vcf.gz file per chromosome. Note that if you plan to run Eagle on multiple target genotype files, you should first use bcftools to convert the data into BCF format (which will greatly reduce input processing time).

The files provided by the 1000 Genomes Project contain multi-allelic sites, which Eagle ignores by default. If you wish to phase these sites, you will need to split them using bcftools. The following pipeline (supplied by Giulio Genovese) does so (plain-text version here):

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \  
  gzip -d > human_g1k_v37.fasta  
samtools faidx human_g1k_v37.fasta  
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr{{1..22}.phase3_shapeit2_mvncall_integrated_v5a,X.phase3_shapeit2_mvncall_integrated_v1b,Y.phase3_integrated_v2a}.20130502.genotypes.vcf.gz{,.tbi}  
 
for chr in {1..22} X Y; do  
  bcftools view --no-version -Ou -c 2 ALL.chr${chr}.phase3⋆integrated_v[125][ab].20130502.genotypes.vcf.gz | \  
  bcftools norm --no-version -Ou -m -any | \  
  bcftools norm --no-version -Ob -o ALL.chr${chr}.phase3_integrated.20130502.genotypes.bcf -d none -f human_g1k_v37.fasta && \  
  bcftools index -f ALL.chr${chr}.phase3_integrated.20130502.genotypes.bcf  
done

5.3.2 GRCh38/hg38

The 1000 Genomes Phase 3 has also been lifted over to GRCh38, but you will need to do some additional processing to remove multi-allelic variants and duplicate variants before running Eagle. The following pipeline (by Giulio Genovese) normalizes variants using the version of the GRCh38 fasta reference recommended by Heng Li (plain-text version here):

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \  
  gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna  
samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna  
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr{{1..22},X,Y}_GRCh38.genotypes.20170504.vcf.gz{,.tbi}  
 
for chr in {1..22} X Y; do  
  (bcftools view --no-version -h ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz | \  
    grep -v "^##contig=<ID=[GNh]" | sed 's/^##contig=<ID=MT/##contig=<ID=chrM/;s/^##contig=<ID=\([0-9XY]\)/##contig=<ID=chr\1/'; \  
  bcftools view --no-version -H -c 2 ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz | sed 's/^/chr/') | \  
  bcftools norm --no-version -Ou -m -any | \  
  bcftools norm --no-version -Ob -o ALL.chr${chr}_GRCh38.genotypes.20170504.bcf -d none -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna && \  
  bcftools index -f ALL.chr${chr}_GRCh38.genotypes.20170504.bcf  
done

A few notes:

6 Contact

If you have comments or questions about the Eagle software, please contact:

Po-Ru Loh, poruloh@broadinstitute.org

7 License

Eagle is free software under the GNU General Public License v3.0 (GPLv3).

References

1.   Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics 12, 703–714 (2011).

2.   Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

3.   Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics 48, 1443–1448 (2016).

4.   Loh, P.-R., Palamara, P. F. & Price, A. L. Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics 48, 811–816 (2016).

5.   Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559–575 (2007).

6.   Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 1–16 (2015).

7.   Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods 10, 5–6 (2013).