Eagle v2.4.1 User Manual

Po-Ru Loh

November 18, 2018

1 Overview
2 Download and installation
- 2.1 Change log
- 2.2 Installation
- 2.3 Running Eagle
- 2.4 Help
3 Computing requirements
- 3.1 Operating system
- 3.2 Memory
- 3.3 Running time
  - 3.3.1 Multi-threading
4 Phasing without a reference panel
- 4.1 Input
  - 4.1.1 Genotypes
  - 4.1.2 Reference genetic maps
  - 4.1.3 Missing data treatment
  - 4.1.4 Genotype QC
  - 4.1.5 Individual and SNP filters
  - 4.1.6 Optional algorithmic parameters
- 4.2 Output
5 Phasing with a reference panel
- 5.1 Input
  - 5.1.1 Genotypes
  - 5.1.2 Reference genetic maps
  - 5.1.3 Missing data treatment
  - 5.1.4 Individual and SNP filters
  - 5.1.5 Optional algorithmic parameters
- 5.2 Output
- 5.3 Reference panels from the 1000 Genomes Project
  - 5.3.1 GRCh37/hg19
  - 5.3.2 GRCh38/hg38
6 Contact
7 License

1 Overview

The Eagle software estimates haplotype phase [1] either within a genotyped cohort or using a phased reference panel. Eagle2 is now the default phasing method used by the Sanger and Michigan imputation servers and uses a new, very fast HMM-based algorithm that improves speed and accuracy over existing methods via two key ideas: a new data structure based on the positional Burrows-Wheeler transform [2] and a rapid search algorithm that explores only the most relevant paths through the HMM. Compared to the Eagle1 algorithm, Eagle2 has similar speed but much greater accuracy at sample sizes <50,000; as such, we have made the Eagle2 algorithm the default option. (The Eagle1 algorithm can be accessed via the --v1 flag.) Eagle v2.3+ supports phasing sequence data with or without a reference and also supports phasing chrX.

The Eagle2 algorithm is described in ref. [3]:

Loh P-R, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, Schoenherr S, Forer L, McCarthy S, Abecasis GR, Durbin R, and Price AL. Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics (2016).

The original Eagle1 algorithm is described in ref. [4]:

Loh P-R, Palamara PF, and Price AL. Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics (2016).

2 Download and installation

You can download the latest version of the Eagle software at:

http://data.broadinstitute.org/alkesgroup/Eagle/downloads/

Previous versions are also available in the old/ subdirectory.

For developers, we are maintaining the Eagle source on github:

http://github.com/poruloh/Eagle

2.1 Change log

Version 2.4.1 (November 18, 2018):
- Allowed --noImpMissing whenever input is provided in VCF format. (Previously, --noImpMissing was only allowed in reference mode when using 1 PBWT iteration.)
- Fixed bug causing --bpStart and --bpEnd to be ignored when --chrom was not specified. (Now --chrom is required with --bpStart/--bpEnd.)
- Added support for multithreaded HTSlib I/O (thanks to David Heise).
- Improved error-checking.
Version 2.4 (December 13, 2017):
- Fixed bug that had allowed target/ref allele flips: allele-matching was only being checked when --allowRefAltSwap was specified. Also enforced requirement that --allowRefAltSwap only apply to SNPs (not indels).
- Fixed bug causing a stack overflow for samples with very large numbers of missing genotypes (thanks to Laurent Francioli).
- Added --pbwtOnly option that skips non-PBWT iterations in non-reference mode (automatically set when N>200K).
- Added --keepMissingPloidyX option that keeps ploidy of missing VCF genotypes ('.' or './.') in chrX. (Because missing genotypes sometimes have incorrectly-coded ploidy, Eagle tries to guess the right ploidy by default.)
- Added --vcfExclude option for ignoring a list of variants when phasing in ref-mode.
- Allowed ‘chr’ prefixes in VCF chromosome names (to support hg38).
- Improved error-checking of genetic maps.
- Suppressed warnings about individuals with low heterozygosity when phasing chrX.
Version 2.3.5 (August 2, 2017): Fixed minor bug causing Eagle to erroneously terminate with failed assertion in rare cases.
Version 2.3.4 (June 5, 2017): Fixed bug causing incorrect imputation of missing genotypes prior to the first heterozygous site in each sample.
Version 2.3.3 (May 25, 2017):
- Fixed bug causing a crash when encountering all-haploid target sites in chrX.
- Fixed bug causing a crash when encountering sites polymorphic in only one of the reference or target (thanks to Petr Danecek).
Version 2.3.2 (March 23, 2017):
- Modified target/ref syncing to keep markers biallelic in the reference panel but monomorphic in the target data (via a change in htslib-1.4; thanks to Petr Danecek).
- Changed default VCF output format back to .vcf.gz (reverting accidental change in v2.3.1).
- Added genetic map for hg38 (lifted over from hg19).
- Added --outputUnphased flag to enable output of target-only sites (which are left unphased) when phasing using a reference.
Version 2.3.1 (March 3, 2017):
- Fixed bug causing rare crashes on samples with no hets.
- Fixed minor bugs in input reading.
- Improved error-checking and error-reporting.
Version 2.3 (July 22, 2016):
- Added support for phasing chrX (and added chrX genetic map).
- Optimized performance when phasing sequence data without a reference.
- Disallowed REF/ALT swaps for indels even with --allowRefAltSwap.
- Improved error-checking.
Version 2.2 (July 15, 2016):
- Added support for VCF/BCF input in non-reference mode.
- Reduced RAM when phasing sequence data in non-ref mode.
- Fixed bug causing crashes on samples with no hets.
- Fixed bug causing crashes on target hets monomorphic in the reference.
- Fixed other minor bugs. (Thanks to Stephan Ripke, Giulio Genovese, Christian Fuchsberger, and Sebastian Schoenherr for reporting bugs and assisting with debugging.)
Version 2.1 (July 6, 2016):
- Replaced in-sample imputation algorithm; new algorithm fills in missing genotypes during phasing (improving overall efficiency on sequence data).
- Added experimental --usePS parameter for specifying FORMAT:PS phase set constraints (--usePS=1 for soft constraint; --usePS=2 for harder constraint). [Note: This implementation appears to be buggy and will probably need to be fixed.]
- Fixed minor error in recombination probability formula.
- Fixed minor bug (incorrect memset call) in first step of non-reference-based phasing algorithm.
- Fixed bug in input-checking (incorrect exit upon finding ”.” genotypes in target VCF).
Version 2.0.5 (May 26, 2016): Fixed bug in genetic map interpolation when using the --chrom flag.
Version 2.0.4 (May 24, 2016): Added error check for 0 Mb / 0 cM region span; added error check for memory allocation failures.
Version 2.0.3 (May 19, 2016): Added error check for 0 or 1 SNPs in intersection of target and ref.
Version 2.0.2 (May 19, 2016): Added --allowRefAltSwap flag to allow swapping of REF/ALT in target vs. ref VCF.
Version 2.0.1 (May 9, 2016): Fixed bug in region handling (thanks to Petr Danecek); added command line to VCF header.
Version 2.0 (May 3, 2016): Initial release of Eagle2.
Version 1.0 (September 23, 2015): Initial release of Eagle1.

2.2 Installation

The Eagle_vX.X.tar.gz download package contains a standalone (i.e., statically linked) 64-bit Linux executable, eagle, which we have tested on several Linux systems. We recommend using this static executable because it is well-optimized and no further installation is required.

If you wish to compile your own version of the Eagle software from the source code (in the src/ subdirectory), you will need to ensure that compiler requirements and library dependencies are fulfilled, and you will need to make appropriate modifications to the Makefile:

Compiler requirements: Support for the unordered_set and unordered_map C++11 containers is required. (Recent versions of GCC support these features.)
Library dependencies:
- HTSlib C library for sequencing data. Eagle uses the HTSlib API (v1.4+) to read and write VCF/BCF files when performing reference-based phasing.
- Boost C++ libraries. The Boost program_options and iostreams libraries need to be installed after downloading and unzipping Boost.
- BLAS numerical library. Eagle uses the sgemm matrix multiplication routine for a very small part of its computation. Linking against any BLAS implementation (e.g., OpenBLAS, ATLAS) will suffice.
Makefile: Paths to libraries need to be modified appropriately.

For reference, the provided eagle executable was created on the Harvard Medical School “Orchestra 2” research computing cluster using GCC 4.8.5 and was linked against the HTSlib 1.9 library (distributed under the MIT/Expat license), the Boost 1.58.0 library (distributed under the Boost license and patched to support bgzip), and the OpenBLAS 0.2.19 library (distributed under the BSD license).

2.3 Running Eagle

To run the eagle executable, simply invoke ./eagle on the Linux command line (within the Eagle install directory) with parameters in the format --optionName=optionValue. The example/ subdirectory contains example phasing runs with and without a reference panel.

A minimal Eagle invocation for phasing a single chromosome looks like:

./eagle --bfile=plink --geneticMapFile=USE_BIM --outPrefix=phased

assuming the PLINK bim file contains genetic map coordinates. (To interpolate genetic coordinates from a reference genetic map, see detailed usage info below.)

2.4 Help

To get a list of options, run:

./eagle -h

3 Computing requirements

3.1 Operating system

We have only compiled and tested Eagle on Linux computing environments; however, the source code is available if you wish to try compiling Eagle for a different OS.

3.2 Memory

When phasing without a reference panel, Eagle’s memory use scales linearly with the number of samples (N) and the number of SNPs (M). For our tests on N=150K UK Biobank samples, Eagle required ≈1 GB RAM per 1,000 SNPs.
When phasing with a reference panel, Eagle requires ≈1.5 bytes per genotype in the merged target and reference.

3.3 Running time

When phasing without a reference panel, Eagle’s running time includes both O(MN) and O(MN²) terms; for the Eagle1 algorithm, these terms contributed roughly equally at N=150K, but for the Eagle2 algorithm, the quadratic term has a much smaller constant factor, such that the running time exhibits close-to-linear scaling for N up to 150K. Phasing N=150K UK Biobank samples on chromosome 1 required ≈2 days on a 16-core 2.60 GHz Intel Xeon E5-2650 v2 processor.
When phasing with a reference panel, Eagle requires ≈1 minute to phase each target sample genome-wide.

3.3.1 Multi-threading

On multi-core machines, running time can be reduced (by a factor close to the number of available cores) using the --numThreads option.

4 Phasing without a reference panel

4.1 Input

4.1.1 Genotypes

When phasing without a reference panel, Eagle takes biallelic genotype input either in PLINK [5] binary format (bed/bim/fam) or in VCF/BCF format. For conversion and manipulation of PLINK files, we highly recommend the PLINK2 software [6]. If all genotypes are contained in a single bed/bim/fam file triple with the same file prefix, you may simply use the command line option --bfile=prefix. Eagle phases only one chromosome at a time; if the PLINK files or VCF/BCF file contain data from multiple chromosomes, the chromosome to phase must be specified using the --chrom option.

X chomosome genotypes have the feature that males are hemizygous (i.e., haploid) except in the pseudo-autosomal regions PAR1 and PAR2 at the tips of chrX.

PLINK format represents the non-PAR part of chrX (chrom 23=X) by setting all non-missing male genotypes to homozygous and represents PAR1 and PAR2 (chrom 25=XY) with regular diploid genotypes. If you wish to phase only the non-PAR region of chrX, you may do so by giving Eagle chrom 23 input. If you wish to phase all of chrX, you may do so by merging the non-PAR and PAR data into a single chrom 23 file (containing male hets) using PLINK’s --merge-x feature. Note that phasing accuracy is much lower in PAR1 and PAR2 than elsewhere in the genome because these regions recombine an order of magnitude faster.
VCF/BCF format natively represents heterogeneity in ploidy, so no special handling is required for chrX.

If you are working with an organism that does not have 23 chromosomes, you will need to set --chromX to the number of chromosomes.

4.1.2 Reference genetic maps

In the event that your PLINK bim file does not contain genetic coordinates (in units of Morgans), we have included reference maps (downloaded from HapMap and reformatted) that you can use to interpolate genetic map coordinates from SNP base pair positions. To use a reference map, use the option

--geneticMapFile=tables/genetic_map_hg##_withX.txt.gz

selecting the build (hg17, hg18, or hg19) corresponding to the base pair coordinates of your bim file. (Note: As of v2.3, we have augmented these maps with chrX, and we have also added a linear map genetic_map_1cMperMb.txt for use with non-human data.) You may use the --geneticMapFile option even if your PLINK bim file does contain genetic coordinates; in this case, the genetic coordinates in the bim file will be ignored, and interpolated coordinates will be used instead.

VCF/BCF format does not include genetic coordinates, so the --geneticMapFile option is mandatory with VCF/BCF input.

4.1.3 Missing data treatment

Missing genotypes are automatically imputed during the phasing process; the phased haplotypes that Eagle outputs contain best-guess imputed (haploid) genotypes. If you wish to leave missing genotypes unimputed, you may do so using the --noImpMissing flag if the input is in VCF/BCF format.

4.1.4 Genotype QC

When phasing PLINK data, Eagle automatically filters SNPs and individuals with missing rates exceeding thresholds of 0.1; filtered SNPs and individuals are not included in the output. These thresholds may be modified using --maxMissingPerSnp and --maxMissingPerIndiv. Eagle does not perform filtering based on minor allele frequency or deviation from Hardy-Weinberg equilibrium; we recommend using PLINK2 to perform such filtering if desired. Eagle also does not perform any filtering when phasing VCF/BCF data.

4.1.5 Individual and SNP filters

When phasing PLINK data, individuals to remove from the analysis may be specified in one or more --remove files listing FID and IIDs (one individual per line). Similarly, SNPs to exclude from the analysis may be specified in one ore more --exclude files listing SNP IDs (typically rs numbers). These options are not supported for VCF/BCF input.

For either PLINK or VCF/BCF input data, analysis may be restricted to a sub-chromosomal region by using the --bpStart and --bpEnd options. (For example, setting --bpStart=50e6 and --bpEnd=100e6 restricts analysis to positions between 50Mb and 100Mb.)

4.1.6 Optional algorithmic parameters

The main parameter that adjusts speed and accuracy of the Eagle2 algorithm is --Kpbwt, which determines the number of conditioning haplotypes Eagle2 uses when phasing each sample. By default, --Kpbwt=10000, which we believe provides a good balance of speed and accuracy for most applications. However, increasing --Kpbwt achieves a small increase in accuracy at the expense of additional computation (and analogously for decreasing --Kpbwt). Additional algorithmic options (e.g., --pbwtIters) can be listed by running ./eagle -h.

4.2 Output

When phasing PLINK data without a reference panel, Eagle outputs phased haplotypes in gzipped Oxford HAPS/SAMPLE format (used by SHAPEIT2 [7]; see the SHAPEIT2 documentation for more information). Output file paths are obtained by appending .haps.gz and .sample to the path prefix specified with the --outPrefix option.

When phasing VCF/BCF data without a reference panel, Eagle outputs phased haplotypes in VCF/BCF format (compressed or uncompressed); the output format can be specified using the --vcfOutFormat flag (e.g., --vcfOutFormat=z for .vcf.gz output as in bcftools). Output file paths are obtained by appending .vcf, .vcf.gz, or .bcf to the path prefix specified with the --outPrefix option.

Eagle writes logging output to stdout and stderr as analysis proceeds; we recommend saving this output. If you wish to save this output while simultaneously viewing it on the command line, you may do so using

./eagle [... list of options ...] 2>&1 | tee output.log

5 Phasing with a reference panel

NOTE: If the data set you wish to phase contains more than twice as many samples as the largest reference panel available to you, then using a reference panel is unlikely to give much of a boost in phasing accuracy.

5.1 Input

5.1.1 Genotypes

When phasing with a reference panel, Eagle takes input for both the target and reference genotypes in tabix-indexed VCF/BCF format (compressed or uncompressed) using the --vcfTarget and --vcfRef options; the file format is automatically detected. We recommmend using BCF format, as VCF parsing time can be substantial for large data sets. For conversion and manipulation of VCF/BCF files, we highly recommend the bcftools software.

If your target VCF/BCF file contains data from multiple chromosomes, you must specify one chromosome to phase using the --chrom option.

Chromosome names must match between the target and reference VCF/BCF files.

Eagle restricts analysis to sites that are contained in both the target and reference (with matching CHR, POS, and ALT fields) and are biallelic in the target VCF/BCF. Sites for which the REF and ALT alleles are swapped in the target VCF/BCF relative to the reference are dropped by default. This requirement can be relaxed via the --allowRefAltSwap flag, which causes REF/ALT swaps to be tolerated (and automatically flipped) for true SNPs. For indels, REF/ALT swaps are always dropped due to the possibility of different indels appearing to be the same. (Thanks to Giulio Genovese for pointing out this subtlety.)

Eagle drops multi-allelic sites from the phased output. If your data contains multi-allelic sites that you wish to phase, you can do so by splitting multi-allelic sites into biallelics using bcftools (in both the target and reference data). However, note that encoding a multi-allelic site (e.g., REF=A, ALT=C,G) using a pair of biallelic markers (e.g., REF=A, ALT=C and REF=A, ALT=G) can occasionally result in nonsensical phase (e.g., if a CG het is phased with both ALT alleles assigned to the same haplotype). This issue may eventually be solved by incorporating a phase constraint between the biallelic markers, but for now, you will need to check your output for this type of error (if relevant to downstream analyses).

5.1.2 Reference genetic maps

Since VCF/BCF files do not contain genetic coordinates, you must specify the path to the reference map provided with Eagle:

--geneticMapFile=tables/genetic_map_hg##.txt.gz

selecting the build (hg17, hg18, hg19, or hg38) corresponding to the base pair coordinates in your input data.

5.1.3 Missing data treatment

By default, missing genotypes in the target VCF/BCF are automatically imputed during the phasing process; the phased haplotypes that Eagle outputs contain best-guess imputed (haploid) genotypes. Missing imputation can be turned off using the --noImpMissing option, in which case these genotypes will be left as ./. (or . for haploid chrX genotypes) in the output VCF/BCF.

Because missing genotypes sometimes have incorrectly-coded ploidy, Eagle tries to guess the right ploidy for missing chrX genotypes by default. This behavior can be changed with the --keepMissingPloidyX option, which forces Eagle to produce output with exactly the same ploidy as the input.

5.1.4 Individual and SNP filters

When phasing using a reference, the --vcfExclude flag can be used to specify a list of variants (in VCF/BCF format) to exclude. The --maxMissingPerSnp, --maxMissingPerIndiv, and --remove flags are not supported when phasing using a reference; however, analysis may still be restricted to a sub-chromosomal region by using the --bpStart and --bpEnd options. Additionally, a flanking region can be specified using the --bpFlanking option; this region will be used during phasing but discarded in the output. (For example, setting --bpStart=50e6, --bpEnd=100e6, and --bpFlanking=1e6 analyzes positions between 49Mb and 101Mb and then outputs positions between 50Mb and 100Mb.)

5.1.5 Optional algorithmic parameters

As above, the main parameter that adjusts speed and accuracy of the Eagle2 algorithm is --Kpbwt; see above for details.

5.2 Output

When phasing with a reference panel, Eagle outputs phased haplotypes in VCF/BCF format (compressed or uncompressed); the output format can be specified using the --vcfOutFormat flag. Output file paths are obtained by appending .vcf, .vcf.gz, or .bcf to the path prefix specified with the --outPrefix option.

As above, we recommend saving Eagle’s logging output (written to stdout and stderr); you may save this output while simultaneously viewing it on the command line using

./eagle [... list of options ...] 2>&1 | tee output.log

At the end of reference-based phasing analysis, Eagle writes a list of estimated average phase confidences (one estimate per sample) to stdout; these confidences can be used for downstream QC. Phase confidences of -nan indicate samples with nothing to phase (i.e., 0 hets or 1 het, which tend to arise when phasing male X chromosomes or in runs of homozygosity).

5.3 Reference panels from the 1000 Genomes Project

If you wish to perform phasing using the publicly available 1000 Genomes Phase 3 haplotypes as a reference panel, please see below for how to download and prepare the data.

5.3.1 GRCh37/hg19

The 1000 Genomes Phase 3 data can be downloaded in hg19 coordinates from the FTP site below:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

The data is divided into one vcf.gz file per chromosome. Note that if you plan to run Eagle on multiple target genotype files, you should first use bcftools to convert the data into BCF format (which will greatly reduce input processing time).

The files provided by the 1000 Genomes Project contain multi-allelic sites, which Eagle ignores by default. If you wish to phase these sites, you will need to split them using bcftools. The following pipeline (supplied by Giulio Genovese) does so (plain-text version here):

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | \
  gzip -d > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr{{1..22}.phase3_shapeit2_mvncall_integrated_v5a,X.phase3_shapeit2_mvncall_integrated_v1b,Y.phase3_integrated_v2a}.20130502.genotypes.vcf.gz{,.tbi}

for chr in {1..22} X Y; do
  bcftools view --no-version -Ou -c 2 ALL.chr${chr}.phase3⋆integrated_v[125][ab].20130502.genotypes.vcf.gz | \
  bcftools norm --no-version -Ou -m -any | \
  bcftools norm --no-version -Ob -o ALL.chr${chr}.phase3_integrated.20130502.genotypes.bcf -d none -f human_g1k_v37.fasta && \
  bcftools index -f ALL.chr${chr}.phase3_integrated.20130502.genotypes.bcf
done

5.3.2 GRCh38/hg38

The 1000 Genomes Phase 3 has also been lifted over to GRCh38, but you will need to do some additional processing to remove multi-allelic variants and duplicate variants before running Eagle. The following pipeline (by Giulio Genovese) normalizes variants using the version of the GRCh38 fasta reference recommended by Heng Li (plain-text version here):

wget -O- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | \
  gzip -d > GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr{{1..22},X,Y}_GRCh38.genotypes.20170504.vcf.gz{,.tbi}

for chr in {1..22} X Y; do
  (bcftools view --no-version -h ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz | \
    grep -v "^##contig=<ID=[GNh]" | sed 's/^##contig=<ID=MT/##contig=<ID=chrM/;s/^##contig=<ID=$[0-9XY]$/##contig=<ID=chr\1/'; \
  bcftools view --no-version -H -c 2 ALL.chr${chr}_GRCh38.genotypes.20170504.vcf.gz | sed 's/^/chr/') | \
  bcftools norm --no-version -Ou -m -any | \
  bcftools norm --no-version -Ob -o ALL.chr${chr}_GRCh38.genotypes.20170504.bcf -d none -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna && \
  bcftools index -f ALL.chr${chr}_GRCh38.genotypes.20170504.bcf
done

A few notes:

The above code converts 1000 Genomes contig names from {{1..22},X,Y,MT} to chr{{1..22},X,Y,M}.
Singletons are removed using bcftools view -c 2
The pipeline requires a recent version of bcftools (≥1.7) that supports the option bcftools norm -d none to remove duplicate variants.

6 Contact

If you have comments or questions about the Eagle software, please contact:

Po-Ru Loh, poruloh@broadinstitute.org

7 License

Eagle is free software under the GNU General Public License v3.0 (GPLv3).

References

1. Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics 12, 703–714 (2011).

2. Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

3. Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nature Genetics 48, 1443–1448 (2016).

4. Loh, P.-R., Palamara, P. F. & Price, A. L. Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics 48, 811–816 (2016).

5. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81, 559–575 (2007).

6. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 1–16 (2015).

7. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nature Methods 10, 5–6 (2013).