SNPsnap


Check out any new features or bug reports on the "What's New?"

Table of Contents

Methodology
Webserver

Methodology

Aim

The SNPsnap webserver enables SNP-based enrichment analysis by providing matched sets of SNPs that can be used to calibrate background expectations. Specifically, SNPsnap efficiently identifies sets of randomly drawn SNPs that are matched to a set of query SNPs based on

Data

SNPsnap uses 1000 Genomes Project Phase 3 variants from the three different ancestral cohorts. (March 2015: SNPsnap was updated from 1000G Phase 1 to Phase 3 variants). SNPsnap uses 1000G Project's definition of the super populations for European and East Asian (see below), but defines West Africa as a subset of the 1000G African samples. Super populations are defined using the 1000G panel file.
Specifically our database contains biallelic, uniquely mapped SNPs derived from preprocessing the phase 3 genotype data. SNPsnap's database only holds common variants (>1% MAF). SNPsnap contains all types of variants listed by 1000 Genomes Project: single nucleotide variants (SNPs), indels and larger structural variants (as assigned with evs-numbers).
SNPsnap contains SNPs located on chromosomes 1-22 and the X-chromosome (March 2015: X chromosome included in SNPsnap).

Technical note: we preprocess all 1000G variants before building SNPsnap's database using the following QC criteria:
  1. Remove SNPs flagged by PLINK as merge conflicts when merging per chromosome 1000G genotypes. See more information at PLINK 1.9 Merge failures.
  2. Remove SNPs with duplicate rsIDs.
  3. Run PLINK variant filtering:
    plink --bfile [prefix to population specific genotypes] --maf 0.01 --hwe 10e-6 --geno 0.1 --make-bed --out [SNPsnap QC PLINK files]
    see details for commando at PLINK 1.9's documentation for maf, hwe and geno.
  4. Remove SNPs with duplicate chromosomal coordinates.

SNPsnap Super Population Definitions

SNPsnap's super population definitions are based on the following 1000G cohorts:

SNPsnap Database Summary

After several QC steps of the 1000G data, the SNPsnap's SNP database is build. The below table summarizes the number of SNPs in the database:

Super Population

Number of SNPs in Database

EUR 9,535,060
EAS 8,433,735
WAFR 16,191,783

SNPsnap Gene Sets

SNPsnap uses genes from the GENCODE consortium downloaded via Ensembl GRCh37 Biomart (Homo sapiens genes, GRCh37.p13). (March 2015: Gene set updated to GENCODE genes). SNPsnap uses any genes within the GENCODE gene set to define the Distance to Nearest Gene and Gene Density. Please note that not all these genes are coding genes. The SNPsnap GENCODE gene set contains 57737 genes of which 20314 are protein coding genes. SNPsnap provides the annotation feature dist_nearest_gene_snpsnap_protein_coding as the distance to the nearest protein coding gene. Finaly, SNPsnap uses Ensembl's mapping to HGNC symbols whenever a HGNC symbol is provided.

Genetic Properties used to Match SNPs

The genetic properties of the SNPs in the current version of SNPsnap's database is left-skewed (see below histograms). *The below plots are based on the older version of SNPsnap using 1000 Genomes Phase 1 variants from European ancestry.

Algorithm

The SNPsnap algorithm identifies matched SNPs as follows:
Step 1
In five uniformly spaced increments, increase the allowable deviation for each of the genetic properties, ending with the prespecified maximum allowable deviation. For each increment, identify matching SNPs, defined as SNPs with genetic properties within the allowable deviations.
Step 2
If there are at least as many matching SNPs as requested, sample without replacement the requested number of SNPs from the matching SNPs and proceed to step 5.
Step 3
If the number of matching SNPs is less than number of requested SNPs, increment the allowable deviation and return to step 1; if the maximum allowable deviation has been reached, proceed to step 4.
Step 4
Sample with replacement from the matched SNPs identified in step 1.
Step 5
Proceed to next query SNP.
Incrementing the allowable deviation of the genetic properties ensures that the best matching SNPs is always used.
This allows for setting a large allowable deviation and still get the best matching SNPs.

Webserver

User input

Chromosomal coordinates of SNPs e.g. 3:20145787 for a SNP on chromosome 3 at bp 20,145,787. SNPsnap uses numeric codes 1-22 for autosomal chromosomes. The X chromosome is assigned the numeric code of 23 (following PLINK's encoding).
SNPsnap also accepts rs-numbers as assigned by the 1000 Genomes Project. Please note that not all variants in the 1000 Genomes Project have been assigned a rs-number and thus only can be identified by their chromosomal coordinate.
We recommend using chromosomal identifers for easier downstream processing of SNPsnap's output. Also, SNPsnap is speed optimized for chromosomal coordinates identifier inputs.
SNPsnap uses genome build GRCh37/hg19 for SNP coordinates.
Technical note: all input SNPs are internally mapped to chromosomal coordinate identifiers (chr:pos). The mapping can be inspected in the file input_snps_identifer_mapping.txt.

Input format

SNPsnap accepts the below SNP list input format as text file or copy/paste into the browser: Please note that SNPs not formatted correctly, SNPs that does not exists in the database and SNPs excluded from matching will be written to the file input_snps_excluded.txt. Please inspect this file to make sure all your SNPs are formatted correctly.

SNPsnap score

SNPsnap reports two scores that serve as guidelines for selecting proper matching settings. Note that the Match-size score is only relevant to consider if the Insufficient-matches score indicates many insufficient matches.
For each score, the score value and an ordinal scoring variable is reported to the user. The below table illustrates an example of the reported SNPsnap score.

SNPsnap score

Value

Rating

Insufficient-matches 34.50% SNPsnap score image
Match-size 91.83% SNPsnap score image
SNPsnap uses genome-wide significant loci from 63 traits and diseases from the GWAS Catalog (Hindorff et al., 2009) as a reference for the scoring system. The ordinal scoring variable (“very poor”, “poor”, “ok”, “good”, “very good”) corresponds to the scoring quintiles devrived from SNPsnap scores of the 63 phenotypes. For example, a 'Insufficient-matches' score 'very good' ranks your SNPsnap query among the top 20% scores observed for the 63 GWAS catalog phenotypes. Note that the scoring metrics will only be valid for similar matching settings as the default SNPsnap matching settings.

SNPsnap matching bias

Matching bias is defined as the ratio between the mean of the genetic properties of the input SNPs to matched SNPs. SNPsnap reports these values as part of the matching result to provide a guideline for selecting the proper matching settings.
The figure below shows the matching bias for each of the genetic properties as a function of the number of requested matched SNPs (using the default matching criteria). The matching bias may be explained by the left skewness of the distribution of the genetic properties. Large allowable deviation of the genetic properties and a large number of requested SNPs are the primary causes for the matching bias. Bias of the matched SNPs may be reduced or eliminated by lowering one or both of these parameters. We recommend reducing any matched bias because it may hamper downstream genetic enrichment analysis. *The below plots are based on the older version of SNPsnap using 1000 Genomes Phase 1 variants from European ancestry.

Clumping

The input SNPs to SNPsnap should be independent if enrichment analysis is the goal of downstream investigations. Failing to input independent SNPs will lead to unintended bias in the matched SNPs, which may cause an impropper background distribution. SNPsnap offers users to check independence of input SNPs by selecting Report input loci independence. SNPs can be clumped based on LD and physical distance thresholds. SNPsnap's clumping function is a wrapper around PLINK 1.9 (Chang et al. GigaScience, 2015) greedy algorithm for clumping SNPs. See PLINK 1.9's documentation for details.
SNPsnap will report one of the following messages on the result site, dependent on the outcome of the clumping:

Technical note: SNPsnap writes a temporary .assoc file (two column file with the field headers "SNP" and "P") for the input SNPs. The values in the "P" column is set to a fixed value for all input SNPs. Since all the p-values have the same value the clumping will be based on the input order of the SNPs, which makes the index SNPs rather arbitrary. We use the following command:
plink --bfile [prefix to population specific genotypes] --clump [tmp assoc file] --clump-r2 [user_r2] --clump-kb [user_kb]

Output files

All files are tab delimited with the .txt file extension and can be directly imported to Excel.

SNP Annotations - input_snps_annotated.txt

SNPsnap currently supports the following annotations of SNPs in the file input_snps_annotated.txt:

#

Column Name

Column Description

1 snpID Chromosomal coordinates of the SNP
2 rsID Rs-number of the SNP (note that in some cases the rs-number is identical to the chromosomal coordinate)
3 freq_bin Frequency bin of the SNP using 1-2, 2-3, ..., 49-50% strata
4 snp_maf Minor Allele Frequency (MAF) for the SNP
5 gene_count The number of genes in the locus (gene density)
6 dist_nearest_gene_snpsnap Distance to the start site of nearest gene the SNP is located within. If this distance is not defined ("inf"), the distance to the start site of nearest gene is used. That is, dist_nearest_gene_snpsnap is equal to either dist_nearest_gene or dist_nearest_gene_located_within. Distance is in base pairs
7 dist_nearest_gene_snpsnap_protein_coding Same as dist_nearest_gene_snpsnap but distance to nearest protein coding gene start site.
8 dist_nearest_gene Distance to the start site of nearest gene. Distance is in base pairs
9 dist_nearest_gene_located_within Distance to the start site of nearest gene the SNP is located within. Distance is in base pairs. If the SNP is not located within any genes the distance is "inf"
10 loci_upstream Upstream locus boundary
11 loci_downstream Downstream locus boundary
12 ID_nearest_gene_snpsnap Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_snpsnap
13 ID_nearest_gene_snpsnap_protein_coding Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_snpsnap_protein_coding
14 ID_nearest_gene Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene
15 ID_nearest_gene_located_within Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_located_within. Note that this column may be empty if dist_nearest_gene_located_within has the value "inf".
16 HGNC_nearest_gene_snpsnap HGNC Symbol for the gene used in the calculation of dist_nearest_gene_snpsnap. Note that this column may be empty if Ensembl does not provide a mapping from ENSG_ID to HGNC Symbol.
17 HGNC_nearest_gene_snpsnap_protein_coding HGNC Symbol for the gene used in the calculation of dist_nearest_gene_snpsnap_protein_coding. Note that this column may be empty if Ensembl does not provide a mapping from ENSG_ID to HGNC Symbol.
18 flag_snp_within_gene Value indicates of the SNP is located within a gene. Value can be True or False.
19 flag_snp_within_gene_protein_coding Value indicates of the SNP is located within a protein coding gene. Value can be True or False.
20 ID_genes_in_matched_locus Ensembl Gene ID of genes overlaping with the locus boundaries. Note that this column may be empty if gene_count has the value "0".
21 friends_ld01 Number of LD buddies using cutoff r² > 0.1
22 friends_ld02 Number of LD buddies using cutoff r² > 0.2
23 friends_ld03 Number of LD buddies using cutoff r² > 0.3
24 friends_ld04 Number of LD buddies using cutoff r² > 0.4
25 friends_ld05 Number of LD buddies using cutoff r² > 0.5
26 friends_ld06 Number of LD buddies using cutoff r² > 0.6
27 friends_ld07 Number of LD buddies using cutoff r² > 0.7
28 friends_ld08 Number of LD buddies using cutoff r² > 0.8
29 friends_ld09 Number of LD buddies using cutoff r² > 0.9

SNP Annotations - matched_snps_annotated.txt

The matched_snps_annotated.txt file contains the same columns as described for input_snps_annotated.txt but annotating the matched SNP.
The file contains the following two additional leading columns:

#

column name

column description

1 set Index of the matched SNP group. The number of set index is equal to the number of required matched SNPs
2 input_snp The chromosomal coordinates for the input SNPs that were matched
3...31 ... ... same as for input_snps_annotated.txt but annotation is for the matched SNP