- Methodology
- Webserver
The SNPsnap webserver enables SNP-based enrichment analysis by providing matched sets of SNPs that can be used to calibrate background expectations. Specifically, SNPsnap efficiently identifies sets of randomly drawn SNPs that are matched to a set of query SNPs based on
- Minor allele frequency
- Number of SNPs in linkage disequilibrium (LD buddies)
- Distance to nearest gene
- Gene density
SNPsnap uses
1000 Genomes Project Phase 3 variants from the
three different ancestral cohorts. (
March 2015: SNPsnap was updated from 1000G Phase 1 to Phase 3 variants).
SNPsnap uses 1000G Project's definition of the super populations for European and East Asian (see below), but defines West Africa as a subset of the 1000G African samples.
Super populations are defined using the
1000G panel file.
Specifically our database contains biallelic, uniquely mapped SNPs derived from preprocessing the phase 3 genotype data.
SNPsnap's database only holds
common variants (>1% MAF).
SNPsnap contains
all types of variants listed by 1000 Genomes Project: single nucleotide variants (SNPs), indels and larger structural variants (as assigned with evs-numbers).
SNPsnap contains SNPs located on chromosomes 1-22 and the X-chromosome (
March 2015: X chromosome included in SNPsnap).
Technical note: we preprocess all 1000G variants before building SNPsnap's database using the following QC criteria:
- Remove SNPs flagged by PLINK as merge conflicts when merging per chromosome 1000G genotypes. See more information at PLINK 1.9 Merge failures.
- Remove SNPs with duplicate rsIDs.
- Run PLINK variant filtering:
plink --bfile [prefix to population specific genotypes] --maf 0.01 --hwe 10e-6 --geno 0.1 --make-bed --out [SNPsnap QC PLINK files]
see details for commando at PLINK 1.9's documentation for
maf,
hwe and
geno.
- Remove SNPs with duplicate chromosomal coordinates.
SNPsnap Super Population Definitions
SNPsnap's super population definitions are based on the following
1000G cohorts:
- European (EUR)
- British in England and Scotland (GBR)
- Finnish in Finland (FIN)
- Iberian populations in Spain (IBS)
- Toscani in Italy (TSI)
- Utah residents with Northern and Western European ancestry (CEU)
- East Asian (EAS)
- Chinese Dai in Xishuangbanna, China (CDX)
- Han Chinese in Bejing, China (CHB)
- Japanese in Tokyo, Japan (JPT)
- Kinh in Ho Chi Minh City, Vietnam (KHV)
- Southern Han Chinese, China (CHS)
- West Africa (WAFR)
- Esan in Nigeria (ESN)
- Yoruba in Ibadan, Nigeria (YRI)
- Gambian in Western Division, The Gambia (GWD)
- Mende in Sierra Leone (MSL)
SNPsnap Database Summary
After several QC steps of the 1000G data, the SNPsnap's SNP database is build. The below table summarizes the number of SNPs in the database:
Super Population |
Number of SNPs in Database |
EUR |
9,535,060 |
EAS |
8,433,735 |
WAFR |
16,191,783 |
SNPsnap Gene Sets
SNPsnap uses genes from the
GENCODE consortium downloaded via
Ensembl GRCh37 Biomart (Homo sapiens genes, GRCh37.p13). (
March 2015: Gene set updated to GENCODE genes).
SNPsnap uses any genes within the GENCODE gene set to define the
Distance to Nearest Gene and
Gene Density. Please note that not all these genes are coding genes.
The SNPsnap GENCODE gene set contains 57737 genes of which 20314 are protein coding genes. SNPsnap provides the annotation feature
dist_nearest_gene_snpsnap_protein_coding as the distance to the nearest protein coding gene.
Finaly, SNPsnap uses Ensembl's mapping to HGNC symbols whenever a HGNC symbol is provided.
- Minor allele frequency: SNPs are partitioned into minor allele frequency bins of 1-2, 2-3, ..., 49-50% strata.
- LD buddies: the number of “buddy” (or "proxy") SNPs in LD at various thresholds. SNPsnap currently offers LD buddy counts for thresholds using r² > 0.1, 0.2, ..., 0.9
- Distance to nearest gene: the distance to the nearest 5’ start site using GENCODE gene coordinates. If the SNP is within a gene, the distance to that gene’s start site is used.
- Gene density: the number of genes in loci around the SNP, using LD (r² > 0.1, 0.2, ..., 0.9) and physical distance (100, 200, ..., 1000 kb) to define loci.
The genetic properties of the SNPs in the current version of SNPsnap's database is left-skewed (see below histograms).
*The below plots are based on the older version of SNPsnap using 1000 Genomes Phase 1 variants from European ancestry.
The SNPsnap algorithm identifies matched SNPs as follows:
- Step 1
- In five uniformly spaced increments, increase the allowable deviation for each of the genetic properties, ending with the prespecified maximum allowable deviation. For each increment, identify matching SNPs, defined as SNPs with genetic properties within the allowable deviations.
- Step 2
- If there are at least as many matching SNPs as requested, sample without replacement the requested number of SNPs from the matching SNPs and proceed to step 5.
- Step 3
- If the number of matching SNPs is less than number of requested SNPs, increment the allowable deviation and return to step 1; if the maximum allowable deviation has been reached, proceed to step 4.
- Step 4
- Sample with replacement from the matched SNPs identified in step 1.
- Step 5
- Proceed to next query SNP.
Incrementing the allowable deviation of the genetic properties ensures that the best matching SNPs is always used.
This allows for setting a large allowable deviation and still get the best matching SNPs.
Chromosomal coordinates of SNPs e.g. 3:20145787 for a SNP on chromosome 3 at bp 20,145,787.
SNPsnap uses numeric codes 1-22 for autosomal chromosomes. The X chromosome is assigned the numeric code of 23 (following PLINK's encoding).
SNPsnap also accepts rs-numbers as assigned by the 1000 Genomes Project. Please note that not all variants in the 1000 Genomes Project have been assigned a rs-number and thus only can be identified by their chromosomal coordinate.
We recommend using chromosomal identifers for easier downstream processing of SNPsnap's output. Also, SNPsnap is speed optimized for chromosomal coordinates identifier inputs.
SNPsnap uses genome build GRCh37/hg19 for SNP coordinates.
Technical note: all input SNPs are internally mapped to chromosomal coordinate identifiers (chr:pos). The mapping can be inspected in the file input_snps_identifer_mapping.txt.
Input format
SNPsnap accepts the below SNP list input format as text file or copy/paste into the browser:
- One SNP per line
- SNP identifers can consists of chromosomal coordinates, rs-numbers or a mixture of the two.
Please note that SNPs not formatted correctly, SNPs that does not exists in the database and SNPs excluded from matching will be written to the file
input_snps_excluded.txt. Please inspect this file to make sure all your SNPs are formatted correctly.
SNPsnap reports two scores that serve as guidelines for selecting proper matching settings.
- Insufficient-matches: the percentage of input SNPs for which SNPsnap is not able to identify the required number of matched SNPs
- Match-size: the percentage median number of SNPs matched for the subset of SNPs with insufficient matches
Note that the Match-size score is only relevant to consider if the Insufficient-matches score indicates many insufficient matches.
For each score, the score value and an ordinal scoring variable is reported to the user. The below table illustrates an example of the reported SNPsnap score.
SNPsnap score |
Value |
Rating |
Insufficient-matches |
34.50% |
 |
Match-size |
91.83% |
 |
SNPsnap uses genome-wide significant loci from 63 traits and diseases from the GWAS Catalog (Hindorff et al., 2009) as a reference for the scoring system.
The ordinal scoring variable (“very poor”, “poor”, “ok”, “good”, “very good”) corresponds to the scoring quintiles devrived from SNPsnap scores of the 63 phenotypes.
For example, a 'Insufficient-matches' score 'very good' ranks your SNPsnap query among the top 20% scores observed for the 63 GWAS catalog phenotypes.
Note that the scoring metrics will only be valid for similar matching settings as the default SNPsnap matching settings.
Matching bias is defined as the ratio between the mean of the genetic properties of the input SNPs to matched SNPs. SNPsnap reports these values as part of the matching result to provide a guideline for selecting the proper matching settings.
The figure below shows the matching bias for each of the genetic properties as a function of the number of requested matched SNPs (using the default matching criteria). The matching bias may be explained by the left skewness of the distribution of the
genetic properties.
Large allowable deviation of the genetic properties and a large number of requested SNPs are the primary causes for the matching bias. Bias of the matched SNPs may be reduced or eliminated by lowering one or both of these parameters.
We recommend reducing any matched bias because it may hamper downstream genetic enrichment analysis.
*The below plots are based on the older version of SNPsnap using 1000 Genomes Phase 1 variants from European ancestry.
The input SNPs to SNPsnap should be independent if enrichment analysis is the goal of downstream investigations. Failing to input independent SNPs will lead to unintended bias in the matched SNPs, which may cause an impropper background distribution.
SNPsnap offers users to check independence of input SNPs by selecting
Report input loci independence. SNPs can be clumped based on LD and physical distance thresholds.
SNPsnap's clumping function is a wrapper around
PLINK 1.9 (Chang et al. GigaScience, 2015) greedy algorithm for clumping SNPs.
See
PLINK 1.9's documentation for details.
SNPsnap will report one of the following messages on the result site, dependent on the outcome of the clumping:
- Your input SNPs are independent
- Your input SNPs are not independent
Technical note:
SNPsnap writes a temporary .assoc file (two column file with the field headers "SNP" and "P") for the input SNPs.
The values in the "P" column is set to a fixed value for all input SNPs.
Since all the p-values have the same value the clumping will be based on the input order of the SNPs, which makes the index SNPs rather arbitrary. We use the following command:
plink --bfile [prefix to population specific genotypes] --clump [tmp assoc file] --clump-r2 [user_r2] --clump-kb [user_kb]
- matched_snps.txt
Primary output of SNPsnap; a matrix with dimension [N_input_SNPs x (Requested number of SNPs + 1)]. The first column lists the input SNPs. Each of the following columns contains a matched set of SNPs.
- input_snps_excluded.txt
File listing the input SNPs excluded from analysis. There may be two reasons for an input SNP being excluded: 1) the SNP does not exist in the SNPsnap database; 2) the SNP maps to the HLA region and the Exclude HLA SNPs is enabled. First column in the file is input SNP identifier; second column is the reason for exclusion.
- input_snps_annotated.txt (if Annotate input SNPs selected)
File containing annotation of input SNPs (see SNP Annotations for details).
- input_snps_clumped.txt (if Report input loci independence selected)
File listing the clumping of the input SNPs. First column is the index SNP identifier of the clumped locus; second column is the count of clumped SNPs; third column is a comma seperated list of clumped SNPs.
- input_snps_identifer_mapping.txt
File useful for mapping to/from chromosomal coordinates and rs-number identifiers. Note that some SNPs do not have rs-numbers. In such cases the chromosomal coordinate identifier is used instead.
- matched_snps_annotated.txt (if Annotate matched SNPs selected)
File containing annotation of matched SNPs (see SNP Annotations for details).
- snpsnap_summary.txt
a summery of the input parameters to SNPsnap and the SNPsnap scores.
All files are tab delimited with the .txt file extension and can be directly imported to Excel.
SNPsnap currently supports the following annotations of SNPs in the file
input_snps_annotated.txt:
# |
Column Name |
Column Description |
1 |
snpID |
Chromosomal coordinates of the SNP |
2 |
rsID |
Rs-number of the SNP (note that in some cases the rs-number is identical to the chromosomal coordinate) |
3 |
freq_bin |
Frequency bin of the SNP using 1-2, 2-3, ..., 49-50% strata |
4 |
snp_maf |
Minor Allele Frequency (MAF) for the SNP |
5 |
gene_count |
The number of genes in the locus (gene density) |
6 |
dist_nearest_gene_snpsnap |
Distance to the start site of nearest gene the SNP is located within. If this distance is not defined ("inf" ), the distance to the start site of nearest gene is used. That is, dist_nearest_gene_snpsnap is equal to either dist_nearest_gene or dist_nearest_gene_located_within. Distance is in base pairs |
7 |
dist_nearest_gene_snpsnap_protein_coding |
Same as dist_nearest_gene_snpsnap but distance to nearest protein coding gene start site. |
8 |
dist_nearest_gene |
Distance to the start site of nearest gene. Distance is in base pairs |
9 |
dist_nearest_gene_located_within |
Distance to the start site of nearest gene the SNP is located within. Distance is in base pairs. If the SNP is not located within any genes the distance is "inf" |
10 |
loci_upstream |
Upstream locus boundary |
11 |
loci_downstream |
Downstream locus boundary |
12 |
ID_nearest_gene_snpsnap |
Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_snpsnap |
13 |
ID_nearest_gene_snpsnap_protein_coding |
Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_snpsnap_protein_coding |
14 |
ID_nearest_gene |
Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene |
15 |
ID_nearest_gene_located_within |
Ensembl Gene ID for the gene used in the calculation of dist_nearest_gene_located_within. Note that this column may be empty if dist_nearest_gene_located_within has the value "inf" . |
16 |
HGNC_nearest_gene_snpsnap |
HGNC Symbol for the gene used in the calculation of dist_nearest_gene_snpsnap. Note that this column may be empty if Ensembl does not provide a mapping from ENSG_ID to HGNC Symbol. |
17 |
HGNC_nearest_gene_snpsnap_protein_coding |
HGNC Symbol for the gene used in the calculation of dist_nearest_gene_snpsnap_protein_coding. Note that this column may be empty if Ensembl does not provide a mapping from ENSG_ID to HGNC Symbol. |
18 |
flag_snp_within_gene |
Value indicates of the SNP is located within a gene. Value can be True or False . |
19 |
flag_snp_within_gene_protein_coding |
Value indicates of the SNP is located within a protein coding gene. Value can be True or False . |
20 |
ID_genes_in_matched_locus |
Ensembl Gene ID of genes overlaping with the locus boundaries. Note that this column may be empty if gene_count has the value "0" . |
21 |
friends_ld01 |
Number of LD buddies using cutoff r² > 0.1 |
22 |
friends_ld02 |
Number of LD buddies using cutoff r² > 0.2 |
23 |
friends_ld03 |
Number of LD buddies using cutoff r² > 0.3 |
24 |
friends_ld04 |
Number of LD buddies using cutoff r² > 0.4 |
25 |
friends_ld05 |
Number of LD buddies using cutoff r² > 0.5 |
26 |
friends_ld06 |
Number of LD buddies using cutoff r² > 0.6 |
27 |
friends_ld07 |
Number of LD buddies using cutoff r² > 0.7 |
28 |
friends_ld08 |
Number of LD buddies using cutoff r² > 0.8 |
29 |
friends_ld09 |
Number of LD buddies using cutoff r² > 0.9 |
The
matched_snps_annotated.txt file contains the same columns as described for
input_snps_annotated.txt but annotating the matched SNP.
The file contains the following two additional leading columns:
# |
column name |
column description |
1 |
set |
Index of the matched SNP group. The number of set index is equal to the number of required matched SNPs |
2 |
input_snp |
The chromosomal coordinates for the input SNPs that were matched |
3...31 |
... |
... same as for input_snps_annotated.txt but annotation is for the matched SNP |