Associate Editor: Martin Bishop
Motivation: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers.
Results: We devised an improved algorithm for tagSNP selection using the pairwiser2 criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data.
Availability: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at
Contact: [email protected]
With the rapid improvement of high-throughput genotyping technologies, genome-wide association studies are emerging as a promising approach to detect genetic variants that contribute to human diseases. Initially, genome-wide association studies will focus on single nucleotide polymorphisms (SNPs) because of their high abundance in the human genome, their low mutation rates and their accessibility to high-throughput genotyping (Collinset al., 1997). There are more than 10 million verified SNPs in dbSNP (build 124) (Sachidanandamet al., 2001), but typing all available SNP markers is inefficient and not necessary since many will provide redundant information due to linkage disequilibrium (LD). A better strategy is to select a subset of representative SNPs (tagging SNPs or tagSNPs) and to remove the rest from consideration (Johnsonet al., 2001;Cardon and Abecasis, 2003). The objective is to have little information overlap among the selected SNPs while retaining much of the signal contained in the original set.
The selection of tagSNPs has become a very active research topic and many strategies have been proposed (Patilet al., 2001;Zhanget al., 2002;Gabrielet al., 2002;Johnsonet al., 2001;Menget al., 2003;Sebastianiet al., 2003;Avi-Itzhaket al., 2003;Ke and Cardon, 2003;Goldsteinet al., 2003;Stram, 2003;Hampeet al., 2003;Chapmanet al., 2003;Lin and Altman, 2004;Halldórssonet al., 2004;Rinaldoet al., 2005). Recently,Zhang and Jin (2003) andCarlsonet al. (2004) introduced methods based on the LD measurer2. These methods search for a small set of SNPs that are in strong LD (measured through pairwiser2) with other SNPs that are not selected for genotyping. Pairwiser2 is an attractive criterion for tagSNP selection since it is closely related to statistical power for case–control association studies, where a directly associated SNP is replaced with an indirectly associated tagSNP (Pritchard and Przeworski, 2001).
In this manuscript, we describe efficient algorithms for tagSNP selection based on pairwise LD measurer2. The algorithms were implemented in a computer program named FESTA (fragmented exhaustive search for tagging SNPs). Essentially, we replace a greedy search, where markers are added sequentially to the tagSNP set, with an exhaustive search where all marker combinations are evaluated. To achieve this, we arrange the genome into precincts of markers in high LD, such that markers in different precincts show only low pairwise disequilibrium. TagSNP selection can then be performed within each precinct independently, greatly reducing computation cost. In most settings, our method is guaranteed to find the optimal tagSNP set(s) defined by ther2 criterion. For a small proportion of precincts where exhaustive search is computationally too expensive to carry out, an efficient greedy-exhaustive hybrid search algorithm is described. Using data from the HapMap project (The International HapMap Consortium, 2003), we show that the majority of these precincts contain relatively small numbers of SNPs, especially when a stringentr2 criterion is used. Our algorithm readily identifies equivalent tagSNP sets, so that additional selection criteria can be incorporated. Other useful extensions are also discussed in this manuscript, such as the inclusion/exclusion of certain SNPs and double coverage, which can increase robustness of tagSNP sets against sporadic genotyping failures or errors.
Consider a set which containsM bi-allelic SNP markersa1, a2,…,aM. Further assume that all these markers have minor allele frequency (MAF) above a certain threshold (0.05 was used in this study). First, two-SNP haplotype frequencies were estimated (Hill, 1974), and then the pairwise LD measurer2 (also referred to as ‘Δ2') (Devlin and Risch, 1995) was calculated for each pair of markers using the inferred haplotype frequencies (Hill and Robertson, 1968). Two markersai andaj are said to be in strong LD if ther2 between them is greater than a pre-specified threshold valuer0, denoted asr2(ai,aj) ≥r0 (r0 = 0.5 or 0.8 in in this study). Both are considered tagSNPs for each other; in thatai can be used as a surrogate foraj, or vice versa.
Our aim is to a find tagSNP set, denoted byT, a subset of such that, ∃aj ∈T that satisfiesr2(ai,aj) ≥r0. In our presentation, we introduce two intermediate SNP sets,P andQ.P is called thecandidate set which contains all the markers that are eligible to be chosen as tagSNPs andQ is named thetarget set which contains all the markers that are yet to be tagged, i.e. no marker inQ is in LD with any tagSNP inT. For each markeram inP, letC(am) = {a :a∈Q andr2(a,am) ≥r0} represent the subset ofQ which contains markers that are in strong LD witham, and let |C(am)| be the number of the elements in the setC(am). Typically, the candidate setP is the complement of the tagSNP setT, andP =Q. One exception occurs when some SNPs are excluded as tagSNPs because they cannot be easily genotyped, but they still should be tagged by other markers if possible. In this case, the candidate set is a subset of target set. We describe several different algorithms for updatingP,Q andT starting with a greedy approach (Carlsonet al., 2004). We then outline successive refinements and extensions of a partition and exhaustive search algorithm, designed to handle various scenarios encountered when planning association studies.
The detailed algorithm is as follows (Carlsonet al., 2004).
Algorithm 1 (greedy approach):In Step 4, by removing associated markers from consideration, the coverage overlap among tagSNPs is greatly reduced. Although it is simple to implement, the greedy procedure may miss more efficient solutions.Figure 1 gives a simple example, where markersA andB each tag half of all markers and together can tag all the markers. However, markerC is connected to more than half of all markers, and it is the first marker selected by the greedy algorithm. In this example, the greedy algorithm produced a set with three tagSNPs, despite the fact that the optimal solution contains onlyA andB.
SetT = ∅ andP =Q =S;
For each markeram inP, calculate |C(am)|;
For every markeram where |C(am)| = 0, addam intoT, and remove it fromQ;
Find the marker inP that has the highest |C(am)| value, denoted asamax, and addamax intoT, removing it and all connected SNPs, i.e.C(am) fromQ;
Repeat Steps 2–4 untilQ = ∅.
An example when the greedy approach does not provide the smallest tagSNP set.
An exhaustive search guarantees the minimum tagSNP set. Therefore, theoretically, the exhaustive search solves the tagSNP selection problem. But in practice, genome-wide tagSNP selection requires consideration of hundreds of thousands of SNP markers. For problem of this scale, exhaustive searches cannot be directly applied due to prohibitive computation costs.
Since appreciable LD only occurs within clusters of nearby markers along chromosomes, a practical solution is to first decompose the set of markers into disjoint precincts, such that markers in different precincts are never in strong LD. Then, selecting tagSNPs using ther2 criterion in the whole set is equivalent to selecting tagSNPs in each precinct and then combining all the tagSNPs together. Here the concept of precinct is defined based on pairwise LD measure. It is therefore closely related to haplotype blocks (Reichet al., 2001;Patilet al., 2001;Dalyet al., 2001;Jeffreyset al., 2001;Gabrielet al., 2002;Dawsonet al., 2002), which are regions where historical recombination events are rare. The main difference is that the precincts of markers in high LD are determined purely on genetic distance. Unlike haplotype block, markers within each precinct may not be consecutive markers sitting next to each other.
Partitioning the markers into precincts can be achieved using standard algorithms in graph theory. We applied the Breadth First Search (BFS) algorithm (Cormenet al., 2001). Starting from any node (a marker) in a new precinct, this algorithm adds all neighboring nodes (markers in LD) and all neighbors of the newly added nodes to the precinct, until there are no neighbors to be added to the precinct. This process is restarted from different nodes until all the nodes are assigned a precinct.
After the partitioning step, we perform the tagSNP selection within each precinct. Starting withK = 1, allK-marker combinations are searched to see if they cover all markers within this precinct. If not,K is increased by one and the search is repeated until a tagSNP set is found or a pre-specified search limit is reached.
When evaluating allK-marker combinations, the computation cost required for an exhaustive search might be too great in some precincts. In such cases, we propose a hybrid solution which reduces the computation cost and retains a good chance of finding optimal tagSNP sets. For each precincti withNi markers (Here on, all parameters with subscripti indicate parameters within thei-th precincts, such asKi,Ji,Pi,Qi,Ti andNi.), we decide whether an exhaustive search is feasible by comparing the computation cost required for evaluating allK-marker combinations within a precinct,, with a computation cost limitL specified a priori, determined based on available computing resources. Larger limits allow a more comprehensive search, which may result in fewer tagSNPs being selected, but require additional computational effort. In this study, we set this limit at 1 million. When this limit is exceeded, we apply the following hybrid algorithm. Specify such that it is the largestK possible that satisfies, whereL0 is a pre-specified computation cost limit (less thanL, set at 10 000 in studies conducted here). Subsequently, for each-marker combinations, denoted as, assume that these markers have already been selected, removeam together with all the markers in from candidate setPi and target setQi,, i.e. then apply the greedy approach to identify a subset ofPi that is able to coverQi, which contains the remaining untagged markers. The tagSNP set obtained in the reduced set plus the previous markers together form a complete tagSNP set for thei-th precinct. The detailed algorithm is as follows:
Algorithm 2 (FESTA: greedy-exhaustive hybrid search):FESTA executes either a ‘pure exhaustive search’ or a ‘greedy-exhaustive hybrid search’ in each precinct depending on the computational cost. Exhaustive search is first attempted, and if the computation cost becomes too high, the hybrid algorithm is used as a fallback. Typically, only a small proportion of the precincts require the ‘greedy-exhaustive hybrid search’.
Apply the Breadth First Search to decompose the entire set of markers into precincts such that high LD can only be observed within precincts., and;
Within each precinct, setK = 1,
If, move to b, otherwise, conduct an exhaustive search over all possibleK-marker combinations. Both the candidate setPi and the target setQi is. If no combination ofK SNPs can cover the entire precinct, setK =K + 1, and repeat this step;
Find such that and. For every-marker combination in, denoted as, let,, and apply the greedy approach to identify a subset ofPi that is able to cover the remaining untagged markersQi. Among all the resulting tagSNP sets, we choose the smallest.
Record all minimum tagSNP sets that cover the precinct. These form the complete minimum tagSNP sets, whereJi is the total number of such minimum tagSNP sets.
Any combination of tagSNP sets identified from all disjoint precincts forms a tagSNP set for the whole set. Suppose the size of the minimum tagSNP set(s) in each precinct isKi, then the overall size of such minimum tagSNP sets is, and the total number of such minimum tagSNP sets is.
So far, both the greedy approach and our FESTA algorithm focus on finding a tagSNP set such that each SNP is either a tagSNP itself or is in LD with at least one of the tagSNPs. This is a criterion aimed at minimizing the number of tagSNPs selected. In reality, random genotyping failure or genotyping error on these tagSNPs can result in loss of power to identify the true signal. To be more robust against such adverse events, we evaluated a more stringent criterion requiring that every untyped SNP be in LD with at least two tagSNPs.
Our FESTA algorithm can be extended to find tagSNP sets that will have double coverage on the SNP markers considered. As always, an exhaustive search is able to find such tagSNP sets when the marker set considered is not too large. When exhaustive search is not feasible, the same greedy-exhaustive hybrid search strategy can be applied. The detailed FESTA double coverage algorithm can be found in the Supplementary Online Material. Note that in practice, it may be useful to consider double coverage only for large precincts, where the cost of losing an SNP to genotyping failure might be higher.
Our algorithm readily allows users to force certain SNP markers to be included in or excluded from the tagSNP set. There are several scenarios where such functionality is important. First, in candidate gene studies, previous knowledge may be available as to which SNPs are functionally important. These might include non-synonymous coding region SNPs (cSNPs) as well as SNPs located in regulatory regions. Second, in genome-wide studies, one might carry out multiple rounds of genotyping and tagSNP selection. In such cases, additional tagSNPs could be selected at each round to cover the markers not tagged by tagSNPs successfully genotyped in the previous round. In other settings, it may be useful to exclude certain SNPs from consideration as tags. For example, some SNP markers may be difficult to genotype using a particular platform.
When there are mandatory markerst1,t2,…,tr, to be included, put these markers into the tagSNP setT and remove them from the candidate set, e.g.P becomes. The target setQ becomes. If there are SNPsu1,u2,…,us that need to be excluded from the tagSNP set, remove them from the candidate setP, the target setQ is unchanged.
Within a densely typed SNP set, redundant tagSNPs are common, which results in multiple tagSNP sets of the same size. All of these sets are equal in the sense of minimizing the number of tagSNPs. In order to choose one set for genotyping, additional criteria can be entertained. Here are examples of such additional criteria:In Criteria 1 and 2, we try to identify tagSNP sets whose members have the strongest connections with those untagged SNPs. Since pairwiser2 between disease locus and marker loci is closely related to statistical power of detecting association, this strategy should result in increased power on average and in the worst case, respectively. These criteria are recommended in regular association study designs. The purpose of using Criterion 3 is to find a tagSNP set whose members are as independent as possible which minimizes overlap between covered SNPs of different tagSNPs and potentially increases the chance of linking to untyped SNPs. This strategy is particularly useful if one suspects the actual disease locus is not among the marker loci genotyped. To evaluate the potential of uncovering the disease-causing mutations in association studies among tagSNP sets identified by the aforementioned criteria, we conducted some empirical evaluations, summarized in the Results section.
Maximize averager2 between tagSNPs and untagged SNPs they represent;
Maximize the lowestr2 between tagSNPs and the untagged SNPs they connect to;
Minimize the averager2 among all pairs of tagSNPs within a precinct.
Other types of criteria may be of even greater interest in practice. For example, in many genotyping technologies, some SNPs are harder to genotype than others due to the characteristics of surrounding genome sequence. We can use this information to select tagSNPs that are likely to have a high success rate and to avoid SNPs that are prone to genotyping failure.
In order to illustrate our proposed piecewise exhaustive search strategy, compare it with the greedy approach and explore the various characteristics of the tagSNP sets selected by our method, we applied both methods to two sets of data, the entire Chromosome 2 and five ENCODE regions (ENr112, ENr131, ENr113, ENm010 and ENm013) genotyped by the HapMap project (release 16c, June 2005). All three populations: CEU (European), YRI (Yoruban) and JPT + CHB (Japanese and Chinese) were studied. The first is in the context of a genome-wide association study and the second is similar to the situation of a candidate region study.
We have applied the greedy algorithm and FESTA to Chromosome 2 using HapMap Phase 1 genotype data (release 16c, June 2005).Table 1 (r2 threshold of 0.5) and Table S1 (r2 threshold of 0.8) summarize the results. FESTA produces less tagSNPs compared with the greedy approach in all three populations. When compared across populations, the YRI samples have about twice the amount of tagSNPs as the CEU or the JPT + CHB samples. The JPT + CHB samples have slightly less tagSNPs identified than the CEU samples. Withr2 threshold 0.5, the percentages of tagSNPs identified by our new algorithm are 21.6% in CEU, 39.3% in YRI and 20.9% in JPT + CHB samples, respectively.
Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct
CEU | YRI | JPT + CHB | |
---|---|---|---|
No. of SNPs | 64 801 | 69 630 | 57 810 |
r2 ≥ 0.5 | |||
No. of precincts | 11 786 | 24 752 | 10 248 |
No. of tagSNPs (Greedy) | 14 384 | 27 804 | 12 454 |
No. of tagSNPs (FESTA) | 13 983 | 27 379 | 12 108 |
No. of tagSNPs (FESTA, double cover) | 23 644 | 41 668 | 20 644 |
r2 ≥ 0.8 | |||
No. of precincts | 23 426 | 41 079 | 20 178 |
No. of tagSNPs (Greedy) | 24 300 | 41 729 | 21 044 |
No. of tagSNPs (FESTA) | 24 176 | 41 664 | 20 963 |
No. of tagSNPs (FESTA, double cover) | 35 824 | 54 101 | 31 463 |
CEU | YRI | JPT + CHB | |
---|---|---|---|
No. of SNPs | 64 801 | 69 630 | 57 810 |
r2 ≥ 0.5 | |||
No. of precincts | 11 786 | 24 752 | 10 248 |
No. of tagSNPs (Greedy) | 14 384 | 27 804 | 12 454 |
No. of tagSNPs (FESTA) | 13 983 | 27 379 | 12 108 |
No. of tagSNPs (FESTA, double cover) | 23 644 | 41 668 | 20 644 |
r2 ≥ 0.8 | |||
No. of precincts | 23 426 | 41 079 | 20 178 |
No. of tagSNPs (Greedy) | 24 300 | 41 729 | 21 044 |
No. of tagSNPs (FESTA) | 24 176 | 41 664 | 20 963 |
No. of tagSNPs (FESTA, double cover) | 35 824 | 54 101 | 31 463 |
Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct
CEU | YRI | JPT + CHB | |
---|---|---|---|
No. of SNPs | 64 801 | 69 630 | 57 810 |
r2 ≥ 0.5 | |||
No. of precincts | 11 786 | 24 752 | 10 248 |
No. of tagSNPs (Greedy) | 14 384 | 27 804 | 12 454 |
No. of tagSNPs (FESTA) | 13 983 | 27 379 | 12 108 |
No. of tagSNPs (FESTA, double cover) | 23 644 | 41 668 | 20 644 |
r2 ≥ 0.8 | |||
No. of precincts | 23 426 | 41 079 | 20 178 |
No. of tagSNPs (Greedy) | 24 300 | 41 729 | 21 044 |
No. of tagSNPs (FESTA) | 24 176 | 41 664 | 20 963 |
No. of tagSNPs (FESTA, double cover) | 35 824 | 54 101 | 31 463 |
CEU | YRI | JPT + CHB | |
---|---|---|---|
No. of SNPs | 64 801 | 69 630 | 57 810 |
r2 ≥ 0.5 | |||
No. of precincts | 11 786 | 24 752 | 10 248 |
No. of tagSNPs (Greedy) | 14 384 | 27 804 | 12 454 |
No. of tagSNPs (FESTA) | 13 983 | 27 379 | 12 108 |
No. of tagSNPs (FESTA, double cover) | 23 644 | 41 668 | 20 644 |
r2 ≥ 0.8 | |||
No. of precincts | 23 426 | 41 079 | 20 178 |
No. of tagSNPs (Greedy) | 24 300 | 41 729 | 21 044 |
No. of tagSNPs (FESTA) | 24 176 | 41 664 | 20 963 |
No. of tagSNPs (FESTA, double cover) | 35 824 | 54 101 | 31 463 |
The size of the tagSNP set is optimal for precincts where the greedy approach indicates that one or two tagSNPs are enough to cover all the SNPs in it. Improvements over the greedy approach is only possible for the remaining precincts. In the CEU samples, there are 599 of such precincts, in which the greedy approach identified 2423 tagSNPs, and FESTA identified 2022, a 16.5% reduction. When ther2 threshold is 0.8, 154 precincts require more than two tagSNPs, as identified by the greedy approach. Among them, the greedy approach and FESTA identified 526 and 402 tagSNPs, respectively. The reduction rate is 23.6%. All the detailed results are summarized inTable 2 (r2 threshold of 0.5) and S1 (r2 threshold of 0.8). When double coverage is required, 69.1 and 45.9% more tagSNPs are needed withr2 thresholds of 0.5 and 0.8, respectively. Similar results were obtained from the YRI and JPT + CHB samples.
Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr2 threshold of 0.5)
CEU | YRI | JPT+CHB | ||||
---|---|---|---|---|---|---|
Greedy | FESTA | Greedy | FESTA | Greedy | FESTA | |
Singletona | 5241 | 5241 | 15 079 | 15 079 | 4416 | 4416 |
1 | 5172 | 5172 | 8096 | 8096 | 4660 | 4660 |
2 | 774 | 911 | 924 | 1070 | 634 | 770 |
3 | 318 | 278 | 355 | 291 | 312 | 250 |
4 | 144 | 99 | 127 | 100 | 113 | 90 |
5 | 59 | 42 | 73 | 53 | 60 | 30 |
6 | 27 | 18 | 36 | 28 | 16 | 15 |
7 | 17 | 17 | 21 | 10 | 14 | 6 |
8 | 16 | 4 | 10 | 8 | 11 | 6 |
9 | 11 | 1 | 6 | 3 | 4 | 4 |
10+ | 7 | 3 | 25 | 14 | 8 | 1 |
Total | 14 384 | 13 983 | 27 804 | 27 379 | 12 454 | 12 108 |
CEU | YRI | JPT+CHB | ||||
---|---|---|---|---|---|---|
Greedy | FESTA | Greedy | FESTA | Greedy | FESTA | |
Singletona | 5241 | 5241 | 15 079 | 15 079 | 4416 | 4416 |
1 | 5172 | 5172 | 8096 | 8096 | 4660 | 4660 |
2 | 774 | 911 | 924 | 1070 | 634 | 770 |
3 | 318 | 278 | 355 | 291 | 312 | 250 |
4 | 144 | 99 | 127 | 100 | 113 | 90 |
5 | 59 | 42 | 73 | 53 | 60 | 30 |
6 | 27 | 18 | 36 | 28 | 16 | 15 |
7 | 17 | 17 | 21 | 10 | 14 | 6 |
8 | 16 | 4 | 10 | 8 | 11 | 6 |
9 | 11 | 1 | 6 | 3 | 4 | 4 |
10+ | 7 | 3 | 25 | 14 | 8 | 1 |
Total | 14 384 | 13 983 | 27 804 | 27 379 | 12 454 | 12 108 |
aSingleton means the precinct only contains one SNP. In other words, singleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than a threshold) with any other SNP in the entire set. Such an SNP, by definition, is one of the tagSNPs.
Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr2 threshold of 0.5)
CEU | YRI | JPT+CHB | ||||
---|---|---|---|---|---|---|
Greedy | FESTA | Greedy | FESTA | Greedy | FESTA | |
Singletona | 5241 | 5241 | 15 079 | 15 079 | 4416 | 4416 |
1 | 5172 | 5172 | 8096 | 8096 | 4660 | 4660 |
2 | 774 | 911 | 924 | 1070 | 634 | 770 |
3 | 318 | 278 | 355 | 291 | 312 | 250 |
4 | 144 | 99 | 127 | 100 | 113 | 90 |
5 | 59 | 42 | 73 | 53 | 60 | 30 |
6 | 27 | 18 | 36 | 28 | 16 | 15 |
7 | 17 | 17 | 21 | 10 | 14 | 6 |
8 | 16 | 4 | 10 | 8 | 11 | 6 |
9 | 11 | 1 | 6 | 3 | 4 | 4 |
10+ | 7 | 3 | 25 | 14 | 8 | 1 |
Total | 14 384 | 13 983 | 27 804 | 27 379 | 12 454 | 12 108 |
CEU | YRI | JPT+CHB | ||||
---|---|---|---|---|---|---|
Greedy | FESTA | Greedy | FESTA | Greedy | FESTA | |
Singletona | 5241 | 5241 | 15 079 | 15 079 | 4416 | 4416 |
1 | 5172 | 5172 | 8096 | 8096 | 4660 | 4660 |
2 | 774 | 911 | 924 | 1070 | 634 | 770 |
3 | 318 | 278 | 355 | 291 | 312 | 250 |
4 | 144 | 99 | 127 | 100 | 113 | 90 |
5 | 59 | 42 | 73 | 53 | 60 | 30 |
6 | 27 | 18 | 36 | 28 | 16 | 15 |
7 | 17 | 17 | 21 | 10 | 14 | 6 |
8 | 16 | 4 | 10 | 8 | 11 | 6 |
9 | 11 | 1 | 6 | 3 | 4 | 4 |
10+ | 7 | 3 | 25 | 14 | 8 | 1 |
Total | 14 384 | 13 983 | 27 804 | 27 379 | 12 454 | 12 108 |
aSingleton means the precinct only contains one SNP. In other words, singleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than a threshold) with any other SNP in the entire set. Such an SNP, by definition, is one of the tagSNPs.
Among all the non-singleton precincts in the CEU samples (6545 forr2 threshold of 0.5 and 10196 forr2 threshold of 0.8), most require only a small number of tagSNPs, so that the exhaustive search can be applied directly. Withr2 threshold of 0.5, the greedy-exhaustive hybrid approach was required for only 98 precincts or 1.5% of all precincts (11 precincts (0.1%) withr2 thershold of 0.8).
A very dense SNP map was recently released by the HapMap project on the ENCODE regions. We used five such regions (ENr112, ENr131, ENr113, ENm010 and ENm013) to evaluate the performance of our algorithm. Each ENCODE regions is 500 kb in length, for the CEU samples, the average number of SNPs in these regions is 832 (ranges from 551 to 1126), corresponding to an SNP density about 1 SNP per 601 bps (1 SNP per 907 bps to 1 SNP per 444 bps for individual regions). The detailed results were summarized inTable 3. Detailed results for the YRI and JPT + CHB samples can be found in Supplementary Tables S2 and S3.
Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
No. of SNPs | 863 | 988 | 1061 | 539 | 708 |
r2 ≥ 0.5 | |||||
No. of precincts | 55 | 78 | 43 | 44 | 26 |
No. of singletonsa | 23 | 31 | 16 | 16 | 11 |
No. of tagSNPs (Greedy) | 81 | 110 | 71 | 66 | 41 |
No. of tagSNPs (FESTA) | 75 | 105 | 67 | 61 | 38 |
No. of tagSNPs (FESTA double cover) | 128 | 183 | 123 | 109 | 67 |
r2 ≥ 0.8 | |||||
No. of precincts | 134 | 184 | 131 | 125 | 72 |
No. of singletonsa | 63 | 81 | 62 | 61 | 25 |
No. of tagSNPs (Greedy) | 152 | 197 | 142 | 131 | 83 |
No. of tagSNPs (FESTA) | 146 | 193 | 141 | 130 | 81 |
No. of tagSNPs (FESTA, double cover) | 237 | 311 | 229 | 204 | 139 |
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
No. of SNPs | 863 | 988 | 1061 | 539 | 708 |
r2 ≥ 0.5 | |||||
No. of precincts | 55 | 78 | 43 | 44 | 26 |
No. of singletonsa | 23 | 31 | 16 | 16 | 11 |
No. of tagSNPs (Greedy) | 81 | 110 | 71 | 66 | 41 |
No. of tagSNPs (FESTA) | 75 | 105 | 67 | 61 | 38 |
No. of tagSNPs (FESTA double cover) | 128 | 183 | 123 | 109 | 67 |
r2 ≥ 0.8 | |||||
No. of precincts | 134 | 184 | 131 | 125 | 72 |
No. of singletonsa | 63 | 81 | 62 | 61 | 25 |
No. of tagSNPs (Greedy) | 152 | 197 | 142 | 131 | 83 |
No. of tagSNPs (FESTA) | 146 | 193 | 141 | 130 | 81 |
No. of tagSNPs (FESTA, double cover) | 237 | 311 | 229 | 204 | 139 |
aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.
Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
No. of SNPs | 863 | 988 | 1061 | 539 | 708 |
r2 ≥ 0.5 | |||||
No. of precincts | 55 | 78 | 43 | 44 | 26 |
No. of singletonsa | 23 | 31 | 16 | 16 | 11 |
No. of tagSNPs (Greedy) | 81 | 110 | 71 | 66 | 41 |
No. of tagSNPs (FESTA) | 75 | 105 | 67 | 61 | 38 |
No. of tagSNPs (FESTA double cover) | 128 | 183 | 123 | 109 | 67 |
r2 ≥ 0.8 | |||||
No. of precincts | 134 | 184 | 131 | 125 | 72 |
No. of singletonsa | 63 | 81 | 62 | 61 | 25 |
No. of tagSNPs (Greedy) | 152 | 197 | 142 | 131 | 83 |
No. of tagSNPs (FESTA) | 146 | 193 | 141 | 130 | 81 |
No. of tagSNPs (FESTA, double cover) | 237 | 311 | 229 | 204 | 139 |
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
No. of SNPs | 863 | 988 | 1061 | 539 | 708 |
r2 ≥ 0.5 | |||||
No. of precincts | 55 | 78 | 43 | 44 | 26 |
No. of singletonsa | 23 | 31 | 16 | 16 | 11 |
No. of tagSNPs (Greedy) | 81 | 110 | 71 | 66 | 41 |
No. of tagSNPs (FESTA) | 75 | 105 | 67 | 61 | 38 |
No. of tagSNPs (FESTA double cover) | 128 | 183 | 123 | 109 | 67 |
r2 ≥ 0.8 | |||||
No. of precincts | 134 | 184 | 131 | 125 | 72 |
No. of singletonsa | 63 | 81 | 62 | 61 | 25 |
No. of tagSNPs (Greedy) | 152 | 197 | 142 | 131 | 83 |
No. of tagSNPs (FESTA) | 146 | 193 | 141 | 130 | 81 |
No. of tagSNPs (FESTA, double cover) | 237 | 311 | 229 | 204 | 139 |
aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.
In this set of densely typed SNPs, using our method withr2 threshold of 0.5, the average percentage of tagSNPs required to cover each of the five regions is 8.3% of all markers (ranges from 5.4 to 11.3%). For double coverage, on average, 76.7% more tagSNPs are required (ranges from 70.7 to 83.6%). With a more stringentr2 threshold of 0.8, the average percentage of tagSNPs required increased to 16.6% of all markers (ranges from 11.4 to 24.1%). To double cover these regions, on average, 62.9% more tagSNPs are required (ranges from 56.9 to 71.6%). For those precincts where improvement over greedy search is possible, using FESTA, the reduction rate is 17.9 and 23.0% on average for the five ENCODE regions withr2 thresholds of 0.5 and 0.8, respectively. Applying our method to YRI and JPT + CHB samples reveals similar trends (data not shown).
With the rapid advance of genotyping technologies, progressively denser SNP maps will become available. As more refined association studies are carried out, it will be useful to select new tagSNPs to ‘fill holes’ in the initial sparse maps. With a good picking strategy for the first round of tagging, this staged approach should result in only a small-to-moderate increase in the total number of tagSNPs compared to a one-stage strategy.
To evaluate this strategy, we constructed an artificial sparse SNP map for each of the five ENCODE regions (using the CEU samples only). Specifically, we selected one in every five consecutive SNP markers. The density of this sparse map is about 1 SNP per 3kb, close to the density of the phase I HapMap. Then, three different tagSNP sets are identified using the three criteria described previously, denoted byTi,i = 1, 2, 3. Finally, we applied our approach to the full ENCODE SNP set, using each of these tagSNP sets as a seed, so as to search for additional tagSNPs to cover the previously ‘hidden’ SNP markers. The effectiveness of these tagSNP sets is evaluated by comparing the number of new tagSNPs needed to cover the ‘newly found’ SNPs. In addition to the three criteria, we also compared three other tagSNP selection strategies:Z random SNPs, assumeZ is the number of tagSNPs for the sparse map; a picket fence strategy withZ equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points); or using all original SNPs as tagSNPs. The results are summarized inTable 4 (r2 threshold of 0.5) and Table S4 (r2 threshold of 0.8) in the Supplementary Online Material. From there, one can see that when ther2 threshold is 0.5, 14.4% more tagSNPs (range from 7.0 to 20.9%) are needed to fill holes in the original map and that number is only 5.4% (range from 3.8 to 7.0%) whenr2 threshold is 0.8. The three tagSNP sets require fewer tagSNPs to cover the holes, compared with tagSNPs picked using a picket fence strategy (31.6% difference forr2 threshold of 0.5 and 21.6% difference forr2 threshold of 0.8) or picked at random (33.8% difference forr2 threshold of 0.5 and 21.0% difference forr2 threshold of 0.8).
Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr2 threshold of 0.5)
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
SNPs in dense mapa | 863 | 988 | 1061 | 539 | 708 |
SNPs in sparse mapb | 173 | 198 | 213 | 108 | 142 |
One-stage picking | |||||
TagSNPs in dense map | 75 | 105 | 67 | 61 | 38 |
Two-stage picking | |||||
Max averager2 b/tags and non-tagsc | 85 | 114 | 81 | 72 | 40 |
Min lowestr2/tags and non-tagsd | 85 | 115 | 80 | 75 | 40 |
Min averager2 among tagse | 85 | 117 | 82 | 74 | 42 |
Other strategies | |||||
Random pickingf | 103.2 | 137.7 | 91.4 | 71.0 | 52.0 |
Picket fenceg | 103 | 136 | 94 | 78 | 52 |
Use all sparseh | 200 | 241 | 239 | 134 | 153 |
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
SNPs in dense mapa | 863 | 988 | 1061 | 539 | 708 |
SNPs in sparse mapb | 173 | 198 | 213 | 108 | 142 |
One-stage picking | |||||
TagSNPs in dense map | 75 | 105 | 67 | 61 | 38 |
Two-stage picking | |||||
Max averager2 b/tags and non-tagsc | 85 | 114 | 81 | 72 | 40 |
Min lowestr2/tags and non-tagsd | 85 | 115 | 80 | 75 | 40 |
Min averager2 among tagse | 85 | 117 | 82 | 74 | 42 |
Other strategies | |||||
Random pickingf | 103.2 | 137.7 | 91.4 | 71.0 | 52.0 |
Picket fenceg | 103 | 136 | 94 | 78 | 52 |
Use all sparseh | 200 | 241 | 239 | 134 | 153 |
aDense map means the densely typed SNP sets obtained in the ENCODE region from the HapMap websiteAuthor Webpage (CEU samples, release 16c, June 2005). All tagSNPs in this table were identified using our new algorithm.
bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.
c–eTotal number of tagSNPs needed to cover the SNPs in the dense map using the tagSNPs identified using different criteria on the sparse map as seeds. Criterion 1, maximize the averager2 between tagSNP and SNPs that it connected to; Criterion 2, minimize the averager2 between tagSNP and SNPs that it connected to; Criterion 3, minimize the averager2 among all tagSNPs.
fTotal number of tagSNPs needed to cover the SNPs in the dense map usingT random SNPs in the sparse map as seeds.T is the number of tagSNPs identified by our algorithm on the sparse map. The number is obtained by repeating this procedure 100 times and taking the average.
gTotal number of tagSNPs needed to cover the SNPs in the dense map usingT equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points) in the sparse map as seeds.
hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.
Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr2 threshold of 0.5)
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
SNPs in dense mapa | 863 | 988 | 1061 | 539 | 708 |
SNPs in sparse mapb | 173 | 198 | 213 | 108 | 142 |
One-stage picking | |||||
TagSNPs in dense map | 75 | 105 | 67 | 61 | 38 |
Two-stage picking | |||||
Max averager2 b/tags and non-tagsc | 85 | 114 | 81 | 72 | 40 |
Min lowestr2/tags and non-tagsd | 85 | 115 | 80 | 75 | 40 |
Min averager2 among tagse | 85 | 117 | 82 | 74 | 42 |
Other strategies | |||||
Random pickingf | 103.2 | 137.7 | 91.4 | 71.0 | 52.0 |
Picket fenceg | 103 | 136 | 94 | 78 | 52 |
Use all sparseh | 200 | 241 | 239 | 134 | 153 |
Region | ENr112 | ENr131 | ENr113 | ENm010 | ENm013 |
---|---|---|---|---|---|
SNPs in dense mapa | 863 | 988 | 1061 | 539 | 708 |
SNPs in sparse mapb | 173 | 198 | 213 | 108 | 142 |
One-stage picking | |||||
TagSNPs in dense map | 75 | 105 | 67 | 61 | 38 |
Two-stage picking | |||||
Max averager2 b/tags and non-tagsc | 85 | 114 | 81 | 72 | 40 |
Min lowestr2/tags and non-tagsd | 85 | 115 | 80 | 75 | 40 |
Min averager2 among tagse | 85 | 117 | 82 | 74 | 42 |
Other strategies | |||||
Random pickingf | 103.2 | 137.7 | 91.4 | 71.0 | 52.0 |
Picket fenceg | 103 | 136 | 94 | 78 | 52 |
Use all sparseh | 200 | 241 | 239 | 134 | 153 |
aDense map means the densely typed SNP sets obtained in the ENCODE region from the HapMap websiteAuthor Webpage (CEU samples, release 16c, June 2005). All tagSNPs in this table were identified using our new algorithm.
bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.
c–eTotal number of tagSNPs needed to cover the SNPs in the dense map using the tagSNPs identified using different criteria on the sparse map as seeds. Criterion 1, maximize the averager2 between tagSNP and SNPs that it connected to; Criterion 2, minimize the averager2 between tagSNP and SNPs that it connected to; Criterion 3, minimize the averager2 among all tagSNPs.
fTotal number of tagSNPs needed to cover the SNPs in the dense map usingT random SNPs in the sparse map as seeds.T is the number of tagSNPs identified by our algorithm on the sparse map. The number is obtained by repeating this procedure 100 times and taking the average.
gTotal number of tagSNPs needed to cover the SNPs in the dense map usingT equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points) in the sparse map as seeds.
hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.
In this manuscript, we developed an efficient computational framework for tagSNP selection using the pairwiser2 criterion. Our algorithm is able to identify smaller tagSNP sets than the greedy approach (Carlsonet al., 2004). Although the improvement is modest, our algorithm always outperforms the greedy approach in terms of the tagSNP size under exactly the same pairwise LD criterion. Using both chromosome-wide data and densely typed ENCODE region data from the HapMap Project, we illustrated the utility of our approach and showed savings increase in more densely typed regions and inside large LD ‘blocks’. Computational time required by FESTA is quite reasonable and can be tailored to available computing resources as needed. Under the default setting, withr2 threshold of 0.5, FESTA takes ∼1–15 min to run on the five ENCODE regions, and ∼120 min on entire Chromosome 2 (withr2 threshold of 0.8, ∼0.1–1.5 min on the five ENCODE regions, and ∼24 min on Chromosome 2) using a 2.8 GHz Pentium class computer server. Another important advance is the ability of our method to identify multiple equivalent tagSNP sets and to use additional criteria to choose an optimal tagSNP set for typing. This feature offers flexibility in picking tagSNPs which is desirable when designing real association studies.
The key improvement of FESTA over the greedy approach is the ‘precinct partitioning’ step which enables the exhaustive search to be carried out very rapidly in most of the partitioned precincts. This is similar in spirit to the idea of ‘partition-ligation’ algorithm proposed byNiuet al. (2002) for haplotype inference.
Many of the existing tagSNP picking algorithms aim to capture haplotype diversity using the reduced set of markers (called haplotype tagging SNPs, htSNPs) such as BEST (Sebastianiet al., 2003). They work well when a small number of common haplotypes exist (typically true in the vicinity of a candidate gene) but these approaches often require the knowledge of complete haplotype phase and the boundary of the haplotype blocks. On the other hand, tagSNP selection usingr2 criteria does not require knowledge of block boundaries and can easily be applied to cover the whole chromosome. Recently, multiple-marker tagging strategies (Stram, 2005; P.I. de Bakker, 2005,Author Webpage) in which multiple tagSNPs can be used to represent each untagged SNPs have been proposed. While these methods further reduce the number of tagSNPs selected, this ‘aggressive’ approach may be sensitive to random genotyping failures.
Our approach is amenable to further computational improvements. For example, parallel programming could be used to search for tagSNPs in separate precincts, further speeding up the computation.
FESTA is freely available and can be downloaded atAuthor Webpage
We are grateful to Drs Mike Boehnke, Randy Pruim and the three anonymous reviewers for critical comments on an early version of this manuscript. This work is partially supported by NIH RO1-HG002651-01 to G.A.
Conflict of Interest: none declared.
Associate Editor: Martin Bishop
Month: | Total Views: |
---|---|
December 2016 | 1 |
January 2017 | 2 |
March 2017 | 12 |
April 2017 | 1 |
May 2017 | 4 |
June 2017 | 3 |
July 2017 | 8 |
August 2017 | 3 |
September 2017 | 2 |
October 2017 | 1 |
November 2017 | 8 |
December 2017 | 20 |
January 2018 | 5 |
February 2018 | 11 |
March 2018 | 26 |
April 2018 | 12 |
May 2018 | 5 |
June 2018 | 17 |
July 2018 | 14 |
August 2018 | 11 |
September 2018 | 3 |
October 2018 | 3 |
November 2018 | 8 |
December 2018 | 10 |
January 2019 | 9 |
February 2019 | 36 |
March 2019 | 8 |
April 2019 | 25 |
May 2019 | 8 |
June 2019 | 10 |
July 2019 | 10 |
August 2019 | 16 |
September 2019 | 9 |
October 2019 | 11 |
November 2019 | 11 |
December 2019 | 9 |
January 2020 | 21 |
February 2020 | 6 |
March 2020 | 16 |
April 2020 | 8 |
May 2020 | 8 |
June 2020 | 6 |
July 2020 | 7 |
August 2020 | 5 |
September 2020 | 4 |
October 2020 | 12 |
November 2020 | 9 |
December 2020 | 3 |
January 2021 | 9 |
February 2021 | 5 |
March 2021 | 17 |
April 2021 | 4 |
May 2021 | 15 |
June 2021 | 9 |
July 2021 | 11 |
August 2021 | 10 |
September 2021 | 6 |
October 2021 | 11 |
November 2021 | 6 |
December 2021 | 6 |
January 2022 | 6 |
February 2022 | 10 |
March 2022 | 9 |
April 2022 | 6 |
May 2022 | 4 |
June 2022 | 7 |
July 2022 | 6 |
August 2022 | 14 |
September 2022 | 13 |
October 2022 | 24 |
November 2022 | 8 |
December 2022 | 8 |
January 2023 | 4 |
February 2023 | 9 |
March 2023 | 3 |
April 2023 | 9 |
May 2023 | 3 |
June 2023 | 3 |
July 2023 | 7 |
August 2023 | 12 |
September 2023 | 6 |
October 2023 | 5 |
November 2023 | 11 |
December 2023 | 8 |
January 2024 | 23 |
February 2024 | 22 |
March 2024 | 10 |
April 2024 | 15 |
May 2024 | 15 |
June 2024 | 11 |
July 2024 | 22 |
August 2024 | 20 |
September 2024 | 9 |
October 2024 | 13 |
November 2024 | 9 |
December 2024 | 4 |
January 2025 | 13 |
February 2025 | 6 |
March 2025 | 18 |
April 2025 | 1 |
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.