Motivation: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers.

Results: We devised an improved algorithm for tagSNP selection using the pairwiser² criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data.

Availability: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at

Contact: [email protected]

Supplementary information:

INTRODUCTION

With the rapid improvement of high-throughput genotyping technologies, genome-wide association studies are emerging as a promising approach to detect genetic variants that contribute to human diseases. Initially, genome-wide association studies will focus on single nucleotide polymorphisms (SNPs) because of their high abundance in the human genome, their low mutation rates and their accessibility to high-throughput genotyping (Collinset al., 1997). There are more than 10 million verified SNPs in dbSNP (build 124) (Sachidanandamet al., 2001), but typing all available SNP markers is inefficient and not necessary since many will provide redundant information due to linkage disequilibrium (LD). A better strategy is to select a subset of representative SNPs (tagging SNPs or tagSNPs) and to remove the rest from consideration (Johnsonet al., 2001;Cardon and Abecasis, 2003). The objective is to have little information overlap among the selected SNPs while retaining much of the signal contained in the original set.

The selection of tagSNPs has become a very active research topic and many strategies have been proposed (Patilet al., 2001;Zhanget al., 2002;Gabrielet al., 2002;Johnsonet al., 2001;Menget al., 2003;Sebastianiet al., 2003;Avi-Itzhaket al., 2003;Ke and Cardon, 2003;Goldsteinet al., 2003;Stram, 2003;Hampeet al., 2003;Chapmanet al., 2003;Lin and Altman, 2004;Halldórssonet al., 2004;Rinaldoet al., 2005). Recently,Zhang and Jin (2003) andCarlsonet al. (2004) introduced methods based on the LD measurer². These methods search for a small set of SNPs that are in strong LD (measured through pairwiser²) with other SNPs that are not selected for genotyping. Pairwiser² is an attractive criterion for tagSNP selection since it is closely related to statistical power for case–control association studies, where a directly associated SNP is replaced with an indirectly associated tagSNP (Pritchard and Przeworski, 2001).

In this manuscript, we describe efficient algorithms for tagSNP selection based on pairwise LD measurer². The algorithms were implemented in a computer program named FESTA (fragmented exhaustive search for tagging SNPs). Essentially, we replace a greedy search, where markers are added sequentially to the tagSNP set, with an exhaustive search where all marker combinations are evaluated. To achieve this, we arrange the genome into precincts of markers in high LD, such that markers in different precincts show only low pairwise disequilibrium. TagSNP selection can then be performed within each precinct independently, greatly reducing computation cost. In most settings, our method is guaranteed to find the optimal tagSNP set(s) defined by ther² criterion. For a small proportion of precincts where exhaustive search is computationally too expensive to carry out, an efficient greedy-exhaustive hybrid search algorithm is described. Using data from the HapMap project (The International HapMap Consortium, 2003), we show that the majority of these precincts contain relatively small numbers of SNPs, especially when a stringentr² criterion is used. Our algorithm readily identifies equivalent tagSNP sets, so that additional selection criteria can be incorporated. Other useful extensions are also discussed in this manuscript, such as the inclusion/exclusion of certain SNPs and double coverage, which can increase robustness of tagSNP sets against sporadic genotyping failures or errors.

METHODS

Consider a set $S$ which containsM bi-allelic SNP markersa₁, a₂,…,a_M. Further assume that all these markers have minor allele frequency (MAF) above a certain threshold (0.05 was used in this study). First, two-SNP haplotype frequencies were estimated (Hill, 1974), and then the pairwise LD measurer² (also referred to as ‘Δ²') (Devlin and Risch, 1995) was calculated for each pair of markers using the inferred haplotype frequencies (Hill and Robertson, 1968). Two markersa_i anda_j are said to be in strong LD if ther² between them is greater than a pre-specified threshold valuer₀, denoted asr²(a_i,a_j) ≥r₀ (r₀ = 0.5 or 0.8 in in this study). Both are considered tagSNPs for each other; in thata_i can be used as a surrogate fora_j, or vice versa.

Our aim is to a find tagSNP set, denoted byT, a subset of $S$ such that $\forall a_{i} \in S \ T$ ⁠, ∃a_j ∈T that satisfiesr²(a_i,a_j) ≥r₀. In our presentation, we introduce two intermediate SNP sets,P andQ.P is called thecandidate set which contains all the markers that are eligible to be chosen as tagSNPs andQ is named thetarget set which contains all the markers that are yet to be tagged, i.e. no marker inQ is in LD with any tagSNP inT. For each markera_m inP, letC(a_m) = {a :a∈Q andr²(a,a_m) ≥r₀} represent the subset ofQ which contains markers that are in strong LD witha_m, and let |C(a_m)| be the number of the elements in the setC(a_m). Typically, the candidate setP is the complement of the tagSNP setT, $P = S \ T$ andP =Q. One exception occurs when some SNPs are excluded as tagSNPs because they cannot be easily genotyped, but they still should be tagged by other markers if possible. In this case, the candidate set is a subset of target set. We describe several different algorithms for updatingP,Q andT starting with a greedy approach (Carlsonet al., 2004). We then outline successive refinements and extensions of a partition and exhaustive search algorithm, designed to handle various scenarios encountered when planning association studies.

Greedy approach

The detailed algorithm is as follows (Carlsonet al., 2004).

Algorithm 1 (greedy approach):In Step 4, by removing associated markers from consideration, the coverage overlap among tagSNPs is greatly reduced. Although it is simple to implement, the greedy procedure may miss more efficient solutions.Figure 1 gives a simple example, where markersA andB each tag half of all markers and together can tag all the markers. However, markerC is connected to more than half of all markers, and it is the first marker selected by the greedy algorithm. In this example, the greedy algorithm produced a set with three tagSNPs, despite the fact that the optimal solution contains onlyA andB.

SetT = ∅ andP =Q =S;
For each markera_m inP, calculate |C(a_m)|;
For every markera_m where |C(a_m)| = 0, adda_m intoT, and remove it fromQ;
Find the marker inP that has the highest |C(a_m)| value, denoted asa_max, and adda_max intoT, removing it and all connected SNPs, i.e.C(a_m) fromQ;
Repeat Steps 2–4 untilQ = ∅.

Fig. 1

An example when the greedy approach does not provide the smallest tagSNP set.

Open in new tab Download slide

FESTA

An exhaustive search guarantees the minimum tagSNP set. Therefore, theoretically, the exhaustive search solves the tagSNP selection problem. But in practice, genome-wide tagSNP selection requires consideration of hundreds of thousands of SNP markers. For problem of this scale, exhaustive searches cannot be directly applied due to prohibitive computation costs.

Since appreciable LD only occurs within clusters of nearby markers along chromosomes, a practical solution is to first decompose the set of markers into disjoint precincts, such that markers in different precincts are never in strong LD. Then, selecting tagSNPs using ther² criterion in the whole set is equivalent to selecting tagSNPs in each precinct and then combining all the tagSNPs together. Here the concept of precinct is defined based on pairwise LD measure. It is therefore closely related to haplotype blocks (Reichet al., 2001;Patilet al., 2001;Dalyet al., 2001;Jeffreyset al., 2001;Gabrielet al., 2002;Dawsonet al., 2002), which are regions where historical recombination events are rare. The main difference is that the precincts of markers in high LD are determined purely on genetic distance. Unlike haplotype block, markers within each precinct may not be consecutive markers sitting next to each other.

Partitioning the markers into precincts can be achieved using standard algorithms in graph theory. We applied the Breadth First Search (BFS) algorithm (Cormenet al., 2001). Starting from any node (a marker) in a new precinct, this algorithm adds all neighboring nodes (markers in LD) and all neighbors of the newly added nodes to the precinct, until there are no neighbors to be added to the precinct. This process is restarted from different nodes until all the nodes are assigned a precinct.

After the partitioning step, we perform the tagSNP selection within each precinct. Starting withK = 1, allK-marker combinations are searched to see if they cover all markers within this precinct. If not,K is increased by one and the search is repeated until a tagSNP set is found or a pre-specified search limit is reached.

When evaluating allK-marker combinations, the computation cost required for an exhaustive search might be too great in some precincts. In such cases, we propose a hybrid solution which reduces the computation cost and retains a good chance of finding optimal tagSNP sets. For each precincti withN_i markers (Here on, all parameters with subscripti indicate parameters within thei-th precincts, such asK_i,J_i,P_i,Q_i,T_i andN_i.), we decide whether an exhaustive search is feasible by comparing the computation cost required for evaluating allK-marker combinations within a precinct, $(\begin{matrix} N_{i} \\ K \end{matrix})$ ⁠, with a computation cost limitL specified a priori, determined based on available computing resources. Larger limits allow a more comprehensive search, which may result in fewer tagSNPs being selected, but require additional computational effort. In this study, we set this limit at 1 million. When this limit is exceeded, we apply the following hybrid algorithm. Specify $K_{i}^{*}$ such that it is the largestK possible that satisfies $(\begin{matrix} N_{i} \\ K \end{matrix}) \leq L_{0}$ ⁠, whereL₀ is a pre-specified computation cost limit (less thanL, set at 10 000 in studies conducted here). Subsequently, for each $K_{i}^{*}$ -marker combinations, denoted as ${a_{1}, …, a_{K_{i}^{*}}}$ ⁠, assume that these markers have already been selected, removea_m together with all the markers in $C (a_{m})$ from candidate setP_i and target setQ_i, $m = 1, …, K_{i}^{*}$ ⁠, i.e. $P_{i} = Q_{i} = S_{i} \ \cup_{m = 1}^{K_{i}^{*}} ({a_{m}} \cup C (a_{m}))$ then apply the greedy approach to identify a subset ofP_i that is able to coverQ_i, which contains the remaining untagged markers. The tagSNP set obtained in the reduced set plus the previous $K_{i}^{*}$ markers together form a complete tagSNP set for thei-th precinct. The detailed algorithm is as follows:

Algorithm 2 (FESTA: greedy-exhaustive hybrid search):FESTA executes either a ‘pure exhaustive search’ or a ‘greedy-exhaustive hybrid search’ in each precinct depending on the computational cost. Exhaustive search is first attempted, and if the computation cost becomes too high, the hybrid algorithm is used as a fallback. Typically, only a small proportion of the precincts require the ‘greedy-exhaustive hybrid search’.

Apply the Breadth First Search to decompose the entire set of markers into precincts such that high LD can only be observed within precincts. $S = \cup_{i = 1}^{n} S_{i}$ ⁠, and $S_{i} \cap S_{j} = \emptyset for all i \neq j$ ⁠;
Within each precinct $S_{i}$ ⁠, setK = 1,
- If $(\begin{matrix} N_{i} \\ K \end{matrix}) \geq L$ ⁠, move to b, otherwise, conduct an exhaustive search over all possibleK-marker combinations. Both the candidate setP_i and the target setQ_i is $S_{i}$ ⁠. If no combination ofK SNPs can cover the entire precinct, setK =K + 1, and repeat this step;
- Find $K_{i}^{*}$ such that $(\begin{matrix} N_{i} \\ K_{i}^{*} \end{matrix}) \leq L_{0}$ and $(\begin{matrix} N_{i} \\ K_{i}^{*} + 1 \end{matrix}) > L_{0}$ ⁠. For every $K_{i}^{*}$ -marker combination in $S_{i}$ ⁠, denoted as ${a_{1}, …, a_{K_{i}^{*}}}$ ⁠, let $T_{i} = \cup_{m = 1}^{K_{i}^{*}} {a_{m}}$ ⁠, $P_{i} = Q_{i} = S_{i} \ \cup_{m = 1}^{K_{i}^{*}} ({a_{m}} \cup C (a_{m}))$ ⁠, and apply the greedy approach to identify a subset ofP_i that is able to cover the remaining untagged markersQ_i. Among all the resulting tagSNP sets, we choose the smallest.
Record all minimum tagSNP sets that cover the precinct. These form the complete minimum tagSNP sets ${T_{i}^{j} : j = 1, …, J_{i}}$ ⁠, whereJ_i is the total number of such minimum tagSNP sets.
Any combination of tagSNP sets identified from all disjoint precincts forms a tagSNP set for the whole set $S$ ⁠. Suppose the size of the minimum tagSNP set(s) in each precinct isK_i, then the overall size of such minimum tagSNP sets is $\sum_{i = 1}^{n} K_{i}$ ⁠, and the total number of such minimum tagSNP sets is $\prod_{i = 1}^{n} J_{i}$ ⁠.

FESTA double coverage

So far, both the greedy approach and our FESTA algorithm focus on finding a tagSNP set such that each SNP is either a tagSNP itself or is in LD with at least one of the tagSNPs. This is a criterion aimed at minimizing the number of tagSNPs selected. In reality, random genotyping failure or genotyping error on these tagSNPs can result in loss of power to identify the true signal. To be more robust against such adverse events, we evaluated a more stringent criterion requiring that every untyped SNP be in LD with at least two tagSNPs.

Our FESTA algorithm can be extended to find tagSNP sets that will have double coverage on the SNP markers considered. As always, an exhaustive search is able to find such tagSNP sets when the marker set considered is not too large. When exhaustive search is not feasible, the same greedy-exhaustive hybrid search strategy can be applied. The detailed FESTA double coverage algorithm can be found in the Supplementary Online Material. Note that in practice, it may be useful to consider double coverage only for large precincts, where the cost of losing an SNP to genotyping failure might be higher.

Further tagSNP selection considerations

Mandatory tagSNP markers

Our algorithm readily allows users to force certain SNP markers to be included in or excluded from the tagSNP set. There are several scenarios where such functionality is important. First, in candidate gene studies, previous knowledge may be available as to which SNPs are functionally important. These might include non-synonymous coding region SNPs (cSNPs) as well as SNPs located in regulatory regions. Second, in genome-wide studies, one might carry out multiple rounds of genotyping and tagSNP selection. In such cases, additional tagSNPs could be selected at each round to cover the markers not tagged by tagSNPs successfully genotyped in the previous round. In other settings, it may be useful to exclude certain SNPs from consideration as tags. For example, some SNP markers may be difficult to genotype using a particular platform.

When there are mandatory markerst₁,t₂,…,t_r, to be included, put these markers into the tagSNP setT and remove them from the candidate set, e.g.P becomes $P \ \cup_{i = 1}^{r} {t_{i}}$ ⁠. The target setQ becomes $Q \ \cup_{i = 1}^{r} ({t_{i}} \cup C (t_{i}))$ ⁠. If there are SNPsu₁,u₂,…,u_s that need to be excluded from the tagSNP set, remove them from the candidate setP, the target setQ is unchanged.

Choosing between alternative solutions

Within a densely typed SNP set, redundant tagSNPs are common, which results in multiple tagSNP sets of the same size. All of these sets are equal in the sense of minimizing the number of tagSNPs. In order to choose one set for genotyping, additional criteria can be entertained. Here are examples of such additional criteria:In Criteria 1 and 2, we try to identify tagSNP sets whose members have the strongest connections with those untagged SNPs. Since pairwiser² between disease locus and marker loci is closely related to statistical power of detecting association, this strategy should result in increased power on average and in the worst case, respectively. These criteria are recommended in regular association study designs. The purpose of using Criterion 3 is to find a tagSNP set whose members are as independent as possible which minimizes overlap between covered SNPs of different tagSNPs and potentially increases the chance of linking to untyped SNPs. This strategy is particularly useful if one suspects the actual disease locus is not among the marker loci genotyped. To evaluate the potential of uncovering the disease-causing mutations in association studies among tagSNP sets identified by the aforementioned criteria, we conducted some empirical evaluations, summarized in the Results section.

Maximize averager² between tagSNPs and untagged SNPs they represent;
Maximize the lowestr² between tagSNPs and the untagged SNPs they connect to;
Minimize the averager² among all pairs of tagSNPs within a precinct.

Other types of criteria may be of even greater interest in practice. For example, in many genotyping technologies, some SNPs are harder to genotype than others due to the characteristics of surrounding genome sequence. We can use this information to select tagSNPs that are likely to have a high success rate and to avoid SNPs that are prone to genotyping failure.

RESULTS

In order to illustrate our proposed piecewise exhaustive search strategy, compare it with the greedy approach and explore the various characteristics of the tagSNP sets selected by our method, we applied both methods to two sets of data, the entire Chromosome 2 and five ENCODE regions (ENr112, ENr131, ENr113, ENm010 and ENm013) genotyped by the HapMap project (release 16c, June 2005). All three populations: CEU (European), YRI (Yoruban) and JPT + CHB (Japanese and Chinese) were studied. The first is in the context of a genome-wide association study and the second is similar to the situation of a candidate region study.

Chromosome-wide tagging

We have applied the greedy algorithm and FESTA to Chromosome 2 using HapMap Phase 1 genotype data (release 16c, June 2005).Table 1 (r² threshold of 0.5) and Table S1 (r² threshold of 0.8) summarize the results. FESTA produces less tagSNPs compared with the greedy approach in all three populations. When compared across populations, the YRI samples have about twice the amount of tagSNPs as the CEU or the JPT + CHB samples. The JPT + CHB samples have slightly less tagSNPs identified than the CEU samples. Withr² threshold 0.5, the percentages of tagSNPs identified by our new algorithm are 21.6% in CEU, 39.3% in YRI and 20.9% in JPT + CHB samples, respectively.

Table 1

Open in new tab

Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct

	CEU	YRI	JPT + CHB
No. of SNPs	64 801	69 630	57 810
r² ≥ 0.5
No. of precincts	11 786	24 752	10 248
No. of tagSNPs (Greedy)	14 384	27 804	12 454
No. of tagSNPs (FESTA)	13 983	27 379	12 108
No. of tagSNPs (FESTA, double cover)	23 644	41 668	20 644
r² ≥ 0.8
No. of precincts	23 426	41 079	20 178
No. of tagSNPs (Greedy)	24 300	41 729	21 044
No. of tagSNPs (FESTA)	24 176	41 664	20 963
No. of tagSNPs (FESTA, double cover)	35 824	54 101	31 463

	CEU	YRI	JPT + CHB
No. of SNPs	64 801	69 630	57 810
r² ≥ 0.5
No. of precincts	11 786	24 752	10 248
No. of tagSNPs (Greedy)	14 384	27 804	12 454
No. of tagSNPs (FESTA)	13 983	27 379	12 108
No. of tagSNPs (FESTA, double cover)	23 644	41 668	20 644
r² ≥ 0.8
No. of precincts	23 426	41 079	20 178
No. of tagSNPs (Greedy)	24 300	41 729	21 044
No. of tagSNPs (FESTA)	24 176	41 664	20 963
No. of tagSNPs (FESTA, double cover)	35 824	54 101	31 463

Table 1

Open in new tab

Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct

	CEU	YRI	JPT + CHB
No. of SNPs	64 801	69 630	57 810
r² ≥ 0.5
No. of precincts	11 786	24 752	10 248
No. of tagSNPs (Greedy)	14 384	27 804	12 454
No. of tagSNPs (FESTA)	13 983	27 379	12 108
No. of tagSNPs (FESTA, double cover)	23 644	41 668	20 644
r² ≥ 0.8
No. of precincts	23 426	41 079	20 178
No. of tagSNPs (Greedy)	24 300	41 729	21 044
No. of tagSNPs (FESTA)	24 176	41 664	20 963
No. of tagSNPs (FESTA, double cover)	35 824	54 101	31 463

	CEU	YRI	JPT + CHB
No. of SNPs	64 801	69 630	57 810
r² ≥ 0.5
No. of precincts	11 786	24 752	10 248
No. of tagSNPs (Greedy)	14 384	27 804	12 454
No. of tagSNPs (FESTA)	13 983	27 379	12 108
No. of tagSNPs (FESTA, double cover)	23 644	41 668	20 644
r² ≥ 0.8
No. of precincts	23 426	41 079	20 178
No. of tagSNPs (Greedy)	24 300	41 729	21 044
No. of tagSNPs (FESTA)	24 176	41 664	20 963
No. of tagSNPs (FESTA, double cover)	35 824	54 101	31 463

The size of the tagSNP set is optimal for precincts where the greedy approach indicates that one or two tagSNPs are enough to cover all the SNPs in it. Improvements over the greedy approach is only possible for the remaining precincts. In the CEU samples, there are 599 of such precincts, in which the greedy approach identified 2423 tagSNPs, and FESTA identified 2022, a 16.5% reduction. When ther² threshold is 0.8, 154 precincts require more than two tagSNPs, as identified by the greedy approach. Among them, the greedy approach and FESTA identified 526 and 402 tagSNPs, respectively. The reduction rate is 23.6%. All the detailed results are summarized inTable 2 (r² threshold of 0.5) and S1 (r² threshold of 0.8). When double coverage is required, 69.1 and 45.9% more tagSNPs are needed withr² thresholds of 0.5 and 0.8, respectively. Similar results were obtained from the YRI and JPT + CHB samples.

Table 2

Open in new tab

Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr² threshold of 0.5)

	CEU		YRI		JPT+CHB
	Greedy	FESTA	Greedy	FESTA	Greedy	FESTA
Singletona	5241	5241	15 079	15 079	4416	4416
1	5172	5172	8096	8096	4660	4660
2	774	911	924	1070	634	770
3	318	278	355	291	312	250
4	144	99	127	100	113	90
5	59	42	73	53	60	30
6	27	18	36	28	16	15
7	17	17	21	10	14	6
8	16	4	10	8	11	6
9	11	1	6	3	4	4
10+	7	3	25	14	8	1
Total	14 384	13 983	27 804	27 379	12 454	12 108

	CEU		YRI		JPT+CHB
	Greedy	FESTA	Greedy	FESTA	Greedy	FESTA
Singletona	5241	5241	15 079	15 079	4416	4416
1	5172	5172	8096	8096	4660	4660
2	774	911	924	1070	634	770
3	318	278	355	291	312	250
4	144	99	127	100	113	90
5	59	42	73	53	60	30
6	27	18	36	28	16	15
7	17	17	21	10	14	6
8	16	4	10	8	11	6
9	11	1	6	3	4	4
10+	7	3	25	14	8	1
Total	14 384	13 983	27 804	27 379	12 454	12 108

^aSingleton means the precinct only contains one SNP. In other words, singleton refers to an SNP marker that is not in LD (pairwise LD measurer² greater than a threshold) with any other SNP in the entire set. Such an SNP, by definition, is one of the tagSNPs.

Table 2

Open in new tab

Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr² threshold of 0.5)

	CEU		YRI		JPT+CHB
	Greedy	FESTA	Greedy	FESTA	Greedy	FESTA
Singletona	5241	5241	15 079	15 079	4416	4416
1	5172	5172	8096	8096	4660	4660
2	774	911	924	1070	634	770
3	318	278	355	291	312	250
4	144	99	127	100	113	90
5	59	42	73	53	60	30
6	27	18	36	28	16	15
7	17	17	21	10	14	6
8	16	4	10	8	11	6
9	11	1	6	3	4	4
10+	7	3	25	14	8	1
Total	14 384	13 983	27 804	27 379	12 454	12 108

	CEU		YRI		JPT+CHB
	Greedy	FESTA	Greedy	FESTA	Greedy	FESTA
Singletona	5241	5241	15 079	15 079	4416	4416
1	5172	5172	8096	8096	4660	4660
2	774	911	924	1070	634	770
3	318	278	355	291	312	250
4	144	99	127	100	113	90
5	59	42	73	53	60	30
6	27	18	36	28	16	15
7	17	17	21	10	14	6
8	16	4	10	8	11	6
9	11	1	6	3	4	4
10+	7	3	25	14	8	1
Total	14 384	13 983	27 804	27 379	12 454	12 108

Among all the non-singleton precincts in the CEU samples (6545 forr² threshold of 0.5 and 10196 forr² threshold of 0.8), most require only a small number of tagSNPs, so that the exhaustive search can be applied directly. Withr² threshold of 0.5, the greedy-exhaustive hybrid approach was required for only 98 precincts or 1.5% of all precincts (11 precincts (0.1%) withr² thershold of 0.8).

Densely typed region

A very dense SNP map was recently released by the HapMap project on the ENCODE regions. We used five such regions (ENr112, ENr131, ENr113, ENm010 and ENm013) to evaluate the performance of our algorithm. Each ENCODE regions is 500 kb in length, for the CEU samples, the average number of SNPs in these regions is 832 (ranges from 551 to 1126), corresponding to an SNP density about 1 SNP per 601 bps (1 SNP per 907 bps to 1 SNP per 444 bps for individual regions). The detailed results were summarized inTable 3. Detailed results for the YRI and JPT + CHB samples can be found in Supplementary Tables S2 and S3.

Table 3

Open in new tab

Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)

Region	ENr112	ENr131	ENr113	ENm010	ENm013
No. of SNPs	863	988	1061	539	708
r² ≥ 0.5
No. of precincts	55	78	43	44	26
No. of singletonsa	23	31	16	16	11
No. of tagSNPs (Greedy)	81	110	71	66	41
No. of tagSNPs (FESTA)	75	105	67	61	38
No. of tagSNPs (FESTA double cover)	128	183	123	109	67
r² ≥ 0.8
No. of precincts	134	184	131	125	72
No. of singletonsa	63	81	62	61	25
No. of tagSNPs (Greedy)	152	197	142	131	83
No. of tagSNPs (FESTA)	146	193	141	130	81
No. of tagSNPs (FESTA, double cover)	237	311	229	204	139

Region	ENr112	ENr131	ENr113	ENm010	ENm013
No. of SNPs	863	988	1061	539	708
r² ≥ 0.5
No. of precincts	55	78	43	44	26
No. of singletonsa	23	31	16	16	11
No. of tagSNPs (Greedy)	81	110	71	66	41
No. of tagSNPs (FESTA)	75	105	67	61	38
No. of tagSNPs (FESTA double cover)	128	183	123	109	67
r² ≥ 0.8
No. of precincts	134	184	131	125	72
No. of singletonsa	63	81	62	61	25
No. of tagSNPs (Greedy)	152	197	142	131	83
No. of tagSNPs (FESTA)	146	193	141	130	81
No. of tagSNPs (FESTA, double cover)	237	311	229	204	139

^aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer² greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.

Table 3

Open in new tab

Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)

Region	ENr112	ENr131	ENr113	ENm010	ENm013
No. of SNPs	863	988	1061	539	708
r² ≥ 0.5
No. of precincts	55	78	43	44	26
No. of singletonsa	23	31	16	16	11
No. of tagSNPs (Greedy)	81	110	71	66	41
No. of tagSNPs (FESTA)	75	105	67	61	38
No. of tagSNPs (FESTA double cover)	128	183	123	109	67
r² ≥ 0.8
No. of precincts	134	184	131	125	72
No. of singletonsa	63	81	62	61	25
No. of tagSNPs (Greedy)	152	197	142	131	83
No. of tagSNPs (FESTA)	146	193	141	130	81
No. of tagSNPs (FESTA, double cover)	237	311	229	204	139

Region	ENr112	ENr131	ENr113	ENm010	ENm013
No. of SNPs	863	988	1061	539	708
r² ≥ 0.5
No. of precincts	55	78	43	44	26
No. of singletonsa	23	31	16	16	11
No. of tagSNPs (Greedy)	81	110	71	66	41
No. of tagSNPs (FESTA)	75	105	67	61	38
No. of tagSNPs (FESTA double cover)	128	183	123	109	67
r² ≥ 0.8
No. of precincts	134	184	131	125	72
No. of singletonsa	63	81	62	61	25
No. of tagSNPs (Greedy)	152	197	142	131	83
No. of tagSNPs (FESTA)	146	193	141	130	81
No. of tagSNPs (FESTA, double cover)	237	311	229	204	139

^aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer² greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.

In this set of densely typed SNPs, using our method withr² threshold of 0.5, the average percentage of tagSNPs required to cover each of the five regions is 8.3% of all markers (ranges from 5.4 to 11.3%). For double coverage, on average, 76.7% more tagSNPs are required (ranges from 70.7 to 83.6%). With a more stringentr² threshold of 0.8, the average percentage of tagSNPs required increased to 16.6% of all markers (ranges from 11.4 to 24.1%). To double cover these regions, on average, 62.9% more tagSNPs are required (ranges from 56.9 to 71.6%). For those precincts where improvement over greedy search is possible, using FESTA, the reduction rate is 17.9 and 23.0% on average for the five ENCODE regions withr² thresholds of 0.5 and 0.8, respectively. Applying our method to YRI and JPT + CHB samples reveals similar trends (data not shown).

Additional TagSNPs for denser SNP map

With the rapid advance of genotyping technologies, progressively denser SNP maps will become available. As more refined association studies are carried out, it will be useful to select new tagSNPs to ‘fill holes’ in the initial sparse maps. With a good picking strategy for the first round of tagging, this staged approach should result in only a small-to-moderate increase in the total number of tagSNPs compared to a one-stage strategy.

To evaluate this strategy, we constructed an artificial sparse SNP map for each of the five ENCODE regions (using the CEU samples only). Specifically, we selected one in every five consecutive SNP markers. The density of this sparse map is about 1 SNP per 3kb, close to the density of the phase I HapMap. Then, three different tagSNP sets are identified using the three criteria described previously, denoted byT_i,i = 1, 2, 3. Finally, we applied our approach to the full ENCODE SNP set, using each of these tagSNP sets as a seed, so as to search for additional tagSNPs to cover the previously ‘hidden’ SNP markers. The effectiveness of these tagSNP sets is evaluated by comparing the number of new tagSNPs needed to cover the ‘newly found’ SNPs. In addition to the three criteria, we also compared three other tagSNP selection strategies:Z random SNPs, assumeZ is the number of tagSNPs for the sparse map; a picket fence strategy withZ equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points); or using all original SNPs as tagSNPs. The results are summarized inTable 4 (r² threshold of 0.5) and Table S4 (r² threshold of 0.8) in the Supplementary Online Material. From there, one can see that when ther² threshold is 0.5, 14.4% more tagSNPs (range from 7.0 to 20.9%) are needed to fill holes in the original map and that number is only 5.4% (range from 3.8 to 7.0%) whenr² threshold is 0.8. The three tagSNP sets require fewer tagSNPs to cover the holes, compared with tagSNPs picked using a picket fence strategy (31.6% difference forr² threshold of 0.5 and 21.6% difference forr² threshold of 0.8) or picked at random (33.8% difference forr² threshold of 0.5 and 21.0% difference forr² threshold of 0.8).

Table 4

Open in new tab

Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr² threshold of 0.5)

Region	ENr112	ENr131	ENr113	ENm010	ENm013
SNPs in dense mapa	863	988	1061	539	708
SNPs in sparse mapb	173	198	213	108	142
One-stage picking
TagSNPs in dense map	75	105	67	61	38
Two-stage picking
Max averager² b/tags and non-tagsc	85	114	81	72	40
Min lowestr²/tags and non-tagsd	85	115	80	75	40
Min averager² among tagse	85	117	82	74	42
Other strategies
Random pickingf	103.2	137.7	91.4	71.0	52.0
Picket fenceg	103	136	94	78	52
Use all sparseh	200	241	239	134	153

Region	ENr112	ENr131	ENr113	ENm010	ENm013
SNPs in dense mapa	863	988	1061	539	708
SNPs in sparse mapb	173	198	213	108	142
One-stage picking
TagSNPs in dense map	75	105	67	61	38
Two-stage picking
Max averager² b/tags and non-tagsc	85	114	81	72	40
Min lowestr²/tags and non-tagsd	85	115	80	75	40
Min averager² among tagse	85	117	82	74	42
Other strategies
Random pickingf	103.2	137.7	91.4	71.0	52.0
Picket fenceg	103	136	94	78	52
Use all sparseh	200	241	239	134	153

^aDense map means the densely typed SNP sets obtained in the ENCODE region from the HapMap websiteAuthor Webpage (CEU samples, release 16c, June 2005). All tagSNPs in this table were identified using our new algorithm.

^bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.

^c–eTotal number of tagSNPs needed to cover the SNPs in the dense map using the tagSNPs identified using different criteria on the sparse map as seeds. Criterion 1, maximize the averager² between tagSNP and SNPs that it connected to; Criterion 2, minimize the averager² between tagSNP and SNPs that it connected to; Criterion 3, minimize the averager² among all tagSNPs.

^fTotal number of tagSNPs needed to cover the SNPs in the dense map usingT random SNPs in the sparse map as seeds.T is the number of tagSNPs identified by our algorithm on the sparse map. The number is obtained by repeating this procedure 100 times and taking the average.

^gTotal number of tagSNPs needed to cover the SNPs in the dense map usingT equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points) in the sparse map as seeds.

^hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.

Table 4

Open in new tab

Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr² threshold of 0.5)

Region	ENr112	ENr131	ENr113	ENm010	ENm013
SNPs in dense mapa	863	988	1061	539	708
SNPs in sparse mapb	173	198	213	108	142
One-stage picking
TagSNPs in dense map	75	105	67	61	38
Two-stage picking
Max averager² b/tags and non-tagsc	85	114	81	72	40
Min lowestr²/tags and non-tagsd	85	115	80	75	40
Min averager² among tagse	85	117	82	74	42
Other strategies
Random pickingf	103.2	137.7	91.4	71.0	52.0
Picket fenceg	103	136	94	78	52
Use all sparseh	200	241	239	134	153

Region	ENr112	ENr131	ENr113	ENm010	ENm013
SNPs in dense mapa	863	988	1061	539	708
SNPs in sparse mapb	173	198	213	108	142
One-stage picking
TagSNPs in dense map	75	105	67	61	38
Two-stage picking
Max averager² b/tags and non-tagsc	85	114	81	72	40
Min lowestr²/tags and non-tagsd	85	115	80	75	40
Min averager² among tagse	85	117	82	74	42
Other strategies
Random pickingf	103.2	137.7	91.4	71.0	52.0
Picket fenceg	103	136	94	78	52
Use all sparseh	200	241	239	134	153

^bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.

^hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.

DISCUSSION

In this manuscript, we developed an efficient computational framework for tagSNP selection using the pairwiser² criterion. Our algorithm is able to identify smaller tagSNP sets than the greedy approach (Carlsonet al., 2004). Although the improvement is modest, our algorithm always outperforms the greedy approach in terms of the tagSNP size under exactly the same pairwise LD criterion. Using both chromosome-wide data and densely typed ENCODE region data from the HapMap Project, we illustrated the utility of our approach and showed savings increase in more densely typed regions and inside large LD ‘blocks’. Computational time required by FESTA is quite reasonable and can be tailored to available computing resources as needed. Under the default setting, withr² threshold of 0.5, FESTA takes ∼1–15 min to run on the five ENCODE regions, and ∼120 min on entire Chromosome 2 (withr² threshold of 0.8, ∼0.1–1.5 min on the five ENCODE regions, and ∼24 min on Chromosome 2) using a 2.8 GHz Pentium class computer server. Another important advance is the ability of our method to identify multiple equivalent tagSNP sets and to use additional criteria to choose an optimal tagSNP set for typing. This feature offers flexibility in picking tagSNPs which is desirable when designing real association studies.

The key improvement of FESTA over the greedy approach is the ‘precinct partitioning’ step which enables the exhaustive search to be carried out very rapidly in most of the partitioned precincts. This is similar in spirit to the idea of ‘partition-ligation’ algorithm proposed byNiuet al. (2002) for haplotype inference.

Many of the existing tagSNP picking algorithms aim to capture haplotype diversity using the reduced set of markers (called haplotype tagging SNPs, htSNPs) such as BEST (Sebastianiet al., 2003). They work well when a small number of common haplotypes exist (typically true in the vicinity of a candidate gene) but these approaches often require the knowledge of complete haplotype phase and the boundary of the haplotype blocks. On the other hand, tagSNP selection usingr² criteria does not require knowledge of block boundaries and can easily be applied to cover the whole chromosome. Recently, multiple-marker tagging strategies (Stram, 2005; P.I. de Bakker, 2005,Author Webpage) in which multiple tagSNPs can be used to represent each untagged SNPs have been proposed. While these methods further reduce the number of tagSNPs selected, this ‘aggressive’ approach may be sensitive to random genotyping failures.

Our approach is amenable to further computational improvements. For example, parallel programming could be used to search for tagSNPs in separate precincts, further speeding up the computation.

FESTA is freely available and can be downloaded atAuthor Webpage

We are grateful to Drs Mike Boehnke, Randy Pruim and the three anonymous reviewers for critical comments on an early version of this manuscript. This work is partially supported by NIH RO1-HG002651-01 to G.A.

Conflict of Interest: none declared.

REFERENCES

Avi-Itzhak

H.I.

et al.,

Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity

Pac. Symp. Biocomput.

2003

(pg.

466

477

)

Google Scholar

OpenURL Placeholder Text

WorldCat

Cardon

L.R.

Abecasis

G.R.

Using haplotype blocks to map human complex trait loci

Trends Genet.

2003

, vol.

(pg.

135

140

)

Carlson

C.S.

et al.,

Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium

Am. J. Hum. Genet.

2004

, vol.

(pg.

106

120

)

Chapman

J.M.

et al.,

Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power

Hum. Hered.

2003

, vol.

pg.

1831

Collins

F.S.

et al.,

Variations on a theme: cataloging human DNA sequence variation

Science

1997

, vol.

278

(pg.

1580

1581

)

Cormen

T.H.

et al.,

Introduction to algorithms

2001

2nd edition

Cambridge

MIT Press

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

COPAC

Daly

M.J.

et al.,

High-resolution haplotype structure in the human genome

Nat. Genet.

2001

, vol.

(pg.

229

232

)

Dawson

et al.,

A first generation slinkage disequilibrium map of human chromosome 22

Nature

2002

, vol.

418

(pg.

544

548

)

Devlin

Risch

A comparison of linkage disequilibrium measures for fine-scale mapping

Genomics

1995

, vol.

(pg.

311

322

)

Gabriel

S.B.

et al.,

The structure of haplotype blocks in the human genome

Science

2002

, vol.

296

(pg.

2225

2229

)

Goldstein

D.B.

et al.,

Genome scans and candidate gene approaches in the study of common diseases and variable drug responses

Trends Genet.

2003

, vol.

(pg.

615

622

)

Hampe

et al.,

Entropy-based SNP selection for genetic association studies

Hum Genet.

2003

, vol.

114

(pg.

)

Hill

W.G.

Estimation of linkage disequilibrium in randomly mating populations

Heredity

1974

, vol.

(pg.

229

239

)

Hill

W.G.

Robertson

The effects of inbreeding at loci with heterozygote advantage

Genetics

1968

, vol.

(pg.

615

628

)

Halldórsson

B.V.

et al.,

Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies

Genome Res.

2004

, vol.

(pg.

1633

1640

)

Johnson

G.C.

et al.,

Haplotype tagging for the identification of common disease genes

Nat. Genet.

2001

, vol.

(pg.

233

237

)

Jeffreys

A.J.

et al.,

Intensely punctuate meiotic recombination in the class II region of the major of histocompatibility complex

Nat. Genet.

2001

, vol.

(pg.

217

222

)

Cardon

L.R.

Efficient selective screening of haplotype tag SNPs

Bioinformatics

2003

, vol.

(pg.

287

288

)

Lin

Altman

R.B.

Finding haplotype tagging SNPs by use of principal components analysis

Am. J. Hum. Genet.

2004

, vol.

(pg.

850

861

)

Meng

et al.,

Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes

Am. J. Hum. Genet.

2003

, vol.

(pg.

115

130

)

Niu

et al.,

Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms

Am. J. Hum. Genet.

2002

, vol.

(pg.

157

169

)

Patil

et al.,

Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21

Science

2001

, vol.

294

(pg.

1719

1723

)

Pritchard

J.K.

Przeworski

Linkage disequilibrium in humans: models and data

Am. J. Hum. Genet.

2001

, vol.

(pg.

)

Reich

D.E.

et al.,

Linkage disequilibrium in the human genome

Nature

2001

, vol.

411

(pg.

199

204

)

Rinaldo

et al.,

Characterization of multilocus linkage disequilibrium

Genet. Epidemiol.

2005

, vol.

(pg.

193

206

)

Sachidanandam

et al.

International SNP Map Working Group

A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms

Nature

2001

, vol.

409

(pg.

928

933

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Sebastiani

et al.,

Minimal haplotype tagging

Proc. Natl Acad. Sci. USA

2003

, vol.

100

(pg.

9900

9905

)

Stram

D.O.

et al.,

Choosing haplotype-tagging SNPs based on unphased genotype data using preliminary sample of unrelated subjects with an example from the multiethic cohort study

Hum. Hered.

2003

, vol.

(pg.

)

Stram

D.O.

Software for tag single nucleotide polymorphism selection

Hum. Genomics

2005

, vol.

(pg.

144

151

)

The International HapMap Consortium

The International HapMap Project

Nature

2003

, vol.

426

(pg.

789

796

)

Zhang

et al.,

A dynamic programming algorithm for haplotype partitioning

Proc. Natl Acad. Sci. USA

2002

, vol.

(pg.

7335

7339

)

Zhang

Jin

HaploBlockFinder: haplotype block analysis

Bioinformatics

2003

, vol.

(pg.

1300

1301

)

Author notes

Associate Editor: Martin Bishop

Issue Section:

Original Papers>Genetics and population analysis

Download all slides

Citations

Views

972

Altmetric

More metrics information

Metrics

Total Views972

696Pageviews

276PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	1
January 2017	2
March 2017	12
April 2017	1
May 2017	4
June 2017	3
July 2017	8
August 2017	3
September 2017	2
October 2017	1
November 2017	8
December 2017	20
January 2018	5
February 2018	11
March 2018	26
April 2018	12
May 2018	5
June 2018	17
July 2018	14
August 2018	11
September 2018	3
October 2018	3
November 2018	8
December 2018	10
January 2019	9
February 2019	36
March 2019	8
April 2019	25
May 2019	8
June 2019	10
July 2019	10
August 2019	16
September 2019	9
October 2019	11
November 2019	11
December 2019	9
January 2020	21
February 2020	6
March 2020	16
April 2020	8
May 2020	8
June 2020	6
July 2020	7
August 2020	5
September 2020	4
October 2020	12
November 2020	9
December 2020	3
January 2021	9
February 2021	5
March 2021	17
April 2021	4
May 2021	15
June 2021	9
July 2021	11
August 2021	10
September 2021	6
October 2021	11
November 2021	6
December 2021	6
January 2022	6
February 2022	10
March 2022	9
April 2022	6
May 2022	4
June 2022	7
July 2022	6
August 2022	14
September 2022	13
October 2022	24
November 2022	8
December 2022	8
January 2023	4
February 2023	9
March 2023	3
April 2023	9
May 2023	3
June 2023	3
July 2023	7
August 2023	12
September 2023	6
October 2023	5
November 2023	11
December 2023	8
January 2024	23
February 2024	22
March 2024	10
April 2024	15
May 2024	15
June 2024	11
July 2024	22
August 2024	20
September 2024	9
October 2024	13
November 2024	9
December 2024	4
January 2025	13
February 2025	6
March 2025	18
April 2025	1

Citations

40Web of Science

Altmetrics

Email alerts

New journal issues

New journal articles

Citing articles via

Web of Science (40)

Google Scholar

Gradient matching accelerates mixed-effects inference for biochemical networks

Predicting Explainable Dementia Types with LLM-aided Feature Engineering

ROICellTrack: A deep learning framework for integrating cellular imaging modalities in subcellular spatial transcriptomic profiling of tumor tissues

Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms

Topology-Driven Negative Sampling Enhances Generalizability in Protein-Protein Interaction Prediction

Looking for your next opportunity?

Having trouble contacting the network. Please try again in a moment or two.

Online ISSN 1367-4811

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Cookie policy
Privacy policy
Legal notice

Movatterモバイル変換

Article Contents

An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria

Cite

Abstract

INTRODUCTION

METHODS

Greedy approach

FESTA

FESTA double coverage

Further tagSNP selection considerations

Mandatory tagSNP markers

Choosing between alternative solutions

RESULTS

Chromosome-wide tagging

Densely typed region

Additional TagSNPs for denser SNP map

DISCUSSION

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

New journal issues alert

Sign in

Personal account

Journal article activity alert

Sign in

Personal account

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only