Movatterモバイル変換


[0]ホーム

URL:


Skip to Main Content
Advertisement
Oxford Academic
Search
Bioinformatics
International Society for Computational Biology
Close
Search
Journal Article

An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria

,
Zhaohui S. Qin*
1  1  
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan
 
1420 Washington Heights, Ann Arbor, MI 48109-2029, USA
*To whom correspondence should be addressed.
Search for other works by this author on:
,
Shyam Gopalakrishnan
1  1  
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan
 
1420 Washington Heights, Ann Arbor, MI 48109-2029, USA
2  2  
Department of Electrical Engineering and Computer Science, University of Michigan
 
Ann Arbor, MI 48109-2122, USA
Search for other works by this author on:
Gonçalo R. Abecasis
1  1  
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan
 
1420 Washington Heights, Ann Arbor, MI 48109-2029, USA
Search for other works by this author on:

Associate Editor: Martin Bishop

Author Notes
Bioinformatics, Volume 22, Issue 2, January 2006, Pages 220–225,https://doi.org/10.1093/bioinformatics/bti762
Published:
03 November 2005
Article history
Received:
06 August 2005
Revision received:
09 October 2005
Accepted:
02 November 2005
Published:
03 November 2005
Search
Close
Search

Abstract

Motivation: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers.

Results: We devised an improved algorithm for tagSNP selection using the pairwiser2 criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data.

Availability: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at

Contact:  [email protected]

Supplementary information:  

INTRODUCTION

With the rapid improvement of high-throughput genotyping technologies, genome-wide association studies are emerging as a promising approach to detect genetic variants that contribute to human diseases. Initially, genome-wide association studies will focus on single nucleotide polymorphisms (SNPs) because of their high abundance in the human genome, their low mutation rates and their accessibility to high-throughput genotyping (Collinset al., 1997). There are more than 10 million verified SNPs in dbSNP (build 124) (Sachidanandamet al., 2001), but typing all available SNP markers is inefficient and not necessary since many will provide redundant information due to linkage disequilibrium (LD). A better strategy is to select a subset of representative SNPs (tagging SNPs or tagSNPs) and to remove the rest from consideration (Johnsonet al., 2001;Cardon and Abecasis, 2003). The objective is to have little information overlap among the selected SNPs while retaining much of the signal contained in the original set.

The selection of tagSNPs has become a very active research topic and many strategies have been proposed (Patilet al., 2001;Zhanget al., 2002;Gabrielet al., 2002;Johnsonet al., 2001;Menget al., 2003;Sebastianiet al., 2003;Avi-Itzhaket al., 2003;Ke and Cardon, 2003;Goldsteinet al., 2003;Stram, 2003;Hampeet al., 2003;Chapmanet al., 2003;Lin and Altman, 2004;Halldórssonet al., 2004;Rinaldoet al., 2005). Recently,Zhang and Jin (2003) andCarlsonet al. (2004) introduced methods based on the LD measurer2. These methods search for a small set of SNPs that are in strong LD (measured through pairwiser2) with other SNPs that are not selected for genotyping. Pairwiser2 is an attractive criterion for tagSNP selection since it is closely related to statistical power for case–control association studies, where a directly associated SNP is replaced with an indirectly associated tagSNP (Pritchard and Przeworski, 2001).

In this manuscript, we describe efficient algorithms for tagSNP selection based on pairwise LD measurer2. The algorithms were implemented in a computer program named FESTA (fragmented exhaustive search for tagging SNPs). Essentially, we replace a greedy search, where markers are added sequentially to the tagSNP set, with an exhaustive search where all marker combinations are evaluated. To achieve this, we arrange the genome into precincts of markers in high LD, such that markers in different precincts show only low pairwise disequilibrium. TagSNP selection can then be performed within each precinct independently, greatly reducing computation cost. In most settings, our method is guaranteed to find the optimal tagSNP set(s) defined by ther2 criterion. For a small proportion of precincts where exhaustive search is computationally too expensive to carry out, an efficient greedy-exhaustive hybrid search algorithm is described. Using data from the HapMap project (The International HapMap Consortium, 2003), we show that the majority of these precincts contain relatively small numbers of SNPs, especially when a stringentr2 criterion is used. Our algorithm readily identifies equivalent tagSNP sets, so that additional selection criteria can be incorporated. Other useful extensions are also discussed in this manuscript, such as the inclusion/exclusion of certain SNPs and double coverage, which can increase robustness of tagSNP sets against sporadic genotyping failures or errors.

METHODS

Consider a setS which containsM bi-allelic SNP markersa1, a2,…,aM. Further assume that all these markers have minor allele frequency (MAF) above a certain threshold (0.05 was used in this study). First, two-SNP haplotype frequencies were estimated (Hill, 1974), and then the pairwise LD measurer2 (also referred to as ‘Δ2') (Devlin and Risch, 1995) was calculated for each pair of markers using the inferred haplotype frequencies (Hill and Robertson, 1968). Two markersai andaj are said to be in strong LD if ther2 between them is greater than a pre-specified threshold valuer0, denoted asr2(ai,aj) ≥r0 (r0 = 0.5 or 0.8 in in this study). Both are considered tagSNPs for each other; in thatai can be used as a surrogate foraj, or vice versa.

Our aim is to a find tagSNP set, denoted byT, a subset ofS such thataiS\T, ∃ajT that satisfiesr2(ai,aj) ≥r0. In our presentation, we introduce two intermediate SNP sets,P andQ.P is called thecandidate set which contains all the markers that are eligible to be chosen as tagSNPs andQ is named thetarget set which contains all the markers that are yet to be tagged, i.e. no marker inQ is in LD with any tagSNP inT. For each markeram inP, letC(am) = {a :aQ andr2(a,am) ≥r0} represent the subset ofQ which contains markers that are in strong LD witham, and let |C(am)| be the number of the elements in the setC(am). Typically, the candidate setP is the complement of the tagSNP setT,P=S\T andP =Q. One exception occurs when some SNPs are excluded as tagSNPs because they cannot be easily genotyped, but they still should be tagged by other markers if possible. In this case, the candidate set is a subset of target set. We describe several different algorithms for updatingP,Q andT starting with a greedy approach (Carlsonet al., 2004). We then outline successive refinements and extensions of a partition and exhaustive search algorithm, designed to handle various scenarios encountered when planning association studies.

Greedy approach

The detailed algorithm is as follows (Carlsonet al., 2004).

Algorithm 1 (greedy approach):In Step 4, by removing associated markers from consideration, the coverage overlap among tagSNPs is greatly reduced. Although it is simple to implement, the greedy procedure may miss more efficient solutions.Figure 1 gives a simple example, where markersA andB each tag half of all markers and together can tag all the markers. However, markerC is connected to more than half of all markers, and it is the first marker selected by the greedy algorithm. In this example, the greedy algorithm produced a set with three tagSNPs, despite the fact that the optimal solution contains onlyA andB.

  1. SetT = ∅ andP =Q =S;

  2. For each markeram inP, calculate |C(am)|;

  3. For every markeram where |C(am)| = 0, addam intoT, and remove it fromQ;

  4. Find the marker inP that has the highest |C(am)| value, denoted asamax, and addamax intoT, removing it and all connected SNPs, i.e.C(am) fromQ;

  5. Repeat Steps 2–4 untilQ = ∅.

An example when the greedy approach does not provide the smallest tagSNP set.
Fig. 1

An example when the greedy approach does not provide the smallest tagSNP set.

FESTA

An exhaustive search guarantees the minimum tagSNP set. Therefore, theoretically, the exhaustive search solves the tagSNP selection problem. But in practice, genome-wide tagSNP selection requires consideration of hundreds of thousands of SNP markers. For problem of this scale, exhaustive searches cannot be directly applied due to prohibitive computation costs.

Since appreciable LD only occurs within clusters of nearby markers along chromosomes, a practical solution is to first decompose the set of markers into disjoint precincts, such that markers in different precincts are never in strong LD. Then, selecting tagSNPs using ther2 criterion in the whole set is equivalent to selecting tagSNPs in each precinct and then combining all the tagSNPs together. Here the concept of precinct is defined based on pairwise LD measure. It is therefore closely related to haplotype blocks (Reichet al., 2001;Patilet al., 2001;Dalyet al., 2001;Jeffreyset al., 2001;Gabrielet al., 2002;Dawsonet al., 2002), which are regions where historical recombination events are rare. The main difference is that the precincts of markers in high LD are determined purely on genetic distance. Unlike haplotype block, markers within each precinct may not be consecutive markers sitting next to each other.

Partitioning the markers into precincts can be achieved using standard algorithms in graph theory. We applied the Breadth First Search (BFS) algorithm (Cormenet al., 2001). Starting from any node (a marker) in a new precinct, this algorithm adds all neighboring nodes (markers in LD) and all neighbors of the newly added nodes to the precinct, until there are no neighbors to be added to the precinct. This process is restarted from different nodes until all the nodes are assigned a precinct.

After the partitioning step, we perform the tagSNP selection within each precinct. Starting withK = 1, allK-marker combinations are searched to see if they cover all markers within this precinct. If not,K is increased by one and the search is repeated until a tagSNP set is found or a pre-specified search limit is reached.

When evaluating allK-marker combinations, the computation cost required for an exhaustive search might be too great in some precincts. In such cases, we propose a hybrid solution which reduces the computation cost and retains a good chance of finding optimal tagSNP sets. For each precincti withNi markers (Here on, all parameters with subscripti indicate parameters within thei-th precincts, such asKi,Ji,Pi,Qi,Ti andNi.), we decide whether an exhaustive search is feasible by comparing the computation cost required for evaluating allK-marker combinations within a precinct,(NiK), with a computation cost limitL specified a priori, determined based on available computing resources. Larger limits allow a more comprehensive search, which may result in fewer tagSNPs being selected, but require additional computational effort. In this study, we set this limit at 1 million. When this limit is exceeded, we apply the following hybrid algorithm. SpecifyKi* such that it is the largestK possible that satisfies(NiK)L0, whereL0 is a pre-specified computation cost limit (less thanL, set at 10 000 in studies conducted here). Subsequently, for eachKi*-marker combinations, denoted as{a1,,aKi*}, assume that these markers have already been selected, removeam together with all the markers inC(am) from candidate setPi and target setQi,m=1,,Ki*, i.e.Pi=Qi=Si\m=1Ki*({am}C(am)) then apply the greedy approach to identify a subset ofPi that is able to coverQi, which contains the remaining untagged markers. The tagSNP set obtained in the reduced set plus the previousKi* markers together form a complete tagSNP set for thei-th precinct. The detailed algorithm is as follows:

Algorithm 2 (FESTA: greedy-exhaustive hybrid search):FESTA executes either a ‘pure exhaustive search’ or a ‘greedy-exhaustive hybrid search’ in each precinct depending on the computational cost. Exhaustive search is first attempted, and if the computation cost becomes too high, the hybrid algorithm is used as a fallback. Typically, only a small proportion of the precincts require the ‘greedy-exhaustive hybrid search’.

  1. Apply the Breadth First Search to decompose the entire set of markers into precincts such that high LD can only be observed within precincts.S=i=1nSi, andSiSj= for all ij;

  2. Within each precinctSi, setK = 1,

    • If(NiK)L, move to b, otherwise, conduct an exhaustive search over all possibleK-marker combinations. Both the candidate setPi and the target setQi isSi. If no combination ofK SNPs can cover the entire precinct, setK =K + 1, and repeat this step;

    • FindKi* such that(NiKi*)L0 and(NiKi*+1)>L0. For everyKi*-marker combination inSi, denoted as{a1,,aKi*}, letTi=m=1Ki*{am},Pi=Qi=Si\m=1Ki*({am}C(am)), and apply the greedy approach to identify a subset ofPi that is able to cover the remaining untagged markersQi. Among all the resulting tagSNP sets, we choose the smallest.

  3. Record all minimum tagSNP sets that cover the precinct. These form the complete minimum tagSNP sets{Tij:j=1,,Ji}, whereJi is the total number of such minimum tagSNP sets.

  4. Any combination of tagSNP sets identified from all disjoint precincts forms a tagSNP set for the whole setS. Suppose the size of the minimum tagSNP set(s) in each precinct isKi, then the overall size of such minimum tagSNP sets isi=1nKi, and the total number of such minimum tagSNP sets isi=1nJi.

FESTA double coverage

So far, both the greedy approach and our FESTA algorithm focus on finding a tagSNP set such that each SNP is either a tagSNP itself or is in LD with at least one of the tagSNPs. This is a criterion aimed at minimizing the number of tagSNPs selected. In reality, random genotyping failure or genotyping error on these tagSNPs can result in loss of power to identify the true signal. To be more robust against such adverse events, we evaluated a more stringent criterion requiring that every untyped SNP be in LD with at least two tagSNPs.

Our FESTA algorithm can be extended to find tagSNP sets that will have double coverage on the SNP markers considered. As always, an exhaustive search is able to find such tagSNP sets when the marker set considered is not too large. When exhaustive search is not feasible, the same greedy-exhaustive hybrid search strategy can be applied. The detailed FESTA double coverage algorithm can be found in the Supplementary Online Material. Note that in practice, it may be useful to consider double coverage only for large precincts, where the cost of losing an SNP to genotyping failure might be higher.

Further tagSNP selection considerations

Mandatory tagSNP markers

Our algorithm readily allows users to force certain SNP markers to be included in or excluded from the tagSNP set. There are several scenarios where such functionality is important. First, in candidate gene studies, previous knowledge may be available as to which SNPs are functionally important. These might include non-synonymous coding region SNPs (cSNPs) as well as SNPs located in regulatory regions. Second, in genome-wide studies, one might carry out multiple rounds of genotyping and tagSNP selection. In such cases, additional tagSNPs could be selected at each round to cover the markers not tagged by tagSNPs successfully genotyped in the previous round. In other settings, it may be useful to exclude certain SNPs from consideration as tags. For example, some SNP markers may be difficult to genotype using a particular platform.

When there are mandatory markerst1,t2,…,tr, to be included, put these markers into the tagSNP setT and remove them from the candidate set, e.g.P becomesP\i=1r{ti}. The target setQ becomesQ\i=1r({ti}C(ti)). If there are SNPsu1,u2,…,us that need to be excluded from the tagSNP set, remove them from the candidate setP, the target setQ is unchanged.

Choosing between alternative solutions

Within a densely typed SNP set, redundant tagSNPs are common, which results in multiple tagSNP sets of the same size. All of these sets are equal in the sense of minimizing the number of tagSNPs. In order to choose one set for genotyping, additional criteria can be entertained. Here are examples of such additional criteria:In Criteria 1 and 2, we try to identify tagSNP sets whose members have the strongest connections with those untagged SNPs. Since pairwiser2 between disease locus and marker loci is closely related to statistical power of detecting association, this strategy should result in increased power on average and in the worst case, respectively. These criteria are recommended in regular association study designs. The purpose of using Criterion 3 is to find a tagSNP set whose members are as independent as possible which minimizes overlap between covered SNPs of different tagSNPs and potentially increases the chance of linking to untyped SNPs. This strategy is particularly useful if one suspects the actual disease locus is not among the marker loci genotyped. To evaluate the potential of uncovering the disease-causing mutations in association studies among tagSNP sets identified by the aforementioned criteria, we conducted some empirical evaluations, summarized in the Results section.

  1. Maximize averager2 between tagSNPs and untagged SNPs they represent;

  2. Maximize the lowestr2 between tagSNPs and the untagged SNPs they connect to;

  3. Minimize the averager2 among all pairs of tagSNPs within a precinct.

Other types of criteria may be of even greater interest in practice. For example, in many genotyping technologies, some SNPs are harder to genotype than others due to the characteristics of surrounding genome sequence. We can use this information to select tagSNPs that are likely to have a high success rate and to avoid SNPs that are prone to genotyping failure.

RESULTS

In order to illustrate our proposed piecewise exhaustive search strategy, compare it with the greedy approach and explore the various characteristics of the tagSNP sets selected by our method, we applied both methods to two sets of data, the entire Chromosome 2 and five ENCODE regions (ENr112, ENr131, ENr113, ENm010 and ENm013) genotyped by the HapMap project (release 16c, June 2005). All three populations: CEU (European), YRI (Yoruban) and JPT + CHB (Japanese and Chinese) were studied. The first is in the context of a genome-wide association study and the second is similar to the situation of a candidate region study.

Chromosome-wide tagging

We have applied the greedy algorithm and FESTA to Chromosome 2 using HapMap Phase 1 genotype data (release 16c, June 2005).Table 1 (r2 threshold of 0.5) and Table S1 (r2 threshold of 0.8) summarize the results. FESTA produces less tagSNPs compared with the greedy approach in all three populations. When compared across populations, the YRI samples have about twice the amount of tagSNPs as the CEU or the JPT + CHB samples. The JPT + CHB samples have slightly less tagSNPs identified than the CEU samples. Withr2 threshold 0.5, the percentages of tagSNPs identified by our new algorithm are 21.6% in CEU, 39.3% in YRI and 20.9% in JPT + CHB samples, respectively.

Table 1

Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct

CEUYRIJPT + CHB
No. of SNPs64 80169 63057 810
    r2 ≥ 0.5
        No. of precincts11 78624 75210 248
        No. of tagSNPs (Greedy)14 38427 80412 454
        No. of tagSNPs (FESTA)13 98327 37912 108
        No. of tagSNPs (FESTA, double cover)23 64441 66820 644
    r2 ≥ 0.8
        No. of precincts23 42641 07920 178
        No. of tagSNPs (Greedy)24 30041 72921 044
        No. of tagSNPs (FESTA)24 17641 66420 963
        No. of tagSNPs (FESTA, double cover)35 82454 10131 463
CEUYRIJPT + CHB
No. of SNPs64 80169 63057 810
    r2 ≥ 0.5
        No. of precincts11 78624 75210 248
        No. of tagSNPs (Greedy)14 38427 80412 454
        No. of tagSNPs (FESTA)13 98327 37912 108
        No. of tagSNPs (FESTA, double cover)23 64441 66820 644
    r2 ≥ 0.8
        No. of precincts23 42641 07920 178
        No. of tagSNPs (Greedy)24 30041 72921 044
        No. of tagSNPs (FESTA)24 17641 66420 963
        No. of tagSNPs (FESTA, double cover)35 82454 10131 463
Table 1

Summary of Chromosome 2: size of disjoint precincts and number of SNPs and tagSNPs in each precinct

CEUYRIJPT + CHB
No. of SNPs64 80169 63057 810
    r2 ≥ 0.5
        No. of precincts11 78624 75210 248
        No. of tagSNPs (Greedy)14 38427 80412 454
        No. of tagSNPs (FESTA)13 98327 37912 108
        No. of tagSNPs (FESTA, double cover)23 64441 66820 644
    r2 ≥ 0.8
        No. of precincts23 42641 07920 178
        No. of tagSNPs (Greedy)24 30041 72921 044
        No. of tagSNPs (FESTA)24 17641 66420 963
        No. of tagSNPs (FESTA, double cover)35 82454 10131 463
CEUYRIJPT + CHB
No. of SNPs64 80169 63057 810
    r2 ≥ 0.5
        No. of precincts11 78624 75210 248
        No. of tagSNPs (Greedy)14 38427 80412 454
        No. of tagSNPs (FESTA)13 98327 37912 108
        No. of tagSNPs (FESTA, double cover)23 64441 66820 644
    r2 ≥ 0.8
        No. of precincts23 42641 07920 178
        No. of tagSNPs (Greedy)24 30041 72921 044
        No. of tagSNPs (FESTA)24 17641 66420 963
        No. of tagSNPs (FESTA, double cover)35 82454 10131 463

The size of the tagSNP set is optimal for precincts where the greedy approach indicates that one or two tagSNPs are enough to cover all the SNPs in it. Improvements over the greedy approach is only possible for the remaining precincts. In the CEU samples, there are 599 of such precincts, in which the greedy approach identified 2423 tagSNPs, and FESTA identified 2022, a 16.5% reduction. When ther2 threshold is 0.8, 154 precincts require more than two tagSNPs, as identified by the greedy approach. Among them, the greedy approach and FESTA identified 526 and 402 tagSNPs, respectively. The reduction rate is 23.6%. All the detailed results are summarized inTable 2 (r2 threshold of 0.5) and S1 (r2 threshold of 0.8). When double coverage is required, 69.1 and 45.9% more tagSNPs are needed withr2 thresholds of 0.5 and 0.8, respectively. Similar results were obtained from the YRI and JPT + CHB samples.

Table 2

Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr2 threshold of 0.5)

CEUYRIJPT+CHB
GreedyFESTAGreedyFESTAGreedyFESTA
Singletona5241524115 07915 07944164416
1517251728096809646604660
27749119241070634770
3318278355291312250
41449912710011390
5594273536030
6271836281615
717172110146
8164108116
91116344
10+73251481
Total14 38413 98327 80427 37912 45412 108
CEUYRIJPT+CHB
GreedyFESTAGreedyFESTAGreedyFESTA
Singletona5241524115 07915 07944164416
1517251728096809646604660
27749119241070634770
3318278355291312250
41449912710011390
5594273536030
6271836281615
717172110146
8164108116
91116344
10+73251481
Total14 38413 98327 80427 37912 45412 108

aSingleton means the precinct only contains one SNP. In other words, singleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than a threshold) with any other SNP in the entire set. Such an SNP, by definition, is one of the tagSNPs.

Table 2

Distributions of the size of the tagSNP sets using the greedy approach and the FESTA algorithm (withr2 threshold of 0.5)

CEUYRIJPT+CHB
GreedyFESTAGreedyFESTAGreedyFESTA
Singletona5241524115 07915 07944164416
1517251728096809646604660
27749119241070634770
3318278355291312250
41449912710011390
5594273536030
6271836281615
717172110146
8164108116
91116344
10+73251481
Total14 38413 98327 80427 37912 45412 108
CEUYRIJPT+CHB
GreedyFESTAGreedyFESTAGreedyFESTA
Singletona5241524115 07915 07944164416
1517251728096809646604660
27749119241070634770
3318278355291312250
41449912710011390
5594273536030
6271836281615
717172110146
8164108116
91116344
10+73251481
Total14 38413 98327 80427 37912 45412 108

aSingleton means the precinct only contains one SNP. In other words, singleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than a threshold) with any other SNP in the entire set. Such an SNP, by definition, is one of the tagSNPs.

Among all the non-singleton precincts in the CEU samples (6545 forr2 threshold of 0.5 and 10196 forr2 threshold of 0.8), most require only a small number of tagSNPs, so that the exhaustive search can be applied directly. Withr2 threshold of 0.5, the greedy-exhaustive hybrid approach was required for only 98 precincts or 1.5% of all precincts (11 precincts (0.1%) withr2 thershold of 0.8).

Densely typed region

A very dense SNP map was recently released by the HapMap project on the ENCODE regions. We used five such regions (ENr112, ENr131, ENr113, ENm010 and ENm013) to evaluate the performance of our algorithm. Each ENCODE regions is 500 kb in length, for the CEU samples, the average number of SNPs in these regions is 832 (ranges from 551 to 1126), corresponding to an SNP density about 1 SNP per 601 bps (1 SNP per 907 bps to 1 SNP per 444 bps for individual regions). The detailed results were summarized inTable 3. Detailed results for the YRI and JPT + CHB samples can be found in Supplementary Tables S2 and S3.

Table 3

Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)

RegionENr112ENr131ENr113ENm010ENm013
No. of SNPs8639881061539708
    r2 ≥ 0.5
        No. of precincts5578434426
        No. of singletonsa2331161611
        No. of tagSNPs (Greedy)81110716641
        No. of tagSNPs (FESTA)75105676138
        No. of tagSNPs (FESTA double cover)12818312310967
    r2 ≥ 0.8
        No. of precincts13418413112572
        No. of singletonsa6381626125
        No. of tagSNPs (Greedy)15219714213183
        No. of tagSNPs (FESTA)14619314113081
        No. of tagSNPs (FESTA, double cover)237311229204139
RegionENr112ENr131ENr113ENm010ENm013
No. of SNPs8639881061539708
    r2 ≥ 0.5
        No. of precincts5578434426
        No. of singletonsa2331161611
        No. of tagSNPs (Greedy)81110716641
        No. of tagSNPs (FESTA)75105676138
        No. of tagSNPs (FESTA double cover)12818312310967
    r2 ≥ 0.8
        No. of precincts13418413112572
        No. of singletonsa6381626125
        No. of tagSNPs (Greedy)15219714213183
        No. of tagSNPs (FESTA)14619314113081
        No. of tagSNPs (FESTA, double cover)237311229204139

aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.

Table 3

Summary of TagSNPs identified by the greedy approach, the FESTA and FESTA double coverage algorithms in the five ENCODE regions (CEU samples)

RegionENr112ENr131ENr113ENm010ENm013
No. of SNPs8639881061539708
    r2 ≥ 0.5
        No. of precincts5578434426
        No. of singletonsa2331161611
        No. of tagSNPs (Greedy)81110716641
        No. of tagSNPs (FESTA)75105676138
        No. of tagSNPs (FESTA double cover)12818312310967
    r2 ≥ 0.8
        No. of precincts13418413112572
        No. of singletonsa6381626125
        No. of tagSNPs (Greedy)15219714213183
        No. of tagSNPs (FESTA)14619314113081
        No. of tagSNPs (FESTA, double cover)237311229204139
RegionENr112ENr131ENr113ENm010ENm013
No. of SNPs8639881061539708
    r2 ≥ 0.5
        No. of precincts5578434426
        No. of singletonsa2331161611
        No. of tagSNPs (Greedy)81110716641
        No. of tagSNPs (FESTA)75105676138
        No. of tagSNPs (FESTA double cover)12818312310967
    r2 ≥ 0.8
        No. of precincts13418413112572
        No. of singletonsa6381626125
        No. of tagSNPs (Greedy)15219714213183
        No. of tagSNPs (FESTA)14619314113081
        No. of tagSNPs (FESTA, double cover)237311229204139

aSingleton refers to an SNP marker that is not in LD (pairwise LD measurer2 greater than the threshold) with any other marker in the entire set. Such a marker, by definition, is one of the tagSNPs.

In this set of densely typed SNPs, using our method withr2 threshold of 0.5, the average percentage of tagSNPs required to cover each of the five regions is 8.3% of all markers (ranges from 5.4 to 11.3%). For double coverage, on average, 76.7% more tagSNPs are required (ranges from 70.7 to 83.6%). With a more stringentr2 threshold of 0.8, the average percentage of tagSNPs required increased to 16.6% of all markers (ranges from 11.4 to 24.1%). To double cover these regions, on average, 62.9% more tagSNPs are required (ranges from 56.9 to 71.6%). For those precincts where improvement over greedy search is possible, using FESTA, the reduction rate is 17.9 and 23.0% on average for the five ENCODE regions withr2 thresholds of 0.5 and 0.8, respectively. Applying our method to YRI and JPT + CHB samples reveals similar trends (data not shown).

Additional TagSNPs for denser SNP map

With the rapid advance of genotyping technologies, progressively denser SNP maps will become available. As more refined association studies are carried out, it will be useful to select new tagSNPs to ‘fill holes’ in the initial sparse maps. With a good picking strategy for the first round of tagging, this staged approach should result in only a small-to-moderate increase in the total number of tagSNPs compared to a one-stage strategy.

To evaluate this strategy, we constructed an artificial sparse SNP map for each of the five ENCODE regions (using the CEU samples only). Specifically, we selected one in every five consecutive SNP markers. The density of this sparse map is about 1 SNP per 3kb, close to the density of the phase I HapMap. Then, three different tagSNP sets are identified using the three criteria described previously, denoted byTi,i = 1, 2, 3. Finally, we applied our approach to the full ENCODE SNP set, using each of these tagSNP sets as a seed, so as to search for additional tagSNPs to cover the previously ‘hidden’ SNP markers. The effectiveness of these tagSNP sets is evaluated by comparing the number of new tagSNPs needed to cover the ‘newly found’ SNPs. In addition to the three criteria, we also compared three other tagSNP selection strategies:Z random SNPs, assumeZ is the number of tagSNPs for the sparse map; a picket fence strategy withZ equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points); or using all original SNPs as tagSNPs. The results are summarized inTable 4 (r2 threshold of 0.5) and Table S4 (r2 threshold of 0.8) in the Supplementary Online Material. From there, one can see that when ther2 threshold is 0.5, 14.4% more tagSNPs (range from 7.0 to 20.9%) are needed to fill holes in the original map and that number is only 5.4% (range from 3.8 to 7.0%) whenr2 threshold is 0.8. The three tagSNP sets require fewer tagSNPs to cover the holes, compared with tagSNPs picked using a picket fence strategy (31.6% difference forr2 threshold of 0.5 and 21.6% difference forr2 threshold of 0.8) or picked at random (33.8% difference forr2 threshold of 0.5 and 21.0% difference forr2 threshold of 0.8).

Table 4

Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr2 threshold of 0.5)

RegionENr112ENr131ENr113ENm010ENm013
SNPs in dense mapa8639881061539708
SNPs in sparse mapb173198213108142
One-stage picking
    TagSNPs in dense map75105676138
Two-stage picking
    Max averager2 b/tags and non-tagsc85114817240
    Min lowestr2/tags and non-tagsd85115807540
    Min averager2 among tagse85117827442
Other strategies
    Random pickingf103.2137.791.471.052.0
    Picket fenceg103136947852
    Use all sparseh200241239134153
RegionENr112ENr131ENr113ENm010ENm013
SNPs in dense mapa8639881061539708
SNPs in sparse mapb173198213108142
One-stage picking
    TagSNPs in dense map75105676138
Two-stage picking
    Max averager2 b/tags and non-tagsc85114817240
    Min lowestr2/tags and non-tagsd85115807540
    Min averager2 among tagse85117827442
Other strategies
    Random pickingf103.2137.791.471.052.0
    Picket fenceg103136947852
    Use all sparseh200241239134153

aDense map means the densely typed SNP sets obtained in the ENCODE region from the HapMap websiteAuthor Webpage (CEU samples, release 16c, June 2005). All tagSNPs in this table were identified using our new algorithm.

bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.

c–eTotal number of tagSNPs needed to cover the SNPs in the dense map using the tagSNPs identified using different criteria on the sparse map as seeds. Criterion 1, maximize the averager2 between tagSNP and SNPs that it connected to; Criterion 2, minimize the averager2 between tagSNP and SNPs that it connected to; Criterion 3, minimize the averager2 among all tagSNPs.

fTotal number of tagSNPs needed to cover the SNPs in the dense map usingT random SNPs in the sparse map as seeds.T is the number of tagSNPs identified by our algorithm on the sparse map. The number is obtained by repeating this procedure 100 times and taking the average.

gTotal number of tagSNPs needed to cover the SNPs in the dense map usingT equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points) in the sparse map as seeds.

hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.

Table 4

Performance comparison of tagSNP sets selected by three different criteria in terms of coverage on denser SNP maps (CEU samples, withr2 threshold of 0.5)

RegionENr112ENr131ENr113ENm010ENm013
SNPs in dense mapa8639881061539708
SNPs in sparse mapb173198213108142
One-stage picking
    TagSNPs in dense map75105676138
Two-stage picking
    Max averager2 b/tags and non-tagsc85114817240
    Min lowestr2/tags and non-tagsd85115807540
    Min averager2 among tagse85117827442
Other strategies
    Random pickingf103.2137.791.471.052.0
    Picket fenceg103136947852
    Use all sparseh200241239134153
RegionENr112ENr131ENr113ENm010ENm013
SNPs in dense mapa8639881061539708
SNPs in sparse mapb173198213108142
One-stage picking
    TagSNPs in dense map75105676138
Two-stage picking
    Max averager2 b/tags and non-tagsc85114817240
    Min lowestr2/tags and non-tagsd85115807540
    Min averager2 among tagse85117827442
Other strategies
    Random pickingf103.2137.791.471.052.0
    Picket fenceg103136947852
    Use all sparseh200241239134153

aDense map means the densely typed SNP sets obtained in the ENCODE region from the HapMap websiteAuthor Webpage (CEU samples, release 16c, June 2005). All tagSNPs in this table were identified using our new algorithm.

bSparse map means the SNP sets obtained by selecting the first SNP in every five consecutive SNPs in the dense maps.

c–eTotal number of tagSNPs needed to cover the SNPs in the dense map using the tagSNPs identified using different criteria on the sparse map as seeds. Criterion 1, maximize the averager2 between tagSNP and SNPs that it connected to; Criterion 2, minimize the averager2 between tagSNP and SNPs that it connected to; Criterion 3, minimize the averager2 among all tagSNPs.

fTotal number of tagSNPs needed to cover the SNPs in the dense map usingT random SNPs in the sparse map as seeds.T is the number of tagSNPs identified by our algorithm on the sparse map. The number is obtained by repeating this procedure 100 times and taking the average.

gTotal number of tagSNPs needed to cover the SNPs in the dense map usingT equally spaced SNPs (where we place equally spaced grid points along the interval and then select markers that are closest to these grid points) in the sparse map as seeds.

hTotal number of tagSNPs needed to cover the SNPs in the dense map using all the SNPs in the sparse map as seeds.

DISCUSSION

In this manuscript, we developed an efficient computational framework for tagSNP selection using the pairwiser2 criterion. Our algorithm is able to identify smaller tagSNP sets than the greedy approach (Carlsonet al., 2004). Although the improvement is modest, our algorithm always outperforms the greedy approach in terms of the tagSNP size under exactly the same pairwise LD criterion. Using both chromosome-wide data and densely typed ENCODE region data from the HapMap Project, we illustrated the utility of our approach and showed savings increase in more densely typed regions and inside large LD ‘blocks’. Computational time required by FESTA is quite reasonable and can be tailored to available computing resources as needed. Under the default setting, withr2 threshold of 0.5, FESTA takes ∼1–15 min to run on the five ENCODE regions, and ∼120 min on entire Chromosome 2 (withr2 threshold of 0.8, ∼0.1–1.5 min on the five ENCODE regions, and ∼24 min on Chromosome 2) using a 2.8 GHz Pentium class computer server. Another important advance is the ability of our method to identify multiple equivalent tagSNP sets and to use additional criteria to choose an optimal tagSNP set for typing. This feature offers flexibility in picking tagSNPs which is desirable when designing real association studies.

The key improvement of FESTA over the greedy approach is the ‘precinct partitioning’ step which enables the exhaustive search to be carried out very rapidly in most of the partitioned precincts. This is similar in spirit to the idea of ‘partition-ligation’ algorithm proposed byNiuet al. (2002) for haplotype inference.

Many of the existing tagSNP picking algorithms aim to capture haplotype diversity using the reduced set of markers (called haplotype tagging SNPs, htSNPs) such as BEST (Sebastianiet al., 2003). They work well when a small number of common haplotypes exist (typically true in the vicinity of a candidate gene) but these approaches often require the knowledge of complete haplotype phase and the boundary of the haplotype blocks. On the other hand, tagSNP selection usingr2 criteria does not require knowledge of block boundaries and can easily be applied to cover the whole chromosome. Recently, multiple-marker tagging strategies (Stram, 2005; P.I. de Bakker, 2005,Author Webpage) in which multiple tagSNPs can be used to represent each untagged SNPs have been proposed. While these methods further reduce the number of tagSNPs selected, this ‘aggressive’ approach may be sensitive to random genotyping failures.

Our approach is amenable to further computational improvements. For example, parallel programming could be used to search for tagSNPs in separate precincts, further speeding up the computation.

FESTA is freely available and can be downloaded atAuthor Webpage

We are grateful to Drs Mike Boehnke, Randy Pruim and the three anonymous reviewers for critical comments on an early version of this manuscript. This work is partially supported by NIH RO1-HG002651-01 to G.A.

Conflict of Interest: none declared.

REFERENCES

Avi-Itzhak
H.I.
et al.
,
Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity
Pac. Symp. Biocomput.
,
2003
(pg.
466
-
477
)
Cardon
L.R.
Abecasis
G.R.
,
Using haplotype blocks to map human complex trait loci
Trends Genet.
,
2003
, vol.
19
(pg.
135
-
140
)
Carlson
C.S.
et al.
,
Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium
Am. J. Hum. Genet.
,
2004
, vol.
74
(pg.
106
-
120
)
Chapman
J.M.
et al.
,
Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power
Hum. Hered.
,
2003
, vol.
56
pg.
1831
Collins
F.S.
et al.
,
Variations on a theme: cataloging human DNA sequence variation
Science
,
1997
, vol.
278
(pg.
1580
-
1581
)
Cormen
T.H.
et al.
Introduction to algorithms
,
2001
2nd edition
Cambridge
MIT Press
Daly
M.J.
et al.
,
High-resolution haplotype structure in the human genome
Nat. Genet.
,
2001
, vol.
29
(pg.
229
-
232
)
Dawson
E.
et al.
,
A first generation slinkage disequilibrium map of human chromosome 22
Nature
,
2002
, vol.
418
(pg.
544
-
548
)
Devlin
B.
Risch
N.
,
A comparison of linkage disequilibrium measures for fine-scale mapping
Genomics
,
1995
, vol.
29
(pg.
311
-
322
)
Gabriel
S.B.
et al.
,
The structure of haplotype blocks in the human genome
Science
,
2002
, vol.
296
(pg.
2225
-
2229
)
Goldstein
D.B.
et al.
,
Genome scans and candidate gene approaches in the study of common diseases and variable drug responses
Trends Genet.
,
2003
, vol.
19
(pg.
615
-
622
)
Hampe
J.
et al.
,
Entropy-based SNP selection for genetic association studies
Hum Genet.
,
2003
, vol.
114
(pg.
36
-
43
)
Hill
W.G.
,
Estimation of linkage disequilibrium in randomly mating populations
Heredity
,
1974
, vol.
33
(pg.
229
-
239
)
Hill
W.G.
Robertson
A.
,
The effects of inbreeding at loci with heterozygote advantage
Genetics
,
1968
, vol.
60
(pg.
615
-
628
)
Halldórsson
B.V.
et al.
,
Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies
Genome Res.
,
2004
, vol.
14
(pg.
1633
-
1640
)
Johnson
G.C.
et al.
,
Haplotype tagging for the identification of common disease genes
Nat. Genet.
,
2001
, vol.
29
(pg.
233
-
237
)
Jeffreys
A.J.
et al.
,
Intensely punctuate meiotic recombination in the class II region of the major of histocompatibility complex
Nat. Genet.
,
2001
, vol.
29
(pg.
217
-
222
)
Ke
X.
Cardon
L.R.
,
Efficient selective screening of haplotype tag SNPs
Bioinformatics
,
2003
, vol.
19
(pg.
287
-
288
)
Lin
Z.
Altman
R.B.
,
Finding haplotype tagging SNPs by use of principal components analysis
Am. J. Hum. Genet.
,
2004
, vol.
75
(pg.
850
-
861
)
Meng
Z.
et al.
,
Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes
Am. J. Hum. Genet.
,
2003
, vol.
73
(pg.
115
-
130
)
Niu
T.
et al.
,
Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms
Am. J. Hum. Genet.
,
2002
, vol.
70
(pg.
157
-
169
)
Patil
N.
et al.
,
Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21
Science
,
2001
, vol.
294
(pg.
1719
-
1723
)
Pritchard
J.K.
Przeworski
M.
,
Linkage disequilibrium in humans: models and data
Am. J. Hum. Genet.
,
2001
, vol.
69
(pg.
1
-
14
)
Reich
D.E.
et al.
,
Linkage disequilibrium in the human genome
Nature
,
2001
, vol.
411
(pg.
199
-
204
)
Rinaldo
A.
et al.
,
Characterization of multilocus linkage disequilibrium
Genet. Epidemiol.
,
2005
, vol.
28
(pg.
193
-
206
)
Sachidanandam
R.
et al.
International SNP Map Working Group
,
A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms
Nature
,
2001
, vol.
409
(pg.
928
-
933
)
Sebastiani
P.
et al.
,
Minimal haplotype tagging
Proc. Natl Acad. Sci. USA
,
2003
, vol.
100
(pg.
9900
-
9905
)
Stram
D.O.
et al.
,
Choosing haplotype-tagging SNPs based on unphased genotype data using preliminary sample of unrelated subjects with an example from the multiethic cohort study
Hum. Hered.
,
2003
, vol.
55
(pg.
27
-
36
)
Stram
D.O.
,
Software for tag single nucleotide polymorphism selection
Hum. Genomics
,
2005
, vol.
2
(pg.
144
-
151
)
The International HapMap Consortium
,
The International HapMap Project
Nature
,
2003
, vol.
426
(pg.
789
-
796
)
Zhang
K.
et al.
,
A dynamic programming algorithm for haplotype partitioning
Proc. Natl Acad. Sci. USA
,
2002
, vol.
99
(pg.
7335
-
7339
)
Zhang
K.
Jin
L.
,
HaploBlockFinder: haplotype block analysis
Bioinformatics
,
2003
, vol.
19
(pg.
1300
-
1301
)

Author notes

Associate Editor: Martin Bishop

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email:[email protected]
Advertisement

Citations

Views

972

Altmetric

Metrics
Total Views972
696Pageviews
276PDF Downloads
Since 12/1/2016
Month:Total Views:
December 20161
January 20172
March 201712
April 20171
May 20174
June 20173
July 20178
August 20173
September 20172
October 20171
November 20178
December 201720
January 20185
February 201811
March 201826
April 201812
May 20185
June 201817
July 201814
August 201811
September 20183
October 20183
November 20188
December 201810
January 20199
February 201936
March 20198
April 201925
May 20198
June 201910
July 201910
August 201916
September 20199
October 201911
November 201911
December 20199
January 202021
February 20206
March 202016
April 20208
May 20208
June 20206
July 20207
August 20205
September 20204
October 202012
November 20209
December 20203
January 20219
February 20215
March 202117
April 20214
May 202115
June 20219
July 202111
August 202110
September 20216
October 202111
November 20216
December 20216
January 20226
February 202210
March 20229
April 20226
May 20224
June 20227
July 20226
August 202214
September 202213
October 202224
November 20228
December 20228
January 20234
February 20239
March 20233
April 20239
May 20233
June 20233
July 20237
August 202312
September 20236
October 20235
November 202311
December 20238
January 202423
February 202422
March 202410
April 202415
May 202415
June 202411
July 202422
August 202420
September 20249
October 202413
November 20249
December 20244
January 202513
February 20256
March 202518
April 20251
Citations
Powered by Dimensions
40Web of Science
Altmetrics
×

Email alerts

New journal issues alert

To set up an email alert, pleasesign in to your personal account, orregister

Sign in

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Journal article activity alert

To set up an email alert, pleasesign in to your personal account, orregister

Sign in

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD
Having trouble contacting the network. Please try again in a moment or two.
Oxford University Press
Journals Career Network
Advertisement
Advertisement
Advertisement
Bioinformatics
  • Online ISSN 1367-4811
  • Copyright © 2025 Oxford University Press
Close
Close
This Feature Is Available To Subscribers Only

Sign In orCreate an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close

[8]ページ先頭

©2009-2025 Movatter.jp