IMPUTE version 2 (also known asIMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from
B. N. Howie, P. Donnelly, and J. Marchini (2009)A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]
IMPUTE2 also includes features that were introduced in other publications, which you can findhere.
The figure below shows the most common scenario in which imputation is used: unobserved genotypes (red question marks) in a set of study individuals are imputed (or predicted) using a set of reference haplotypes and genotypes from a SNP chip.
IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new toIMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.
You should begin by downloading the program fromhere. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.
Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on
When you have learned the basic functionality of the program, you can use several features of this website to prepare your own analysis:
We have just releasedIMPUTE v2.3.2. This version is a very minor update to add additional columns that report the two alleles at each imputed variant to the info files.
We have just releasedIMPUTE v2.3.1. This version fixes a bug inpanel-merging functionality that caused variants seen in one of two reference panels to be imputed with a fixed allele (non-ref allele in Panel 0, ref allele in Panel 1). If you have used these options, then we would recommend re-running your imputation.
In addition, we have released a new version of the 1000 Genomes Phase 1 haplotypes
We have released a new version of the 1000 Genomes Phase 1 haplotypes
We have released a new version of the 1000 Genomes Phase 1 haplotypes
We have just releasedIMPUTE v2.3.0, which includes a number of new features and minor bug fixes. One valuable new function is a simple and robust approach formerging reference panels; for example, it is easy to combine 1,000 Genomes haplotypes with population-specific sequence data to capture the strength of both reference sets. We have also written detailed documentation for theconcordance tables printed at the end of mostIMPUTE2 runs.
We recently published an article called"Fast and accurate genotype imputation in genome-wide association studies through pre-phasing" inNature Genetics. This paper describes a strategy ("pre-phasing") for efficient genotype imputation with large reference panels. By reducing the computational burden of imputation, pre-phasing makes imputation-based studies feasible for groups with limited computing power, and it also makes it easier to re-impute existing GWAS datasets as more informative reference panels become available. You can learn more about pre-phasing withIMPUTE2here.
In March 2012, the 1,000 Genomes Project released a powerful reference panel known as "Phase I version 3". In August 2012, we modified this panel by excluding variants with only one copy of the minor allele (singletons) across all 1,092 individuals. Singleton variants are difficult to impute, yet they make up ~20% of all variants in the reference panel; removing them makes imputation faster without hurting the power for association mapping. You can download either the orginal reference panel or the modified version (which is labeled "macGT1" for "minor allele count greater than one")here.
We published an article called"Genotype imputation with thousands of genomes" in the open-access journalG3: Genes, Genomes, Genetics. This paper describes our strategy for achieving high accuracy with ancestrally diverse reference panels, especially at low-frequency variants and in admixed study cohorts: we supply a cosmopolitan set of reference haplotypes toIMPUTE2, which can automatically find the most useful ones for each study individual with the help of the tuning parameter
IMPUTE2'spre-phasing approach now works with phased haplotypes fromSHAPEIT, a highly accurate phasing algorithm that can handle mixtures of unrelateds, duos, and trios. Details are availablehere. We highly recommend usingSHAPEIT to infer the haplotypes underlying your study genotypes, then passing these toIMPUTE2 for imputation as shown in the second step ofthis example.
IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read theLICENCE file, which is included with each software download.
Pre-compiledIMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to ourmail list.
The latest software release isv2.3.1. We support only the most recent version.
Platform | File |
---|---|
Linux (x86_64) Static Executable | impute_v2.3.2_x86_64_static.tgz |
Linux (x86_64) Dynamic Executable | impute_v2.3.2_x86_64_dynamic.tgz |
Mac OSX Intel | impute_v2.3.2_MacOSX_Intel.tgz |
Windows MS-DOS (Intel) | impute_v2.3.1_Windows.tgz (coming soon) |
Solaris 5.10 | impute_v2.3.2_Solaris5.10.tar.gz (coming soon) |
To unpack the files on a Linux computer, use a command like this:
tar -zxvf impute_v2.X.Y_i386.tgz
(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable calledimpute2, aLICENCE file, and anExample/ directory that contains example data files. We show how to perform various kinds of analyses with the example fileshere.
IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset.
1000 Genomes Phase 3 | b37 | October 2014 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | June 2014 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | Dec 2013 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | Sep 2013 | |
1000 Genomes Phase I integrated variant set | b37 | Mar 2012 | Includes chrX; updated 24 Aug 2012 |
1000 Genomes Phase I (interim) | b37 | Jun 2011 | Includes chrX; updated 19 Apr 2012 |
1000 Genomes (2010 interim) | b37 | Dec 2010 | |
1000 Genomes Pilot + HapMap 3 | b36 | Jun 2010 / Feb 2009 | |
1000 Genomes Pilot | b36 | Jun 2010 | |
HapMap 3 (release #2) | b36 | Feb 2009 | Includes chrX |
HapMap 2 (release #24) | b36 | Oct 2008 | |
HapMap 2 (release #22) | b36 | Jan 2008 | |
HapMap 2 (release #21) | b35 | Jul 2006 |
Human genetic variation resources, like those produced by HapMap 3 and the 1,000 Genomes Project, capture a broad cross-section of human genetic diversity: detailed variation data have now been collected from a variety of sampling locations in Africa, Asia, Europe, and the Americas. Large sequencing projects are actively expanding these datasets to include additional populations and deeper sampling within populations. These public databases provide powerful reference panels for genotype imputation studies.
In this context, one important question is how to choose a reference panel that will produce high imputation accuracy in a population of interest. The answer is seldom obvious because human populations have experienced complex demographic histories with many migration and mixture events. Consequently, it can be hard to decide which reference haplotypes should be used in a particular study.
We have proposed a simple and universal solution to this problem: we provide all available reference haplotypes toIMPUTE2, then let the software choose a "custom" reference panel for each individual to be imputed. There are several advantages to this approach:
There are a few program settings that you should be aware of when usingIMPUTE2 with an ancestrally diverse reference panel:
As explained above, we believe that the best way to useIMPUTE2 with modern reference panels is to provide all available haplotypes to the program and let it choose which ones to use. Here, we explain how this approach works.
IMPUTE2 does not use population labels or other genome-wide measures of relatedness between individuals, either for the reference haplotypes or the individuals being imputed. Instead, it looks for reference haplotypes that share high sequence identity with the haplotypes of a particular study individual. These haplotypes constitute a "custom" reference panel that can be used to impute missing genotypes in the individual of interest.
This process is largely insensitive to the ancestral composition of the reference panel: as long as the panel contains haplotypes that share segments of recent common ancestry with individuals in a study,IMPUTE2 can find the shared segments and use them to impute missing alleles. Consequently, �it can also include other kinds of haplotypes:
Expert users will note that the model underlyingIMPUTE2 is formally designed to represent genetic variation in a single population. This might imply that the method would have trouble using reference panels that include populations with different linkage disequilibrium patterns, nucleotide diversity levels, and allele frequency spectra. However, we have found that theIMPUTE2 is extremely adaptable: it can find segments of shared ancestry in multi-population reference panels despite its simple model of human populations, and it is largely robust to changes in its model parameters. Imputation accuracy might theoretically be improved by more detailed modeling of population relationships (for example, the population labels thatIMPUTE2 ignores might sometimes be informative), but we believe that our approach captures most of the potential accuracy in an efficient way.
We published our work supporting these ideas in an article called"Genotype imputation with thousands of genomes" in the open-access journalG3: Genes, Genomes, Genetics. Please citethis paper andthe originalIMPUTE2 paper when usingIMPUTE2 with multi-population reference panels like those from the 1,000 Genomes Project.
This section provides some example commands that illustrate typical applications ofIMPUTE2. All of the data files used in these commands are included in theExample/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains theimpute2 executable). Detailed explanations are provided at each link below.
Run type | Description |
---|---|
Imputation with one phased reference panel | Basic scenario in which most people will useIMPUTE2. |
Imputation with one phased reference panel (pre-phasing) | As above, but withpre-phasing functionality to speed up the analysis. |
Imputation with one phased reference panel (chromosome X) | Basic imputation scenario applied to human chromosome X, which requires special program options. |
Imputation with one phased reference panel (plus variant filtering) | Basic imputation scenario with flexible filtering of reference panel variants. |
Imputation with one unphased reference panel | Basic imputation scenario adapted to unphased reference genotypes. |
Imputation with two phased reference panels | Extended functionality for imputing from multiple reference panels defined on different sets of variants. |
Imputation with two phased reference panels (merge reference panels) | Merge reference panels defined on different sets of variants and use combined panel for imputation. |
Imputation with one phased and one unphased reference panel | Specialized method for combining reference panels of different types. |
Imputation with one phased and one unphased reference panel, with additional options | As above, but illustrating a variety of options that can be used to customize the behavior ofIMPUTE2. |
Phasing | Methodology for inferring haplotypes from unphased genotypes. |
Phasing with a reference panel | Phasing analysis aided by reference haplotypes. |
All of the data files in the example commands below are included in theExample/ directory that comes with theIMPUTE2 software download. You should run the command from the main download directory, which is the one that contains theimpute2 executable. For example, if you just downloaded a software package namedimpute_v2.X.Y_i386.tgz and unpacked it according to the directionshere, you can reach the appropriate directory by typing "
Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser'sCopy command, and then using thePaste command in your terminal window. (You may then need to hitEnter to start the run.)
Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.
You do not have to runIMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, seehere.
Most of the examples below include the string "
This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes.
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
This is the most common genotype imputation scenario: we want to use a panel of reference haplotypes to impute SNPs that were not typed in a study. Here, we show how to perform this task via
The following commands show how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
./impute2 \
./impute2 \
./Example/chrX/example.chrX.map \
./Example/chrX/example.chrX.study.sample \
./impute2 \
This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis on chromosome X, which requires special treatment due to the hemizygosity of males. (This example and the files in our download packages focus on the non-pseudoautosomal part of chromosome X.)
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
Among human chromosomes, chromosome X is unique in that it is dizygous (two copies) in females but hemizygous (one copy) in males. To deal with chromosome X data,IMPUTE2 requires that you use the flag and make some small changes to the input file formats.
ID_1 ID_2 missing
0 0 0
INDIV1 INDIV1 0.0
INDIV2 INDIV2 0.0
INDIV3 INDIV3 0.0
0 1 1 - 0 -
0 0 1 - 1 -
1 0 0 - 1 -
1 1 0 - 1 -
0 0 1 - 0 -
0 1 1 0
0 0 1 1
1 0 0 1
1 1 0 1
0 0 1 0
This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis after flexibly removing a subset of sites from the reference panel.
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
It is not necessary for the reference panel to be phased:IMPUTE2 can do the phasing internally while accounting for the phase uncertainty. To use an unphased reference panel, simply replace the
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
It is sometimes helpful to use multiple reference panels to impute genotypes in a single study. For example, we previously recommended combining reference haplotypes from the 1,000 Genomes Pilot Project and HapMap 3: the first set provided extensive coverage of polymorphisms in the genome, while the second set provided greater sample size at a subset of SNPs. We no longer recommend that you use this hybrid reference panel because the 1,000 Genomes Project has generated even richer reference sets (which you can downloadhere), but some investigators may have additional reference data that could be used in this way.
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
Many investigators have access to multiple reference panels that could inform their imputation analyses. For example, they might want to supplement the 1,000 Genomes haplotypes (which can be downloadedhere) with dedicated sequencing data from a study population.
If you have two panels that have been phased and put intoIMPUTE2's reference format (legend/haplotype file pairs), you can ask the program to merge them internally and impute your study genotypes by entering the following command, which uses example data that come with the program download:
./impute2 \
Sometimes it is useful to combine a phased reference panel with an unphased reference panel when imputing genotypes in a study. For example,
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
Here we perform the same basic analysis as inthis example, but we use a number of additional options to modify the behavior ofIMPUTE2.
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
AlthoughIMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to ourmail list.
AlthoughIMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the
Here, we extend abasic phasing analysis to incorporate a phased reference panel. Population-based phasing methods work by pooling linkage disequilibrium information across individuals, so adding a panel of high-quality haplotypes can improve phasing accuracy.
The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:
./impute2 \
We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to ourmail list.
These links explain the command-line arguments that can be used to controlIMPUTE2.
Option type | Description |
---|---|
Required arguments | The program will not run if these are not supplied. |
Input file options | A list of possible input files, with formatting requirements. |
Output file options | Naming conventions and options for controlling format of output files. |
Basic options | Options for controlling how the program processes input data. |
Strand alignment options | Options for aligning allele coding across data files. |
Filtering options | Options for controlling the filters that get applied to input data. |
MCMC options | Options for controlling the MCMC algorithm. |
Pre-phasing options | Options that facilitate pre-phasing and subsequent imputation. |
Panel merging options | Options for merging a pair of reference panels. |
Chromosome X options | Options for analyzing chromosome X data. |
Expert options | Options to be used by experts only. |
This table shows the input arguments that you must supply in order forIMPUTE2 to run. the program will not do anything useful unless you also supply other input options and/or data files.
Flag | Default | Description |
---|---|---|
-g <file> | none | File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on ourfile format webpage and is the same as the output format from our genotype calling programCHIAMO. If you do not supply a file of unphased genotypes via this argument, youmust supply a file of phased study haplotypes via the |
-m <file> | none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of our |
-int <lower> <upper> | none | Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the |
This table explains the formatting requirements for input data files that can be supplied toIMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems).
Flag | Default | Description |
---|---|---|
-g <file> | none | File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on ourfile format webpage and is the same as the output format from our genotype calling programCHIAMO. If you do not supply a file of unphased genotypes via this argument, youmust supply a file of phased study haplotypes via the |
-m <file> | none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of our |
-h <file 1> <file 2> | none | File of known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as0 or1, and each InIMPUTE2, it is possible to specify two |
-l <file 1> <file 2> | none | Legend file(s) with information about the SNPs in the-h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding-h file; these alleles can take values in{A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). We provide legend files for data from the HapMap Project and the 1,000 Genomes Project in our When using two-h files withIMPUTE2, you must supply the corresponding legend files in the same order�i.e., the file with more SNPs comes first. |
-g_ref <file> | none | File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the-g file. A-g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single-h file to create a two-tiered reference panel (in the latter case, the-g_ref file should contain roughly a subset of the SNPs in the-h file). |
-known_haps_g <file> | none | File containing known haplotypes for the study cohort. The format is the same as the output format fromIMPUTE2's If your study dataset is fully phased, you can replace the The |
The options in this table control the format and naming conventions of output files printed byIMPUTE2.
Flag | Default | Description |
---|---|---|
Name of main output file. Follows the sameformat as the | ||
Name of SNP-wise information file with one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses): 1. SNP identifier from 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the-o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) [details] 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) Depending on the command-line options invoked, there may also be columns labeledinfo_typeX,concord_typeX, andr2_typeX.IMPUTE2 assigns every SNP an internal "type" which reflects the combination of input datasets that include data for that SNP; here,X gives the type, which takes values in For SNPs that have genotypes in the-g file,concord_typeX is the concordance between the input genotypes and the best-guess imputed genotypes, where the input genotypes at that SNP have been masked internally and then imputed as if the SNP were of typeX; similarly,r2_typeX is the squared correlation between input and masked/imputed genotypes at a SNP. Theinfo_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-typeX SNPs in the leave-one-out masking experiment. These columns are useful for post-hoc quality control; we will soon explain how we use them in our section onBest Practices for Imputation. | ||
Name of log file that records a summary of the screen output. | ||
Name of file that records warnings generated byIMPUTE2. | ||
"Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in theOverview). By default, all imputed and genotyped SNPs are included in the output, i.e., " | ||
Specifies that the main output file should be compressed by thegzip utility; this also applies to some non-standard output files that can become large. | ||
3 | Specifies the number of decimal places to use for reporting genotype probabilities in the main output file. | |
Suppresses printing ofinfo_typeX,concord_typeX, andr2_typeX columns in the-i file. | ||
Suppresses printing of per-sample quality control metrics file. The default is to print a file named " | ||
IMPUTE2 always implicitly phases the study genotypes ( In addition to this "best-guess" haplotype file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named " As illustrated by ourexample commands, it is possible to use the | ||
"Predict Genotyped SNPs": Tells the program to replace the input genotypes from the | ||
Unlike This is an appealing option that will "fill in" sporadically missing genotypes in your input data. However, it is possible that this could cause subtle problems in downstream association testing. We therefore suggest that you use caution when applying this option. |
IMPUTE2 reports an information metric in the fifth column of its
Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).
Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.
We define our info metric and compare it against other metrics in a
These options control some basic processing that the program does to prepare input data for inference.
Flag | Default | Description |
---|---|---|
none | Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the | |
250 kb | Length of buffer region () to include on each side of the analysis interval specified by the-int option. SNPs in the buffer regions inform the inference but do not appear in output files (unless you activate the Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. Larger buffers may improve accuracy for low-frequency variants (since such variants tend to reside on long haplotype backgrounds) at the cost of longer running times. | |
Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is describedhere. | ||
Tells the program to include SNPs from the | ||
20000 | "Effective size" of the population (commonly denoted asNe in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates thatIMPUTE2 uses to guide its model of linkage disequilibrium patterns. When most imputation runs were conducted with reference panels from HapMap Phase 2, we suggested values of11418 for imputation from HapMap CEU,17469 for YRI, and14269 for CHB+JPT. Modern imputation analyses typically involve reference panels with greater ancestral diversity, which can make it hard to determine the "ideal" | |
0.9 | Threshold for calling genotypes in the-g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing. NOTE: This threshold applies only toinput genotypes. If you want to apply a calling threshold toIMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing softwareSNPTEST for better suggestions. | |
# of indiv in | Number of individuals from the-g file to include in the analysis. For example, to impute only the first five individuals, set | |
Print detailed output about the progress of imputation. By default,IMPUTE2 prints only the number of the current MCMC iteration when performing imputation, but this flag tells it to print more detailed updates. |
In any imputation analysis, is itabsolutely essential that all panels have their allele codings aligned to a fixed reference (usually thehuman genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you.
NOTE:IMPUTE2 will automatically align the strand between panels whenever it can do so unambiguously; e.g., flipping A/C inPanel 2 to match G/T in the reference. The options below pertain to variants where this is not possible, e.g. because an A/T SNP cannot be aligned by label alone.
NOTE: We currently assume that all phased reference files have already been aligned to the '+' strand of the human genome reference sequence, which is true of
Flag | Default | Description |
---|---|---|
none | File showing the strand orientation of the SNP allele codings in the The ordering of the SNPs in this file does not matter (by contrast to the-g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). We provide model strand files in theExample/ directory that comes with the software download. | |
none | Same as | |
Activates the program's internal strand alignment procedure for the NOTE: This flag can be used in conjunction with the-strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to align the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The best way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured. | ||
Similar to NOTE: Just as-align_by_maf_g can be used in conjunction with-strand_g, this flag can be used in conjunction with the-strand_g_ref option. As before, the strand file takes precedence over aligning the strand by MAF. NOTE: As with-align_by_maf_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%. |
The options in this table affect the way that the program filters the input data. Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.
Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the-g file you should use-exclude_snps_g, and to exclude SNPs from the-g_ref file you should use-exclude_snps_g_ref.
Flag | Default | Description |
---|---|---|
none | This option provides flexible variant filtering in the reference panel via "filter rules", which are based on annotation columns in a To filter variants based on the numeric annotation values in the Otherwise, the command-line environment may interpret symbols like and as linux redirection operators. There should be no white space within the single quotes. You can develop annotations yourself and add them to the For an illustration of using | |
none | List of SNPs to exclude from the | |
none | Same as | |
Specifies that SNPs excluded from the study dataset via the | ||
none | List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the reference SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the This option does not have any effect on SNPs in the | |
none | File of sample IDs for the individuals in the NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the | |
none | Same as | |
none | List of samples to exclude from the NOTE: Part of theIMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option. | |
none | Same as |
IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.
Flag | Default | Description |
---|---|---|
30 | Total number of MCMC iterations to perform,including burn-in. Increasing the number of iterations may improve accuracy slightly, although increasing | |
10 | Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for unphased individuals during each of the first | |
80 | Number of haplotypes (in the reference or study data) to use as templates when phasing observed genotypes. Increasing this value will lead to higher accuracy at the cost of longer running times, which scale quadratically with | |
500 | Number of reference haplotypes to use as templates when imputing missing genotypes. If this value is less than the total number of haplotypes in your reference panel,IMPUTE2 will choose a "custom" set of If all of your reference haplotypes have similar ancestry to the subjects in your study, each haplotype is potentially useful for imputation, so the best accuracy can be achieved by setting Conversely, we now recommend runningIMPUTE2 with large reference panels containing haplotypes of diverse ancestry. (For more details, seehere.) In this context, our rule of thumb suggests setting As of software version 2.3.0, |
You can greatly speed up your imputation through a process called
Flag | Default | Description |
---|---|---|
TellsIMPUTE2 to phase the genotypes in the | ||
TellsIMPUTE2 to perform imputation with pre-phased GWAS haplotypes, which must be supplied via a |
These options allowIMPUTE2 to efficiently combine two reference panels typed on partially overlapping sets of variants.
Flag | Default | Description |
---|---|---|
Tells the program to combine information across two reference panels using the approach describedhere. | ||
none | Activates | |
none | Activates |
NOTE: If you wantIMPUTE2 to print a merged reference panel with buffer regions included, you should use one of the last two options together with the
NOTE: You can see an example run that uses
These options facilitate the analysis of genotype data from human chromosome X.
Flag | Default | Description |
---|---|---|
Specifies that this is an analysis of chromosome X data. This flag changes the model parameters by automatically reducing the value by 25%, and it allows the file to include a mixture of dizygous females and hemizygous males. When using the option, it is essential to provide a file with a column named , since this tells the program which individuals are males and which are females. More details on the file formats for chromosome X analysis are availablehere, and you can see an example runhere. | ||
Specifies that the current dataset comes from apseudoautosomal region (PAR) of chromosome X, where both males and females are diploid. When used together with , this flag will reduce by 25% but otherwise run the analysis in the same way as on the autosomes. |
The options in this table are meant for experts only. Don't use them unless you know what you are doing!
Flag | Default | Description |
---|---|---|
random | Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option. | |
Turns warnings off, so that the | ||
Turns on the "hole-filling" function, which allows SNPs that are typed in the | ||
Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. Such SNPs will be retained in the output, but they will not be used for inference. |
IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.
Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.
When you provideIMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). If two or more variants have the same position�perhaps because one is a SNP and one is an overlapping INDEL�then these variants are matched across panels based on their allele labels.
It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results fromIMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".
Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Ourreference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like theliftOver program from UCSC. If you need help with this step, please send a message to ourmail list.
It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.
Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options describedhere. If you cannot recover the strand alignment from the original assay, you can useother options that tellIMPUTE2 to make educated guesses.
Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supplyIMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy can increase accuracy at low-frequency variants, and it avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population. You can read our paper on this strategyhere, learn about practical ways of applying ithere, and download state-of-the-art reference haplotypeshere.
If you have collected a custom reference panel for your study population�say, exome-wide or genome-wide sequencing data�you can combine it with the 1,000 Genomes data to maximize accuracy and genomic coverage at the same time. To learn howIMPUTE2 does this, seehere.
It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a few mechanisms to help with this process:
It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.
We distribute a program calledSNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at theSNPTEST website.
Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:
Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.
Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of theIMPUTE2 methodology.
The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the originalIMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.
For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the originalIMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. Pre-phasing is implemented through three program options:
If you use this functionality in your study, please remember to cite
In principle, it is possible to impute genotypes across an entire chromosome in a single run ofIMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.
We therefore recommend using the program on regions of
The-int parameter provides an easy way to break a chromosome into smaller chunks for analysis byIMPUTE2. For example, if we wanted to split a chromosome into
Once you have split a chromosome into multiple chunks and imputed them separately, theIMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:
cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2
Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used byIMPUTE2.)
Modern genotyping and sequencing technologies are generating a variety of reference datasets that can be used for genotype imputation in association studies. Combining reference panels from different populations can often improve imputation accuracy (e.g., seeHowieet al. 2011), but it is not clear how best to merge panels that are genotyped at different sets of variants.
Howieet al. 2009 proposed a solution for the special case where one reference panel contains a subset of the variants in another reference panel. We previously released a combined1,000 Genomes + HapMap 3 panel that takes advantage of this framework, and it was also used in theWTCCC2 studies.
Many association studies are now using the latest 1,000 Genomes data to drive their genotype imputation, but they may also have sequenced additional individuals from the population being studied. It makes sense to combine these resources in order to use all available reference information, but in this case each reference panel will contain many variants that are not found in the other�that is, the "hierarchical" variant framework of Howieet al. 2009 no longer applies.
With this in mind, we have devised a new strategy for combining reference panels created by different sequencing or genotyping studies.
There are many possible ways to merge two reference panels. We are exploring several of these options, but we decided to start with the simple approach depicted in the figure below. The top panel of this figure shows two reference panels and a GWAS cohort; you can think of the rows as individuals and the columns as positions along the genome. Each vertical line represents a genotyped variant in a given panel, and each reference panel includes variants that are not found in the other.
We impute the untyped variants in this figure in three steps:
This process can be performed withIMPUTE2 (version 2.3 and later) in a streamlined way: all you have to do is add the
The
For finer control of the merging step, you can supply two values to
The order in which you supply the reference panels on the command line should not affect the accuracy of imputation from the merged panel: inside the program, the calculations are completely symmetric. One practical limitation is that only the first legend file in anIMPUTE2 command is allowed to have more than four columns. The 1,000 Genomes legend files we distribute typically have more than four columns, so if you are using these files it makes sense to provide the 1,000 Genomes panel before your other panel on the command line.
By default,IMPUTE2 does not print the merged reference panel (the outcome of Steps 1 and 2 above); the merging is done internally, and the output shows only the imputed genotypes for the study cohort. If you want the program to output the merged panel, you can replace
If you want to merge two reference panels without imputing into a study dataset (i.e., to skip Step 3 above), you should use one of these two options and omit the study data (
Normally, these options print the merged reference panel within the region specified by the
Our approach for merging reference panels has not yet been published outside this website. We have tested the method on realistic datasets, and it has performed well in all of our analyses. We are actively working to document our work on this approach and to compare it with other strategies; we aim to report the results of these experiments and the details of our methodology as soon as possible.
In the meantime, we are happy to answer thoughtful questions and to hear about your experiences with this new functionality. If you would like to send comments, please do so through our
Every run ofIMPUTE2 produces a concordance table, except under certain settings that are not commonly used. A concordance table shows the results of an internal cross-validation that the program performs automatically. For this analysis, IMPUTE2 masks the genotypes of one variant at a time in the study data (Panel 2), then imputes the masked genotypes with information from the reference data and nearby study variants. The imputed genotypes are then compared with the original genotypes to evaluate the quality of the imputation. The results are summarized in a table like the one below:
If you are interested in the results of this experiment at a given variant, you can find this information in the
Only variants with input data from a
The genotype probabilities from imputation are used somewhat differently. In the first three columns of the table, we assign each imputed genotype to a bin (
In the last three columns of the table, we again bin the imputed genotypes based on their maximum posterior probabilities, but this time the binning is cumulative: the bin at the bottom of the table includes only genotypes that were confidently imputed (max prob >= 0.9), while each bin above includes all genotypes that pass a more lenient certainty threshold. These thresholds are shown in the fourth column (
We can learn a couple of things from this kind of analysis. First, the results can alert us to problems in the imputation: if the concordance between imputed and input genotypes is abnormally low, it may indicate that something went wrong in the analysis or input files. A useful summary statistic is the number in the upper righthand corner of the table, which gives the overall concordance from the cross-validation. This number should typically be around 95%; it may be lower in certain populations or regions of the genome, but if it is much lower then you may need to double-check the analysis. If you are worried about your results, please send a message with details of your analysis (including a
Concordance tables can also be used to predict the general quality of imputed genotypes at SNPs where we do not know the true genotypes. SNPs on GWAS microarrays tend to be easier to impute than untyped SNPs of the same frequency, so the cross-validation results may be somewhat optimistic, but they are often useful for relative comparisons�say, between different parameter settings ofIMPUTE2.
Finally, the per-variant results of the cross-validation in the output
If you provide two reference panels toIMPUTE2, the program will perform the cross-validation in two different ways. First it will use only a single reference panel (Panel 0) to mimic Type 0 SNPs, and then it will use both reference panels together (Panels 0 and 1) to mimic Type 1 SNPs. In this case,IMPUTE2 will print two concordance tables�one for each type of reference SNP. Note that the same masked study genotypes are used to evaluate accuracy in both cases; the only difference is how much reference data we allow the program to see when imputing the masked genotypes.
The concordance table is printed at the end of anIMPUTE2 run. One copy is printed to STDOUT, and another copy is printed in the
The following scripts are designed to help with various parts of anIMPUTE2 analysis. We provide them in the hope that they will be useful, but due to inconsistencies in file formats, assumptions, etc. If you want to use one of these scripts, we suggest that you first read through the code to understand how it works.
All of these scripts are released under theGNU General Public License. Each script will print a list of command line options if you run it with no arguments.
vcf2impute_legend_haps.pl | Convert aphased VCF file into reference panel format: one legend file and one haplotypes file. |
vcf2impute_gen.pl | Convert aphased orunphased VCF file into genotype file format (.gen). |
Our FAQ has moved to thisGoogle document.
[1] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly (2007)A new multipoint method for genome-wide association studies via imputation of genotypes.Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]
[2] B. N. Howie, P. Donnelly, and J. Marchini (2009)A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]
[3] J. Marchini and B. Howie (2010)Genotype imputation for genome-wide association studies.Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]
[4] B. Howie, J. Marchini, and M. Stephens (2011)Genotype imputation with thousands of genomes.G3: Genes, Genomics, Genetics 1(6): 457-470 [Open Access Article] [Supplementary Material]
[5] B. Howie, C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis (2012)Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.Nature Genetics 44(8): 955-959 [Restricted Access PDF]
The following people developed the methodology and software forIMPUTE2:
If you have a question aboutIMPUTE2, please send a message to our mailing list:
http://www.jiscmail.ac.uk/OXSTATGEN
You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.
: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.