Movatterモバイル変換

IMPUTE version 2 (also known asIMPUTE2) is a genotype imputation and haplotype phasing program based on ideas fromHowieet al. 2009:

B. N. Howie, P. Donnelly, and J. Marchini (2009)A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]

IMPUTE2 also includes features that were introduced in other publications, which you can findhere.

The figure below shows the most common scenario in which imputation is used: unobserved genotypes (red question marks) in a set of study individuals are imputed (or predicted) using a set of reference haplotypes and genotypes from a SNP chip.

Getting Started

IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new toIMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.

You should begin by downloading the program fromhere. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.

Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section onExamples shows how to use the most commonIMPUTE2 functions. We suggest that you work through these examples and try to understand what the elements of each command are doing. If you don't understand something or would like to know if the program can perform a function that isn't listed, you can read ourFAQ or submit a question to ourmail list.

When you have learned the basic functionality of the program, you can use several features of this website to prepare your own analysis:

Learn aboutbest practices for imputation.
Downloadreference data that you can use to impute genotypes in your study.
Look through a complete list ofprogram options.

What's New?

New release (23 December 2014)

We have just releasedIMPUTE v2.3.2. This version is a very minor update to add additional columns that report the two alleles at each imputed variant to the info files.

New release (16 June 2014)

We have just releasedIMPUTE v2.3.1. This version fixes a bug inpanel-merging functionality that caused variants seen in one of two reference panels to be imputed with a fixed allele (non-ref allele in Panel 0, ref allele in Panel 1). If you have used these options, then we would recommend re-running your imputation.

In addition, we have released a new version of the 1000 Genomes Phase 1 haplotypes

New release (9 Dec 2013)

We have released a new version of the 1000 Genomes Phase 1 haplotypes

New release (16 Sept 2013)

We have released a new version of the 1000 Genomes Phase 1 haplotypes

New software release (04 Jan 2013)

We have just releasedIMPUTE v2.3.0, which includes a number of new features and minor bug fixes. One valuable new function is a simple and robust approach formerging reference panels; for example, it is easy to combine 1,000 Genomes haplotypes with population-specific sequence data to capture the strength of both reference sets. We have also written detailed documentation for theconcordance tables printed at the end of mostIMPUTE2 runs.

Paper on "pre-phasing" study genotypes for faster imputation

We recently published an article called"Fast and accurate genotype imputation in genome-wide association studies through pre-phasing" inNature Genetics. This paper describes a strategy ("pre-phasing") for efficient genotype imputation with large reference panels. By reducing the computational burden of imputation, pre-phasing makes imputation-based studies feasible for groups with limited computing power, and it also makes it easier to re-impute existing GWAS datasets as more informative reference panels become available. You can learn more about pre-phasing withIMPUTE2here.

Latest 1,000 Genomes Phase I reference panel

In March 2012, the 1,000 Genomes Project released a powerful reference panel known as "Phase I version 3". In August 2012, we modified this panel by excluding variants with only one copy of the minor allele (singletons) across all 1,092 individuals. Singleton variants are difficult to impute, yet they make up ~20% of all variants in the reference panel; removing them makes imputation faster without hurting the power for association mapping. You can download either the orginal reference panel or the modified version (which is labeled "macGT1" for "minor allele count greater than one")here.

Paper on imputation strategies for ancestrally diverse reference panels

We published an article called"Genotype imputation with thousands of genomes" in the open-access journalG3: Genes, Genomes, Genetics. This paper describes our strategy for achieving high accuracy with ancestrally diverse reference panels, especially at low-frequency variants and in admixed study cohorts: we supply a cosmopolitan set of reference haplotypes toIMPUTE2, which can automatically find the most useful ones for each study individual with the help of the tuning parameter-k_hap. You can read more about the results that support this strategy in the article, and we provide practical suggestions for applying ithere.

Pre-phasing with SHAPEIT

IMPUTE2'spre-phasing approach now works with phased haplotypes fromSHAPEIT, a highly accurate phasing algorithm that can handle mixtures of unrelateds, duos, and trios. Details are availablehere. We highly recommend usingSHAPEIT to infer the haplotypes underlying your study genotypes, then passing these toIMPUTE2 for imputation as shown in the second step ofthis example.

Download IMPUTE2

IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read theLICENCE file, which is included with each software download.

Pre-compiledIMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to ourmail list.

The latest software release isv2.3.1. We support only the most recent version.

Platform	File
Linux (x86_64) Static Executable	impute_v2.3.2_x86_64_static.tgz
Linux (x86_64) Dynamic Executable	impute_v2.3.2_x86_64_dynamic.tgz
Mac OSX Intel	impute_v2.3.2_MacOSX_Intel.tgz
Windows MS-DOS (Intel)	impute_v2.3.1_Windows.tgz (coming soon)
Solaris 5.10	impute_v2.3.2_Solaris5.10.tar.gz (coming soon)

To unpack the files on a Linux computer, use a command like this:

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable calledimpute2, aLICENCE file, and anExample/ directory that contains example data files. We show how to perform various kinds of analyses with the example fileshere.

Download Reference Data

IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset.

Link to download page	NCBI build	Haplotype release date	Release status
1000 Genomes Phase 3	b37	October 2014
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2)	b37	June 2014
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2)	b37	Dec 2013
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2)	b37	Sep 2013
1000 Genomes Phase I integrated variant set	b37	Mar 2012	Includes chrX; updated 24 Aug 2012
1000 Genomes Phase I (interim)	b37	Jun 2011	Includes chrX; updated 19 Apr 2012
1000 Genomes (2010 interim)	b37	Dec 2010
1000 Genomes Pilot + HapMap 3	b36	Jun 2010 / Feb 2009
1000 Genomes Pilot	b36	Jun 2010
HapMap 3 (release #2)	b36	Feb 2009	Includes chrX
HapMap 2 (release #24)	b36	Oct 2008
HapMap 2 (release #22)	b36	Jan 2008
HapMap 2 (release #21)	b35	Jul 2006

Using Multi-Population Reference Panels

Overview

Human genetic variation resources, like those produced by HapMap 3 and the 1,000 Genomes Project, capture a broad cross-section of human genetic diversity: detailed variation data have now been collected from a variety of sampling locations in Africa, Asia, Europe, and the Americas. Large sequencing projects are actively expanding these datasets to include additional populations and deeper sampling within populations. These public databases provide powerful reference panels for genotype imputation studies.

In this context, one important question is how to choose a reference panel that will produce high imputation accuracy in a population of interest. The answer is seldom obvious because human populations have experienced complex demographic histories with many migration and mixture events. Consequently, it can be hard to decide which reference haplotypes should be used in a particular study.

We have proposed a simple and universal solution to this problem: we provide all available reference haplotypes toIMPUTE2, then let the software choose a "custom" reference panel for each individual to be imputed. There are several advantages to this approach:

Good results can be obtained in any study population by tuning a single software parameter (-k_hap) with a simple rule of thumb; see below for more details.
Our group and others have used this approach to successfully impute populations ranging from homogeneous isolates to recent and complex admixtures.
This is because individuals from "diverged" populations may still share genomic segments of recent common ancestry, andIMPUTE2 can use this haplotype sharing to improve accuracy. At the same time, the software can ignore haplotypes that are not helpful.

The benefits of using inclusive reference panels are greatest at low-frequency variants (MAF < 5%), since these variants may be poorly represented in a reference panel from the population of interest (due to sampling effects) but well-represented in panel from a different population (e.g., due to genetic drift).
You might worry that using all available reference haplotypes would greatly increase the computational burden of imputation, butIMPUTE2 uses an approximation that limits the cost of adding reference haplotypes while maintaining (or improving) accuracy.

Practical suggestions

There are a few program settings that you should be aware of when usingIMPUTE2 with an ancestrally diverse reference panel:

-k_hap�This parameter determines how many of the reference haplotypes will be used in the "custom" reference panel for each study individual. The default value is500, which is a good starting point for modern reference datasets.

For example, suppose you were imputing a Spanish dataset from a reference panel containing 400 Western European haplotypes and 400 African American haplotypes. In this case, you could achieve high accuracy by leaving-k_hap at the default value of 500 since, in any part of the genome, the expected number of reference haplotypes with European ancestry is roughly400 + 0.2 * (400) = 480. (This calculation assumes that, on average, African American haplotypes have 20% European ancestry.)

Imputation accuracy is not very sensitive to-k_hap, which is why this rule of thumb usually provides good results without requiring detailed parameter tuning. If you want advice on the best value for your dataset, please send a message to ourmail list.
-Ne�This parameter controls the effective population size in the population-genetic model used byIMPUTE2. Different human populations have different effective sizes (as estimated from genetic diversity levels), so it is not obvious how to choose a single-Ne value when using a multi-population reference panel.

Fortunately, we have found thatIMPUTE2 achieves high accuracy across a wide range of-Ne values, with slightly higher accuracy at large values. , regardless of the study population being imputed or the composition of the reference panel. This will become the default value in our next software release (v2.1.3), but for now you should set it manually.
-int�This command-line option specifies the boundaries of the region to be imputed on the current chromosome, using two numbers. For example,"-int 1 5e6" tellsIMPUTE2 to analyze physical positions 1-5,000,000.

The imputation interval should not be too large because this weakensIMPUTE2's approximation for choosing custom reference panels, which is based on an assumption of limited recombination in the region being analyzed. In theory, it might be desirable to tailor the interval size to the population being imputed�e.g., to use shorter intervals in African populations�but in practice, we have found that the exact size of the interval has little effect on imputation accuracy as long as the interval is relatively small (say, < 10 Mb).

How does it work?

As explained above, we believe that the best way to useIMPUTE2 with modern reference panels is to provide all available haplotypes to the program and let it choose which ones to use. Here, we explain how this approach works.

IMPUTE2 does not use population labels or other genome-wide measures of relatedness between individuals, either for the reference haplotypes or the individuals being imputed. Instead, it looks for reference haplotypes that share high sequence identity with the haplotypes of a particular study individual. These haplotypes constitute a "custom" reference panel that can be used to impute missing genotypes in the individual of interest.

This process is largely insensitive to the ancestral composition of the reference panel: as long as the panel contains haplotypes that share segments of recent common ancestry with individuals in a study,IMPUTE2 can find the shared segments and use them to impute missing alleles. Consequently, �it can also include other kinds of haplotypes:

�If two or more distinct populations have mixed within the past few hundred years, the resulting admixed population may contain some haplotype segments that are closely related to a population of interest and other segments that are highly diverged.IMPUTE2 can identify the useful segments while ignoring the diverged segments, thereby achieving accurate imputation.
�Even if a set of reference haplotypes comes from a different population than the one you want to impute, it may still provide segments of recent ancestry that can help the imputation. The prevalence of such segments is a complicated function of reference panel size and population history, but in our experience there is often a surprising amount of ancestry sharing between genetically distinct populations.
�Reference haplotypes that are highly diverged from your study population are unlikely to be useful for imputation, but such haplotypes are easily identified and ignored byIMPUTE2. In other words, highly diverged reference haplotypes neither help nor hurt imputation accuracy. This is important because the distinction between "moderately" and "highly" diverged populations is not always clear; since it does not hurt to include unhelpful reference haplotypes, we can err on the side of including too many in order to capture more of the moderately diverged ones that improve imputation accuracy.

Expert users will note that the model underlyingIMPUTE2 is formally designed to represent genetic variation in a single population. This might imply that the method would have trouble using reference panels that include populations with different linkage disequilibrium patterns, nucleotide diversity levels, and allele frequency spectra. However, we have found that theIMPUTE2 is extremely adaptable: it can find segments of shared ancestry in multi-population reference panels despite its simple model of human populations, and it is largely robust to changes in its model parameters. Imputation accuracy might theoretically be improved by more detailed modeling of population relationships (for example, the population labels thatIMPUTE2 ignores might sometimes be informative), but we believe that our approach captures most of the potential accuracy in an efficient way.

Published results

We published our work supporting these ideas in an article called"Genotype imputation with thousands of genomes" in the open-access journalG3: Genes, Genomes, Genetics. Please citethis paper andthe originalIMPUTE2 paper when usingIMPUTE2 with multi-population reference panels like those from the 1,000 Genomes Project.

Examples

This section provides some example commands that illustrate typical applications ofIMPUTE2. All of the data files used in these commands are included in theExample/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains theimpute2 executable). Detailed explanations are provided at each link below.

Run type	Description
Imputation with one phased reference panel	Basic scenario in which most people will useIMPUTE2.
Imputation with one phased reference panel (pre-phasing)	As above, but withpre-phasing functionality to speed up the analysis.
Imputation with one phased reference panel (chromosome X)	Basic imputation scenario applied to human chromosome X, which requires special program options.
Imputation with one phased reference panel (plus variant filtering)	Basic imputation scenario with flexible filtering of reference panel variants.
Imputation with one unphased reference panel	Basic imputation scenario adapted to unphased reference genotypes.
Imputation with two phased reference panels	Extended functionality for imputing from multiple reference panels defined on different sets of variants.
Imputation with two phased reference panels (merge reference panels)	Merge reference panels defined on different sets of variants and use combined panel for imputation.
Imputation with one phased and one unphased reference panel	Specialized method for combining reference panels of different types.
Imputation with one phased and one unphased reference panel, with additional options	As above, but illustrating a variety of options that can be used to customize the behavior ofIMPUTE2.
Phasing	Methodology for inferring haplotypes from unphased genotypes.
Phasing with a reference panel	Phasing analysis aided by reference haplotypes.

How to use example commands

All of the data files in the example commands below are included in theExample/ directory that comes with theIMPUTE2 software download. You should run the command from the main download directory, which is the one that contains theimpute2 executable. For example, if you just downloaded a software package namedimpute_v2.X.Y_i386.tgz and unpacked it according to the directionshere, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.

Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser'sCopy command, and then using thePaste command in your terminal window. (You may then need to hitEnter to start the run.)

Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.

You do not have to runIMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, seehere.

Most of the examples below include the string "-int 20.4e6 20.5e6", which tells the program to produce results for a 100 kb region (positions20,400,000-20,500,000) on a single chromosome.IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome. Applying the program to a much larger region�say, a whole chromosome or the whole genome�requires running many such jobs with different values of the-int parameter, usually in parallel on a computing cluster. For more details about how to do this, seehere.

Imputation with one phased reference panel

This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.one.phased.impute2

Comments

Here we have used the-strand_g option to provide a strand file to the program. This file tellsIMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and-l files). , either before runningIMPUTE2 or during a run with the options describedhere.
This command invokes the standard MCMC algorithm used byIMPUTE2, which usually provides accurate results in a reasonable amount of time. Another way to run this kind of analysis is to use ourpre-phasing approach, which decreases the running time by orders of magnitude at the cost of a small drop in imputation accuracy. To see how to run this example with pre-phasing, clickhere.

Imputation with one phased reference panel (pre-phasing)

This is the most common genotype imputation scenario: we want to use a panel of reference haplotypes to impute SNPs that were not typed in a study. Here, we show how to perform this task viapre-phasing, which is an approach that speeds up the imputation process by splitting it into two steps: (i) statistically phase the study genotypes; (ii) impute from the reference panel into the estimated study haplotypes.

The following commands show how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.study.gens \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.prephasing.impute2

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.prephasing.impute2_haps \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.one.phased.impute2

Comments

Pre-phasing is a useful technique for speeding up an imputation run, but it is even more useful if you want to impute a single study dataset from different reference panels (e.g., successive updates to the reference haplotypes released by the 1,000 Genomes Project). In that situation, you can perform the pre-phasing step just once and save the estimated haplotypes; you can then use the same study haplotypes to perform the imputation step with each new reference panel.
If you are usingIMPUTE2 for both the pre-phasing and subsequent imputation, it is important to use the same values of the-int parameter in both steps.
The-prephase_g flag activates a couple of features that are necessary for pre-phasing. First, it tells the program to estimate and print phased haplotypes at SNPs included in the-g file; the haplotypes will be written to a file named"[-o]_haps", where[-o] is the name supplied for the main output file. These haplotypes will include SNPs in the buffer regions that flank the main region specified via-int. Extending the haplotypes into the buffer regions helps prevent edge effects in downstream imputation runs.
It is possible to include a reference panel in the pre-phasing step, and this may improve the phasing quality. Seehere for an example of this kind of analysis (note that the linked example is missing the-prephase_g flag). To expedite the pre-phasing in this scenario, the program will not impute reference-only variants when-prephase_g is active, although you can override this behavior with the-os option.
You can use the-strand_g option in either the pre-phasing or downstream imputation step, . Strand alignment is not usually necessary when you just want to phase a dataset, but it is important when that dataset will be combined with a reference panel in a downstream analysis, as in this case.
Note that the file supplied to the-known_haps_g argument in the imputation step is the estimated haplotypes file from the pre-phasing step ("[-o]_haps"). Also note that the-use_prephased_g flag must be provided when imputing into pre-phased haplotypes.
The option in Step 2 above produces a file containing the haplotypes at the imputed and genotyped sites. In this example, the file would be called
Pre-phasing based imputation on chromosome X is also possible. The only things you need to do differently are to make sure that you supply a sample file for your data using the-sample_g option (soIMPUTE2 knows which individuals are male and which are female), and use the-chrX flag, and ensure that your male haploid genotypes are encoded according to the described file format. If you useSHAPEIT2 for the phasing step then this software also has options to phase chromosome X. Below is an example of how to carry out chromosome X pre-phasing based imputation

./impute2 \ \ \
./Example/chrX/example.chrX.map \ ./Example/chrX/example.chrX.study.gen \
./Example/chrX/example.chrX.study.sample \
10.3e6 10.7e6 \
20000 \
./Example/chrX/example.chrX.prephasing.impute2

./impute2 \ \ \
./Example/chrX/example.chrX.map \
./Example/chrX/example.chrX.reference.hap \
./Example/chrX/example.chrX.reference.legend \
./Example/chrX/example.chrX.prephasing.impute2_haps \
10.3e6 10.7e6 \
20000 \
./Example/chrX/example.chrX.one.phased.impute2

Imputation with one phased reference panel (chromosome X)

This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis on chromosome X, which requires special treatment due to the hemizygosity of males. (This example and the files in our download packages focus on the non-pseudoautosomal part of chromosome X.)

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
\
./Example/chrX/example.chrX.map \
./Example/chrX/example.chrX.reference.hap \
./Example/chrX/example.chrX.reference.legend \
./Example/chrX/example.chrX.study.gen \
./Example/chrX/example.chrX.study.sample \
10.3e6 10.7e6 \
20000 \
./Example/chrX/example.chrX.one.phased.impute2

Comments

The-chrX flag is essential because it tellsIMPUTE2 to expect thespecial file formatting conventions used for chromosome X data.
Whenever you analyze data on chromosome X, you must also provide a-sample_g file so that the program knows which individuals are males and which are females. You can learn about the specific requirements of this filehere.
There is no need to use a different-Ne value on chromosome X than you would on the autosomes; the-chrX flag tellsIMPUTE2 to automatically reduce the value by 25%, which changes the parameters of the haplotype copying model.
Like the input files, theIMPUTE2 output files from chromosome X analyses should be interpreted according tothese conventions.

File formats for chromosome X

Among human chromosomes, chromosome X is unique in that it is dizygous (two copies) in females but hemizygous (one copy) in males. To deal with chromosome X data,IMPUTE2 requires that you use the flag and make some small changes to the input file formats.

): As in a standard file, each study individual should have three columns (genotype probabilities) per SNP. For females, these have the standard interpretation that columns 1, 2, and 3 represent P(G=0), P(G=1), and P(G=2), respectively, where G=1 is the heterozygous state. Males have only two possible genotypes on chromosome X, and we encode these in columns 1 and 3; column 2, which corresponds to P(G=1), should always be zero in this setting, and non-zero values in this column will automatically be truncated to zero for males when the flag is active.
): In order for the input genotype convention explained above to work,IMPUTE2 needs to know which study individuals are males and which are females. This is accomplished by adding an extra column named to the file, which is when using the flag. This column should be coded as type (discrete covariate), where males are indicated by s and females are indicated by s. Here is an example snippet where the first individual is female and the second and third individuals are male:
ID_1 ID_2 missing
0 0 0
INDIV1 INDIV1 0.0
INDIV2 INDIV2 0.0
INDIV3 INDIV3 0.0
): It does not usually matter which reference individuals are male or female when their genotypes have already been phased. However, it may sometimes be convenient to create a file with two columns per individual, soIMPUTE2 allows the presence of dummy columns made of characters to represent the non-existent second haplotypes of males on chromosome X. For example, here is a small haplotypes file with 5 SNPs (one per row) typed in a female (columns 1-2) and two males (columns 3-4 and 5-6):
0 1 1 - 0 -
0 0 1 - 1 -
1 0 0 - 1 -
1 1 0 - 1 -
0 0 1 - 0 -
The dummy columns are optional�the following would be an equally valid format for the same file:
0 1 1 0
0 0 1 1
1 0 0 1
1 1 0 1
0 0 1 0
): The main output file will follow the same convention as the genotypes file described above: each individual has three entries per SNP, but the middle entry is set to zero for males. WhenIMPUTE2 produces haplotype output files for chromosome X, both males and females will have two columns per individual, although the second column for each male will be filled with dummy values of .

Imputation with one phased reference panel (plus variant filtering)

This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis after flexibly removing a subset of sites from the reference panel.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.annot.legend \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.one.phased.impute2

Comments

The main novelty here is the use of the-filt_rules_l option. This option works by defining "filtering rules" that combine annotation categories (here, and and) with comparison operators ( and ==) and values ( and and). Each annotation string is present on the first line of the-l file and is followed by a column of numeric or character values (one for each site in the reference panel) that determine whether a given site should be filtered from the reference set. In this example, the filtering rules tellIMPUTE2 to ignore reference variants with minor allele frequency less than 1% in a European panel less than 5% in an African panel sites that are annotated as (Filtering rules are always applied in 'OR' fashion.)
You can make your own filtering rules by adding numeric or character annotation columns to a reference legend (-l) file, or you can use the annotations that we provide in some of our reference panel download packages. For example, we have included continent-level minor allele frequencies in the legend files for the1,000 Genomes Phase 1 integrated variant reference panel.
Our main motivation in creating the-filt_rules_l option was to provide a fast and easy way of reducing the computational burden of large, sequence-based reference panels. A principled way to do this is to remove the reference SNPs that are expected to provide the least power in an imputation-based association analysis. We suggest that the rarest SNPs in a dataset fall into this category, both because there is generally less power to detect these under many study designs and because such SNPs are often harder to impute, which further diminishes the real power for detection. So, is to use a minor allele frequency filtering rule (e.g., ) for MAF annotations from a population like the one being studied.

Imputation with one unphased reference panel

It is not necessary for the reference panel to be phased:IMPUTE2 can do the phasing internally while accounting for the phase uncertainty. To use an unphased reference panel, simply replace the-h and-l files with a-g_ref file.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
./Example/example.chr22.map \
./Example/example.chr22.reference.gens \
./Example/example.chr22.reference.strand \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.one.unphased.impute2

Comments

As with any imputation analysis, . In this example, we assume that both the-g_ref and-g files include SNPs that are not aligned to the '+' strand of the human genome reference sequence, so we use the-strand_g_ref and-strand_g options to bring them into alignment.
This procedure is not recommended for unphased reference panels that have high SNP density, such as those that result from resequencing studies of population samples. In that situation, there may be statistical convergence issues that could decrease the imputation quality. If you need advice on how to use that kind of reference dataset, please send a message to ourmail list.

Imputation with two phased reference panels

It is sometimes helpful to use multiple reference panels to impute genotypes in a single study. For example, we previously recommended combining reference haplotypes from the 1,000 Genomes Pilot Project and HapMap 3: the first set provided extensive coverage of polymorphisms in the genome, while the second set provided greater sample size at a subset of SNPs. We no longer recommend that you use this hybrid reference panel because the 1,000 Genomes Project has generated even richer reference sets (which you can downloadhere), but some investigators may have additional reference data that could be used in this way.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.hm3.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.hm3.legend \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.two.phased.impute2

Comments

This is a somewhat complicated scenario, and some restrictions are necessary to make sure the statistical machinery will produce good results. Ideally, one reference panel should contain a subset of the SNPs typed in the other reference panel, and the study dataset should contain a subset of the SNPs typed in both reference panels. If your dataset deviates substantially from these conditions, you may obtain sub-optimal imputation accuracy. Please send a message to ourmail list if you want advice on whether this scheme will work with your data.
Assuming the conditions described above are nearly satisfied, the reference panel with a larger number of SNPs should always come first on the command line. In this example, there are more SNPs in the 1,000 Genomes ("1kG") panel than in the HapMap 3 ("hm3") panel, so the 1,000 Genomes files are listed first after the-h and-l arguments.
Here we have used the-strand_g option to provide a strand file to the program. This file tellsIMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and-l files; assumed to be aligned to the '+' strand of the human genome reference sequence). , either before runningIMPUTE2 or during a run with the options describedhere.

Imputation with two phased reference panels (merge reference panels)

Many investigators have access to multiple reference panels that could inform their imputation analyses. For example, they might want to supplement the 1,000 Genomes haplotypes (which can be downloadedhere) with dedicated sequencing data from a study population.

If you have two panels that have been phased and put intoIMPUTE2's reference format (legend/haplotype file pairs), you can ask the program to merge them internally and impute your study genotypes by entering the following command, which uses example data that come with the program download:

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.hm3.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.hm3.legend \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.two.phased.impute2

Comments

For details on how the reference panel merging works, please read thedocumentation.
This approach also works with pre-phased study haplotypes. To use pre-phased study data in this example, you would replace the-g file with a-known_haps_g file and add the-use_prephased_g flag to yourIMPUTE2 command.
If you want to print the merged, phased panel inIMPUTE2 reference format (one-l file and one-h file), you should add the-merge_ref_panels_output_ref flag.
If you want to print the merged, unphased panel inIMPUTE2 genotype format (one-g file), you should add the-merge_ref_panels_output_gen flag.
If you simply want to merge two reference panels without imputing missing genotypes in a study dataset, you should add the-merge_ref_panels_output_ref or-merge_ref_panels_output_gen flag and omit the study genotypes (-g or-known_haps_g file) from yourIMPUTE2 command.

Imputation with one phased and one unphased reference panel

Sometimes it is useful to combine a phased reference panel with an unphased reference panel when imputing genotypes in a study. For example,Howie et al. (2009) considered a hybrid reference panel that included phased haplotypes from HapMap and unphased genotypes from population controls typed on multiple SNP chips (they referred to this configuration as "Scenario B"). By using the genetic information in both panels simultaneously,IMPUTE2 can achieve a better combination of accuracy and coverage than it would with either panel alone.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.reference.gens \
./Example/example.chr22.reference.strand \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.one.phased.one.unphased.impute2

Comments

This is a somewhat complicated scenario, and some restrictions are necessary to make sure the statistical machinery will produce good results. Ideally, the study data (-g file) should contain a subset of the SNPs in the unphased reference panel (-g_ref file), which should in turn contain a subset of the SNPs in the phased reference panel (-h and-l files). If your dataset deviates substantially from these conditions, you may obtain sub-optimal imputation accuracy. Please send a message to ourmail list if you want advice on whether this scheme will work with your dataset.
Here we have used the-strand_g and-strand_g_ref options to provide strand files to the program. These files tellIMPUTE2 how to align the allele coding of the study genotypes (-g file) and the unphased reference genotypes (-g_ref file) with the coding of the phased reference haplotypes (-h and-l files; assumed to be aligned to the '+' strand of the human genome reference sequence). , either before runningIMPUTE2 or during a run with the options describedhere.
Additional options must be invoked if you want to include the-g_ref panel in your association tests (e.g., as part of your control set). This process requires a fair amount of imputation expertise, and we prefer to advise people about it on an individual basis. If you are interested in using this approach, please send a message to ourmail list.

Imputation with one phased and one unphased reference panel, with additional options

Here we perform the same basic analysis as inthis example, but we use a number of additional options to modify the behavior ofIMPUTE2.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.reference.gens \
./Example/example.chr22.reference.strand \
./Example/example.chr22.reference.snp.exclusions \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
\
./Example/example.study.samples \
./Example/example.study.sample.exclusions \
20.4e6 20.5e6 \
20000 \
100 \
5 \
20 \
\
\
\
./Example/example.chr22.complicated.impute2

Comments

These comments will focus on the specialized options used in the example above; for comments on this general imputation scenario, seehere.
The-exclude_snps_g_ref option specifies a few SNPs to remove from the-g_ref file, using different types of SNP IDs. These might be SNPs that failed QC testing, for example.
The-align_by_maf_g option tells the program to use minor allele frequencies to align the allele coding of A/T and C/G SNPs between the-g file and the-l file. However, the-strand_g option takes precedence over-align_by_maf_g, and in this case all of the genotyped SNPs have explicit alignments in the strand file, so the-align_by_maf_g flag has no effect.
This run includes both a-sample_g file and an-exclude_samples_g file. The sample file tellsIMPUTE2 which samples in the-g file are which, and the exclusions file tells it the IDs of samples that should be removed from the analysis. These might be individuals who showed systematic data quality problems on a genome-wide SNP chip, for example.
Here we have increased-k from its default value of 80 to 100. This will increase the imputation accuracy, but it will also increaseIMPUTE2's running time. In this example we have tried to offset the increased running time by decreasing the-burnin value from 10 (default) to 5 and the-iter value from 30 (default) to 20.
The-pgs flag tells the program to "predict genotyped SNPs"; that is, to replace the original study genotypes with LD-based imputed genotypes in the output file.
The-no_sample_qc_info flag suppresses the output file that shows quality control metrics for each individual in the-g file.
The-o_gz flag specifies that the main output file should be compressed by the gzip algorithm; this is useful if you are running jobs that produce large output files.

Phasing

AlthoughIMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the-phase option.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.study.gens \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.phasing.impute2

Comments

The-o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named./Example/example.chr22.phasing.impute2_haps, where the_haps suffix is added automatically. The format of this output file is explainedhere.
No strand alignment is needed in this example since we are using only one data panel.
In our experience this phasing procedure works well for SNP chip data, but it may have statistical convergence issues in datasets with high marker density, such as those that result from resequencing studies of population samples. If you would like to phase that kind of dataset, please send a message to ourmail list for suggestions about how to improve the quality of inference.

We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to ourmail list.

Phasing with a reference panel

Here, we extend abasic phasing analysis to incorporate a phased reference panel. Population-based phasing methods work by pooling linkage disequilibrium information across individuals, so adding a panel of high-quality haplotypes can improve phasing accuracy.

The following command shows how to run this kind of analysis withIMPUTE2, using the example data that come with the program download:

./impute2 \
\
./Example/example.chr22.map \
./Example/example.chr22.1kG.haps \
./Example/example.chr22.1kG.legend \
./Example/example.chr22.study.gens \
./Example/example.chr22.study.strand \
20.4e6 20.5e6 \
20000 \
./Example/example.chr22.phasing.impute2

Comments

The-o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named./Example/example.chr22.phasing.impute2_haps, where the_haps suffix is added automatically. The format of this output file is explainedhere.
The reference panel in this example includes SNPs that are not present in the-g file.IMPUTE2 can simultaneously impute the untyped SNPs and phase the typed SNPs in that file, but it will not phase the untyped SNPs; the main output file (./Example/example.chr22.phasing.impute2) will include estimated genotypes for all study + reference SNPs, but the phased haplotype output file (./Example/example.chr22.phasing.impute2_haps) will include only the SNPs from the-g file. We decided not to have the program produce haplotypes at reference-panel-only SNPs because the computation needed to provide good estimates is much greater than that needed to phase just the input genotypes or to impute the untyped SNPs without phasing them. If you really want to try phasing the untyped SNPs as well, please send a message to ourmail list.
If you don't care about imputing the reference-panel-only SNPs into your study data (i.e., you just want to phase the original genotypes), you can substantially speed up the inference by adding "-os 2" to the command line. This tells the program to "output SNPs of type 2", which are ones with input data in both the reference and study panels. By implicitly telling the program not to output other kinds of SNPs (e.g., those typed only in the reference panel), you allow it to avoid wasting calculations that won't contribute to the final output.
Here we have used the-strand_g option to provide a strand file to the program. This file tellsIMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and-l files). , either before runningIMPUTE2 or during a run with the options describedhere.
In our experience this phasing procedure works well for SNP chip data, but it may have statistical convergence issues in datasets with high marker density, such as those that result from resequencing studies of population samples. If you would like to phase that kind of dataset, please send a message to ourmail list for suggestions about how to improve the quality of inference.

Program Options

These links explain the command-line arguments that can be used to controlIMPUTE2.

Option type	Description
Required arguments	The program will not run if these are not supplied.
Input file options	A list of possible input files, with formatting requirements.
Output file options	Naming conventions and options for controlling format of output files.
Basic options	Options for controlling how the program processes input data.
Strand alignment options	Options for aligning allele coding across data files.
Filtering options	Options for controlling the filters that get applied to input data.
MCMC options	Options for controlling the MCMC algorithm.
Pre-phasing options	Options that facilitate pre-phasing and subsequent imputation.
Panel merging options	Options for merging a pair of reference panels.
Chromosome X options	Options for analyzing chromosome X data.
Expert options	Options to be used by experts only.

Required arguments

This table shows the input arguments that you must supply in order forIMPUTE2 to run. the program will not do anything useful unless you also supply other input options and/or data files.

Flag	Default	Description
-g <file>	none	File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on ourfile format webpage and is the same as the output format from our genotype calling programCHIAMO. If you do not supply a file of unphased genotypes via this argument, youmust supply a file of phased study haplotypes via the-known_haps_g option.
-m <file>	none	Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of ourreference panel download packages come with appropriate recombination map files.
-int <lower> <upper>	none	Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g.,-int 5420000 10420000) or in exponential notation (e.g.,-int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in thesection on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the-allow_large_regions flag.

Input file options

This table explains the formatting requirements for input data files that can be supplied toIMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems).

Flag	Default	Description
-g <file>	none	File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on ourfile format webpage and is the same as the output format from our genotype calling programCHIAMO. If you do not supply a file of unphased genotypes via this argument, youmust supply a file of phased study haplotypes via the-known_haps_g option.
-m <file>	none	Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of ourreference panel download packages come with appropriate recombination map files.
-h <file 1> <file 2>	none	File of known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as0 or1, and each-h file must be provided with a correspondinglegend file. We provide formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in ourreference panel download packages. InIMPUTE2, it is possible to specify two-h files. In this case, the file with more SNPs should be provided first (in the<file 1> position) and the file with fewer SNPs should be provided second (in the<file 2> position), with a single space separating the file names.
-l <file 1> <file 2>	none	Legend file(s) with information about the SNPs in the-h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding-h file; these alleles can take values in{A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). We provide legend files for data from the HapMap Project and the 1,000 Genomes Project in ourreference panel download packages. When using two-h files withIMPUTE2, you must supply the corresponding legend files in the same order�i.e., the file with more SNPs comes first.
-g_ref <file>	none	File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the-g file. A-g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single-h file to create a two-tiered reference panel (in the latter case, the-g_ref file should contain roughly a subset of the SNPs in the-h file).
-known_haps_g <file>	none	File containing known haplotypes for the study cohort. The format is the same as the output format fromIMPUTE2's-phase option: five header columns (as in the-g file) followed by two columns (haplotypes) per individual. Allowed values in the haplotype columns are,, and. If your study dataset is fully phased, you can replace the-g file with a-known_haps_g file. This will causeIMPUTE2 to perform haploid imputation, although it will still report diploid imputation probabilities in the main output file. If any genotypes are missing, they can be marked as '? ?' (two question marks separated by one space) in the input file. (The program does not allow just one allele from a diploid genotype to be missing.) If the reference panels are also phased,IMPUTE2 will perform a single, fast imputation step rather than its standard MCMC module�this is how the program imputes intopre-phased GWAS haplotypes. The-known_haps_g file can also be used to specify study genotypes that are "partially" phased, in the sense that some genotypes are phased relative to a fixed reference point while others are not. We anticipate that this will be most useful when trying to phase resequencing data onto a scaffold of known haplotypes. To mark a known genotype as unphased, place an asterisk immediately after each allele, with no space between the allele () and the asterisk (); e.g., "" for a heterozygous genotype of unknown phase.

Output file options

The options in this table control the format and naming conventions of output files printed byIMPUTE2.

Flag	Default	Description
-o <file>	./test.impute2	Name of main output file. Follows the sameformat as the-g file.
-i <file>	[-o]_info	Name of SNP-wise information file with one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses): 1. SNP identifier from-g file (snp_id) 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the-o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) [details] 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) Depending on the command-line options invoked, there may also be columns labeledinfo_typeX,concord_typeX, andr2_typeX.IMPUTE2 assigns every SNP an internal "type" which reflects the combination of input datasets that include data for that SNP; here,X gives the type, which takes values in{0,1,2}. You can learn how the program determines SNP typeshere. For SNPs that have genotypes in the-g file,concord_typeX is the concordance between the input genotypes and the best-guess imputed genotypes, where the input genotypes at that SNP have been masked internally and then imputed as if the SNP were of typeX; similarly,r2_typeX is the squared correlation between input and masked/imputed genotypes at a SNP. Theinfo_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-typeX SNPs in the leave-one-out masking experiment. These columns are useful for post-hoc quality control; we will soon explain how we use them in our section onBest Practices for Imputation.
-r <file>	[-o]_summary	Name of log file that records a summary of the screen output.
-w <file>	[-o]_warnings	Name of file that records warnings generated byIMPUTE2.
-os <int> <int> ...	0 1 2 3	"Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in theOverview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3".
-o_gz		Specifies that the main output file should be compressed by thegzip utility; this also applies to some non-standard output files that can become large.
-outdp <int>	3	Specifies the number of decimal places to use for reporting genotype probabilities in the main output file.
-no_snp_qc_info		Suppresses printing ofinfo_typeX,concord_typeX, andr2_typeX columns in the-i file.
-no_sample_qc_info		Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample".
-phase		IMPUTE2 always implicitly phases the study genotypes (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotype file named "[-o]_haps". This file contains the same five header columns as the standard output, along with two columns (haplotypes) per individual, in the same order they appear in the main output. In addition to this "best-guess" haplotype file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". In this file, homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (upstream) is correct. By convention, the first heterozygous SNP in each individual for a given analysis region is assigned a phasing certainty of 1.0. As illustrated by ourexample commands, it is possible to use the-phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis.
-pgs		"Predict Genotyped SNPs": Tells the program to replace the input genotypes from the-g file with imputed genotypes in the-o file (applies toType 2 SNPs only).
-pgs_miss		Unlike-pgs, which replaces all input genotypes with imputed genotypes, this option tells the program to replace only the missing genotypes at typed SNPs. That is, any input genotype whose maximum probability exceeds the-call_thresh will simply be reprinted in the-o file, whereas input genotypes that fall below the calling threshold will be imputed in the output. This is an appealing option that will "fill in" sporadically missing genotypes in your input data. However, it is possible that this could cause subtle problems in downstream association testing. We therefore suggest that you use caution when applying this option.

Details about 'info' metric

IMPUTE2 reports an information metric in the fifth column of its-i file. This metric is similar to the r-squared metrics reported by other programs like MaCH and Beagle. Although each of these metrics is defined differently, they tend to be correlated.

Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).

Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.

We define our info metric and compare it against other metrics in areview paper that we recently published. If you have questions, please read that material first, then send a message to ourmail list if anything is still unclear.

Basic options

These options control some basic processing that the program does to prepare input data for inference.

Flag	Default	Description
-int <lower> <upper>	none	Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g.,-int 5420000 10420000) or in exponential notation (e.g.,-int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in thesection on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the-allow_large_regions flag.
-buffer <int>	250 kb	Length of buffer region () to include on each side of the analysis interval specified by the-int option. SNPs in the buffer regions inform the inference but do not appear in output files (unless you activate the-include_buffer_in_output flag). Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. Larger buffers may improve accuracy for low-frequency variants (since such variants tend to reside on long haplotype backgrounds) at the cost of longer running times.
-allow_large_regions		Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is describedhere.
-include_buffer_in_output		Tells the program to include SNPs from the-buffer region in all output files. The main reason for using this option is to preserve the buffer information for downstream imputation, e.g. whenpre-phasing a GWAS dataset.
-Ne <int>	20000	"Effective size" of the population (commonly denoted asNe in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates thatIMPUTE2 uses to guide its model of linkage disequilibrium patterns. When most imputation runs were conducted with reference panels from HapMap Phase 2, we suggested values of11418 for imputation from HapMap CEU,17469 for YRI, and14269 for CHB+JPT. Modern imputation analyses typically involve reference panels with greater ancestral diversity, which can make it hard to determine the "ideal"-Ne value for a particular study. Fortunately, we have found that imputation accuracy is highly robust to different-Ne values; within each of several human populations, we have obtained nearly identical accuracy levels for values between10000 and25000.
-call_thresh <float>	0.9	Threshold for calling genotypes in the-g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing. NOTE: This threshold applies only toinput genotypes. If you want to apply a calling threshold toIMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing softwareSNPTEST for better suggestions.
-nind <int>	# of indiv in-g file	Number of individuals from the-g file to include in the analysis. For example, to impute only the first five individuals, set-nind 5. This option is useful for debugging and test runs.
-verbose		Print detailed output about the progress of imputation. By default,IMPUTE2 prints only the number of the current MCMC iteration when performing imputation, but this flag tells it to print more detailed updates.

Strand alignment options

In any imputation analysis, is itabsolutely essential that all panels have their allele codings aligned to a fixed reference (usually thehuman genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you.

NOTE:IMPUTE2 will automatically align the strand between panels whenever it can do so unambiguously; e.g., flipping A/C inPanel 2 to match G/T in the reference. The options below pertain to variants where this is not possible, e.g. because an A/T SNP cannot be aligned by label alone.

NOTE: We currently assume that all phased reference files have already been aligned to the '+' strand of the human genome reference sequence, which is true ofthe files that we distribute; hence, the options here pertain only to study genotype files (like the-g and-known_haps_g files) and unphased reference files (i.e., a-g_ref file).

Flag	Default	Description
-strand_g <file>	none	File showing the strand orientation of the SNP allele codings in the-g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space. The ordering of the SNPs in this file does not matter (by contrast to the-g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). We provide model strand files in theExample/ directory that comes with the software download.
-strand_g_ref <file>	none	Same as-strand_g, but applies to the-g_ref file.
-align_by_maf_g		Activates the program's internal strand alignment procedure for the-g file (AKAPanel 2; for details about the panel nomenclature used here, see theoverview). The strand is aligned to the alleles in referencePanel 0, if present, otherwise to referencePanel 1. This option pertains only to A/T and C/G SNPs, which it aligns such thatPanel 2 and the alignment reference (Panel 0 or1) have the same minor allele. NOTE: This flag can be used in conjunction with the-strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to align the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The best way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured.
-align_by_maf_g_ref		Similar to-align_by_maf_g, but applies to the-g_ref file (Panel 1). In this case the strand is aligned to the alleles inPanel 0, so the flag does not work ifPanel 0 was not provided (i.e., if you did not supply-l and-h files). NOTE: Just as-align_by_maf_g can be used in conjunction with-strand_g, this flag can be used in conjunction with the-strand_g_ref option. As before, the strand file takes precedence over aligning the strand by MAF. NOTE: As with-align_by_maf_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%.

Filtering options

The options in this table affect the way that the program filters the input data. Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.

Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the-g file you should use-exclude_snps_g, and to exclude SNPs from the-g_ref file you should use-exclude_snps_g_ref.

Flag	Default	Description
-filt_rules_l <str> <str> ...	none	This option provides flexible variant filtering in the reference panel via "filter rules", which are based on annotation columns in a-l file. Each column should be labeled by a contiguous string (no whitespace) describing its contents. For example, theExample/ directory in thesoftware download packages includes a file namedexample.chr22.1kG.annot.legend that contains columns named and and . To filter variants based on the numeric annotation values in the-l file, you should combine a column string with a cutoff value and one of these six comparison operators: . For example, writing on the command line would tell the program to remove any variants with values less than 0.05 from the reference panel. You can include an arbitrary number of filtering strings after the-filt_rules_l option, in which case the filtering conditions will be applied in 'or' fashion: if any condition is true, the variant will be removed. Otherwise, the command-line environment may interpret symbols like and as linux redirection operators. There should be no white space within the single quotes. You can develop annotations yourself and add them to the-l file, or you can use the annotations that we provide in some of our reference download packages. For example, we have included continent-level minor allele frequencies in the legend files for the1,000 Genomes Phase 1 integrated variant reference panel. For an illustration of using-filt_rules_l in practice, see thisexample command.
-exclude_snps_g <file>	none	List of SNPs to exclude from the-g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of-g file), their rsIDs (second column of-g file), or their base pair positions (third column of-g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the-impute_excluded option.
-exclude_snps_g_ref <file>	none	Same as-exclude_snps_g, but applies to the-g_ref file.
-impute_excluded		Specifies that SNPs excluded from the study dataset via the-exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored.
-include_snps <file>	none	List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the reference SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the-g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of-g_ref file), their rsIDs (second column of-g_ref file or first column of-l file), or their base pair positions (third column of-g_ref file or second column of-l file). This option does not have any effect on SNPs in the-g file.
-sample_g <file>	none	File of sample IDs for the individuals in the-g file; should follow the format describedhere. Only the first two columns are necessary, but they must be present and labeled "ID_1" and "ID_2". NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the-exclude_samples_g option, or if you are analyzing chromosome X data via the-chrX option.
-sample_g_ref <file>	none	Same as-sample_g, but applies to the-g_ref file.
-exclude_samples_g <file>	none	List of samples to exclude from the-g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the-sample_g file, which is if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples", where-o is the name of the main output file. NOTE: Part of theIMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option.
-exclude_samples_g_ref <file>	none	Same as-exclude_samples_g, but applies to the-g_ref file. One difference is that the program will not print a filtered list of-g_ref samples like the one that gets printed with-exclude_samples_g.

MCMC options

IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.

Flag	Default	Description
-iter <int>	30	Total number of MCMC iterations to perform,including burn-in. Increasing the number of iterations may improve accuracy slightly, although increasing-k generally leads to greater improvements for a fixed computational cost.
-burnin <int>	10	Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for unphased individuals during each of the first[-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets.
-k <int>	80	Number of haplotypes (in the reference or study data) to use as templates when phasing observed genotypes. Increasing this value will lead to higher accuracy at the cost of longer running times, which scale quadratically with-k. The default value should be sufficient for most analyses.
-k_hap <int>	500	Number of reference haplotypes to use as templates when imputing missing genotypes. If this value is less than the total number of haplotypes in your reference panel,IMPUTE2 will choose a "custom" set of-k_hap haplotypes each time it imputes missing alleles in a study haplotype. If all of your reference haplotypes have similar ancestry to the subjects in your study, each haplotype is potentially useful for imputation, so the best accuracy can be achieved by setting-k_hap to the total number of reference haplotypes. Using smaller values will decrease the running time linearly while incurring a slight loss of accuracy. Conversely, we now recommend runningIMPUTE2 with large reference panels containing haplotypes of diverse ancestry. (For more details, seehere.) In this context, our rule of thumb suggests setting-k_hap to be smaller than the total size of the reference panel. Imputation accuracy is robust to different values of-k_hap within a sensible range, so it should usually be sufficient to choose a value by intuition. , since we often find that diverse reference panels contain more useful haplotypes than one might expect. As of software version 2.3.0,-k_hap can accept two values when you are imputing from two reference panels�for example, '-k_hap 500 200'. In this context, the first value is the number of haplotypes to be chosen from Panel 0 and the second value is the number to be chosen from Panel 1. This flexibility can be useful whenmerging reference panels.

Pre-phasing options

You can greatly speed up your imputation through a process called"pre-phasing". The idea of this approach is to first phase your GWAS genotypes, then use the estimated GWAS haplotypes to impute untyped variants from a reference panel. The options in this table activate the corresponding functionality inIMPUTE2. You can see how these options are applied in thisexample command.

Flag	Default	Description
-prephase_g		TellsIMPUTE2 to phase the genotypes in the-g file. The estimated haplotypes are printed to a dedicated output file named"[-o]_haps", where[-o] is the name supplied for the main output file. To avoid edge effects in downstream imputation,IMPUTE2 will extend the estimated haplotypes into the buffer regions that flank the main region specified via-int.
-use_prephased_g		TellsIMPUTE2 to perform imputation with pre-phased GWAS haplotypes, which must be supplied via a-known_haps_g file. This file will often be produced by a pre-phasing run that used-prephase_g on the same imputation interval (-int), although it may also come from a different phasing algorithm likeSHAPEIT, which can print haplotypes in-known_haps_g format.

Panel merging options

These options allowIMPUTE2 to efficiently combine two reference panels typed on partially overlapping sets of variants.

Flag	Default	Description
-merge_ref_panels		Tells the program to combine information across two reference panels using the approach describedhere.
-merge_ref_panels_output_ref <file>	none	Activates-merge_ref_panels and tells the program to store the merged panel in two output files: a legend file named<file>.legend and a haplotype file named<file>.hap.
-merge_ref_panels_output_gen <file>	none	Activates-merge_ref_panels and tells the program to store the merged panel in.gen format in an output file named<file>.gen.

NOTE: If you wantIMPUTE2 to print a merged reference panel with buffer regions included, you should use one of the last two options together with the-include_buffer_in_output flag.

NOTE: You can see an example run that uses-merge_ref_panelshere.

Chromosome X options

These options facilitate the analysis of genotype data from human chromosome X.

Flag	Default	Description
-chrX		Specifies that this is an analysis of chromosome X data. This flag changes the model parameters by automatically reducing the value by 25%, and it allows the file to include a mixture of dizygous females and hemizygous males. When using the option, it is essential to provide a file with a column named , since this tells the program which individuals are males and which are females. More details on the file formats for chromosome X analysis are availablehere, and you can see an example runhere.
-Xpar		Specifies that the current dataset comes from apseudoautosomal region (PAR) of chromosome X, where both males and females are diploid. When used together with , this flag will reduce by 25% but otherwise run the analysis in the same way as on the autosomes.

Expert options

The options in this table are meant for experts only. Don't use them unless you know what you are doing!

Flag	Default	Description
-seed <int>	random	Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option.
-no_warn		Turns warnings off, so that the-w file does not get printed.
-fill_holes		Turns on the "hole-filling" function, which allows SNPs that are typed in the-g file but not in the lowest reference panel to contribute to the inference.
-no_remove		Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. Such SNPs will be retained in the output, but they will not be used for inference.

Best Practices for Imputation

IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.

Pre-imputation filtering of study genotypes

Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.

Variant position matching across input files

When you provideIMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). If two or more variants have the same position�perhaps because one is a SNP and one is an overlapping INDEL�then these variants are matched across panels based on their allele labels.

It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results fromIMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".

Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Ourreference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like theliftOver program from UCSC. If you need help with this step, please send a message to ourmail list.

Strand alignment between study and reference data

It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.

Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options describedhere. If you cannot recover the strand alignment from the original assay, you can useother options that tellIMPUTE2 to make educated guesses.

Choosing a reference panel

Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supplyIMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy can increase accuracy at low-frequency variants, and it avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population. You can read our paper on this strategyhere, learn about practical ways of applying ithere, and download state-of-the-art reference haplotypeshere.

If you have collected a custom reference panel for your study population�say, exome-wide or genome-wide sequencing data�you can combine it with the 1,000 Genomes data to maximize accuracy and genomic coverage at the same time. To learn howIMPUTE2 does this, seehere.

Genome-wide imputation

It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a few mechanisms to help with this process:

IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. Seehere for suggestions on how to use this functionality.
IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approachhere. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the originalIMPUTE2 MCMC algorithm for maximizing accuracy in smaller regions.
Sequence-based reference panels contain large numbers of rare and low-frequency variants, which can drive up the computational cost of imputation. When computing power is limited, it may be desirable to remove some of these variants (e.g., those with very low frequencies in the population of interest) before running imputation. To facilitate this process, we have added the-filt_rules_l option, which can flexibly remove reference variants based on command-line input to anIMPUTE2 run. You can see an example application of this approach and some guidelines for using ithere.

Post-imputation filtering

It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.

Association testing

We distribute a program calledSNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at theSNPTEST website.

Follow-up imputation of putative associations

Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:

In contrast to thepre-phasing approach that we recommend for genome-wide imputation, we suggest using the standardIMPUTE2 MCMC algorithm for follow-up imputation. This method takes longer to run in each region, but it should lead to slightly higher accuracy (especially at low-frequency variants) and remain computationally feasible when run on a limited portion of the genome.
If time permits, the overall accuracy may be improved by increasing the value of the-k parameter.
If time permits, the accuracy at low-frequency variants may be improved by increasing the size of the-buffer region�say, from the default value of 250 kb to 1000 kb (1 Mb).

Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.

Pre-Phasing GWAS

Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of theIMPUTE2 methodology.

The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the originalIMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.

For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the originalIMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. Pre-phasing is implemented through three program options:-prephase_g,-use_prephased_g, and-known_haps_g. The best way to learn how to use this approach isby example.

If you use this functionality in your study, please remember to citeour article about pre-phasing in GWAS and theoriginalIMPUTE2 article.

Analyzing Whole Chromosomes

In principle, it is possible to impute genotypes across an entire chromosome in a single run ofIMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the-allow_large_regions flag.

The-int parameter provides an easy way to break a chromosome into smaller chunks for analysis byIMPUTE2. For example, if we wanted to split a chromosome into5-Mb regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on, all without changing the input files.IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by-int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the-buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, theIMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used byIMPUTE2.)

Merging Reference Panels

Problem statement

Modern genotyping and sequencing technologies are generating a variety of reference datasets that can be used for genotype imputation in association studies. Combining reference panels from different populations can often improve imputation accuracy (e.g., seeHowieet al. 2011), but it is not clear how best to merge panels that are genotyped at different sets of variants.

Howieet al. 2009 proposed a solution for the special case where one reference panel contains a subset of the variants in another reference panel. We previously released a combined1,000 Genomes + HapMap 3 panel that takes advantage of this framework, and it was also used in theWTCCC2 studies.

Many association studies are now using the latest 1,000 Genomes data to drive their genotype imputation, but they may also have sequenced additional individuals from the population being studied. It makes sense to combine these resources in order to use all available reference information, but in this case each reference panel will contain many variants that are not found in the other�that is, the "hierarchical" variant framework of Howieet al. 2009 no longer applies.

With this in mind, we have devised a new strategy for combining reference panels created by different sequencing or genotyping studies.

Our approach

There are many possible ways to merge two reference panels. We are exploring several of these options, but we decided to start with the simple approach depicted in the figure below. The top panel of this figure shows two reference panels and a GWAS cohort; you can think of the rows as individuals and the columns as positions along the genome. Each vertical line represents a genotyped variant in a given panel, and each reference panel includes variants that are not found in the other.

We impute the untyped variants in this figure in three steps:

Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation.
Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation.
Now that we have imputed the two reference panels up to the union of their variants, treat the imputed haplotypes as known (i.e., take the best-guess haplotypes) and impute the GWAS cohort in the usual way.

This process can be performed withIMPUTE2 (version 2.3 and later) in a streamlined way: all you have to do is add the-merge_ref_panels flag to the command line. You can see a working example commandhere.

Practical considerations

Using pre-phased study data

The-merge_ref_panels flag works with both unphased study genotypes (-g file) and pre-phased study haplotypes (-known_haps_g file).

Parameter settings

For finer control of the merging step, you can supply two values to-k_hap on the command line�for example, '-k_hap 500 200'. This setting tellsIMPUTE2 to use 500 haplotypes from Panel 0 and 200 haplotypes from Panel 1. These values should reflect the number of haplotypes in each panel that you expect to be useful for imputation in the study population, which could be less than the total number if either panel is multi-ethnic.

Reference panel ordering

The order in which you supply the reference panels on the command line should not affect the accuracy of imputation from the merged panel: inside the program, the calculations are completely symmetric. One practical limitation is that only the first legend file in anIMPUTE2 command is allowed to have more than four columns. The 1,000 Genomes legend files we distribute typically have more than four columns, so if you are using these files it makes sense to provide the 1,000 Genomes panel before your other panel on the command line.

Printing the merged panel

By default,IMPUTE2 does not print the merged reference panel (the outcome of Steps 1 and 2 above); the merging is done internally, and the output shows only the imputed genotypes for the study cohort. If you want the program to output the merged panel, you can replace-merge_ref_panels with one of two options:

-merge_ref_panels_output_ref�This option tells the software to merge the two reference panels and print the results inIMPUTE2 reference file format: one legend file and one haplotypes file. See the link for more information.
-merge_ref_panels_output_gen�This option tells the software to merge the two reference panels and print the results inIMPUTE2 .gen file format. Phase information is ignored when creating this file, which can be useful if you want to re-phase the merged reference panel. See the link for more information.

If you want to merge two reference panels without imputing into a study dataset (i.e., to skip Step 3 above), you should use one of these two options and omit the study data (-g file or-known_haps_g file) from yourIMPUTE2 command.

Normally, these options print the merged reference panel within the region specified by the-int argument. If you want to include thebuffer regions in the output, you should add the-include_buffer_in_output flag to your command line statement.

Publication and citation

Our approach for merging reference panels has not yet been published outside this website. We have tested the method on realistic datasets, and it has performed well in all of our analyses. We are actively working to document our work on this approach and to compare it with other strategies; we aim to report the results of these experiments and the details of our methodology as soon as possible.

In the meantime, we are happy to answer thoughtful questions and to hear about your experiences with this new functionality. If you would like to send comments, please do so through ourmail list.

Imputation Concordance Tables

What is a concordance table?

Every run ofIMPUTE2 produces a concordance table, except under certain settings that are not commonly used. A concordance table shows the results of an internal cross-validation that the program performs automatically. For this analysis, IMPUTE2 masks the genotypes of one variant at a time in the study data (Panel 2), then imputes the masked genotypes with information from the reference data and nearby study variants. The imputed genotypes are then compared with the original genotypes to evaluate the quality of the imputation. The results are summarized in a table like the one below:

If you are interested in the results of this experiment at a given variant, you can find this information in the_info file printed byIMPUTE2. Theconcord_typeX column shows the concordance between input genotypes and best-guess imputed genotypes at each variant, while ther2_typeX column gives the squared correlation between input genotypes and expected genotypes (or "dosages") from imputation. Note that the cross-validation cannot be performed at variants that were not provided in a Panel 2 input file (-g or-known_haps_g), so reference-only variants are assigned values of -1 in the_info file. To learn more about the format of this output file, seehere.

How are concordance tables made?

Only variants with input data from a-g or-known_haps_g file are masked and imputed in this analysis. When a-known_haps_g file is provided, all input genotypes are treated as being true. When a-g file is provided, we make hard genotype calls by applying a threshold (default = 0.9) to the maximum value in each input probability triple. For example, a genotype withP(G=0,1,2) = (0.03, 0.95, 0.02) would be called as a '1' (heterozygous), while a genotype withP(G=0,1,2) = (0.1, 0.7, 0.2) would be left uncalled and omitted from the concordance calculations.

The genotype probabilities from imputation are used somewhat differently. In the first three columns of the table, we assign each imputed genotype to a bin (Interval) based on its maximum posterior probability. Then, for each bin we report the number of imputed genotypes that passed the calling threshold in the input data (#Genotypes). We then convert the imputed probabilities to 'best-guess' genotypes: for each posterior probability triple, we select the genotype with the highest value, regardless of magnitude. Finally, we compare the input genotype calls with the best-guess imputed genotypes and report the concordance (%Concordance) within each bin.

In the last three columns of the table, we again bin the imputed genotypes based on their maximum posterior probabilities, but this time the binning is cumulative: the bin at the bottom of the table includes only genotypes that were confidently imputed (max prob >= 0.9), while each bin above includes all genotypes that pass a more lenient certainty threshold. These thresholds are shown in the fourth column (Interval). The fifth column (%Called) shows the percentage of imputed genotypes that pass a given probability threshold, where the denominator is the total number of imputed genotypes for which hard calls are available in the input data. The sixth column (%Concordance) shows the percentage of imputed genotypes in a given bin that match the masked input genotypes.

What can I do with a concordance table?

We can learn a couple of things from this kind of analysis. First, the results can alert us to problems in the imputation: if the concordance between imputed and input genotypes is abnormally low, it may indicate that something went wrong in the analysis or input files. A useful summary statistic is the number in the upper righthand corner of the table, which gives the overall concordance from the cross-validation. This number should typically be around 95%; it may be lower in certain populations or regions of the genome, but if it is much lower then you may need to double-check the analysis. If you are worried about your results, please send a message with details of your analysis (including a_summary output file fromIMPUTE2) to ourmail list.

Concordance tables can also be used to predict the general quality of imputed genotypes at SNPs where we do not know the true genotypes. SNPs on GWAS microarrays tend to be easier to impute than untyped SNPs of the same frequency, so the cross-validation results may be somewhat optimistic, but they are often useful for relative comparisons�say, between different parameter settings ofIMPUTE2.

Finally, the per-variant results of the cross-validation in the output_info file can help identify poorly genotyped SNPs and strand flips. For example, an input SNP that has a lowconcord_typeX value (implying that the imputed genotypes do not agree with the original genotypes) and a highinfo_typeX value (implying that the imputation is confident) might be worth investigating or removing from subsequent imputation runs.

Multiple reference panels

If you provide two reference panels toIMPUTE2, the program will perform the cross-validation in two different ways. First it will use only a single reference panel (Panel 0) to mimic Type 0 SNPs, and then it will use both reference panels together (Panels 0 and 1) to mimic Type 1 SNPs. In this case,IMPUTE2 will print two concordance tables�one for each type of reference SNP. Note that the same masked study genotypes are used to evaluate accuracy in both cases; the only difference is how much reference data we allow the program to see when imputing the masked genotypes.

Where can I find the concordance table?

The concordance table is printed at the end of anIMPUTE2 run. One copy is printed to STDOUT, and another copy is printed in the_summary output file.

Scripts

The following scripts are designed to help with various parts of anIMPUTE2 analysis. We provide them in the hope that they will be useful, but due to inconsistencies in file formats, assumptions, etc. If you want to use one of these scripts, we suggest that you first read through the code to understand how it works.

All of these scripts are released under theGNU General Public License. Each script will print a list of command line options if you run it with no arguments.

Script name	Function
vcf2impute_legend_haps.pl	Convert aphased VCF file into reference panel format: one legend file and one haplotypes file.
vcf2impute_gen.pl	Convert aphased orunphased VCF file into genotype file format (.gen).

FAQ

Our FAQ has moved to thisGoogle document.

References

[1] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly (2007)A new multipoint method for genome-wide association studies via imputation of genotypes.Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2] B. N. Howie, P. Donnelly, and J. Marchini (2009)A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]

[3] J. Marchini and B. Howie (2010)Genotype imputation for genome-wide association studies.Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]

[4] B. Howie, J. Marchini, and M. Stephens (2011)Genotype imputation with thousands of genomes.G3: Genes, Genomics, Genetics 1(6): 457-470 [Open Access Article] [Supplementary Material]

[5] B. Howie, C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis (2012)Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.Nature Genetics 44(8): 955-959 [Restricted Access PDF]

Contributors

The following people developed the methodology and software forIMPUTE2:

Bryan Howie,Jonathan Marchini

Mail List

If you have a question aboutIMPUTE2, please send a message to our mailing list:

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.

: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.

The version number ofIMPUTE2 and the type of computer you are using to run it�e.g.,"IMPUTE v2.2.2 on Mac OSX 10.6".
Any log files and/or screen output from the program; e.g., the "_summary" output file.
For difficult problems like memory access errors (e.g., "segmentation faults"), we may need you to send data files that show the problem. These files should ideally be small, and we can provide suggestions if you are not allowed to share your actual data.