NotificationsYou must be signed in to change notification settings
Fork6
Star18

Cross-population analysis using GWAS summary statistics

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
R		R
inst/extdata		inst/extdata
man		man
src		src
DESCRIPTION		DESCRIPTION
Manual_XPASS.pdf		Manual_XPASS.pdf
NAMESPACE		NAMESPACE
README.md		README.md
ref_cvt.pdf		ref_cvt.pdf

Repository files navigation

XPASS

The XPASS package implements the XPASS approach for constructing PRS in an under-representated target population by leveraging Biobank-scale GWAS data in European populations.

Installation

#install.packages("devtools")devtools::install_github("YangLabHKUST/XPASS")

Quick start

We illustrate the usage of XPASS using the GWAS summary statistics of BMI from UKB and BBJ. For demonstration, we use the easily accessible genotypes from The 1000 Genomes Project as the reference panel.However, because this reference panel only contains 1.3 million SNPs from 377 EAS amples and 417 EUR samples, the result of XPASS is slightly less accurate than what we reported in the AJHG paper. The performance of XPASS can be better when a GWAS dataset with a larger sample size (e.g.,n>2000) and more SNPs (e.g., 3M) is used as the reference panel. We strongly suggest that users use their own reference panels with sufficiently large sample sizes if available.

Data preparation

The datasets involved in the following example can be downloaded fromhere.

Input files of XPASS include:

summay statistics file of the target population
summay statistics file of the auxiliary population
reference panel of the target population in plink 1 format
reference panel of the auxiliary population in plink 1 format
covariates file associated with the target population reference panel. The covariates files can include sex, age and population information of the individuals from the reference panel.
covariates file associated with the auxiliary population reference panel

Different from LDSC that only utilizes LD from local SNPs, XPASS uses the LD information from the entire chromosomes to estimate heritability and coheritability. This approach yields smaller standard error, but requires the population structures in the reference pannel to be properly corrected byincluding the covariates (e.g., principal components). Otherwise, the estimated heritability and coheritability can bebiased by the population structures (seehere for an example). Therefore, westrongly suggest that usersinclude covariates (e.g., principal components) when using XPASS although they are optional in the software.

The XPASS format GWAS summary statistics file has 5 fields:

SNP: SNP rsid
N: sample size
Z: Z-scores
A1: effect allele
A2: other allele.

Here, we use the BMI GWAS from BBJ male as the target training set and BMI GWAS from UKB as the auxiliary training set

$ head BMI_bbj_male_3M_format_3MImp.txtSNPNZA1A2rs117086422159095-1.20413423957762TCrs28612348159095-1.25827042089233TCrs4475691159095-1.19842287777303TCrs950122159095-1.2014434974188CGrs3905286159095-1.27046106136441TCrs28407778159095-1.26746342605063AGrs4246505159095-1.24706211285128AGrs4626817159095-1.26366297625074AGrs11507767159095-1.28611566069053GA

$ head height_ukb_3M_format.txtSNPNZA1A2rs1170864224293121.42436004338939TCrs286123484293121.48706291417224TCrs44756914293121.53977372135067TCrs9501224293121.37958155329171CGrs39052864293121.77045946243262TCrs284077784293121.9908370573435AGrs42465054293121.90922505355565AGrs46268174293121.53216668392479AGrs115077674293121.55873328059033GA

We keep the BMI GWAS from BBJ female as the external validation dataset:

$ head BMI_bbj_male_3M_format_3MImp.txtSNPNZA1A2rs117086422723900.897252679561414TCrs28612348723900.904738461538462TCrs4475691723900.89177374486687TCrs950122723900.891523178807947CGrs3905286723900.827441738675046TCrs28407778723900.816801884323476AGrs4246505723900.812148186935463AGrs4626817723900.811624558188245AGrs11507767723900.811100929441026GA

The covariates files should not include row names and column names. The rows should be exactly corresponding to the individuals in the .fam file of reference genotypes. Each column corresponds to one covariate. A column of one should not be included in the file.

$ head 1000G.EAS.QC.hm3.ind.pc5.txt-0.0242863 0.0206888 -0.0028171 0.0343263 0.0211044-0.025051 0.0275325 -0.0332156 -0.0233166 0.0588989-0.0198603 0.0286747 0.008464 -0.0215478 0.0189802-0.0117008 0.0251493 -0.0228276 -0.0670019 0.0186454-0.0246233 0.0281415 -0.0418795 0.0059947 0.0196607-0.0270411 0.0127586 -0.0376545 0.0182809 0.0344744-0.0193744 0.0367263 -0.0176168 -0.0307868 0.0254125-0.0212299 0.022658 0.0225698 -0.0249273 0.0144324-0.0159054 0.00558952 -0.00609582 -0.033497 0.0518336-0.0258843 0.0476758 0.0073353 0.0164056 0.0072118

$ head 1000G.EUR.QC.hm3.ind.pc20.txt0.0290072 0.0717627 -0.0314029 0.0317316 0.0618357 0.0385132 0.122857 -0.0289581 -0.0114267 -0.0205926 -0.0466983 0.0836711 0.00690379 0.0345008 -0.0179313 0.0109661 -0.0214763 0.0014544 0.0182944 0.03996250.0378568 0.0499758 -0.00811504 0.0363021 -0.0579984 0.0422509 0.141279 -0.0167868 -0.0181999 0.0165593 -0.0304088 0.0423324 0.0226001 0.00843853 0.0212477 -0.0666462 -0.0787379 0.0136196 0.108933 0.08012460.0417318 0.0732938 -0.0409508 0.0176872 0.0801957 0.0124742 0.0252689 -0.0444921 -0.00305238 0.0035535 -0.0070929 0.0240194 -0.00543616 0.0272464 -0.0048309 -0.0207223 0.0415044 0.0494025 0.0213837 -0.00210930.0348071 0.0715395 -0.0266058 0.00280025 -0.0166164 0.0440144 0.135709 0.017364 0.0276564 -0.00286321 0.0314583 0.00299185 0.0792055 0.012042 -0.0337269 -0.00999033 0.0186435 -0.0699027 0.00791191 -0.01311680.0444763 0.0713348 0.00162451 -0.0107805 0.0868909 0.0218014 0.0314216 -0.0429928 -0.0137937 0.00913544 -0.062828 0.0555199 0.0378234 -0.0162297 0.000344947 -0.0164497 0.0523967 -0.0861731 -0.038893 0.0281660.0300713 0.0522082 -0.0116997 -0.029994 0.0539977 0.0162067 -0.0522507 0.0022036 0.0255471 0.0129012 0.0371803 0.0701241 0.0314957 0.00870374 0.00566795 0.116672 0.0267188 0.0451948 0.00288655 -0.04273040.0339424 0.0718272 0.000614329 -0.0183555 0.0162768 -0.0600599 0.00343218 0.0149793 -0.0236301 -0.0267658 0.0387814 -0.00624387 -0.0364751 0.00486515 -0.0341221 0.0415286 -0.0274807 -0.013188 0.0695243 0.04953760.0423378 0.0656126 -0.0331807 -0.0361484 -0.0155739 0.0557459 0.00428138 0.0840953 -0.034451 0.0753096 -0.0180153 0.0595412 -0.0367107 -0.0285888 -0.0986386 0.00845412 -0.00388558 -0.0641134 -0.05815 0.02424330.0429698 0.0431213 -0.0357636 -0.00270477 -0.0567958 0.0892002 0.0980711 0.0468321 -0.0359592 0.011195 -0.000235849 -0.0192522 -0.00271491 0.0381155 -0.00845755 -0.0171629 -0.026532 -0.0415778 0.0274635 0.1220630.0290721 0.0541744 -0.0238317 0.0254426 0.0986334 0.0706142 0.0977585 0.00427919 -0.0381976 0.0020029 -0.0161052 -0.016666 -0.00627125 -0.00490556 -0.0410802 0.0125096 -0.0175252 0.0320359 0.00866061 0.0736791

Run XPASS

Once the imput files are formatted, XPASS will automatically process the datasets, including SNPs overlapping and allele matching.Run XPASS with the following comand:

# library(devtools)# install_github("https://github.com/YangLabHKUST/XPASS")library(XPASS)library(data.table)library(RhpcBLASctl)blas_set_num_threads(30)# reference genotypes for EAS (prefix of plink file bim/bed/fam)ref_EAS <- "1000G.EAS.QC.hm3.ind"# covariates of EAS reference genotypescov_EAS <- "1000G.EAS.QC.hm3.ind.pc5.txt"# reference genotypes for EUR (prefix of plink file bim/bed/fam)ref_EUR <- "1000G.EUR.QC.hm3.ind"# covariates of EUR reference genotypescov_EUR <- "1000G.EUR.QC.hm3.ind.pc20.txt"# genotype file of test data (plink prefix). # Note: for demonstration, we assume that the genotypes of prediction target are used as the reference panel of target population. # In practice, one can also use genotypes from other sources as reference panel.BMI_test <- "1000G.EAS.QC.hm3.ind"# sumstats of heightBMI_bbj_male <- "BMI_bbj_male_3M_format_3MImp.txt"  # targetBMI_ukb <- "BMI_ukb_sumstat_format_all.txt"  # auxiliaryBMI_bbj_female <- "BMI_bbj_female_3M_format_3MImp.txt"  # external validationfit_bbj <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS,                file_ref2 = ref_EUR,                file_cov1 = cov_EAS,file_cov2 = cov_EUR,                file_predGeno = BMI_test,                compPRS=T,                pop = "EAS",sd_method="LD_block",compPosMean = T,                file_out = "BMI_bbj_ukb_ref_TGP")Summary statistics file 1: BMI_bbj_male_3M_format_3MImp.txtSummary statistics file 2: BMI_ukb_sumstat_format_all.txtReference file 1: 1000G.EAS.QC.hm3.indReference file 2: 1000G.EUR.QC.hm3.indCovariates file 1: 1000G.EAS.QC.hm3.ind.pc5.txtCovariates file 2: 1000G.EUR.QC.hm3.ind.pc20.txtReading data from summary statisitcs...3506148 and 3777871 SNPs found in summary statistics files 1 and 2.Reading SNP info from reference panels...1209411 and 1313833 SNPs found in reference panel 1 and 2.746454 SNPs are matched in all files.0 SNPs are removed because of ambiguity; 746454 SNPs remained.Calculating kinship matrix from the both reference panels...127749 SNPs in the second reference panel are alligned for alleles according to the first.14337 SNPs have different minor alleles in population 1, z-scores are corrected according to reference panel.14332 SNPs have different minor alleles in population 2, z-scores are corrected according to reference panel.Assigning SNPs to LD Blocks...Calculate PVE...              h1          h2         h12        rho[1,] 0.165790395 0.247603545 0.129399058 0.63866483[2,] 0.007631346 0.008542654 0.006856057 0.02002658...Predicting PRS from test genotypes...Done.

XPASS output

XPASS returns a list of results, some key outputs are:

H: a table of estimated heritabilities, co-heritability and genetic correlation (first row) and their corresponding standard erros (second row).

> fit_bbj$H              h1          h2         h12        rho[1,] 0.165790395 0.247603545 0.129399058 0.63866483[2,] 0.007631346 0.008542654 0.006856057 0.02002658

mu: a data frame storing the posterior means computed by LDpred-inf using only the target dataset (mu1) and only the auxiliary dataset (mu2), and the posterior means of the target population and the auxiliary population computed by XPASS (mu_XPASS1 and mu_XPASS2). In eff_type1 and eff_type2 columns, "RE" indicates the SNP effect is random effect and "FE" indicates the effect is treated as population-specific fixed effect in the corresponding populations. SNPs information is also returned: A1 is the effect allele, A2 is the other allele.

> head(fit_bbj$mu)  CHR       SNP    POS A1 A2           mu1           mu2     mu_XPASS11   1 rs4475691 846808  T  C -1.271443e-04 -0.0002770532 -0.00032485792   1 rs7537756 854250  G  A -4.778206e-05 -0.0003401613 -0.00029798703   1 rs3748592 880238  A  G -3.201406e-04 -0.0012640981 -0.00074775064   1 rs2340582 882803  A  G -3.396992e-04 -0.0012885694 -0.00080765125   1 rs4246503 884815  A  G -3.318260e-04 -0.0013238182 -0.00081487046   1 rs3748597 888659  T  C -3.327131e-04 -0.0013031399 -0.0008113772      mu_XPASS2 eff_type1 eff_type21 -0.0003319743        RE        RE2 -0.0003509346        RE        RE3 -0.0014312273        RE        RE4 -0.0014550808        RE        RE5 -0.0014828292        RE        RE6 -0.0014623186        RE        RE

PRS (if file_predGeno provided and compPRS=T): a data frame storing the PRS generated using mu1, mu2, mu_XPASS1, and mu_XPASS2, respectively.

> head(fit_bbj$PRS)      FID     IID        PRS1      PRS2 PRS_XPASS1 PRS_XPASS21 HG00403 HG00403  0.14310344 1.1430451 0.55960910  1.16382722 HG00404 HG00404 -0.19478304 0.7149414 0.03773900  0.64658113 HG00406 HG00406 -0.09440555 0.8028816 0.14173517  0.71102574 HG00407 HG00407  0.05533467 0.1135734 0.05091083  0.13540405 HG00409 HG00409  0.25063980 1.9968242 0.83527097  1.98518916 HG00410 HG00410  0.13454737 0.2543978 0.10214320  0.2608015# One can also compute PRS after fitting the model:> PRS <- predict_XPASS(fit_bbj$mu,ref_EAS)> head(PRS)      FID     IID        PRS1      PRS2 PRS_XPASS1 PRS_XPASS21 HG00403 HG00403  0.14310344 1.1430451 0.55960910  1.16382722 HG00404 HG00404 -0.19478304 0.7149414 0.03773900  0.64658113 HG00406 HG00406 -0.09440555 0.8028816 0.14173517  0.71102574 HG00407 HG00407  0.05533467 0.1135734 0.05091083  0.13540405 HG00409 HG00409  0.25063980 1.9968242 0.83527097  1.98518916 HG00410 HG00410  0.13454737 0.2543978 0.10214320  0.2608015

XPASS will also write above outputs into the files withfile_out prefix, if provided.

External validation using independent GWAS data

We use the GWAS of female BMI from BBJ as the external validation dataset to approximate the prediction R2. Specifically we use the following equation:

$R^2=corr(y,\hat{y})^2=\left(\frac{cov(y,\hat{y})}{\sqrt{var(y)var(\hat{y})}}\right)^2=\left(\frac{z^T\tilde{\mu}/\sqrt{n}}{\sqrt{\tilde{\mu}^T\Sigma\tilde{\mu}}}\right)^2,$

where z is the z-score of external summsry statistics, n is its sample size, $\tilde{\mu}$ is the posterior mean of effect size at the standardized genotype scale, $\Sigma$ is the LD reference panel.

TheevalR2_XPASS function takes posterior means from XPASS output and the external validation summary statistics as its input to evaluate the approximated predictive R2. The ouput is a vector of prdictive R2 evaluated using mu1, mu2, mu_XPASS1, and mu_XPASS2, respectively.

> R2 <- evalR2_XPASS(fit_bbj$mu,BMI_bbj_female,ref_EAS)> R2      PRS1       PRS2 PRS_XPASS1 PRS_XPASS20.02235596 0.01638359 0.02905665 0.01949609

While the reference panels have only limmited samples, XPASS still achieves 30% relative improvement compared to LDpred-inf in terms of R2.

XPASS+

XPASS+ allows the population-specific effects to be utilized in PRS construction. To fit XPASS+, we first need to apply the P+T procedure to construct a set of pre-selected SNPs that are population-specific. Here, we use the ieugwasr package.

library(ieugwasr)# P+T procedure for bbjz_bbj <- fread(BMI_bbj_male) # read summary statistics filepval <- data.frame(rsid=z_bbj$SNP,pval=2*pnorm(abs(z_bbj$Z),lower.tail=F))clp_bbj <- ld_clump(pval, clump_kb=1000, clump_r2=0.1, clump_p=1e-6,                       bfile=ref_EAS,                       plink_bin="/import/home/mcaiad/plink/plink")snps_l <- clp_bbj$rsid

The rs id of pre-selected SNPs are stored insnps_l, which is then passed to thesnps_fe1 argument inXPASS function.

fit_bbj <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS,                file_ref2 = ref_EUR,                file_cov1 = cov_EAS,file_cov2 = cov_EUR,                file_predGeno = BMI_test,                snps_fe1 = snps_l,                compPRS=T,                pop = "EAS",sd_method="LD_block",compPosMean = T,                file_out = "BMI_bbj_ukb_plus_ref_TGP")R2 <- evalR2_XPASS(fit_bbj$mu,BMI_bbj_female,ref_EAS)

> R2      PRS1       PRS2 PRS_XPASS1 PRS_XPASS20.02562267 0.01638359 0.03149972 0.01963232

By including the population-specific effects, XPASS+ achieves 3.15% predictive R^2, offering 8% improvement compared to XPASS.

While in the above example we only include population-specific effects in the target population, our implementation of XPASS+ allows the inclusion of such effects for both populations. To include fixed effects in both populations, the pre-selected SNPs for corresponding populations should be passed tosnps_fe1 andsnps_fe2 arguments, respectively.

# P+T procedure for ukbz_ukb <- fread(BMI_ukb) # read summary statistics filepval <- data.frame(rsid=z_ukb$SNP,pval=2*pnorm(abs(z_ukb$Z),lower.tail=F))clp_ukb <- ld_clump(pval, clump_kb=1000, clump_r2=0.1, clump_p=1e-10,                    bfile=ref_EUR,                    plink_bin="/import/home/mcaiad/plink/plink")snps_ukb <- clp_ukb$rsidfit_both <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS,                 file_ref2 = ref_EUR,                 file_cov1 = cov_EAS,file_cov2 = cov_EUR,                 file_predGeno = BMI_test,                 snps_fe1 = snps_bbj,                 snps_fe2 = snps_ukb,                 compPRS=T,                 pop = "EAS",sd_method="LD_block",compPosMean = T,                 file_out = "BMI_bbj_ukb_plus_ref_TGP")R2 <- evalR2_XPASS(fit_both$mu,BMI_bbj_female,ref_EAS)

> R2      PRS1       PRS2 PRS_XPASS1 PRS_XPASS20.02353173 0.01192856 0.03294502 0.01419943

This yields similar results when the population-specific effects are included only in the target population.

An additional example for constructing PRS of Type 2 Diabetes can be found inthis PDF.

FAQ

Common issues are discussed inFAQ

Development

The XPASS package is developed by Mingxuan Cai (mcaiad@ust.hk).

Contact information

Please contact Mingxuan Cai (mcaiad@ust.hk) or Prof. Can Yang (macyang@ust.hk) if any enquiry.

Reference

Mingxuan Cai, Jiashun Xiao, Shunkang Zhang, Xiang Wan, Hongyu Zhao, Gang Chen, Can Yang. A uniﬁed framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. The American Journal of Human Genetics. 108, 632-655, April 2021.

About

Cross-population analysis using GWAS summary statistics

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

XPASS

Installation

Quick start

Data preparation

Run XPASS

XPASS output

External validation using independent GWAS data

XPASS+

FAQ

Development

Contact information

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

YangLabHKUST/XPASS

Folders and files

Latest commit

History

Repository files navigation

XPASS

Installation

Quick start

Data preparation

Run XPASS

XPASS output

External validation using independent GWAS data

XPASS+

FAQ

Development

Contact information

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages