- Notifications
You must be signed in to change notification settings - Fork6
YangLabHKUST/XPASS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The XPASS package implements the XPASS approach for constructing PRS in an under-representated target population by leveraging Biobank-scale GWAS data in European populations.
#install.packages("devtools")devtools::install_github("YangLabHKUST/XPASS")
We illustrate the usage of XPASS using the GWAS summary statistics of BMI from UKB and BBJ. For demonstration, we use the easily accessible genotypes from The 1000 Genomes Project as the reference panel.However, because this reference panel only contains 1.3 million SNPs from 377 EAS amples and 417 EUR samples, the result of XPASS is slightly less accurate than what we reported in the AJHG paper. The performance of XPASS can be better when a GWAS dataset with a larger sample size (e.g.,n>2000) and more SNPs (e.g., 3M) is used as the reference panel. We strongly suggest that users use their own reference panels with sufficiently large sample sizes if available.
The datasets involved in the following example can be downloaded fromhere.
Input files of XPASS include:
- summay statistics file of the target population
- summay statistics file of the auxiliary population
- reference panel of the target population in plink 1 format
- reference panel of the auxiliary population in plink 1 format
- covariates file associated with the target population reference panel. The covariates files can include sex, age and population information of the individuals from the reference panel.
- covariates file associated with the auxiliary population reference panel
Different from LDSC that only utilizes LD from local SNPs, XPASS uses the LD information from the entire chromosomes to estimate heritability and coheritability. This approach yields smaller standard error, but requires the population structures in the reference pannel to be properly corrected byincluding the covariates (e.g., principal components). Otherwise, the estimated heritability and coheritability can bebiased by the population structures (seehere for an example). Therefore, westrongly suggest that usersinclude covariates (e.g., principal components) when using XPASS although they are optional in the software.
The XPASS format GWAS summary statistics file has 5 fields:
- SNP: SNP rsid
- N: sample size
- Z: Z-scores
- A1: effect allele
- A2: other allele.
Here, we use the BMI GWAS from BBJ male as the target training set and BMI GWAS from UKB as the auxiliary training set
$ head BMI_bbj_male_3M_format_3MImp.txtSNPNZA1A2rs117086422159095-1.20413423957762TCrs28612348159095-1.25827042089233TCrs4475691159095-1.19842287777303TCrs950122159095-1.2014434974188CGrs3905286159095-1.27046106136441TCrs28407778159095-1.26746342605063AGrs4246505159095-1.24706211285128AGrs4626817159095-1.26366297625074AGrs11507767159095-1.28611566069053GA
$ head height_ukb_3M_format.txtSNPNZA1A2rs1170864224293121.42436004338939TCrs286123484293121.48706291417224TCrs44756914293121.53977372135067TCrs9501224293121.37958155329171CGrs39052864293121.77045946243262TCrs284077784293121.9908370573435AGrs42465054293121.90922505355565AGrs46268174293121.53216668392479AGrs115077674293121.55873328059033GA
We keep the BMI GWAS from BBJ female as the external validation dataset:
$ head BMI_bbj_male_3M_format_3MImp.txtSNPNZA1A2rs117086422723900.897252679561414TCrs28612348723900.904738461538462TCrs4475691723900.89177374486687TCrs950122723900.891523178807947CGrs3905286723900.827441738675046TCrs28407778723900.816801884323476AGrs4246505723900.812148186935463AGrs4626817723900.811624558188245AGrs11507767723900.811100929441026GA
The covariates files should not include row names and column names. The rows should be exactly corresponding to the individuals in the .fam file of reference genotypes. Each column corresponds to one covariate. A column of one should not be included in the file.
$ head 1000G.EAS.QC.hm3.ind.pc5.txt-0.0242863 0.0206888 -0.0028171 0.0343263 0.0211044-0.025051 0.0275325 -0.0332156 -0.0233166 0.0588989-0.0198603 0.0286747 0.008464 -0.0215478 0.0189802-0.0117008 0.0251493 -0.0228276 -0.0670019 0.0186454-0.0246233 0.0281415 -0.0418795 0.0059947 0.0196607-0.0270411 0.0127586 -0.0376545 0.0182809 0.0344744-0.0193744 0.0367263 -0.0176168 -0.0307868 0.0254125-0.0212299 0.022658 0.0225698 -0.0249273 0.0144324-0.0159054 0.00558952 -0.00609582 -0.033497 0.0518336-0.0258843 0.0476758 0.0073353 0.0164056 0.0072118
$ head 1000G.EUR.QC.hm3.ind.pc20.txt0.0290072 0.0717627 -0.0314029 0.0317316 0.0618357 0.0385132 0.122857 -0.0289581 -0.0114267 -0.0205926 -0.0466983 0.0836711 0.00690379 0.0345008 -0.0179313 0.0109661 -0.0214763 0.0014544 0.0182944 0.03996250.0378568 0.0499758 -0.00811504 0.0363021 -0.0579984 0.0422509 0.141279 -0.0167868 -0.0181999 0.0165593 -0.0304088 0.0423324 0.0226001 0.00843853 0.0212477 -0.0666462 -0.0787379 0.0136196 0.108933 0.08012460.0417318 0.0732938 -0.0409508 0.0176872 0.0801957 0.0124742 0.0252689 -0.0444921 -0.00305238 0.0035535 -0.0070929 0.0240194 -0.00543616 0.0272464 -0.0048309 -0.0207223 0.0415044 0.0494025 0.0213837 -0.00210930.0348071 0.0715395 -0.0266058 0.00280025 -0.0166164 0.0440144 0.135709 0.017364 0.0276564 -0.00286321 0.0314583 0.00299185 0.0792055 0.012042 -0.0337269 -0.00999033 0.0186435 -0.0699027 0.00791191 -0.01311680.0444763 0.0713348 0.00162451 -0.0107805 0.0868909 0.0218014 0.0314216 -0.0429928 -0.0137937 0.00913544 -0.062828 0.0555199 0.0378234 -0.0162297 0.000344947 -0.0164497 0.0523967 -0.0861731 -0.038893 0.0281660.0300713 0.0522082 -0.0116997 -0.029994 0.0539977 0.0162067 -0.0522507 0.0022036 0.0255471 0.0129012 0.0371803 0.0701241 0.0314957 0.00870374 0.00566795 0.116672 0.0267188 0.0451948 0.00288655 -0.04273040.0339424 0.0718272 0.000614329 -0.0183555 0.0162768 -0.0600599 0.00343218 0.0149793 -0.0236301 -0.0267658 0.0387814 -0.00624387 -0.0364751 0.00486515 -0.0341221 0.0415286 -0.0274807 -0.013188 0.0695243 0.04953760.0423378 0.0656126 -0.0331807 -0.0361484 -0.0155739 0.0557459 0.00428138 0.0840953 -0.034451 0.0753096 -0.0180153 0.0595412 -0.0367107 -0.0285888 -0.0986386 0.00845412 -0.00388558 -0.0641134 -0.05815 0.02424330.0429698 0.0431213 -0.0357636 -0.00270477 -0.0567958 0.0892002 0.0980711 0.0468321 -0.0359592 0.011195 -0.000235849 -0.0192522 -0.00271491 0.0381155 -0.00845755 -0.0171629 -0.026532 -0.0415778 0.0274635 0.1220630.0290721 0.0541744 -0.0238317 0.0254426 0.0986334 0.0706142 0.0977585 0.00427919 -0.0381976 0.0020029 -0.0161052 -0.016666 -0.00627125 -0.00490556 -0.0410802 0.0125096 -0.0175252 0.0320359 0.00866061 0.0736791
Once the imput files are formatted, XPASS will automatically process the datasets, including SNPs overlapping and allele matching.Run XPASS with the following comand:
# library(devtools)# install_github("https://github.com/YangLabHKUST/XPASS")library(XPASS)library(data.table)library(RhpcBLASctl)blas_set_num_threads(30)# reference genotypes for EAS (prefix of plink file bim/bed/fam)ref_EAS <- "1000G.EAS.QC.hm3.ind"# covariates of EAS reference genotypescov_EAS <- "1000G.EAS.QC.hm3.ind.pc5.txt"# reference genotypes for EUR (prefix of plink file bim/bed/fam)ref_EUR <- "1000G.EUR.QC.hm3.ind"# covariates of EUR reference genotypescov_EUR <- "1000G.EUR.QC.hm3.ind.pc20.txt"# genotype file of test data (plink prefix). # Note: for demonstration, we assume that the genotypes of prediction target are used as the reference panel of target population. # In practice, one can also use genotypes from other sources as reference panel.BMI_test <- "1000G.EAS.QC.hm3.ind"# sumstats of heightBMI_bbj_male <- "BMI_bbj_male_3M_format_3MImp.txt" # targetBMI_ukb <- "BMI_ukb_sumstat_format_all.txt" # auxiliaryBMI_bbj_female <- "BMI_bbj_female_3M_format_3MImp.txt" # external validationfit_bbj <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS, file_ref2 = ref_EUR, file_cov1 = cov_EAS,file_cov2 = cov_EUR, file_predGeno = BMI_test, compPRS=T, pop = "EAS",sd_method="LD_block",compPosMean = T, file_out = "BMI_bbj_ukb_ref_TGP")Summary statistics file 1: BMI_bbj_male_3M_format_3MImp.txtSummary statistics file 2: BMI_ukb_sumstat_format_all.txtReference file 1: 1000G.EAS.QC.hm3.indReference file 2: 1000G.EUR.QC.hm3.indCovariates file 1: 1000G.EAS.QC.hm3.ind.pc5.txtCovariates file 2: 1000G.EUR.QC.hm3.ind.pc20.txtReading data from summary statisitcs...3506148 and 3777871 SNPs found in summary statistics files 1 and 2.Reading SNP info from reference panels...1209411 and 1313833 SNPs found in reference panel 1 and 2.746454 SNPs are matched in all files.0 SNPs are removed because of ambiguity; 746454 SNPs remained.Calculating kinship matrix from the both reference panels...127749 SNPs in the second reference panel are alligned for alleles according to the first.14337 SNPs have different minor alleles in population 1, z-scores are corrected according to reference panel.14332 SNPs have different minor alleles in population 2, z-scores are corrected according to reference panel.Assigning SNPs to LD Blocks...Calculate PVE... h1 h2 h12 rho[1,] 0.165790395 0.247603545 0.129399058 0.63866483[2,] 0.007631346 0.008542654 0.006856057 0.02002658...Predicting PRS from test genotypes...Done.
XPASS returns a list of results, some key outputs are:
- H: a table of estimated heritabilities, co-heritability and genetic correlation (first row) and their corresponding standard erros (second row).
> fit_bbj$H h1 h2 h12 rho[1,] 0.165790395 0.247603545 0.129399058 0.63866483[2,] 0.007631346 0.008542654 0.006856057 0.02002658
- mu: a data frame storing the posterior means computed by LDpred-inf using only the target dataset (mu1) and only the auxiliary dataset (mu2), and the posterior means of the target population and the auxiliary population computed by XPASS (mu_XPASS1 and mu_XPASS2). In eff_type1 and eff_type2 columns, "RE" indicates the SNP effect is random effect and "FE" indicates the effect is treated as population-specific fixed effect in the corresponding populations. SNPs information is also returned: A1 is the effect allele, A2 is the other allele.
> head(fit_bbj$mu) CHR SNP POS A1 A2 mu1 mu2 mu_XPASS11 1 rs4475691 846808 T C -1.271443e-04 -0.0002770532 -0.00032485792 1 rs7537756 854250 G A -4.778206e-05 -0.0003401613 -0.00029798703 1 rs3748592 880238 A G -3.201406e-04 -0.0012640981 -0.00074775064 1 rs2340582 882803 A G -3.396992e-04 -0.0012885694 -0.00080765125 1 rs4246503 884815 A G -3.318260e-04 -0.0013238182 -0.00081487046 1 rs3748597 888659 T C -3.327131e-04 -0.0013031399 -0.0008113772 mu_XPASS2 eff_type1 eff_type21 -0.0003319743 RE RE2 -0.0003509346 RE RE3 -0.0014312273 RE RE4 -0.0014550808 RE RE5 -0.0014828292 RE RE6 -0.0014623186 RE RE
- PRS (if file_predGeno provided and compPRS=T): a data frame storing the PRS generated using mu1, mu2, mu_XPASS1, and mu_XPASS2, respectively.
> head(fit_bbj$PRS) FID IID PRS1 PRS2 PRS_XPASS1 PRS_XPASS21 HG00403 HG00403 0.14310344 1.1430451 0.55960910 1.16382722 HG00404 HG00404 -0.19478304 0.7149414 0.03773900 0.64658113 HG00406 HG00406 -0.09440555 0.8028816 0.14173517 0.71102574 HG00407 HG00407 0.05533467 0.1135734 0.05091083 0.13540405 HG00409 HG00409 0.25063980 1.9968242 0.83527097 1.98518916 HG00410 HG00410 0.13454737 0.2543978 0.10214320 0.2608015# One can also compute PRS after fitting the model:> PRS <- predict_XPASS(fit_bbj$mu,ref_EAS)> head(PRS) FID IID PRS1 PRS2 PRS_XPASS1 PRS_XPASS21 HG00403 HG00403 0.14310344 1.1430451 0.55960910 1.16382722 HG00404 HG00404 -0.19478304 0.7149414 0.03773900 0.64658113 HG00406 HG00406 -0.09440555 0.8028816 0.14173517 0.71102574 HG00407 HG00407 0.05533467 0.1135734 0.05091083 0.13540405 HG00409 HG00409 0.25063980 1.9968242 0.83527097 1.98518916 HG00410 HG00410 0.13454737 0.2543978 0.10214320 0.2608015
XPASS will also write above outputs into the files withfile_out
prefix, if provided.
We use the GWAS of female BMI from BBJ as the external validation dataset to approximate the prediction R2. Specifically we use the following equation:
where z is the z-score of external summsry statistics, n is its sample size, is the posterior mean of effect size at the standardized genotype scale,
is the LD reference panel.
TheevalR2_XPASS
function takes posterior means from XPASS output and the external validation summary statistics as its input to evaluate the approximated predictive R2. The ouput is a vector of prdictive R2 evaluated using mu1, mu2, mu_XPASS1, and mu_XPASS2, respectively.
> R2 <- evalR2_XPASS(fit_bbj$mu,BMI_bbj_female,ref_EAS)> R2 PRS1 PRS2 PRS_XPASS1 PRS_XPASS20.02235596 0.01638359 0.02905665 0.01949609
While the reference panels have only limmited samples, XPASS still achieves 30% relative improvement compared to LDpred-inf in terms of R2.
XPASS+ allows the population-specific effects to be utilized in PRS construction. To fit XPASS+, we first need to apply the P+T procedure to construct a set of pre-selected SNPs that are population-specific. Here, we use the ieugwasr package.
library(ieugwasr)# P+T procedure for bbjz_bbj <- fread(BMI_bbj_male) # read summary statistics filepval <- data.frame(rsid=z_bbj$SNP,pval=2*pnorm(abs(z_bbj$Z),lower.tail=F))clp_bbj <- ld_clump(pval, clump_kb=1000, clump_r2=0.1, clump_p=1e-6, bfile=ref_EAS, plink_bin="/import/home/mcaiad/plink/plink")snps_l <- clp_bbj$rsid
The rs id of pre-selected SNPs are stored insnps_l
, which is then passed to thesnps_fe1
argument inXPASS
function.
fit_bbj <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS, file_ref2 = ref_EUR, file_cov1 = cov_EAS,file_cov2 = cov_EUR, file_predGeno = BMI_test, snps_fe1 = snps_l, compPRS=T, pop = "EAS",sd_method="LD_block",compPosMean = T, file_out = "BMI_bbj_ukb_plus_ref_TGP")R2 <- evalR2_XPASS(fit_bbj$mu,BMI_bbj_female,ref_EAS)
> R2 PRS1 PRS2 PRS_XPASS1 PRS_XPASS20.02562267 0.01638359 0.03149972 0.01963232
By including the population-specific effects, XPASS+ achieves 3.15% predictive R^2, offering 8% improvement compared to XPASS.
While in the above example we only include population-specific effects in the target population, our implementation of XPASS+ allows the inclusion of such effects for both populations. To include fixed effects in both populations, the pre-selected SNPs for corresponding populations should be passed tosnps_fe1
andsnps_fe2
arguments, respectively.
# P+T procedure for ukbz_ukb <- fread(BMI_ukb) # read summary statistics filepval <- data.frame(rsid=z_ukb$SNP,pval=2*pnorm(abs(z_ukb$Z),lower.tail=F))clp_ukb <- ld_clump(pval, clump_kb=1000, clump_r2=0.1, clump_p=1e-10, bfile=ref_EUR, plink_bin="/import/home/mcaiad/plink/plink")snps_ukb <- clp_ukb$rsidfit_both <-XPASS(file_z1 = BMI_bbj_male,file_z2 = BMI_ukb,file_ref1 = ref_EAS, file_ref2 = ref_EUR, file_cov1 = cov_EAS,file_cov2 = cov_EUR, file_predGeno = BMI_test, snps_fe1 = snps_bbj, snps_fe2 = snps_ukb, compPRS=T, pop = "EAS",sd_method="LD_block",compPosMean = T, file_out = "BMI_bbj_ukb_plus_ref_TGP")R2 <- evalR2_XPASS(fit_both$mu,BMI_bbj_female,ref_EAS)
> R2 PRS1 PRS2 PRS_XPASS1 PRS_XPASS20.02353173 0.01192856 0.03294502 0.01419943
This yields similar results when the population-specific effects are included only in the target population.
An additional example for constructing PRS of Type 2 Diabetes can be found inthis PDF.
Common issues are discussed inFAQ
The XPASS package is developed by Mingxuan Cai (mcaiad@ust.hk).
Please contact Mingxuan Cai (mcaiad@ust.hk) or Prof. Can Yang (macyang@ust.hk) if any enquiry.
Mingxuan Cai, Jiashun Xiao, Shunkang Zhang, Xiang Wan, Hongyu Zhao, Gang Chen, Can Yang. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. The American Journal of Human Genetics. 108, 632-655, April 2021.