pgsc_calc: a reproducible workflow to calculate polygenic scores
Contents
pgsc_calc: a reproducible workflow to calculate polygenic scores#
Thepgsc_calc workflow makes it easy to calculate apolygenic score(PGS) using scoring files published in thePolygenic Score (PGS) Catalog🧬 and/or custom scoring files.
The calculator workflow automates PGS downloads from the Catalog, variantmatching between scoring files and target genotyping samplesets, and theparallel calculation of multiple PGS. Genetic ancestry assignment and PGSnormalisation methods are also supported.
Workflow summary#

The workflow performs the following steps:
Downloading scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
Reading custom scoring files (and performing a liftover if genotyping data is in a different build).
Automatically combines and creates scoring files for efficient parallelcomputation of multiple PGS.
Matches variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format).
Calculates PGS for all samples (linear sum of weights and dosages).
Creates a summary report to visualize score distributions and pipeline metadata (variant matching QC).
And optionally has additional functionality to:
Use a reference panel to obtain genetic ancestry data using PCA, and define the most similar population in thereference panel for each target sample.
Report PGS using methods to adjust for genetic ancestry.
Tip
To enable these optional steps, seeHow do I normalise calculated scores across different genetic ancestry groups?
SeeFeatures under development section for information about planned updates.
The workflow relies on open source scientific software, including:
A full description of included software is described inReference: container images.
Quick example#
InstallNextflow
InstallDocker orSingularity (minimum
v3.8.3) for fullreproducibility orConda as a fallbackCalculate some polygenic scores using synthetic test data:
$nextflowrunpgscatalog/pgsc_calc-profiletest,dockerThe workflow should output:
... <configuration messages intentionally not shown> ...------------------------------------------------------If you use pgscatalog/pgsc_calc for your analysis please cite:* The Polygenic Score Catalog https://doi.org/10.1038/s41588-021-00783-5* The nf-core framework https://doi.org/10.1038/s41587-020-0439-x* Software dependencies https://github.com/pgscatalog/pgsc_calc/blob/master/CITATIONS.md------------------------------------------------------executor > local (7)[49/d28766] process > PGSC_CALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (samplesheet.csv) [100%] 1 of 1 ✔[c3/a8e0d9] process > PGSC_CALC:PGSCALC:INPUT_CHECK:SCOREFILE_CHECK [100%] 1 of 1 ✔[- ] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF -[7c/5cca6c] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_BFILE (cineca_synthetic_subset) [100%] 1 of 1 ✔[3b/ce0e39] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:MATCH_VARIANTS (cineca_synthetic_subset) [100%] 1 of 1 ✔[2e/fb3233] process > PGSC_CALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE (cineca_synthetic_subset) [100%] 1 of 1 ✔[b5/fc5b1e] process > PGSC_CALC:PGSCALC:APPLY_SCORE:SCORE_REPORT (1) [100%] 1 of 1 ✔[03/009cb6] process > PGSC_CALC:PGSCALC:DUMPSOFTWAREVERSIONS (1) [100%] 1 of 1 ✔-[pgscatalog/pgsc_calc] Pipeline completed successfully-
Note
Thedocker profile option can be replaced withsingularity orconda depending on your local environment
If you want to try the workflow with your own data, have a look at theGetting started section.
Documentation#
Get started: install pgsc_calc and calculate some polygenic scores quickly
How-to guides: step-by-step guides, covering different use cases
Reference guides: technical information about workflow configuration
Explanations: more detailed explanations about PGS calculation and results
Changelog#
TheChangelog page describes fixes and enhancements for each version.
Features under development#
These are some of the fetures and improvements we’re planning for thepgsc_calc:
Further optimizations to the PCA & ancestry similarity analysis steps focused on improving automatic QC
Performance improvements to make
pgsc_calcwork with 1000s of scoring files in paralell (e.g. integrationwithOmicsPred)
Credits#
pgscatalog/pgsc_calc is developed as part of the PGS Catalog project, acollaboration between the University of Cambridge’s Department of Public Healthand Primary Care (Michael Inouye, Samuel Lambert) and the EuropeanBioinformatics Institute (Helen Parkinson, Laura Harris).
The pipeline seeks to provide a standardized workflow for PGS calculation andancestry inference implemented in nextflow derived from an existing set oftools/scripts developed by Inouye lab (Rodrigo Canovas, Scott Ritchie, JingqinWu) and PGS Catalog teams (Samuel Lambert, Laurent Gil).
The adaptation of the codebase, nextflow implementation, and PGS Catalog featuresare written by Benjamin Wingfield, Samuel Lambert, Laurent Gil with additional inputfrom Aoife McMahon (EBI). Development of new features, testing, and code reviewis ongoing including Inouye lab members (Rodrigo Canovas, Scott Ritchie) and others. Amanuscript describing the tool is in preparation (seeCitations) and wewelcome ongoing community feedback before then via ourdiscussion board orissue tracker.
Citations#
If you usepgscatalog/pgsc_calc in your analysis, please cite:
Lambert, Wingfield,et al. (2024) Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nature Genetics. doi:10.1038/s41588-024-01937-x.
In addition, please remember to cite the primary publications for any PGS Catalog scoresyou use in your analyses, and the underlying data/software tools described in thecitations file.
License Information#
This pipeline is distributed under anApache 2.0 license, but makes use ofmultiple open-source software and datasets (complete list in thecitations file)that are distributed under their own licenses. Notably:
Nextflow (Apache 2.0 license) andnf-core (MIT license). See & citeEwels et al. Nature Biotech (2020) for additional information about the project.
PLINK 1/2 software (GPLv3+)
CINECA synthetic cohort data for test dataset (CC-BY-NC-SA)
We note that it is up to end-users to ensure that their use of the pipelineand test data conforms to the license restrictions.
Funding#
This work has received funding from EMBL-EBI core funds, the Baker Institute,the University of Cambridge, Health Data Research UK (HDRUK), and the EuropeanUnion’s Horizon 2020 research and innovation programme under grant agreement No101016775 INTERVENE.