Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

License

NotificationsYou must be signed in to change notification settings

brentp/somalier

Repository files navigation

Actions StatusCiteinstall with bioconda

Quick Start

somalier makes checking any number of samples for identity easydirectly from the alignments or from jointly-called VCFs:

The first step is to extract sites. ForVCF just use:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $cohort.vcf.gz

with a sites file fromreleases

ForBAM orCRAM, use:This is parallelizable by sample via cluster or cloud, but here, using a for loop:

for f in *.cram; do    somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $fdone

--sites is a VCF of known polymorphic sites in VCF format. A good set is provided inthereleases but any set of common variants will work.

⚠️somalier can work on GVCF and individual VCFs, but it isrecommended to extract from bam/cram when possible. It is also good toextract from a jointly-called VCF/BCF when only looking within that cohort.While extracting from a single-sample VCF is possible (with --unknown) andGVCF is also supported, these options are less accurate and more prone toproblems.

The next step is to calculate relatedness on the extracted data:

somalier relate --ped $pedigree extracted/*.somalier

This will create text and interactive HTML outputthat makes it fast and easy to detect mismatched samples and sample-swaps.The files created are:

  • `{output_prefix}.samples.tsv # creates a .ped like file with extra QC columns
  • `{output_prefix}.pairs.tsv # shows IBS for all possible sample pairs
  • `{output_prefix}.groups.tsv # shows pairs of samples above a certain relatedness
  • `{output_prefix}.html # interactive html

Example html output ishere

Note that thesomalier relate command runs extremely quickly (< 2 seconds for 600 samples and ~1 minute for 4,500 samples) so it's possibleto add/remove samples or adjust a pedigree file and re-run iteratively.

For example to add then + 1th samples, just runsomalier extract on the new sample and then re-usethe already extracted data from then original samples.

Forhuge sample-sets, if you run into a bash error forargument list too long, you can pass the somalier files as quotedglob strings like:"/path/to/set-a/*.somalier" "/path/to/set-b/*.somalier".

Example Output

  • Interactive output fromsomalier relate ishere
  • Interactive output fromsomalier ancestry ishere

Infer

somalier can also infer first-degree relationships (parent-child) when both-parentsare present and can often build entire pedigrees on high-qualty data. To do this, usesomalier relate --infer ... and thesamples.tsv output will be a pedigree fileindicating the inferred relationships.

Seewiki for more detail.

Usage

The usage is also described above. Briefly, after downloading the somalier binary and a sites vcf from thereleases run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $sample.bam

for each sample to create a small binary file of the ref and alt counts for the variants listedin sites.hg38.vcf.gz.

for a vcf, run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $cohort.bcf

somalier canextract from a multi or single-sample VCF or a GVCF. This will be much faster, in cases where it's available,this would look like:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa joint.vcf.gz

following this, there will be a$sample.somalier file for each sample in thejoint.vcf.gz

Note thatsomalier uses theAD field to extract depth information. If that FORMAT field is not present in theheader, then it will use the genotypes only and use a total depth of 20 (10,10 for heterozygote), etc.

Then run:

somalier relate --ped $pedigree_file cohort/*.somalier

This will create an html file for QC in a few seconds.

Note that if a new sample is added to the cohort, it's only necessary to performtheextract step on that sample and then run the (fast)relate step again with allof the extracted files.

Extended Usage

For each command of somalier, extended parameters are listed in--help of each subcommand.

$./somalier --helpCommands:  extract      :   extract genotype-like information for a single sample from VCF/BAM/CRAM.  relate       :   aggregate `extract`ed information and calculate relatedness among samples.  ancestry     :   perform ancestry prediction on a set of samples, given a set of labeled samples  find-sites   :   create a new sites.vcf.gz file from a population VCF (this is rarely needed).

somalier extract

extract genotype-like information for a single-sample at selected sites

Usage:  somalier extract [options] sample_fileArguments:  sample_file      single-sample CRAM/BAM/GVCF file or multi/single-sample VCF from which to extractOptions:  -s, --sites=SITES          sites vcf file of variants to extract  -f, --fasta=FASTA          path to reference fasta file  -d, --out-dir=OUT_DIR      path to output directory (default: .)  --sample-prefix=SAMPLE_PREFIX                             prefix for the sample name stored inside the digest

somalier relate

calculate relatedness among samples from extracted, genotype-like information

Usage:  somalier relate [options] [extracted ...]Arguments:  [extracted ...]  $sample.somalier files for each sample. the first 10 are tested as a glob patternsOptions:  -g, --groups=GROUPS        optional path  to expected groups of samples (e.g. tumor normal pairs).specified as comma-separated groups per line e.g.:    normal1,tumor1a,tumor1b    normal2,tumor2a  --sample-prefix=SAMPLE_PREFIX                             optional sample prefixes that can be removed to find identical samples. e.g. batch1-sampleA batch2-sampleA  -p, --ped=PED              optional path to a ped/fam file indicating the expected relationships among samples.  -d, --min-depth=MIN_DEPTH  only genotype sites with at least this depth. (default: 7)  --min-ab=MIN_AB            hets sites must be between min-ab and 1 - min_ab. set this to 0.2 for RNA-Seq data (default: 0.3)  -u, --unknown              set unknown genotypes to hom-ref. it is often preferable to use this with VCF samples that were not jointly called  -i, --infer                infer relationships (https://github.com/brentp/somalier/wiki/pedigree-inference)  -o, --output-prefix=OUTPUT_PREFIX                             output prefix for results. (default: somalier)

Note that for large cohorts, by default,somalier relate will subset to interesting sample-pairs so as not toballoon the size of the output. By default, pairs that are expected to be unrelated and have a relatedness <= 0.05.If you wish to force all samples to be reported, then set the environment variableSOMALIER_REPORT_ALL_PAIRS to any non-emptyvalue, e.g.export SOMALIER_REPORT_ALL_PAIRS=1

find-sites

To create a set of new sites, usesomalier find-sites on a population VCF. More info on this tool is available inthe wiki

Install

Somalier is available via bioconda, seehere.

Alternatively, you can get a static binary fromhere.

Users can also get a docker imageherewhich contains htslib and a somalier binary ready-for-use.

How it works

somalier takes a list of known polymorphic sites. Even a few hundred (or dozen) sitescan be a very good indicator of relatedness. The best sites are those with a populationallele frequency close to 0.5 as that maximizes the probability that any 2 samples will differ.A list of such sites is provided in thereleasesfor GRCh37 and hg38.

In order to quickly calculate genotypes at these sites,somalier assays the exact base.The extraction step is done directly from the bam/cram files 1 sample at a time.

Therelate step is run on the output of theextract commands. It runs extremely quicklyso that new samples can be added and compared. It uses 3 bit-vectors per sample for hom-ref,het, hom-alt. Each bitvector is a sequence of 64 bit integers where each bit is set ifthe variant at that index in the sample is for example, heterozygous. With this setup,we can use fast bitwise operations andpopcounthardware instructions to calculate relatedness extremely quickly.

For each sample-pair, it reports:

  1. IBS0 -- the number of sites where one sample is hom-ref and another is hom-alt
  2. IBS2 -- the number of sites where the samples have the same genotype
  3. shared-hets -- the number of sites where both samples are heterozygotes
  4. shared-hom-alts -- the number of sites where both samples are homozygous alternate

These are used to calculaterelatednessand a measure of relatedness that is unaffected by loss-of-heterozygosity that is often seen in somecancers. The interactive output allows toggling between any of these measures.

It also reports depth information and the count ofHET,HOM_REF,HOM_ALT, andunknown genotypes for each samplealong with a number of metrics that are useful for general QC.

Example

example

Here, each point is a pair of samples. We can see that the expected identical sample-pairs (e.g. tumor-normal pairs) specified by the userand drawn in red mostly cluster together on the right. Unrelateds cluster on the lower left. The sample-swaps are the blue points that cluster withthe red. In the somalier output, the user canhover to see which sample-pairs are involved each point

Ancestry Estimate

somalier can predict ancestry on a set of query samples given a set of labelled samples. You can read more about this inthe wiki

Usage

Usage is intentionally very simple and runningsomalier extract orsomalier relate will give sufficient help for nearlyall cases.

By defaultsomalier will only consider variants that have a "PASS" or "RefCall" FILTER. To extend this list, setthe environment variableSOMALIER_ALLOWED_FILTERS to a comma-delimited list of additional filters to allow.

by default sites with an allele balance < 0.01 will be considered homozygous reference. To adjust this, use e.g. :SOMALIER_AB_HOM_CUTOFF=0.04 somalier relate ...

Other Work

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499645/

https://academic.oup.com/bioinformatics/article/33/4/596/2624551

Acknowledgement

This work was motivated by interaction and discussions with Preeti Aahir and severalearly users who provided valuable feedback.

About

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp