Movatterモバイル変換

apriha/snpsPublic

NotificationsYou must be signed in to change notification settings
Fork19
Star109

tools for reading, writing, merging, and remapping SNPs

License

BSD-3-Clause license

109 stars 19 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,186 Commits
.github/workflows		.github/workflows
analysis		analysis
docs		docs
src/snps		src/snps
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
CONTRIBUTORS.rst		CONTRIBUTORS.rst
LICENSE.txt		LICENSE.txt
LICENSES-3RD-PARTY.txt		LICENSES-3RD-PARTY.txt
MANIFEST.in		MANIFEST.in
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.rst		README.rst
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

Repository files navigation

snps

tools for reading, writing, merging, and remapping SNPs 🧬

snpsstrives to be an easy-to-use and accessible open-source library for working withgenotype data

Features

Input / Output

Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testingsources with aSNPsobject
Read and write VCF files (e.g., convert23andMe to VCF)
Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
Handle several variations of file types, validated viaopenSNP parsing analysis

Build / Assembly Detection and Remapping

Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
Remap SNPs between builds / assemblies

Data Cleaning

Perform quality control (QC) / filter low quality SNPs based onchip clusters
Fix several common issues when loading SNPs
Sort SNPs based on chromosome and position
Deduplicate RSIDs
Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
Deduplicate alleles on MT
Assign PAR SNPs to the X or Y chromosome

Analysis

Derive sex from SNPs
Detect deduced genotype / chip array and chip version based onchip clusters
Predict ancestry from SNPs (when installed withezancestry)

Supported Genotype Files

snps supportsVCF files andgenotype files from the following DNA testing sources:

Additionally,snps can read a variety of "generic" CSV and TSV files.

Dependencies

snps requiresPython 3.8+ and the following Pythonpackages:

Installation

snps isavailable on thePython Package Index. Installsnps (and its requiredPython dependencies) viapip:

$ pip install snps

Forancestry predictioncapability,snps can be installed withezancestry:

$ pip install snps[ezancestry]

Examples

Download Example Data

First, let's setup logging to get some helpful output:

>>>import logging, sys>>> logger= logging.getLogger()>>> logger.setLevel(logging.INFO)>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we're ready to download some example data fromopenSNP:

>>>from snps.resourcesimport Resources>>> r= Resources()>>> paths= r.download_example_datasets()Downloading resources/662.23andme.340.txt.gzDownloading resources/662.ftdna-illumina.341.csv.gz

Load Raw Data

Load a23andMe raw data file:

>>>from snpsimport SNPs>>> s= SNPs("resources/662.23andme.340.txt.gz")>>> s.source'23andMe'>>> s.count991786

TheSNPs class accepts a path to a file or a bytes object. AReader class attempts toinfer the data source and load the SNPs. The loaded SNPs arenormalized andavailable via apandas.DataFrame:

>>> df= s.snps>>> df.columns.valuesarray(['chrom', 'pos', 'genotype'], dtype=object)>>> df.index.name'rsid'>>> df.chrom.dtype.name'object'>>> df.pos.dtype.name'uint32'>>> df.genotype.dtype.name'object'>>>len(df)991786

snps also attempts to detect the build / assembly of the data:

>>> s.build37>>> s.build_detectedTrue>>> s.assembly'GRCh37'

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing sources - let's combinethese files. Specifically, we'll update theSNPs object with SNPs from aFamily Tree DNA file.

>>> merge_results= s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null>>> s.source'23andMe, FTDNA'>>> s.count1006960>>> s.build37>>> s.build_detectedTrue

If the SNPs being merged have a build that differs from the destination build, the SNPs to mergewill be remapped automatically. After this example merge, the build is still detected, since thebuild was detected for allSNPs objects that were merged.

As the data gets added, it's compared to the existing data, and SNP position and genotypediscrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) Thesediscrepant SNPs are available for inspection after the merge via properties of theSNPs object.

>>>len(s.discrepant_merge_genotypes)151

Additionally, any non-called / null genotypes will be updated during the merge, if the filebeing merged has a called genotype for the SNP.

Moreover,merge takes achrom parameter - this enables merging of only SNPs associatedwith the specified chromosome (e.g., "Y" or "MT").

Finally,merge returns a list ofdict, where eachdict has information correspondingto the results of each merge (e.g., SNPs in common).

>>>sorted(list(merge_results[0].keys()))['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']>>> merge_results[0]["merged"]True>>>len(merge_results[0]["common_rsids"])692918

Remap SNPs

Now, let's remap the merged SNPs to change the assembly / build:

>>> s.snps.loc["rs3094315"].pos752566>>> chromosomes_remapped, chromosomes_not_remapped= s.remap(38)Downloading resources/GRCh37_GRCh38.tar.gz>>> s.build38>>> s.assembly'GRCh38'>>> s.snps.loc["rs3094315"].pos817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38(GRCh38).

Save SNPs

Ok, so far we've merged the SNPs from two files (ensuring the same build in the process andidentifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let's savethe merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:

>>> saved_snps= s.to_tsv("out.txt")Saving output/out.txt>>>print(saved_snps)output/out.txt

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps= s.to_vcf("out.vcf")Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gzSaving output/out.vcf1 SNP positions were found to be discrepant when saving VCF

When saving a VCF, if any SNPs have positions outside of the reference sequence, they are markedas discrepant and are available via a property of theSNPs object.

Alloutput files are saved to theoutput directory.

Documentation

Documentation is availablehere.

Acknowledgements

Thanks to Mike Agostino, Padma Reddy, Kevin Arvai,openSNP,Open Humans, andSano Genetics.

snps incorporates code and concepts generated with the assistance ofOpenAI's ChatGPT. ✨

License

snps is licensed under theBSD 3-Clause License.

About

tools for reading, writing, merging, and remapping SNPs

Releases46

v2.10.0 Latest

Mar 22, 2025

+ 45 releases

Contributors11

Languages

Python100.0%

Movatterモバイル変換

License

apriha/snps

Folders and files

Latest commit

History

Repository files navigation

snps

Features

Input / Output

Build / Assembly Detection and Remapping

Data Cleaning

Analysis

Supported Genotype Files

Dependencies

Installation

Examples

Download Example Data

Load Raw Data

Merge Raw Data Files

Remap SNPs

Save SNPs

Documentation

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases46

Uh oh!

Contributors11

Uh oh!

Languages