- Notifications
You must be signed in to change notification settings - Fork19
tools for reading, writing, merging, and remapping SNPs
License
apriha/snps
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
tools for reading, writing, merging, and remapping SNPs 🧬
snps
strives to be an easy-to-use and accessible open-source library for working withgenotype data
- Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testingsources with aSNPsobject
- Read and write VCF files (e.g., convert23andMe to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
- Read data in a variety of formats (e.g., files, bytes, compressed with gzip or zip)
- Handle several variations of file types, validated viaopenSNP parsing analysis
- Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
- Remap SNPs between builds / assemblies
- Perform quality control (QC) / filter low quality SNPs based onchip clusters
- Fix several common issues when loading SNPs
- Sort SNPs based on chromosome and position
- Deduplicate RSIDs
- Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
- Deduplicate alleles on MT
- Assign PAR SNPs to the X or Y chromosome
- Derive sex from SNPs
- Detect deduced genotype / chip array and chip version based onchip clusters
- Predict ancestry from SNPs (when installed withezancestry)
snps
supportsVCF files andgenotype files from the following DNA testing sources:
- 23andMe
- Ancestry
- CircleDNA
- Código 46
- DNA.Land
- Family Tree DNA
- Genes for Good
- LivingDNA
- Mapmygenome
- MyHeritage
- PLINK
- Sano Genetics
- SelfDecode
- tellmeGen
Additionally,snps
can read a variety of "generic" CSV and TSV files.
snps
requiresPython 3.8+ and the following Pythonpackages:
snps
isavailable on thePython Package Index. Installsnps
(and its requiredPython dependencies) viapip
:
$ pip install snps
Forancestry predictioncapability,snps
can be installed withezancestry:
$ pip install snps[ezancestry]
First, let's setup logging to get some helpful output:
>>>import logging, sys>>> logger= logging.getLogger()>>> logger.setLevel(logging.INFO)>>> logger.addHandler(logging.StreamHandler(sys.stdout))
Now we're ready to download some example data fromopenSNP:
>>>from snps.resourcesimport Resources>>> r= Resources()>>> paths= r.download_example_datasets()Downloading resources/662.23andme.340.txt.gzDownloading resources/662.ftdna-illumina.341.csv.gz
Load a23andMe raw data file:
>>>from snpsimport SNPs>>> s= SNPs("resources/662.23andme.340.txt.gz")>>> s.source'23andMe'>>> s.count991786
TheSNPs
class accepts a path to a file or a bytes object. AReader
class attempts toinfer the data source and load the SNPs. The loaded SNPs arenormalized andavailable via apandas.DataFrame
:
>>> df= s.snps>>> df.columns.valuesarray(['chrom', 'pos', 'genotype'], dtype=object)>>> df.index.name'rsid'>>> df.chrom.dtype.name'object'>>> df.pos.dtype.name'uint32'>>> df.genotype.dtype.name'object'>>>len(df)991786
snps
also attempts to detect the build / assembly of the data:
>>> s.build37>>> s.build_detectedTrue>>> s.assembly'GRCh37'
The dataset consists of raw data files from two different DNA testing sources - let's combinethese files. Specifically, we'll update theSNPs
object with SNPs from aFamily Tree DNA file.
>>> merge_results= s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null>>> s.source'23andMe, FTDNA'>>> s.count1006960>>> s.build37>>> s.build_detectedTrue
If the SNPs being merged have a build that differs from the destination build, the SNPs to mergewill be remapped automatically. After this example merge, the build is still detected, since thebuild was detected for allSNPs
objects that were merged.
As the data gets added, it's compared to the existing data, and SNP position and genotypediscrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) Thesediscrepant SNPs are available for inspection after the merge via properties of theSNPs
object.
>>>len(s.discrepant_merge_genotypes)151
Additionally, any non-called / null genotypes will be updated during the merge, if the filebeing merged has a called genotype for the SNP.
Moreover,merge
takes achrom
parameter - this enables merging of only SNPs associatedwith the specified chromosome (e.g., "Y" or "MT").
Finally,merge
returns a list ofdict
, where eachdict
has information correspondingto the results of each merge (e.g., SNPs in common).
>>>sorted(list(merge_results[0].keys()))['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']>>> merge_results[0]["merged"]True>>>len(merge_results[0]["common_rsids"])692918
Now, let's remap the merged SNPs to change the assembly / build:
>>> s.snps.loc["rs3094315"].pos752566>>> chromosomes_remapped, chromosomes_not_remapped= s.remap(38)Downloading resources/GRCh37_GRCh38.tar.gz>>> s.build38>>> s.assembly'GRCh38'>>> s.snps.loc["rs3094315"].pos817186
SNPs can be remapped between Build 36 (NCBI36
), Build 37 (GRCh37
), and Build 38(GRCh38
).
Ok, so far we've merged the SNPs from two files (ensuring the same build in the process andidentifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let's savethe merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:
>>> saved_snps= s.to_tsv("out.txt")Saving output/out.txt>>>print(saved_snps)output/out.txt
Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
>>> saved_snps= s.to_vcf("out.vcf")Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gzDownloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gzSaving output/out.vcf1 SNP positions were found to be discrepant when saving VCF
When saving a VCF, if any SNPs have positions outside of the reference sequence, they are markedas discrepant and are available via a property of theSNPs
object.
Alloutput files are saved to theoutput directory.
Documentation is availablehere.
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai,openSNP,Open Humans, andSano Genetics.
snps
incorporates code and concepts generated with the assistance ofOpenAI'sChatGPT. ✨
snps
is licensed under theBSD 3-Clause License.
About
tools for reading, writing, merging, and remapping SNPs