- Notifications
You must be signed in to change notification settings - Fork26
tools for genetic genealogy and the analysis of consumer DNA test results
License
apriha/lineage
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
lineage
provides a framework for analyzing genotype (raw data) files from direct-to-consumer(DTC) DNA testing companies, primarily for the purposes of genetic genealogy.
- Find shared DNA and genes between individuals
- Compute centiMorgans (cMs) of shared DNA using a variety of genetic maps (e.g., HapMap Phase II, 1000 Genomes Project)
- Plot shared DNA between individuals
- Find discordant SNPs between child and parent(s)
- Read, write, merge, and remap SNPs for an individual via thesnps package
lineage
supports all genotype files supported bysnps.
lineage
isavailable on thePython Package Index. Installlineage
(and its requiredPython dependencies) viapip
:
$ pip install lineage
Also see theinstallation documentation.
lineage
requiresPython 3.8+ and the following Python packages:
ImportLineage
and instantiate aLineage
object:
>>>from lineageimport Lineage>>> l= Lineage()
First, let's setup logging to get some helpful output:
>>>import logging, sys>>> logger= logging.getLogger()>>> logger.setLevel(logging.INFO)>>> logger.addHandler(logging.StreamHandler(sys.stdout))
Now we're ready to download some example data fromopenSNP:
>>> paths= l.download_example_datasets()Downloading resources/662.23andme.340.txt.gzDownloading resources/662.ftdna-illumina.341.csv.gzDownloading resources/663.23andme.305.txt.gzDownloading resources/4583.ftdna-illumina.3482.csv.gzDownloading resources/4584.ftdna-illumina.3483.csv.gz
We'll call these datasetsUser662
,User663
,User4583
, andUser4584
.
Create anIndividual
in the context of thelineage
framework to interact with theUser662
dataset:
>>> user662= l.create_individual('User662', ['resources/662.23andme.340.txt.gz','resources/662.ftdna-illumina.341.csv.gz'])Loading SNPs('662.23andme.340.txt.gz')Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null
Here we createduser662
with the nameUser662
. In the process, we merged two raw datafiles for this individual. Specifically:
662.23andme.340.txt.gz
was loaded.- Then,
662.ftdna-illumina.341.csv.gz
was merged. In the process, it was found to have Build36. So, it was automatically remapped to Build 37 (downloading the remapping data in theprocess) to match the build of the SNPs already loaded. After this merge, 27 SNP positions and151 SNP genotypes were found to be discrepant.
user662
is represented by anIndividual
object, which inherits fromsnps.SNPs
.Therefore, all of theproperties and methodsavailable to aSNPs
object are available here; for example:
>>>len(user662.discrepant_merge_genotypes)151>>> user662.build37>>> user662.build_detectedTrue>>> user662.assembly'GRCh37'>>> user662.count1006960
As such, SNPs can be saved, remapped, merged, etc. See thesnps package for further examples.
Let's create anotherIndividual
for theUser663
dataset:
>>> user663= l.create_individual('User663','resources/663.23andme.305.txt.gz')Loading SNPs('663.23andme.305.txt.gz')
Now we can perform some analysis between theUser662
andUser663
datasets.
First, let's find discordant SNPs (i.e., SNP data that is not consistent with Mendelianinheritance):
>>> discordant_snps= l.find_discordant_snps(user662, user663,save_output=True)Saving output/discordant_snps_User662_User663_GRCh37.csv
Alloutput files are saved tothe output directory (a parameter toLineage
).
This method also returns apandas.DataFrame
, and it can be inspected interactively atthe prompt, although the same output is available in the CSV file.
>>>len(discordant_snps.loc[discordant_snps['chrom']!='MT'])37
Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets.
lineage
uses the probabilistic recombination rates throughout the human genome from theInternational HapMap Projectand the1000 Genomes Project to compute the shared DNA(in centiMorgans) between two individuals. Additionally,lineage
denotes when the shared DNAis shared on either one or both chromosomes in a pair. For example, when siblings share a segmentof DNA on both chromosomes, they inherited the same DNA from their mother and father for thatsegment.
With that background, let's find the shared DNA between theUser662
andUser663
datasets,calculating the centiMorgans of shared DNA and plotting the results:
>>> results= l.find_shared_dna([user662, user663],cM_threshold=0.75,snp_threshold=1100)Downloading resources/genetic_map_HapMapII_GRCh37.tar.gzDownloading resources/cytoBand_hg19.txt.gzSaving output/shared_dna_User662_User663_0p75cM_1100snps_GRCh37_HapMap2.pngSaving output/shared_dna_one_chrom_User662_User663_0p75cM_1100snps_GRCh37_HapMap2.csv
Notice that the centiMorgan and SNP thresholds for each DNA segment can be tuned. Additionally,notice that two files were downloaded to facilitate the analysis and plotting - future analyseswill use the downloaded files instead of downloading the files again. Finally, notice that a listof individuals is passed tofind_shared_dna
... This list can contain an arbitrary number ofindividuals, andlineage
will find shared DNA across all individuals in the list (i.e.,where all individuals share segments of DNA on either one or both chromosomes).
Output is returned as a dictionary with the following keys (pandas.DataFrame
andpandas.Index
items):
>>>sorted(results.keys())['one_chrom_discrepant_snps', 'one_chrom_shared_dna', 'one_chrom_shared_genes', 'two_chrom_discrepant_snps', 'two_chrom_shared_dna', 'two_chrom_shared_genes']
In this example, there are 27 segments of shared DNA:
>>>len(results['one_chrom_shared_dna'])27
Also,output files arecreated; these files are detailed in the documentation and their generation can be disabled with asave_output=False
argument. In this example, the output files consist of a CSV file thatdetails the shared segments of DNA on one chromosome and a plot that illustrates the shared DNA:
TheCentral Dogma of Molecular Biologystates that genetic information flows from DNA to mRNA to proteins: DNA is transcribed intomRNA, and mRNA is translated into a protein. It's more complicated than this (it's biologyafter all), but generally, one mRNA produces one protein, and the mRNA / protein is considered agene.
Therefore, it would be interesting to understand not just what DNA is shared between individuals,but whatgenes are shared between individualswith the same variations. In other words,what genes are producing thesame proteins?[*] Sincelineage
can determine the shared DNAbetween individuals, it can use that information to determine what genes are also shared oneither one or both chromosomes.
[*] | In theory, shared segments of DNA should be producing the same proteins, but there are manycomplexities, such as copy number variation (CNV), gene expression, etc. |
For this example, let's create two moreIndividuals
for theUser4583
andUser4584
datasets:
>>> user4583= l.create_individual('User4583','resources/4583.ftdna-illumina.3482.csv.gz')Loading SNPs('4583.ftdna-illumina.3482.csv.gz')
>>> user4584= l.create_individual('User4584','resources/4584.ftdna-illumina.3483.csv.gz')Loading SNPs('4584.ftdna-illumina.3483.csv.gz')
Now let's find the shared genes, specifying apopulation-specific1000 Genomes Project genetic map (e.g., as predicted byezancestry!):
>>> results= l.find_shared_dna([user4583, user4584],shared_genes=True,genetic_map="CEU")Downloading resources/CEU_omni_recombination_20130507.tarDownloading resources/knownGene_hg19.txt.gzDownloading resources/kgXref_hg19.txt.gzSaving output/shared_dna_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.pngSaving output/shared_dna_one_chrom_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_dna_two_chroms_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_genes_one_chrom_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_genes_two_chroms_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csv
The plot that illustrates the shared DNA is shown below. Note that in addition to outputting theshared DNA segments on either one or both chromosomes, the shared genes on either one or bothchromosomes are also output.
Note
Shared DNA is not computed on the X chromosome with the 1000 Genomes Project geneticmaps since the X chromosome is not included in these genetic maps.
In this example, there are 77,776 shared genes/transcripts on both chromosomes transcribed from 36 segmentsof shared DNA:
>>>len(results['two_chrom_shared_genes'])77776>>>len(results['two_chrom_shared_dna'])36
Documentation is availablehere.
Thanks to Whit Athey, Ryan Dale, Binh Bui, Jeff Gill, Gopal Vashishtha,CS50, andopenSNP.
lineage
incorporates code and concepts generated with the assistance ofOpenAI'sChatGPT . ✨
lineage
is licensed under theMIT License.
About
tools for genetic genealogy and the analysis of consumer DNA test results