Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

tools for genetic genealogy and the analysis of consumer DNA test results

License

NotificationsYou must be signed in to change notification settings

apriha/lineage

Repository files navigation

https://raw.githubusercontent.com/apriha/lineage/main/docs/images/lineage_banner.png

cicodecovdocspypipythondownloadslicenseRuff

lineage

lineage provides a framework for analyzing genotype (raw data) files from direct-to-consumer(DTC) DNA testing companies, primarily for the purposes of genetic genealogy.

Capabilities

  • Find shared DNA and genes between individuals
  • Compute centiMorgans (cMs) of shared DNA using a variety of genetic maps (e.g., HapMap Phase II, 1000 Genomes Project)
  • Plot shared DNA between individuals
  • Find discordant SNPs between child and parent(s)
  • Read, write, merge, and remap SNPs for an individual via thesnps package

Supported Genotype Files

lineage supports all genotype files supported bysnps.

Installation

lineage isavailable on thePython Package Index. Installlineage (and its requiredPython dependencies) viapip:

$ pip install lineage

Also see theinstallation documentation.

Dependencies

lineage requiresPython 3.8+ and the following Python packages:

Examples

Initialize the lineage Framework

ImportLineage and instantiate aLineage object:

>>>from lineageimport Lineage>>> l= Lineage()

Download Example Data

First, let's setup logging to get some helpful output:

>>>import logging, sys>>> logger= logging.getLogger()>>> logger.setLevel(logging.INFO)>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we're ready to download some example data fromopenSNP:

>>> paths= l.download_example_datasets()Downloading resources/662.23andme.340.txt.gzDownloading resources/662.ftdna-illumina.341.csv.gzDownloading resources/663.23andme.305.txt.gzDownloading resources/4583.ftdna-illumina.3482.csv.gzDownloading resources/4584.ftdna-illumina.3483.csv.gz

We'll call these datasetsUser662,User663,User4583, andUser4584.

Load Raw Data

Create anIndividual in the context of thelineage framework to interact with theUser662 dataset:

>>> user662= l.create_individual('User662', ['resources/662.23andme.340.txt.gz','resources/662.ftdna-illumina.341.csv.gz'])Loading SNPs('662.23andme.340.txt.gz')Merging SNPs('662.ftdna-illumina.341.csv.gz')SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37Downloading resources/NCBI36_GRCh37.tar.gz27 SNP positions were discrepant; keeping original positions151 SNP genotypes were discrepant; marking those as null

Here we createduser662 with the nameUser662. In the process, we merged two raw datafiles for this individual. Specifically:

  • 662.23andme.340.txt.gz was loaded.
  • Then,662.ftdna-illumina.341.csv.gz was merged. In the process, it was found to have Build36. So, it was automatically remapped to Build 37 (downloading the remapping data in theprocess) to match the build of the SNPs already loaded. After this merge, 27 SNP positions and151 SNP genotypes were found to be discrepant.

user662 is represented by anIndividual object, which inherits fromsnps.SNPs.Therefore, all of theproperties and methodsavailable to aSNPs object are available here; for example:

>>>len(user662.discrepant_merge_genotypes)151>>> user662.build37>>> user662.build_detectedTrue>>> user662.assembly'GRCh37'>>> user662.count1006960

As such, SNPs can be saved, remapped, merged, etc. See thesnps package for further examples.

Compare Individuals

Let's create anotherIndividual for theUser663 dataset:

>>> user663= l.create_individual('User663','resources/663.23andme.305.txt.gz')Loading SNPs('663.23andme.305.txt.gz')

Now we can perform some analysis between theUser662 andUser663 datasets.

First, let's find discordant SNPs (i.e., SNP data that is not consistent with Mendelianinheritance):

>>> discordant_snps= l.find_discordant_snps(user662, user663,save_output=True)Saving output/discordant_snps_User662_User663_GRCh37.csv

Alloutput files are saved tothe output directory (a parameter toLineage).

This method also returns apandas.DataFrame, and it can be inspected interactively atthe prompt, although the same output is available in the CSV file.

>>>len(discordant_snps.loc[discordant_snps['chrom']!='MT'])37

Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets.

lineage uses the probabilistic recombination rates throughout the human genome from theInternational HapMap Projectand the1000 Genomes Project to compute the shared DNA(in centiMorgans) between two individuals. Additionally,lineage denotes when the shared DNAis shared on either one or both chromosomes in a pair. For example, when siblings share a segmentof DNA on both chromosomes, they inherited the same DNA from their mother and father for thatsegment.

With that background, let's find the shared DNA between theUser662 andUser663 datasets,calculating the centiMorgans of shared DNA and plotting the results:

>>> results= l.find_shared_dna([user662, user663],cM_threshold=0.75,snp_threshold=1100)Downloading resources/genetic_map_HapMapII_GRCh37.tar.gzDownloading resources/cytoBand_hg19.txt.gzSaving output/shared_dna_User662_User663_0p75cM_1100snps_GRCh37_HapMap2.pngSaving output/shared_dna_one_chrom_User662_User663_0p75cM_1100snps_GRCh37_HapMap2.csv

Notice that the centiMorgan and SNP thresholds for each DNA segment can be tuned. Additionally,notice that two files were downloaded to facilitate the analysis and plotting - future analyseswill use the downloaded files instead of downloading the files again. Finally, notice that a listof individuals is passed tofind_shared_dna... This list can contain an arbitrary number ofindividuals, andlineage will find shared DNA across all individuals in the list (i.e.,where all individuals share segments of DNA on either one or both chromosomes).

Output is returned as a dictionary with the following keys (pandas.DataFrame andpandas.Index items):

>>>sorted(results.keys())['one_chrom_discrepant_snps', 'one_chrom_shared_dna', 'one_chrom_shared_genes', 'two_chrom_discrepant_snps', 'two_chrom_shared_dna', 'two_chrom_shared_genes']

In this example, there are 27 segments of shared DNA:

>>>len(results['one_chrom_shared_dna'])27

Also,output files arecreated; these files are detailed in the documentation and their generation can be disabled with asave_output=False argument. In this example, the output files consist of a CSV file thatdetails the shared segments of DNA on one chromosome and a plot that illustrates the shared DNA:

https://raw.githubusercontent.com/apriha/lineage/main/docs/images/shared_dna_User662_User663_0p75cM_1100snps_GRCh37_HapMap2.png

TheCentral Dogma of Molecular Biologystates that genetic information flows from DNA to mRNA to proteins: DNA is transcribed intomRNA, and mRNA is translated into a protein. It's more complicated than this (it's biologyafter all), but generally, one mRNA produces one protein, and the mRNA / protein is considered agene.

Therefore, it would be interesting to understand not just what DNA is shared between individuals,but whatgenes are shared between individualswith the same variations. In other words,what genes are producing thesame proteins?[*] Sincelineage can determine the shared DNAbetween individuals, it can use that information to determine what genes are also shared oneither one or both chromosomes.

[*]In theory, shared segments of DNA should be producing the same proteins, but there are manycomplexities, such as copy number variation (CNV), gene expression, etc.

For this example, let's create two moreIndividuals for theUser4583 andUser4584datasets:

>>> user4583= l.create_individual('User4583','resources/4583.ftdna-illumina.3482.csv.gz')Loading SNPs('4583.ftdna-illumina.3482.csv.gz')
>>> user4584= l.create_individual('User4584','resources/4584.ftdna-illumina.3483.csv.gz')Loading SNPs('4584.ftdna-illumina.3483.csv.gz')

Now let's find the shared genes, specifying apopulation-specific1000 Genomes Project genetic map (e.g., as predicted byezancestry!):

>>> results= l.find_shared_dna([user4583, user4584],shared_genes=True,genetic_map="CEU")Downloading resources/CEU_omni_recombination_20130507.tarDownloading resources/knownGene_hg19.txt.gzDownloading resources/kgXref_hg19.txt.gzSaving output/shared_dna_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.pngSaving output/shared_dna_one_chrom_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_dna_two_chroms_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_genes_one_chrom_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csvSaving output/shared_genes_two_chroms_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.csv

The plot that illustrates the shared DNA is shown below. Note that in addition to outputting theshared DNA segments on either one or both chromosomes, the shared genes on either one or bothchromosomes are also output.

Note

Shared DNA is not computed on the X chromosome with the 1000 Genomes Project geneticmaps since the X chromosome is not included in these genetic maps.

In this example, there are 77,776 shared genes/transcripts on both chromosomes transcribed from 36 segmentsof shared DNA:

>>>len(results['two_chrom_shared_genes'])77776>>>len(results['two_chrom_shared_dna'])36

https://raw.githubusercontent.com/apriha/lineage/main/docs/images/shared_dna_User4583_User4584_0p75cM_1100snps_GRCh37_CEU.png

Documentation

Documentation is availablehere.

Acknowledgements

Thanks to Whit Athey, Ryan Dale, Binh Bui, Jeff Gill, Gopal Vashishtha,CS50, andopenSNP.

lineage incorporates code and concepts generated with the assistance ofOpenAI'sChatGPT . ✨

License

lineage is licensed under theMIT License.


[8]ページ先頭

©2009-2025 Movatter.jp