makovalab-psu/RecoverYPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star10

K-mer based read classification for Y-chromosome specific sequencing and assembly

License

MIT license

10 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
dependency		dependency
img		img
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
recoverY.py		recoverY.py

Repository files navigation

RecoverY

RecoverY is a tool for shortlisting enriched reads from a sequencing dataset, based on k-mer abundance. Specifically, it can be used for isolating Y-specific reads from a Y flow-sorted dataset.

Usage

Before running RecoverY, the following input files are required in ./data folder.Note that currently the names of the "data" folder and of the files are hardcoded into RecoverY.

r1.fastq : Enriched raw reads (first in pair) r2.fastq : Enriched raw reads (second in pair) kmers_from_reads : kmer counts for r1.fastqtrusted_kmers : kmers occuring in one copy on the Y chromosome.

See below for more info on generating these files.

A typical run of RecoverY looks like this.

python recoverY.py --read_length 250 --kmer_size 31 --Ymer_match_threshold 50 --threads 8

RecoverY accepts the following parameters.

python recoverY.py [--help] [--read_length READ_LENGTH] [--kmer_size KMER_SIZE]               [--Ymer_match_threshold YMER_MATCH_THRESHOLD]       [--abundance_threshold ABUNDANCE_THRESHOLD]               [--threads THREADS] [--plots]

--kmer_size (default: 25): kmer size used for classifying reads. This must be the same as DSK's k-mer-size. We recommend a value between 25 and 31 for Illumina 150x150 bp reads. This value is used for calculating Ymer_match_threshold below
--read_length (default:150) : The length of longest un-trimmed read input to RecoverY. This value is used for calculating Ymer_match_threshold below
--Ymer_match_threshold : the number of k-mers a read must match to the Ymer table in order to be classified as coming from the Y. The default value is calculated by the formula : 0.4 * (read_len - kmer_size + 1 - (2*kmer_size*read_len/100)). User may change this, but we recommend a value between 20 and 50 for Illumina 150x150 bp reads.
--abundance_threshold : the abundance threshold beyond which all k-mers are assumed to be Ymers. Specify this only if user does not have access to a trusted k-mers file and would like to manually set an abundance threshold.
--threads (default: 2): The number of threads that RecoverY should use.
--help: print usage information.
--plots: generates a k-mer abundance plot inoutput/kmerplot.png (requires Matplotlib and Seaborn python packages to be installed, see below). This plot visualizes the abundance threshold selected by RecoverY, by plotting the abundance of raw read k-mers as well as trusted k-mers. Seehere for an example.

The output of RecoverY are two files:output/op_r1.fastq andoutput/op_r2.fastq.This is a subset of the read pairs indata/r1.fastq anddata/r2.fastq that are deemed to have originated on the Y chromosome.

Installation

To download,

git clone https://github.com/makovalab-psu/RecoverY

RecoverY also requires the numpy and biopython python packages in order to run.Additionally, if you use the--plot option, RecoverY needs the matplotlib and seaborn packages for python.However, RecoverY can be run without the matplotlib or seaborn packages, as long as the--plot option is not used.These packages can be installed on many systems as follows:

pip install numpypip install biopythonpip install matplotlibpip install seaborn

RecoverY also uses the k-mer counter DSK. The latest DSK binaries (v2.2.0 for Linux 64 bit and v2.2.0 for Mac OSX) are provided in the dependency folder. Thus, if you are using either of these operating systems, DSK need not be installed, and you may use the binaries as provided. For other operating systems, or if alternate versions or functionality of DSK is desired, seehttps://gatb.inria.fr/software/dsk/.

Example

To test if RecoverY works properly, thedata folder comes with an example dataset.The example read data is a random subsample of 100,000 reads from a larger simulated data set in which reads from hg38 autosomes are sampled at 6x depth, X chromosome at 3x depth, and Y chromosome at 300x depth (see Section 3.1 in the paper).

To test the example, you must first untar it and then run RecoverY, as follows:

cd data/tar xf kmers_from_reads.tar.xzcd ../python recoverY.py

The reads in this example contain their chromosome of origin as part of their FASTQ header.We can therefore check the number of correctly retrieved Y-reads by RecoverY:

grep "@chrY" data/r1.fastq | wc -lgrep "@chrY" output/op_r1.fastq | wc -l

The result should be32116 and31468, respectively.This indicates that RecoverY correctly identified most of the Y chromosome reads from the sample.

Input files

Generating`trusted_kmers`

The providedtrusted_kmers file contains those k-mers that occur exactly oncein the X-degenerate gene sequences from the human Y chromosome. It can be used without any modifications.

Alternatively, the user can generate their owntrusted_kmers file, for example, by using the X-degenerate gene sequencesfrom a different species, or using other single-copy regions of a Y chromosome.The file can be generated by running DSK k-mer counter on the relevant sequences and extracting only k-mers with a count of 1.Thetrusted_kmers file should contain one kmer per line.Extra information after the kmer is allowed but ignored.

Generating`kmers_from_reads`

Thekmers_from_reads is a file which includes a line for every distinct kmer inr1.fastq, andeach line contains the kmer sequence followed by whitespace followed by the number of times it occurs inr1.fastq.For example, the first three lines could look like this.

AAAAAAAAAAAAAAAAGAAAAACAA 5AAAAAAAAAAAAAAACAAGCTGAAT 8AAAAAAAAAAAAAAAGAAAAACAAA 3

Thekmers_from_reads file can be generated by the k-mer counter DSK. The ./dependency folder contains DSK binaries for Linux 64 bit and Mac OSX. Usage is as follows (example shown below is for a Linux system) :

cd dependency./run_dsk_Linux.sh <FASTQ_file> <kmer_size>

If the k-mer counts file for raw reads (r1.fastq) is not already provided, the user may need to generate k-mer counts manually using DSK. To generate k-mer counts with DSK, the following steps are needed :

cd dependency ln -s ../data/r1.fastq   # make sure the correct reads file is provided to DSK./run_dsk_Linux.sh r1.fastq 25

The kmer_counts table will be generated in :

dependency/dsk_output/kmers_from_reads

This file can be copied or linked to the data folder so that RecoverY can use it :

cd ../dataln -s ../dependency/kmers_from_reads

RecoverY ignored any k-mer inkmers_from_reads that has a count less than 3.If using DSK to generate the file, k-mers with counts less than 3 are discarded by default,thereby saving space (i.e. --abundance-min is set to 3 by default in DSK) .

Miscellaneous

The following scripts are included with this distribution of RecoverY, and are automatically run by recovery.py as part of the pipeline. Users may consider them separately for custom needs if required.

kmers.py

a set of general purpose functions to work with kmers

kmerPaint.py

input : trusted_kmers and reads_from_kmers output : Ymer_table with new abundance threshold

classify_as_Y_chr.py

input : all raw reads (first in pair) and Ymer tableoutput : Y-specific reads according to RecoverY algorithm (first in pair)

find_mates.py

input : all raw reads (second in pair) and Y-specific reads accoding to RecoverY algorithm (first in pair)output : Y-specific reads according to RecoverY algorithm (second in pair)

License

This program is released under the MIT License. Please see LICENSE.md for details

Citation

If you use RecoverY in your research, please cite

RecoverY : k-mer-based read classification for Y-chromosome-specific sequencing and assembly,

Samarth Rangavittal, Robert S. Harris, Monika Cechova, Marta Tomaszkiewicz, Rayan Chikhi, Kateryna Makova, Paul Medvedev

Bioinformatics 2017https://doi.org/10.1093/bioinformatics/btx771.

About

K-mer based read classification for Y-chromosome specific sequencing and assembly

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RecoverY

Usage

Installation

Example

Input files

Generating`trusted_kmers`

Generating`kmers_from_reads`

Miscellaneous

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

makovalab-psu/RecoverY

Folders and files

Latest commit

History

Repository files navigation

RecoverY

Usage

Installation

Example

Input files

Generatingtrusted_kmers

Generatingkmers_from_reads

Miscellaneous

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Uh oh!

Languages

Generating`trusted_kmers`

Generating`kmers_from_reads`

Packages