- Notifications
You must be signed in to change notification settings - Fork1
Evaluation tools for "A performant bridge between fixed-size and variable-size seeding"
License
ITBE-Lab/seed-evaluation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository contains the scripts for the experiments performed inA performant bridge between fixed-size and variable-size seeding.
❗ The pseudocode of Algorithm 2b contains an error: The Build-Max-Heap operation in line 7 should be "descending byl; for equall descending byq" (more info below) |
---|
Name | Linux install command | Recommended version | Remarks |
---|---|---|---|
This repo | git clone https://github.com/ITBE-Lab/seed-evaluation | latest | Core evaluation scripts. |
DWGSIM | sudo apt-get install zlib1g-dev; sudo apt-get install libncurses5-dev; git clone --recursive https://github.com/nh13/DWGSIM; make -j$(nproc) | 0.1.11 | Tool for generating Illumina reads. |
SURVIVOR | git clone https://github.com/ITBE-Lab/SURVIVOR; cd SURVIVOR/Debug; make -j$(nproc); cd ..; unzip *.zip | 1.0.5 | Tool for generating PacBio reads. (Modified by us, to generate specific amounts of reads.) |
Python 3 | sudo apt-get install python3 | 3.5.3 | Python 3 environment. |
Bokeh | sudo apt-get install python3-pip; pip3 install bokeh | 1.4.0 | Plotting library. |
MA - The Modular Aligner | see below | 1.1.1-ef9ab22 | C++ library implementing all runtime critical code. |
cmake | sudo apt-get install cmake | 3.13.2 | Compiling MA. |
Our testing environment: Debian GNU/Linux with a 4.9.0 kernel.
The MA github page can be foundhere.
The following sequence of commands creates the MA library:
git clone https://github.com/ITBE-Lab/MAgit checkout b7cf5e7 # commit used for experimentsmkdir buildcd buildcmake -DWITH_PYTHON=ON ../MA/ # with python required for evaluation scriptsmake -j$(nproc)cd ..export PYTHONPATH=$PYTHONPATH:`pwd`/build:`pwd`/MA/python # setup system environment
Type./build/maCMD
for checking if MA was built successfully.
If you get an error during the cmake setup or compilation, here are some things that might have gone wrong:
- MA is written in C++17, so you will need an appropriate compiler. We recommend GCC 6.3.0 or above.
- Multiple Python 3 instances on your system can confuse cmake.
- The export command (last line in the above script) is not persistent between different terminals and logins.
config.py
contains the configuration:
- Set
dwgsim_str
,survivor_str
andsurvivor_error_profile
to the appropriate paths for your system. - Set
prefix
to the folder that shall contain the temporary data and output data. - Create an FMD-Index via
./build/maCMD --Create_Index <fasta_file_name>,<output_folder>,<index_name>
, where<fasta_file_name>
is the FASTA file containing the reference genome.
Setreference_genome_path
to<output_folder>/<index_name>
.
Once everything is configured, you can runpython3 compute_times.py
.
This will trigger 4 functions (very bottom of the script):
read_generation
: Generates the reads for all experiments and saves them in theprefix
folder. Hence, this needs to be run only once and can be removed once the reads have been generated.runtime_analysis
: Performs the time evaluation (Figure 4 of the manuscript) and generates all indices.seed_entropy_analysis
: Performs the entropy analysis (Figure 5 of the manuscript).seed_set_diff_analysis
: Performs the seed set difference analysis (Figure 3 of the manuscript).
Tip: Double clicking on a plot will toggle the legend visibility.
The published pseudocode for extracting maximal spanning seeds from MEMs contains an error:
The Build-Max-Heap operation in line 7 should be "descending byl; for equall descending byq" so that thefirst element in the max-heap is the largest seed (there can be multiple seeds with the same size) that reaches the furthest right.
This error was only in the pseudocode; the actual implementation and measurements are correct.
Further, a max-heap is not actually required. Instead a single iteration overT is enough to extract all relevant seeds:
We would like to thank Roman Cheplyaka for pointing out our mistake as well as this optimization.