tdseher/addtag-projectPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star12

CRISPR/Cas-directed HDR genome editing suite: finds+scores gRNA targets, generates donor DNAs, & produces optimal cPCR primer designs.

License

AGPL-3.0 license

12 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 560 Commits
docs		docs
manuscript		manuscript
source		source
tests		tests
.azure-pipelines.yml		.azure-pipelines.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
addtag		addtag
addtag.1		addtag.1

Repository files navigation

CRISPR/Cas AddTag Readme

Program for identifying exclusive endogenous gRNA sites and creating unique synthetic gRNA sites.

Features •Requirements •Installing •Usage •Aligners •Thermodynamics •Algorithms •Citing •Contributing

☑ Features

Basic Features:

Supports both direct (1-step) and indirect (2-step) genome editing through CRISPR/Cas-induced homology-directed repair (HDR).
Analyzes arbitrary genomic DNA (gDNA).
- Fully supports ambiguous characters or polymorphisms (RYMKWSBDHVN) in genome contigs.
- Respects case-masked gDNA for and identification.
Uses an intuitive syntax to locate RNA-guided nuclease () cut sites (s) within a locus of interest ().
- Fully supports ambiguous bases (RYMKWSBDHVN) in or.
- Accepts 3'-adjacent sequences, such as Cas9 (>NGG).
- Accepts 5'-adjacent sequences, such as Cas12a (TTTN<).
- Supports arbitrary length and composition constraints, such as for plant experiments (G{,2}N{19,20}).
- Supports arbitrary sequences (MAD7:YTTN<, Cas12d:TA<, BlCas9:>NGGNCNDD, etc).
- Supports any number of stranded forward (/), reverse (\) and unstranded (|) cut sites.
- Supports sequences defined by complex nested logic, such as xCas9 (>(N{1,2}G,GAW,CAA))
Simultaneously calculates any number ofon-target andoff-target scores (seeAlgorithms).
- Includes a "weight" calculation for balancing bothon-target andoff-target scores.
- The "weight" allows for comparing efficiency and specificity betweeens from differents.
Searches fors using selectable pairwise alignment program (seeAligners).
Generates exogenous, donor DNA () sequences for modifying the same locus successively.
- Assembles unique sites on so the locus can be edited again (addtag).
- Adds uniques tos while introducing minimal amounts of extrinsic DNA (mintag).
Engineers a single set of verification PCR (vPCR)s for assessing genome editing.
- Performsin silico recombination between gDNA ands to predict the genome sequences after editing.
- Sames work for all genotypes (reference, intermediary, and add-back)
- Positive amplification shows if was edited correctly.
- A different, positive amplification shows if was edited incorrectly.
- Determines thermodynamic properties of pairs (Tm, minimum ΔG, amplicon size, etc).
- Uses a genetic algorithm to select sequences that have compatible properties, so they can be run in parallel with the same thermal cycler conditions.
Facilitates ploidy-aware editing (multi-allelic,allele-specific, andallele-agnostic).
- Identifies ploidy-awares.
- Produces that have poidy-aware homology arms.
Contains the most-complete index of all knowns fors.

📋 Requirements

Hardware recommendations

Processor:

≥ 4 cores, ≥ 3 GHz

Computations scale fairly linearly, so the more computational cores you can assign to the task, the faster it will go.

Memory:

≥ 4 Gb (for evaluation)
≥ 4 Gb (for evaluation)

SeeNotes for tips on memory optimization.

Software requirements

Below are lists AddTag requirements. Each entry is marked with a 🗹 or ☐, indicating whether or not an additional download/setup is required:

All requirements included in AddTag
Additional download/setup required

For tips on setting up AddTag requirements, please review the commands in the.azure-pipelines.yml file.

Basic prerequisites

Base operation of AddTag requires the following:

Python ≥ 3.5.1 (source,binaries,documentation)
regex Python module (source,whls,documentation)

Certain optional AddTag functionality (version information, and software updates) depends on the following:

Git ≥ 1.7.1 (source,binaries,documentation)

📐 Supported sequence Aligners

One pairwise sequence aligner is required:

BLAST+ ≥ 2.6.0 (source,binaries,documentation)
Bowtie 2 ≥ 2.3.4.1 (source,binaries,documentation)
BWA ≥ 0.7.12 (source,ugene binaries,bioconda binaries,documentation)
Cas-OFFinder ≥ 2.4 (source,binaries,documentation)

For polymorphism-aware expansion (using the--homologs option), one multiple sequence aligner is required:

MAFFT (source,binaries,documentation)

🌡 Supported thermodynamics calculators

For oligo design, AddTag requires one of the following third-party thermodynamics solutions to be installed:

UNAFold ≥ 3.8 (source,documentation) withpatch440
primer3-py Python module (source,whls,documentation)
ViennaRNA Python module (source,official binaries,bioconda binaries,documentation)

📈 Supported scoring Algorithms

The following scoring algorithms are subclasses ofSingleSequenceAlgorithm.

Azimuth (Doench, Fusi, et al (2016))
note: Either Azimuth 2 or Azimuth 3 can be used to calculate Azimuth scores. There is no need to have both installed.
- Azimuth 3 Python module (source,documentation)
  note: requires specific versions of numpy, scikit-learn, and pandas.Other dependencies include click, biopython, scipy, GPy, hyperopt, paramz, theanets, glmnet_py, dill, matplotlib, pytz, python-dateutil, six, tqdm, future, networkx, pymongo, decorator, downhill, theano, nose-parameterized, joblib, kiwisolver, cycler, pyparsing, setuptools, glmnet-py.
- Azimuth 2 Python module (source,documentation)on 2.7.10 ≤ Python < 3.0.0 (source,binaries,documentation)
  note: requires python-tk to be installed. Also requires specific versions of scipy, numpy, matplotlib, nose, scikit-learn, pandas, biopython, pyparsing, cycler, six, pytz, python-dateutil, functools32, subprocess32.
CINDEL/DeepCpf1 (Kim, Song, et al (2016),Kim, Song, et al (2018))
note: Requires both Keras and Theano Python modules.
- Keras Python module (source,whls,documentation)
- Theano Python module (source,whls,documentation)
Doench-2014 (Doench, et al (2014))
Housden (Housden, et al (2015))
Moreno-Mateos (Moreno-Mateos, et al (2015))
CRISPRater (Labuhn, et al. (2018))
GC (Wang, et al (2014))
Homopolymer (Hough, et al. (2017))
ProximalG
PolyT
PAM Identity
Position

The following scoring algorithms are subclasses ofPairedSequenceAlgorithm.

CFD (Doench, Fusi, et al (2016))
Substitutions, Insertions, Deletions, Errors (Needleman, Wunsch (1970))
Hsu-Zhang (Hsu, et al (2013))
Linear

Python package setup

There are several standard ways to make modules available to your Python installation. The easy way to install a package this is throughpip.

For example, the following code will download and setup theregex package fromPYPI into your default Python installation.

pip install regex

If you want to make the module available to a specific Python installation, use a command like this:

/path/to/python -m pip install regex

Often, the package is not available on PYPI, or you need a development version. In these cases, you can directpip to download and setup a package from a code repository. The easiest way to install it and take care of all dependencies is to usepip, assuminggit is available in thePATH environmental variable. Here is how to install theAzimuth package fromGitHub.

pip2.7 install git+https://github.com/MicrosoftResearch/Azimuth.git

Some Python packages are available throughbioconda. To installviennarna usingconda, use this command:

conda install -c bioconda viennarna

⤵ Installing AddTag

You can download the latest version of AddTag over HTTPS usinggit with the following command.

git clone https://github.com/tdseher/addtag-project.git

This will download AddTag into a folder calledaddtag-project/ in your current working directory. Go ahead and change the working directory into the AddTag folder.

cd addtag-project/

git should automatically make theaddtag program executable. If it does not, you can use the following command to do it.

chmod +x addtag

To make the AddTag executable accessible from any working directory, you can add the absolute path of the current working directory to thePATH variable.

On Windows, run:

set PATH=%PATH%;%CD%

On Linux or macOS, run:

export PATH=$PATH:$PWD

If you run AddTag with no parameters, you should get the following output:

usage: addtag [-h] [-v] action ...

Special note

One way to obtain AddTag is by downloading and extracting the code directly from GitHub:

wget https://github.com/tdseher/addtag-project/archive/master.zipunzip master.zipcd addtag-project-master/

If you try runningaddtag, you will get a message similar to the following:

./addtag

fatal: Not a git repository (or any parent up to mount point /media/sf_VirtualBox_share)Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

This message means that the AddTag directory isn't a validgit repository (it is missing the.git subfolder).As a consequence, the version information will not be accessible.

./addtag --version

addtag missing (revision missing)

To fix this, simply ensuregit is installed and available in thePATH environment variable(SeeSoftware prerequisites), and run the following:

./addtag update

Now, when you runaddtag, you should not receive the warnings, and the version field will be populated.

./addtag --version

addtag 9e8748b (revision 460)

🔁 Updating AddTag

The commands in this section assume the working directory is the AddTag folder.

cd addtag-project/

If you would like to update your local copy to the newest version available, use the following command from within theaddtag-project/ directory.

./addtag update

If you want the newest version, but you made changes to the source code, then you can first discard your changes, and then update. Use the following command from inside theaddtag-project/ folder.

./addtag update --discard_local_changes

Alternatively, if you want to keep the local modifications, you can use the--keep_local_changes option to stash, pull, then reapply them afterwards.

./addtag update --keep_local_changes

Each one of these commands assumesgit is available on thePATH environment variable.

💻 Program Instructions

Displaying the usage

Click to expand/collapse

Because AddTag is being updated regularly, the most current feature set and usage can be viewed by running AddTag with the--help command line option.

The following commands assume the current working directory is the AddTag folderaddtag-project/. This will print out command line parameter descriptions and examples.

./addtag --help

Additionally, you may view the included man page, which is probably not up-to-date.

man ./addtag.1

Format of input data

Click to expand/collapse

FASTA input

AddTag requires a FASTA genome of the organism you wish to manipulate. FASTA files resemble the following:

>primary_header1 attribute1=value1 attribute2=value2NNNNCGAAATCGGCGCATAGGCCTAAGAGCTCCTATAGAGATCGATATAAACGCTAGAAGATAGAGAGAGGCTCGCGCTCGATCGCGATAAGAGAGCTCTCGGCCGATAGAGGAATCTCGggctcgcatatatyhcgcggcatatGGCCTAGAGGACCAATAAAGATATATAGCCTAAAGGAATATATAGAGAGATATATATAGNNNN>primary_header2 attribute1=value1 attribute2=value2AGCTAGAGACWWWCTCCTCTCCTAGAGASSSAGAGGAGAGCTCTCCGAGAGACGCTCGCTCGTATGCCTCTATATCGATATATAGGAGAATCCTCGATATATAG

FASTA files are plain text files that use newline (\n or\r\n) characters as delimiters. If a line begins with a greater than (>) symbol, it represents the start of a new sequence record. All characters between the> and\n are considered the 'header' of the record. Everything between the> and the first whitespace character ( or\t), if one exists, is considered the 'primary identifier' for the record. All subsequent lines until the next 'header' line contain the sequence information for that record. Therefore FASTA files can contain many sequence records. Each record in a genome assembly's FASTA file is called a 'contig'.

Typically, the DNA sequence information in FASTA files are list a bunch of canonical nucletide abbreviations (ACGT). However, FASTA files can contain any number of ambiguous characters (RYMKWSBDHVN), which can represent allelic variation expected within the sample or sequencing uncertainty. FASTA files can also contain a mix ofUPPER andlower cased characters. Typical use forlower case characters is to exclude these residues from or identification.

GFF input

AddTag requires a GFF file containing annotations for the Features you wish to manipulate (technical specifications of GFF format). GFF files resemble the following:

# seqidsourcefeaturestartendscorestrandframeattributeC1ADBgene34895146.+.ID=C1A_001;Name=C1_001;Gene=GENE1C1ADBmRNA34895146.+.ID=C1A_001-T;Parent=C1A_001C1ADBexon34895146.+.ID=C1A_001-T-E1;Parent=C1A_001-TC1ADBCDS34895146.+0ID=C1A_001-P;Parent=C1A_001-TC1BDBgene32674924.+.ID=C1B_001C1BDBmRNA32674924.+.ID=C1B_001-T;Parent=C1B_001C1BDBexon32674924.+.ID=C1B_001-T-E1;Parent=C1B_001-TC1BDBCDS32674924.+0ID=C1B_001-P;Parent=C1B_001-T

GFF files describe the contig locations of important genomic Features. Empty lines and lines that begin with the pound (#) symbol are ignored. Of note is the far-rightattribute column, which AddTag assumes is a semicolon-delimited set of key/value pairs. AddTag assumes each Feature has a unique identifier. By default, it uses theID attribute as the unique name for each Feature. If your GFF file does not have anID attribute, then you can select a different one with the--tag command line option.

Typical AddTag analyses require at least one GFF file. AddTag can handle GFF files in two ways.

For the first method, all Features matching the selected type, designated by the--features command line argument, will be included for analysis. By default, only lines in the GFF file containinggene infeature column will be considered. This system is useful if your GFF file contains only the Features you wish to manipulate.
If your GFF file contains all annotations for the entire genome (which is typical), the second approach requires you to select only the few Features you want to edit using the--selection command line argument.

Often, you will have a GFF file with annotations for the entire genome. Theattributes column is not often structured intuitively, and can prove cumbersome to search (grep) or sort (sort) manually. To make it easy to identify the desired lines of a GFF file, AddTag includes thefind_feature subroutine. Here is an example that tries to find all lines associated withHSP90 by searching several attribute tags, and outputting a GFF with a commented line containing field names:

addtag find_feature --gff genome.fasta --query HSP90 --linked_tags Name Alias Parent Gene --header> features.gff

Target motif input

The Target motif is written from 5' to 3'. Use a greater than (>) symbol if your has a 3'-adjacent PAM, and use a less than (<) symbol if your has a 5'-adjacent PAM. Ambiguous nucleotide characters are accepted.{a,b} are quantifiers.(a,b,…) are permitted alternatives./ is a sense strand cut,\ is an antisense strand cut, and| is a double-strand cut.. is a base used for positional information, but not enzymatic recognition. Be sure to enclose each motif in quotes so your shell does not interpretSTDIN/STDOUT redirection.

You can specify any number of Target motifs to be considered 'on-target' using the--motifs command line option. You can also designate any number of Target motifs to be considered 'off-target' using the--off_target_motigs command line option.

To see an exhaustive list of all identified Target motifs for each known, run the following command:

addtag list_motifs

Homologs input

Some researchers are lucky enough to get to work on organisms with phased genomes. This means that full haplotype information is known for each chromosome. AddTag can accommodate haploid, diploid, and polyploid genomes when homologous Features are linked by the addition of the--homologs command line option. The 'homologs' file has the following format:

# grouphom_ahom_bhom_cGENE1C1A_001C1B_001GENE2C1A_002C1B_002C1C_002

Each Feature identifier has its contig start and end position defined in the input GFF file. The 'homologs' file merely links them together. Columns in the homologs file are delimited by the\t character. The first column is the name of the group of Features. Every subsequent column should contain the identifier of a Feature to consider as a homolog. Homolog groups can each have any number of Features. If a Feature identifier appears on multiple lines, then all those Features are linked together as one homolog group. The identifier can be changed with the--tag command line option.

Format of output data

Click to expand/collapse

AddTag outputs most of the experimental results you need toSTDOUT. However, for simplicity sequences are output toFASTA files. Please note that the output table formats are not consistent among AddTag versions--more recent releases are more thorough and useful.

STDOUT

The final data are printed toSTDOUT as tab-delimited tables. Lines containing column headers start with a# character.

ThereTarget results table contains information on optimal Targets that exist within the ★tag insert on the r1-gDNA.

TheexTarget results table contains information on optimal Targets that exist within the extended Feature on the r0-gDNA.

TheAmpF/AmpR results table contains information on optimal Primer Pairs for amplifying the Feature to create the r2-dDNAs.

ThereDonor results table lists information on r2-dDNAs.

TheexDonor results table lists information on r1-dDNAs.

TheRegion definitions table lists the genome, contig, start and end coordinates for where cPCR Primers will be selected from.

ThePrimer sequences table lists the optimal PrimerSets by weight order, with non-redundant Primers

ThePrimerPairs table lists the PrimerPair attributes for each amplicon.

TheAmplicon diagram succinctly relates the primer names to the regions in the genomes they bind to and amplify.

TheIn silico recombination table lists where in the gDNAs the dDNAs were incorporated.

STDERR

If the AddTag software fails for any reason, error messages will be printed toSTDERR. If you pipeSTDERR into a file, and the file size is nonzero, then this indicates that an error occurred.

Often, errors happen if required AddTag arguments are missing, or input data is improperly formatted.

log.txt

AddTag outputs intermediate calculations and computation status to thelog.txt file. This includes the exact commands used when calling any external programs (such asAligners), alignments of Target sequences to dDNA sequences, and timestamps.

excision-dDNAs.fasta

Theexcision-dDNAs.fasta file contains the dDNA sequences for creating the intermediary genome that are referenced by the tables fromSTDOUT. These dDNA sequences contain themintag,addtag,unitag,bartag, orsigtag as requested by the AddTag invocation arguments.

An example of a nominalmintag that targets both alleles of a diploid chromosome:

>exDonor-0 spacers=4 C1A_002:C1A:+:272323..272373::274197..274247 C1B_002:C1B:+:272338..272388::274212..274262ACTAAAATGAAAACCACATACAGCAGTAATAGTACTAGCCAACTCACTATTTTGATTTTGGGAACGGAGTTGAGCGGTATATGTGACAACAGTGACTATG

An example of anaddtag experiment:

>exDonor-0 spacers=1 C1A_003:C1A:+:109972..110010:ctccgctctcgcctagactcggg:112195..112234 C1B_003:C1B:+:109967..110005:ctccgctctcgcctagactcggg:112220..112259GCATAGGCTAGAGATAGTCCTCAGATAATAATAGAGCTctccgctctcgcctagactcgggAATATAAGATCAGTCTCTCCCGACTAGAATCTCTAGCAA>exDonor-1 spacers=1 C1A_003:C1A:+:109972..110010:cccgagtctaggcgagagcggag:112195..112234 C1B_003:C1B:+:109967..110005:cccgagtctaggcgagagcggag:112220..112259GCATAGGCTAGAGATAGTCCTCAGATAATAATAGAGCTcccgagtctaggcgagagcggagAATATAAGATCAGTCTCTCCCGACTAGAATCTCTAGCAA

These dDNAs each are predicted to recombine with contigsC1A andC1B. Note that each dDNA incorporates the exogenousaddtag sequence in an opposite orientation.

excision-targets.fasta

This file contains only the Target sequences that are contained within the Feature, but inFASTA format. For the most part, theexTarget results table fromSTDOUT contains more information. We intend this file to be used as input to thefind_header subroutine.

reversion-dDNAs.fasta

This file is structured identically to theexcision-dDNAs.fasta file.

If you direct AddTag to find Primers to amplify the wild type Feature, then their amplicon sequences will be stored in thereversion-dDNAs.fasta file. If you do not have AddTag find the AmpF/AmpR primers, then the entire region containing the Feature, upstream, and downstream sequences is written to thereversion-dDNAs.fasta file.

This example shows that polymorphisms at the Feature and its flanking sequences mean there are two possible dDNAs:

>reDonor-0 spacers=0 C3A_005:C3A:+:1722491..1722834TTTTTTTTGGTTAACCACTTTGTGTCCCTTGCATACTTTTACATTGGAAACATACATACACTAACATTCACACTCAATACACTCATATTATTTACCATTTTTGTTGTGAAGATACACGTATTTATTGAGTATTCCTTCATAACATTTAATTTATATTCCAAGAGTTAATTGATTAAACAACTTGGTCCAAACAAACATAAACATAAACAAAAACGTTTTCTTTTTTTGCATAATATCTATCTATGTATATGTATATATATGTGTGTAAGTCATTGTCTTTTCCATTTTCTTTTCCATTTTCTTTTTTTTTTAGTTTTGTTTTCAAGTGTGTAATAATAATAAT>reDonor-1 spacers=0 C3B_005:C3B:+:1723088..1723418TTTTTTTTGGTTAACCCCTTTGTGTCCCTTGCATACTTTTACATTGGAAACATACATACACTAACATTCACACTCAATACACTCACATTATTTACCATTTTTGTTGTGAAGATACACGTATTTATTGAGTATTCCTTCATAACATTTAATTTATATTCCAAGAGTTAATTGATTAAACAACTTGGTCCAAAAAACAAAAACGTTTTCTTTTTTTGCATAATATCTATCTATGTATATGTATATATATGTGTGTAAGTCATTGTCTTTTCCATTTTCTTTTCCATTTTCTTTTCTTTTTAGTTTTGTTTTCAAGTGTGTAATAATAATAAT

reversion-targets.fasta

This file contains only the RGN Target sequences compatible with theexDonor sequences (and by extension, the intermediary genome). For the most part, thereTarget results table fromSTDOUT contains more information.

genome-rN.fasta

In silico recombination will integrate the input dDNAs into their respective loci within the input genome. Contig names (primary identifiers) are modified with the incorporated dDNAs as well as the round.

For example,genome-r0.fasta may resemble the following:

>contig_001GCTAAGCGCATCGCGCATAGGGCGGCAAAAAAGCGCTAGAGACTCAGAGGAGCGCTAGCGGCTCGAATATAATAGATAGCTATAGCCTAGGAGATAGGAAACTCAGAAATAGACCATAAA>contig_002AATAAGCTCAGATAATATAGCTCGCTCTCTCGATAGCTCTAGACTCCCTAGAGCCCTAAGCCCGCTCGCGAATAGATCCTCTAGACTAGATGAGAGCCGGCCCTCGCGCGCGATAGAGAA

If the first round dDNA contains the following:

>dDNA1GCTCGAATATAATAGATAGCTATAGcccgggAGGAAACTCAGAAATAGACCATAAA

After the first round ofin silico recombination,genome-r1.fasta will be:

>contig_001-r1[dDNA1]GCTAAGCGCATCGCGCATAGGGCGGCAAAAAAGCGCTAGAGACTCAGAGGAGCGCTAGCGGCTCGAATATAATAGATAGCTATAGcccgggAGGAAACTCAGAAATAGACCATAAA>contig_002AATAAGCTCAGATAATATAGCTCGCTCTCTCGATAGCTCTAGACTCCCTAGAGCCCTAAGCCCGCTCGCGAATAGATCCTCTAGACTAGATGAGAGCCGGCCCTCGCGCGCGATAGAGAA

Available subroutines

Click to expand/collapse

The AddTag program contains a set of subroutines that can be run independently. There are four categories of subroutines.

Theevaluate_* subroutines run only a very specific analysis on input data.
Thefind_* subroutines are used to search input files for specific things, so the user can easily learn the correct parameters to use for AddTag input.
Thegenerate_* subroutines perform the deep computational analyses.
Thelist_* subroutines just print information the user might find useful.

Available RGN scoring Algorithms

Click to expand/collapse

Over the past few years, several Algorithms have been proposed to describe behavior within certain biological contexts. We implemented most of the commonly-used ones into the AddTag software. To view information about each, use the following command:

addtag list_algorithms

This will write the pertinent information for all implemented Algorithms toSTDOUT.

If an Algorithm is used for pre-alignment filtering (Prefilter) or post-alignment filtering (Postfilter), then the score of the Target must lie between theMin andMax values to be continued on through the analysis. For instance, the 'off-target' scoringCFD Algorithm has aMin of1.0. This means that some positions with significant sequence similarity to the query Target (because they are identified in the Alignment step) will not contribute to the final 'off-target' score if their score is less than1.0.

Available oligonucleotide thermodynamics calculators

Click to expand/collapse

To view which thermodynamics calculators are available on your system, use the following command:

addtag list_thermodynamics

Workflow for editing loci in the manuscript

These are instructions for using the current version of AddTag to re-design the experiments featured in the manuscript.The commands for the original design are in themethods.md file.

Get genome data

Click to expand/collapse

Download theCandida albicans reference genome and annotations used for this study.

wget http://www.candidagenome.org/download/sequence/C_albicans_SC5314/Assembly22/archive/C_albicans_SC5314_version_A22-s07-m01-r19_chromosomes.fasta.gzgunzip C_albicans_SC5314_version_A22-s07-m01-r19_chromosomes.fasta.gzwget http://www.candidagenome.org/download/gff/C_albicans_SC5314/archive/C_albicans_SC5314_version_A22-s07-m01-r19_features.gff

Set convenience variables for referencing these two files.

GENOME_FASTA=C_albicans_SC5314_version_A22-s07-m01-r19_chromosomes.fastaGENOME_GFF=C_albicans_SC5314_version_A22-s07-m01-r19_features.gffGENOME_HOMOLOGS=C_albicans_SC5314_version_A22-s07-m01-r19_homologs.txt

Create the*.homologs file for theC. albicans genome.

python3 gff2homologs.py${GENOME_GFF}>${GENOME_HOMOLOGS}

ADE2_CDS

Click to expand/collapse

For simplicity, we use a variable to hold the label for this computational experiment.

GENE=ADE2

Create and enter the directory for this experiment.

mkdir${GENE}_CDScd${GENE}_CDS

Extract the feature IDs of the genes we want to remove from the*.homologs file.

SELECTION=$(grep${GENE} ../${GENOME_HOMOLOGS}| cut -f 2- --output-delimiter'')

Identify the optimal Target sites and generate potential dDNAs.

addtag generate_all \  --fasta ../${GENOME_FASTA} \  --gff ../${GENOME_GFF} \  --homologs ../${GENOME_HOMOLOGS} \  --selection${SELECTION} \  --features gene \  --tag ID \  --ko-gRNA \  --ko-dDNA mintag \  --ki-gRNA \  --ki-dDNA \  --motifs'N{17}|N{3}>NGG' \  --off_target_motifs'N{17}|N{3}>NAG' \  --excise_insert_lengths 0 4  \  --revert_amplification_primers \  --revert_homology_length 100 200 \  --folder${GENE}ga>${GENE}ga.out2>${GENE}ga.err

Select the best +Target and ΔTarget.

addtag find_header --fasta${GENE}ga/excision-targets.fasta --query'\brank=0\b'> ko-target.fastaaddtag find_header --fasta${GENE}ga/reversion-targets.fasta --query'\brank=0\b'> ki-target.fasta

Select an arbitrary ΔdDNA associated with the top-ranked ΔTarget, select the AdDNA with the best AmpF/AmpR primer pair.

DONOR=$(grep'# reTarget results' -A 2${GENE}ga.out| tail -n +3| cut -f 9| cut -d',' -f 1)addtag find_header --fasta${GENE}ga/excision-dDNAs.fasta --query"${DONOR}\b"> ko-dDNA.fastaaddtag find_header --fasta${GENE}ga/reversion-dDNAs.fasta --query'\brank=0\b'> ki-dDNA.fasta

Calculate a decent Primer Design for validating each genome engineering step.

addtag generate_primers \  --fasta ../${GENOME_FASTA} \  --dDNAs ko-dDNA.fasta ki-dDNA.fasta \  --primer_scan_limit 600 \  --primer_pair_limit 300 \  --o_primers_required y n y \  --i_primers_required y n y \  --oligo ViennaRNA \  --specificity all \  --max_number_designs_reported 1000 \  --folder${GENE}gp>${GENE}gp.out2>${GENE}gp.err

The fileADE2gp.out contains any identified sets of primers, ordered by weight.Choose one that has the highest weight for the number of primers you need.

Finally change back to the parent folder

cd ..

EFG1_CDS

Click to expand/collapse

*~ Section incomplete ~*

BRG1_CDS

Click to expand/collapse

*~ Section incomplete ~*

ZAP1_US

Click to expand/collapse

*~ Section incomplete ~*

ZRT2_US

Click to expand/collapse

*~ Section incomplete ~*

WOR1_USd

Click to expand/collapse

*~ Section incomplete ~*

WOR1_USp

Click to expand/collapse

*~ Section incomplete ~*

WOR2_DS

Click to expand/collapse

*~ Section incomplete ~*

Typical workflows

1-step deletion of a single Feature

Click to expand/collapse

In this simplest of examples, we will choose a Feature to delete from a genome, identify the optimal Target to design the gRNA against, create the necessary dDNA, and generate the set of Primers to validate the deletion.

This process uses a 'nominal'mintag, which means the generated dDNA consists of homology arms concatenated together with no insert.

The first step is to obtain input data. Let's download the sequences (FASTA) and annotations (GFF) for a haploidC. albicans assembly into the current working directory:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Candida_albicans/all_assembly_versions/GCF_000182965.3_ASM18296v3/GCF_000182965.3_ASM18296v3_genomic.fna.gzgunzip GCF_000182965.3_ASM18296v3_genomic.fna.gzwget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/Candida_albicans/all_assembly_versions/GCF_000182965.3_ASM18296v3/GCF_000182965.3_ASM18296v3_genomic.gff.gzgunzip GCF_000182965.3_ASM18296v3_genomic.gff.gz

For convenience, let's use a variable to abbreviate these paths:

GENOME=GCF_000182965.3_ASM18296v3_genomic

The 1-step approach is appropriate when the Feature you wish to remove contains a high quality Target within it. We will select a Feature from the GFF file using the--selection option.

Let's pretend we are interested in the geneGCN20. Let's store it into a variable.

GENE=GCN20

For the purposes of this walkthrough,GCN20 is interesting because all its potential Cas9 Targets have several off-targets across the genome.Because there is no precise Target, the Algorithm weight is especially useful for balancing the on-target and off-target scores.

If we know its gene ID, we can directly include the option--selection ID. However, we don't know the ID for this gene, so we can search for it. To do this, we will usetheaddtag find_feature subroutine to find all Features associated withGCN20:

addtag find_feature --linked_tags --header --query${GENE} --gff${GENOME}.gff

# seqidsourcefeaturestartendscorestrandframeattributeNC_032089.1RefSeqCDS7557377828.-0ID=cds-XP_719022.1;Parent=rna-XM_713929.2;Dbxref=CGD:CAL0000181616,GeneID:3639314,Genbank:XP_719022.1;Name=XP_719022.1;Note=YEF3-subfamily ABC family protein%2C predicted not to be a transporter;gbkey=CDS;gene=GCN20;locus_tag=CAALFM_C100480CA;orig_transcript_id=gnl|WGS:AACQ|mrna_CAALFM_C100480CA;product=putative AAA family ATPase;protein_id=XP_719022.1;transl_table=12NC_032089.1RefSeqexon7557377828.-.ID=exon-XM_713929.2-1;Parent=rna-XM_713929.2;Dbxref=GeneID:3639314,Genbank:XM_713929.2;end_range=77828,.;gbkey=mRNA;gene=GCN20;locus_tag=CAALFM_C100480CA;orig_protein_id=gnl|WGS:AACQ|CAALFM_C100480CA;orig_transcript_id=gnl|WGS:AACQ|mrna_CAALFM_C100480CA;partial=true;product=putative AAA family ATPase;start_range=.,75573;transcript_id=XM_713929.2NC_032089.1RefSeqgene7557377828.-.ID=gene-CAALFM_C100480CA;Dbxref=GeneID:3639314;Name=GCN20;end_range=77828,.;gbkey=Gene;gene=GCN20;gene_biotype=protein_coding;locus_tag=CAALFM_C100480CA;partial=true;start_range=.,75573NC_032089.1RefSeqmRNA7557377828.-.ID=rna-XM_713929.2;Parent=gene-CAALFM_C100480CA;Dbxref=GeneID:3639314,Genbank:XM_713929.2;Name=XM_713929.2;end_range=77828,.;gbkey=mRNA;gene=GCN20;locus_tag=CAALFM_C100480CA;orig_protein_id=gnl|WGS:AACQ|CAALFM_C100480CA;orig_transcript_id=gnl|WGS:AACQ|mrna_CAALFM_C100480CA;partial=true;product=putative AAA family ATPase;start_range=.,75573;transcript_id=XM_713929.2

We see there are 4 annotations associated withGCN20, each a different Feature type (CDS,exon,gene,mRNA),and they all point toward the same 2256 nt on chromosome 1.

Let's choose the Feature typegene, and its corresponding attribute IDgene-CAALFM_C100480CA.

We will use a Target motif, an on-target score, and an off-target score each appropriate for Cas9. We use default score weights for bothAzimuth andCFD. We want to narrow the specificity by broadening the number of sequences that can be considered off-target, so we specify the--off_target_motifs option.

We will keep the rest of the AddTag default options. Our final command to identify the best Target sequences and generate the dDNA is the following:

addtag generate_all \  --features gene \  --selection gene-CAALFM_C100480CA \  --motifs'N{17}|N{3}>NGG' \  --off_target_motifs'N{17}|N{3}>NAG' \  --ontargetfilters Azimuth \  --offtargetfilters CFD \  --excise_insert_lengths 0 0 \  --ko-gRNA \  --ko-dDNA mintag \  --fasta${GENOME}.fna \  --gff${GENOME}.gff \  --folder${GENE}g>${GENE}g.out2>${GENE}g.err

This will output a single table, with the best Targets in the top of the output, and the worst toward the bottom.

head${GENE}g.out

# exTarget results# genefeaturesweightexTarget nameexTarget sequenceOT:CFDAzimuthreDonorsNonefeature:contig:strand:start..endwarningsgene-CAALFM_C100480CAgene-CAALFM_C100480CA0.876615628563555exTarget-96CCAACGAAACAGTTTTCAGG>GGG71.6364.65Nonegene-CAALFM_C100480CA:NC_032089.1:-:76719..76742Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6877504215295679exTarget-110CATTATTACGTGCCTTGTCG>AGG57.8357.84Nonegene-CAALFM_C100480CA:NC_032089.1:-:77090..77113Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.650732417867368exTarget-86CTCTTTCTATGCAACTCGTG>AGG49.9759.21Nonegene-CAALFM_C100480CA:NC_032089.1:-:76456..76479Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6497399064714101exTarget-84ACAGTCTCGTATCAAGAAGT>TGG47.8361.06Nonegene-CAALFM_C100480CA:NC_032089.1:-:76324..76347Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6373121738894936exTarget-21GACTTTCGTATTCACGACGT>TGG61.7555.93Nonegene-CAALFM_C100480CA:NC_032089.1:+:76420..76443Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6105741443075372exTarget-117GAGCGAGGCGTCATTGACAT>TGG61.255.21Nonegene-CAALFM_C100480CA:NC_032089.1:-:77164..77187Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.49714752260403006exTarget-56GGATGAACCGTCCAATCACT>TGG49.754.11Nonegene-CAALFM_C100480CA:NC_032089.1:-:75787..75810Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.4582728976377312exTarget-89AGATATAATCCATCAACACT>CGG40.0466.96Nonegene-CAALFM_C100480CA:NC_032089.1:-:76513..76536None

If you run this command again, but omit the--off_target_motifs option, you get the following:

# exTarget results# genefeaturesweightexTarget nameexTarget sequenceOT:CFDAzimuthreDonorsNonefeature:contig:strand:start..endwarningsgene-CAALFM_C100480CAgene-CAALFM_C100480CA0.8776689032281014exTarget-96CCAACGAAACAGTTTTCAGG>GGG74.2964.65Nonegene-CAALFM_C100480CA:NC_032089.1:-:76719..76742Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.8360644970889776exTarget-95CAACGAAACAGTTTTCAGGG>GGG50.074.38Nonegene-CAALFM_C100480CA:NC_032089.1:-:76718..76741Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.8056514544867718exTarget-84ACAGTCTCGTATCAAGAAGT>TGG91.6761.06Nonegene-CAALFM_C100480CA:NC_032089.1:-:76324..76347Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.7897087206837166exTarget-30GTTTAACTCTCTCCTCGACA>AGG49.9767.38Nonegene-CAALFM_C100480CA:NC_032089.1:+:77078..77101Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.7553979473976637exTarget-86CTCTTTCTATGCAACTCGTG>AGG76.7959.21Nonegene-CAALFM_C100480CA:NC_032089.1:-:76456..76479Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.7126907697600093exTarget-110CATTATTACGTGCCTTGTCG>AGG73.0957.84Nonegene-CAALFM_C100480CA:NC_032089.1:-:77090..77113Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6433552008729548exTarget-100GTATGGTTTGGGTTTCACAA>AGG72.3855.81Nonegene-CAALFM_C100480CA:NC_032089.1:-:76756..76779Nonegene-CAALFM_C100480CAgene-CAALFM_C100480CA0.6373121738894936exTarget-21GACTTTCGTATTCACGACGT>TGG61.7555.93Nonegene-CAALFM_C100480CA:NC_032089.1:+:76420..76443None

Notice that by including the additional off-target motif, we see generally lower off-target scores (theOT:CFD column).

Next we will identify the best cPCR primers for verifying the 'GCN20' full CDS deletion.

2-step deletion of a single Feature

Click to expand/collapse

We will delete a Feature that has no Target within it.

*~ Section incomplete ~*

1-step editing of a single Feature

Click to expand/collapse

We will edit a Feature

*~ Section incomplete ~*

2-step editing of a single Feature

Click to expand/collapse

We will edit a Feature that has no Target within it.

*~ Section incomplete ~*

2-step deletion and add-back of a single Feature

Click to expand/collapse

In this example, we will go through creating a nominalmintag to knock-out a single input Feature, then creating primers necessary to revert back to the wild type Feature.

The standard procedure is to first runaddtag generate_all, and use its output as input foraddtag generate_primers.

For simplicity, We will assume the name of the Feature you are interested in isGENE.

The first thing you will want to do, is compose a Target motif for the your biological system uses. To see a list of commonly-used Target motifs, run the following:

addtag list_motifs

Let's pretend our biological system uses the 'AsCpf1'. So we will use the associatedTTTN<N{19}/.{4,6}\ Target motif. Thus, we will add--motifs 'TTTN<N{19}/.{4,6}\' to theaddtag generate_all command.

The next step is to select one or more Algorithms to calculate the 'on-target' and 'off-target' scores for this. To see a list of all implemented Algorithms, run the following:

addtag list_algorithms

Let's choose theDeepCpf1 Algorithm for our 'on-target' score. Let's also choose theLinear Algorithm for the 'off-target' score, whose implicit behavior severely penalizes insertions and deletions at 'off-target' sites, but is explicitly less biased against mismatches. Therefore we add--ontargetfilters DeepCpf1 --offtargetfilters Linear to the command. Because we would like the output Target sites to be ranked based on their specificity, and because theLinear algorithm does not have a default weight, we define a weight for it using the--weights command line option.

Let's use themintag method for creating an RGN Target on the dDNA we generate for creating the intermediary genome. Because we don't want to add any extra bases--only remove the feature--we include--excise_insert_lengths 0 0. Finally, we want merely to revert back to wild type at the native locus, so we direct AddTag to generate the optimal AmpF/AmpR primers using the--revert_amplification_primers option.

Because our input genome is a phased diploid assembly, and we want our gRNAs to target both alleles, we use the default--target_specificity. Because we want a single dDNA to repair both alleles, we also use the default--donor_specificity. Since we want the computer to use all available compute power, we use the default number of processors (which automatically selects all available). Let's also use the default thermodynamics calculator and the default aligner.

Let's store all the output in paths that start withGENEga, where 'ga' is for 'generate_all'.

To identify the best Target locations within our Feature of interest, and to generate dDNA for knock-out, we run the full command:

addtag generate_all \  --motifs'TTTN<N{19}/.{4,6}\' \  --ontargetfilters DeepCpf1 \  --offtargetfilters Linear \  --weights Linear:85+1.7 \  --excise_insert_lengths 0 0 \  --ko-gRNA \  --ko-dDNA mintag \  --revert_amplification_primers \  --fasta genome.fasta \  --gff genome.gff \  --folder GENEga> GENEga.out2> GENEga.err

This writes 4 output tables to theGENEga.out file. Each of these tables refers to sequences in output FASTA files. Please note that certain sequence Aligners, such as 'Bowtie2' can have non-deterministic output. Therefore, your results may vary from what is presented here.

Now would be a good time to explain the terminology you will see in the AddTag input and output. For simplicity in text processing, we use different labels than what are presented in the manuscript, though they are equivalent.

OUTPUT           PAPER    DESCRIPTIONr0-gDNA          +gDNA    Wild type genomer1-gDNA          ΔgDNA    Intermediary genomer2-gDNA          AgDNA    Final genomeexTarget         +Target  Target site in wild type +Feature that is used to 'excise' the featurereTarget         ΔTarget  Target site introduced with ★tag insert that is used to 'revert' the genotypeexDonor/r1-dDNA  ΔdDNA    Excision, or knock out dDNA (ko-dDNA)reDonor/r2-dDNA  AdDNA    Reversion, add-back, or knock-in dDNA (ki-dDNA)

Thus, we refer to the first round of genome engineering (r1) as the knock-out round, and the second round (r2) as the knock-in round.

From the first table, we select the highest-weighedreTarget ('reversion Target', abbreviated), and then we store it in its own FASTA file.

addtag find_header \  --fasta GENEga/reversion-targets.fasta \  --query'\brank=0\b'> ki-target.fasta

EachreTarget can target one or more identifiedexDonor dDNA sequences.In this example, we expect only a singleexDonor associated with the highest-weightreTarget.We extract that sequence, and store it in a convenient fileko-dDNA.fasta.

DONOR=$(grep'# reTarget results' -A 2 GENEga.out| tail -n +3| cut -f 9| cut -d',' -f 1)addtag find_header \  --fasta GENEga/excision-dDNAs.fasta \  --query"${DONOR}\b"> ko-dDNA.fasta

From the second table, we select the highest-weightedexTarget ('excision Target' abbreviated), which is used for excising the input Feature from the input gDNA:

addtag find_header \  --fasta GENEga/excision-targets.fasta \  --query'\brank=0\b'> ko-target.fasta

Finally, we identify the highest-weight dDNA for reverting back to the wild type, and put it in its own FASTA file:

addtag find_header \  --fasta GENEga/reversion-dDNAs.fasta \  --query'reDonor-0\b'> ki-dDNA.fasta

Next we need to identify a single cPCR verification primer design. Let's use the default pairwise sequence aligner.

addtag generate_primers \  --fasta genome.fasta \  --dDNAs ko-dDNA.fasta ki-dDNA.fasta \  --folder GENEgp> GENEgp.out2> GENEgp.err

2-step deletion and add-back of a single, phased Feature

Click to expand/collapse

*~ Section incomplete ~*

2-step editing of several Features

Click to expand/collapse

*~ Section incomplete ~*

Multiplexed, 2-step editing of several Features

Click to expand/collapse

All Features in input GFF file will be evaluated simultaneously.

*~ Section incomplete ~*

📝 Citing AddTag

If you use the AddTag indirect genome editing method, please cite the paper with the initial proof-of-concept [1] as well as the full method description [2]. If you use the AddTag software for your research, please cite [2]. If you comment on, or further develop, AddTag's computational methods (such as Target identification, dDNA generation, or primer design—specifically the weight equations), please cite [3]:

Namkha Nguyen, Morgan M. F. Quail, and Aaron D. Hernday.An efficient, rapid, and recyclable system for CRISPR-mediated genome editing in Candida albicans.mSphere Volume 2, Number 2 (2017). doi:10.1128/mSphereDirect.00149-17, PMID:28497115, PMCID:PMC5422035.
Thaddeus D. Seher, Namkha Nguyen, Diana Ramos, Priyanka Bapat, Clarissa J. Nobile, Suzanne S. Sindi, and Aaron D. Hernday.AddTag, a two-step approach with supporting software package that facilitates CRISPR/Cas-mediated precision genome editing.G3 Genes|Genomes|Genetics, Volume 11, Issue 9 (2021). doi:10.1093/g3journal/jkab216, retrieved from: <https://github.com/tdseher/addtag-project>.
Thaddeus D. Seher.A computational approach for microbial genome editing.eScholarship: UC Merced Electronic Theses and Dissertations (2021). item:uc/item/4rd9215f.

✍ Authors

Who do I talk to?

Aaron D. Hernday (🔬 PI leading the project)
Thaddeus D. Seher (💻 programmer) (💬@tdseher)

See also the list ofcontributors who participated in this project.

👥 Contributing

🤔 What can I do to help improve AddTag?

Click to expand/collapse

We are always looking for ways to broaden the usability of the AddTag software. Here is a list of things that would be great contributions.

Improvements to the documentation, such as additional example workflows.
More Target motifs (SPACER≷PAM combinations) from new CRISPR/Cas literature to add to themotifs.txt file.
Support for additional pairwise sequence Aligners.
Support for additional scoring Algorithms.
Support for additional thermodynamics calculators.
Running AddTag on different types of genomes with different parameters to test proper logic and assess compatibilities.

🐞 How do I submit a bug report?

Click to expand/collapse

First, check to see if the problem you are having has already been added to theissue tracker.If not, then please submit a new issue.

⚠ How do I make a feature request?

Click to expand/collapse

Send a message to@tdseher.

⤴ How do I add my code to the AddTag software?

Click to expand/collapse

Please submit apull request.

📈 Adding scoring Algorithms

Click to expand/collapse

Scoring Algorithms have been broken down into two general types.

SingleSequenceAlgorithm objects calculate scores by comparing a potential RNA or DNA to a model trained on empirical data.
PairedSequenceAlgorithm instances generate scores that compare a potential RNA to a DNA.

To add a new scoring algorithm, you must subclass one of the the above types, and add it to a*.py file in thesource/algorithms/ subdirectory. AddTag will automatically calculate the score on every generated.

We welcome anygit pull requests to widen the repertoire of scoring algorithms available to AddTag. The easiest way to get started is to copy and modify one of the provided subclasses.

📐 Adding sequence Aligners

Click to expand/collapse

AddTag comes with wrappers for several alignment programs. Depending on your experimental design and computing system, you may decide to use an aligner with no included wrapper. To implement your own, create a subclass ofAligner, and put it in a*.py file in thesource/aligners/ subdirectory. AddTag will automatically make that aligner available for you.

Share your code with us so we can make it available to all AddTag users.

🌡 Adding Thermodynamics calculators

Click to expand/collapse

Several wrappers to popular oligonucleotide conformation, free energy, and melting temperature calculation programs are included. You can add your own by subclassing theOligo class, and then adding its*.py file to thesource/thermodynamics/ subdirectory.

If you create your own wrapper, please submit agit pull request so we can add it to the next version of the software.

📖 License

Please see theLICENSE.md file.

Notes

Below are tips and descriptions of AddTag limitations that will help you make successful designs.

Click to expand/collapse

If you are identifying cPCR primers, then it is often useful to use the--cache option. This lets you decrease the stringency of the PCR conditions and run thegenerate_primers subroutine again, pointing to the same--folder, and AddTag will use the results from the previous calculations when it can instead of doing the computations from scratch.
The protein you use should be engineered specifically for your organism. If you are using an eukaryotic system, the should contain an appropriate nuclear localization sequence. To determine a codon-optimized sequence for your experimental organism, you can useSimple Codon Optimizer.
By default, AddTag will avoid designing homology regions and Targets against polymorphisms whenever possible.
Sequences in FASTA files should have unique names. In other words, the primary sequence identifier--everything following the '>' character and preceding the first whitespace/tab '' character--should exist only once across all input*.fasta files.
AddTag makes no effort to restrict which Target motifs the user can use according to the selected Algorithms. Therefore, the user needs to independently verify which Target motifs are compatible with the selected Algorithms.
Right now AddTag can only handle linear chromosomes. If you want to analyze a circular chromosome, then you will need to artificially concatenate the ends of the chromosome together and adjust any annotations before running AddTag. An additional complication the software does not address is circular chromosomes. Features and their flanking regions cannot span the junction created when the contig end is concatenated to the start (typically the starting position on a contig is labeled the ORIGIN). To address this, the user should manually shift the coordinates of the experimental Features, and wrap the contigs as appropriate.
AddTag assumes one Feature copy per contig. The current implementation of AddTag assumes homology regions around Features are not repeated across any one contig. This means that is will fail to generate cPCR oligos for a large proportion of genes in transposon-rich genomes such aswheat. This limitation is currently a result of both thein silico recombination and the primer identification routine. If there are tandem Features on a contig, then the sF and sR primers are likely duplicated across these adjacent loci. The shared primers thus can't specifically amplify one of the tandem duplications and not the other.
AddTag uses thein silico recombination phase ofgenerate_primers subroutine to determine if flanking homology regions of dDNAs are too repetetive across the genome (ideally, this would be performed in thegenerate_all subroutine).
A single Feature cannot span two or more contigs (partially a limitation of the GFF format). AddTag assumes that the entire feature sequence, and any flanking regions, are not in terminal regions of the reference contig.
AddTag does not address overlapping genes, such as when an intron contains an exon for another gene, or when the same DNA encodes for genes on opposite strands. Everything between the Feature bounds is removed in the first engineering step. Currently, if the selected Feature overlaps with any other feature, only the selected Feature is considered. The other Feature will be disrupted. AddTag will report a warning that these other Features may be disrupted, but it does not attempt to reconcile this in any way. However, AddTag does have the ability to limit Feature expansion to keep the deletion outside of neighboring Features.
AddTag was not designed to perform paired Cas design, such as FokI-dCas9 nickase. You would need to run the program and select two gRNAs designed for opposite strands within a certain distance from each other. Alternatively, you could probably make some really-long Target motif. One way to mitigate errors is to use PAM-out nickases. This requires Cas9 cutting by two targets to get double-stranded break. This significantly decreases off-target genome editing. However, this initial AddTag version does not explicitly facilitate this.
AddTag can identify cut sites for Cas enzymes which have the PAM site. No functionality is provided for finding sites without an adjacent PAM sequence. AddTag requires motifs to define a PAM sequence. Therefore Cas14a is not supported. This can be probably be circumvented by using anN character as the PAM sequence, but this hasn't been tested. The number of CRISPR/Cas genome editing technologies are rapidly growing. With the recent discovery ofCas14a, which targets single-stranded DNA (ssDNA) molecules without requiring a PAM site, the expanded prevalence of CRISPR/Cas methods in biological sciences is assured. However, often researchers wish to edit sites on double-stranded DNA (dsDNA) using an RGN (such as Cas9 or Cas12a) that requires binding to a PAM motif.
Please note, that at this time, no special restriction sites will be taken into account when designing primers.
For simplicity, all calculated scores ignore terms dealing with proximity to exon/CDS/ORF sequences. In cases such as the Stemmer and Azimuth calculations, the authors attempted to include the risk of disrupting genes neighboring potential targets in their models. We don’t attempt to do this.
Additionally, some scoring Algorithms take chromatin structure (DNA accessibility) into account. For simplicity, AddTag treats all input gDNA as equally accessible.
During the course of writing this software, apaper was published that outlines how hairpins can be inserted into the pre-spacer and spacer regions of the gRNA in order to increase specificity. AddTag does not model pre-spacer sequences.
AddTag assumes the RGN template type is dsDNA. AddTag was designed specifically to enable efficient gDNA editing. It does not use predictive models for ssDNA or RNA templates.
A corollary of this is that AddTag assumes all input sequences are DNA sequences. So the--fasta file specified will be treated as a DNA template. Thus, if there are any non-DNA residues, such asU, AddTag will probably fail. Also, since the Primer thermodynamics calculators are all set to estimate DNA:DNA hybridization (not DNA:RNA or RNA:RNA), any resulting calculations will be incorrect.
Since Bartag motifs are user-specified, simple pre-computed lists of compatible 'bartag' sequences would be incomplete. Thus we implemented a greedy 'bartag' generation algorithm. When evaluating candidate 'bartag' sequences, AddTag will keep 'bartags' that satisfy all edit distance requirements with all previously-accepted 'bartags'. To limit runtime to a reasonable amount, we limited the total number of Features and 'bartags' that can be generated.
Of special note are things the Primer design does not explicitly consider, such as characteristics of the cPCR template molecule. AddTag does not exploit the differential nature of template sequence composition (e.g. H. sapiens compared to E. coli). Also, AddTag does not use information on the presence of known secondary modifications to the template, such as methylated residues or oxidative damage.
One of the big limitations of this version of AddTag is that the Primer attribute stringencies are held uniform across all regions. You specify this using the--cycle_start N and--cycle_stop N options. If any one of the desired Primer Pairs is not found under the selected stringency, then no simulated annealing is performed. Cycles range fromN of 0 to 21, with 0 being the most restrictive, and 21 being the most permissive. Due to the brute-force nature of the Primer Pair calculations, increasingN will exponentially increase the amount of memory needed to evaluate primers. So if you increase the cycles, be sure to monitor system RAM.
To facilitate more straightforward programming, AddTag outputs 0-based genomic coordinates (as opposed to traditional 1-based coordinates). All input data, such asGFF files, are expected to use 1-based genomic coordinates.
If Algorithm columns in theSTDOUT of thegenerate_all subroutine, return0.0, then a likely cause is that the Algorithm prerequisites are not correctly installed. For instance, ifAzimuth scores are all0.0 on a Linux machine, then you might be missing thepython-tk system package. In this case, try to runsource/algorithms/addtag_wrapper.py in isolation to troubleshoot the problem.
The 'forward' and 'reverse' cognomina are absolute to the input contig coordinates. For instance, the 'sF' and 'sR' Primers are not relative to the orientation of the Feature defined in the input GFF. Instead, 'sF' is earlier in the contig (lower number), and 'sR' is later in the contig (higher number).
In this current version, AddTag'sgenerate_primers subroutine assumes that there is a double-stranded break (DSB) in the gDNA between the the locations the dDNA homology arms are similar to. Furthermore, it assumes these DSBs are repaired perfectly though homology-directed repair (HDR). This makes sense in our experimental biological systemC. albicans. In other systems, such asH. sapiens, there is a higher amount of error-prone DSB repair.
Run time is a function of the number of potential primers that need to be analyzed. Thus, genes that are longer have more potential primers. Also, the number of potential primers actually analyzed depends on the sequence composition of each region. If a region has great complexity, then more primers will be analyzed with the full suite of filters, and the analysis will take longer. If a region has little complexity, then more potential primers will be discarded at early filters, and the analysis will take less time.
In rare cases, endogenous RNA may bind to the RGN to drive cutting at non-target loci. AddTag does not screen the input gDNA for this possibility because it does not analyze the scaffold section of the gRNA.

About

CRISPR/Cas-directed HDR genome editing suite: finds+scores gRNA targets, generates donor DNAs, & produces optimal cPCR primer designs.

Releases

2tags

Packages

No packages published

Movatterモバイル変換

License

tdseher/addtag-project

Folders and files

Latest commit

History

Repository files navigation

CRISPR/Cas AddTag Readme

☑ Features

📋 Requirements

Hardware recommendations

Software requirements

Basic prerequisites

📐 Supported sequence Aligners

🌡 Supported thermodynamics calculators

📈 Supported scoring Algorithms

Python package setup

⤵ Installing AddTag

Special note

🔁 Updating AddTag

💻 Program Instructions

Displaying the usage

Format of input data

FASTA input

GFF input

Target motif input

Homologs input

Format of output data

STDOUT

STDERR

log.txt

excision-dDNAs.fasta

excision-targets.fasta

reversion-dDNAs.fasta

reversion-targets.fasta

genome-rN.fasta

Available subroutines

Available RGN scoring Algorithms

Available oligonucleotide thermodynamics calculators

Workflow for editing loci in the manuscript

Get genome data

ADE2_CDS

EFG1_CDS

BRG1_CDS

ZAP1_US

ZRT2_US

WOR1_USd

WOR1_USp

WOR2_DS

Typical workflows

1-step deletion of a single Feature

2-step deletion of a single Feature

1-step editing of a single Feature

2-step editing of a single Feature

2-step deletion and add-back of a single Feature

2-step deletion and add-back of a single, phased Feature

2-step editing of several Features

Multiplexed, 2-step editing of several Features

📝 Citing AddTag

✍ Authors

👥 Contributing

🤔 What can I do to help improve AddTag?

🐞 How do I submit a bug report?

⚠ How do I make a feature request?

⤴ How do I add my code to the AddTag software?

📈 Adding scoring Algorithms

📐 Adding sequence Aligners

🌡 Adding Thermodynamics calculators

📖 License

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Packages