| Type: | Package |
| Title: | Basic Biological Sequence Handling |
| Version: | 2.1.7 |
| Date: | 2025-09-17 |
| Maintainer: | Lars Snipen <lars.snipen@nmbu.no> |
| Description: | Basic functions for microbial sequence data analysis. The idea is to use generic R data structures as much as possible, making R data wrangling possible also for sequence data. |
| License: | GPL-2 |
| Depends: | R (≥ 4.0.0), tibble, stringr, dplyr, rlang |
| Suggests: | R.utils |
| Imports: | Rcpp (≥ 0.12.0), data.table (≥ 1.9.8) |
| ZipData: | TRUE |
| LinkingTo: | Rcpp (≥ 0.12.0) |
| URL: | https://github.com/larssnip/microseq |
| RoxygenNote: | 7.3.2 |
| NeedsCompilation: | yes |
| Packaged: | 2025-09-18 08:00:23 UTC; larssn |
| Author: | Lars Snipen [aut, cre], Kristian Hovde Liland [aut] |
| Repository: | CRAN |
| Date/Publication: | 2025-09-18 08:30:02 UTC |
Replace amino acids with codons
Description
Replaces aligned amino acids with their original codon triplets.
Usage
backTranslate(aa.msa, nuc.ffn)Arguments
aa.msa | A fasta object with a multiple alignment, see |
nuc.ffn | A fasta object with the coding sequences, see |
Details
This function replaces the aligned amino acids inaa.msa with theiroriginal codon triplets. This is possible only when the nucleotide sequences innuc.ffnare the exact nucleotide sequences behind the protein sequences that are aligned inaa.msa.
It is required that the first token of the ‘Header’ lines is identical for a protein sequenceinaa.msa and its nucleotide version in ‘nuc.ffn’, otherwise it is impossible tomatch them. Thus, they may not appear in the same order in the two input fasta objects.
When aligning coding sequences, one should in general always align their protein sequences, tokeep the codon structure, and then usebackTranslate to convert this into a nucleotide alignment, if required.
If the nuclotide sequences contain the stop codons, these will be removed.
Value
A fasta object similar toaa.msa, but where each amino acid has been replace by its corresponding codon. All gaps ‘"-"’ are replaced by triplets ‘"---"’.
Author(s)
Lars Snipen.
See Also
Examples
msa.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.msa")aa.msa <- readFasta(msa.file)nuc.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.ffn")nuc <- readFasta(nuc.file)nuc.msa <- backTranslate(aa.msa, nuc)Finding coding genes
Description
Finding coding genes in genomic DNA using the Prodigal software.
Usage
findGenes( genome, prodigal.exe = "prodigal", faa.file = "", ffn.file = "", proc = "single", trans.tab = 11, mask.N = FALSE, bypass.SD = FALSE)Arguments
genome | A table with columns Header and Sequence, containing the genome sequence(s). |
prodigal.exe | Command to run the external software prodigal on the system (text). |
faa.file | If provided, prodigal will output all proteins to this fasta-file (text). |
ffn.file | If provided, prodigal will output all DNA sequences to this fasta-file (text). |
proc | Either |
trans.tab | Either 11 or 4 (see below). |
mask.N | Turn on masking of N's (logical) |
bypass.SD | Bypass Shine-Dalgarno filter (logical) |
Details
The external software Prodigal is used to scan through a prokaryotic genome to detect the proteincoding genes. The text inprodigal.exe must contain the exact command to invoke barrnap on the system.
In addition to the standard output from this function, FASTA files with protein and/or DNA sequences maybe produced directly by providing filenames infaa.file andffn.file.
The inputproc allows you to specify if the input data should be treated as a single genome(default) or as a metagenome. In the latter case thegenome are (un-binned) contigs.
The translation table is by default 11 (the standard code), but table 4 should be used for Mycoplasma etc.
Themask.N will prevent genes having runs of N inside. Thebypass.SD turn off the searchfor a Shine-Dalgarno motif.
Value
A GFF-table (seereadGFF for details) with one row for each detectedcoding gene.
Note
The prodigal software must be installed on the system for this function to work, i.e. the command‘system("prodigal -h")’ must be recognized as a valid command if you run it in the Console window.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
## Not run: # This example requires the external prodigal software# Using a genome file in this package.genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Searching for coding sequences, this is Mycoplasma (trans.tab = 4)genome <- readFasta(genome.file)gff.tbl <- findGenes(genome, trans.tab = 4)# Retrieving the sequencescds.tbl <- gff2fasta(gff.tbl, genome)# You may use the pipe operatorlibrary(ggplot2)readFasta(genome.file) %>% findGenes(trans.tab = 4) %>% filter(Score >= 50) %>% ggplot() + geom_histogram(aes(x = Score), bins = 25)## End(Not run)Finding ORFs in genomes
Description
Finds all ORFs in prokaryotic genome sequences.
Usage
findOrfs(genome, circular = F, trans.tab = 11)Arguments
genome | A fasta object (see |
circular | Logical indicating if the genome sequences are completed, circular sequences. |
trans.tab | Translation table. |
Details
A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence starting with a start-codon (ATG, GTG or TTG), followed by an integer numberof triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unlesstrans.tab = 4, see below). This function will locate all such ORFs ina genome.
The argumentgenome is a fasta object, i.e. a table with columns ‘Header’ and ‘Sequence’, and will typically have several sequences(chromosomes/plasmids/scaffolds/contigs). It is vital that thefirst token (characters before first space) of every ‘Header’ isunique, since this will be used to identify these genome sequences in theoutput.
By default the genome sequences are assumed to be linear, i.e. contigs orother incomplete fragments of a genome. In such cases there will usually besome truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In theorf.table returned by this function thisis marked in the ‘Attributes’ column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the beginning or end of the genomic sequence, respectively. If the suppliedgenome is a completed genome,with circular chromosome/plasmids, set the flagcircular = TRUE and notruncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications ofthe GFF3 format, where a ‘Start’ cannot be larger than thecorresponding ‘End’.
An alternative translation table may be specified, and as of now the onlyalternative implemented is table 4. This means codon TGA is no longer a stop,but codes for Tryptophan. This coding is used by some bacteria(e.g. under the orders Entomoplasmatales and Mycoplasmatales).
Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all such nested ORFs, i.e. the same stop position may appear multiple times. If you want ORFs withthe most upstream start-codon only (LORFs), then filter the output from thisfunction withlorfs.
Value
This function returns anorf.table, which is simply a tibble with columns adhering to the GFF3 format specifications(agff.table), seereadGFF. If you want to retrievethe actual ORF sequences, usegff2fasta.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# Using a genome file in this packagegenome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading genome and finding orfsgenome <- readFasta(genome.file)orf.tbl <- findOrfs(genome)# Pipeline for finding LORFs of minimum length 100 amino acids# and collecting their sequences from the genomefindOrfs(genome) %>% lorfs() %>% filter(orfLength(., aa = TRUE) > 100) %>% gff2fasta(genome) -> lorf.tblFinding rRNA genes
Description
Finding rRNA genes in genomic DNA using the barrnap software.
Usage
findrRNA(genome, barrnap.exe = "barrnap", bacteria = TRUE, cpu = 1)Arguments
genome | A table with columns Header and Sequence, containing the genome sequence(s). |
barrnap.exe | Command to run the external software barrnap on the system (text). |
bacteria | Logical, the genome is either a bacteria (default) or an archea. |
cpu | Number of CPUs to use, default is 1. |
Details
The external software barrnap is used to scan through a prokaryotic genome to detect therRNA genes (5S, 16S, 23S).The text inbarrnap.exe must contain the exact command to invoke barrnap on the system.
Value
A GFF-table (seereadGFF for details) with one row for each detectedrRNA sequence.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
## Not run: # This example requires the external barrnap software# Using a genome file in this package.genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Searching for rRNA sequences, and inspectinggenome <- readFasta(genome.file)gff.tbl <- findrRNA(genome)print(gff.table)# Retrieving the sequencesrRNA <- gff2fasta(gff.tbl, genome)## End(Not run)Retrieving annotated sequences
Description
Retrieving from a genome the sequences specified in agff.table.
Usage
gff2fasta(gff.table, genome)Arguments
gff.table | A |
genome | A fasta object ( |
Details
Each row ingff.table (seereadGFF) describes a genomic featurein thegenome, which is a tibble with columns ‘Header’ and‘Sequence’. The information in the columns Seqid, Start, End and Strand are used toretrieve the sequences from the ‘Sequence’ column ofgenome. Every Seqid inthegff.table must match the first token in one of the ‘Header’ texts, inorder to retrieve from the correct ‘Sequence’.
Value
A fasta object with one row for each row ingff.table. TheHeader for each sequence is a summary of the information in thecorresponding row ofgff.table.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# Using two files in this packagegff.file <- file.path(path.package("microseq"),"extdata","small.gff")genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading the genome firstgenome <- readFasta(genome.file)# Retrieving sequencesgff.table <- readGFF(gff.file)fa.tbl <- gff2fasta(gff.table, genome)# Alternative, using pipingreadGFF(gff.file) %>% gff2fasta(genome) -> fa.tblExtendedgregexpr with substring retrieval
Description
An extension of the functionbase::gregexpr enabling retrieval of the matching substrings.
Usage
gregexpr( pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, extract = FALSE)Arguments
pattern | Character string containing aregular expression (or character string for |
text | A character vector where matches are sought, or an object which can be coerced by |
ignore.case | If |
perl | Logical. Should perl-compatible regexps be used? Has priority over |
fixed | Logical. If |
useBytes | Logical. If |
extract | Logical indicating if matching substrings should be extracted and returned. |
Details
Extended version ofbase:gregexpr that enables the return of the substrings matchingthe pattern. The last argument ‘extract’ is the only difference tobase::gregexpr. The defaultbehaviour is identical tobase::gregexpr, but settingextract=TRUE means the matching substringsare returned.
Value
It will either return what thebase::gregexpr would (extract = FALSE) or a ‘list’of substrings matching the pattern (extract = TRUE). There is one ‘list’ element for each string in‘text’, and each list element contains a character vector of all matching substrings in the correspondingentry of ‘text’.
Author(s)
Lars Snipen and Kristian Liland.
See Also
Examples
sequences <- c("ACATGTCATGTCC", "CTTGTATGCTG")gregexpr("ATG", sequences, extract = TRUE)Ambiguity symbol conversion
Description
Converting DNA ambiguity symbols to regular expressions, and vice versa.
Usage
iupac2regex(sequence)regex2iupac(sequence)Arguments
sequence | Character vector containing DNA sequences. |
Details
The DNA alphabet may contain ambiguity symbols, e.g. a W means either A or T.When using a regular expression search, these letters must be replaced by the properregular expression, e.g. W is replaced by [AT] in the string. Theiupac2regex makes thistranslation, whileregex2iupac converts the other way again (replace [AT] with W).
Value
A string where the ambiguity symbol has been replaced by a regular expression(iupac2regex) or a regular expression has been replaced by an ambiguity symbol(regex2iupac).
Author(s)
Lars Snipen.
Examples
iupac2regex("ACWGT")regex2iupac("AC[AG]GT")Longest ORF
Description
Filtering anorf.table with ORF information to keep only the LORFs.
Usage
lorfs(orf.tbl)Arguments
orf.tbl | A |
Details
For every stop-codon there are usually multiple possible start-codons in the same readingframe (nested ORFs). The LORF (Longest ORF) is defined as the longest of these nested ORFs,i.e. the ORF starting at the most upstream start-codon matching the stop-codon.
Value
A tibble with a subset of the rows of the argumentorf.tbl. After this filtering the Type variable inorf.tbl is changed to"LORF". If you want toretrieve the LORF sequences, usegff2fasta.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# See the example in the Help-file for findOrfs.Basic Biological Sequence Analysis
Description
A collection of functions for basic analysis of microbial sequence data.
Usage
microseq()Details
| Package: | microseq |
| Type: | Package |
| Version: | 2.1.7 |
| Date: | 2025-09-17 |
| License: | GPL-2 |
Author(s)
Lars Snipen, Kristian Hovde Liland
Maintainer: Lars Snipen <lars.snipen@nmbu.no>
Convert alignment to matrix
Description
Converts a FASTA formatted multiple alignment to a matrix.
Usage
msa2mat(msa)Arguments
msa | A fasta object with a multiple alignment, see |
Details
This function converts the fasta objectmsa, containing a multiple alignment,to a matrix. This means each position in the alignment is a column in the matrix, and thecontent of the ‘Header’ column ofmsa is used as rownames of theh matrix.
Such a matrix is useful for conversion to aDNAbin object that is used by theapepackage for re-constructing phylogenetic trees.
Value
Amatrix where each row is a vector of aligned bases/amino acids.
Author(s)
Lars Snipen.
See Also
Examples
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa")msa <- readFasta(msa.file)msa.mat <- msa2mat(msa) # to use with ape::as.DNAbin(msa.mat)Trimming multiple sequence alignments
Description
Trimming a multiple sequence alignment by discarding columns with too many gaps.
Usage
msaTrim(msa, gap.end = 0.5, gap.mid = 0.9)Arguments
msa | A fasta object containing a multiple alignment. |
gap.end | Fraction of gaps tolerated at the ends of the alignment (0-1). |
gap.mid | Fraction of gaps tolerated inside the alignment (0-1). |
Details
A multiple alignment is trimmed by removing columns with too many indels (gap-symbols). Any columns containing a fraction of gaps larger thangap.mid are discarded. For this reason,gap.midshould always be farily close to 1.0 therwise too many columns may be discarded, destroying the alignment.
Due to the heuristics of multiple alignment methods, both ends of the alignment tend to be uncertain and mostof the trimming should be done at the ends. Starting at each end, columns are discarded as long as their fraction of gapssurpassesgap.end. Typicallygap.end can be much smaller thangap.mid, but if set too low you risk that all columns are discarded!
Value
The trimmed alignment is returned as a fasta object.
Author(s)
Lars Snipen.
See Also
Examples
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa")msa <- readFasta(msa.file)print(str_length(msa$Sequence))msa.trimmed <- msaTrim(msa)print(str_length(msa.trimmed$Sequence))msa.mat <- msa2mat(msa) # for use with ape::as.DNAbin(msa.mat)Multiple alignment
Description
Quickly computing a smallish multiple sequence alignment.
Usage
msalign(fsa.tbl, machine = "microseq::muscle")Arguments
fsa.tbl | A fasta object (data.frame or tibble) with input sequences. |
machine | Function that does the 'dirty work'. |
Details
This function computes a multiple sequence alignment given a set of sequences ina fasta object, seereadFasta for more on fasta objects.
It is merely a wrapper for the function named inmachine to avoid explicit writingand reading of files. This function should only be used for small data sets, since no resultfiles are stored. For heavier jobs, use themachine function directly.
At present, the onlymachine function implemented ismuscle, but otherthird-partymachines may be included later.
Note that this function will runmuscle with default settings, which is finefor small data sets.
Value
Results are returned as a fasta object, i.e. a tibble with columns‘Header’ and ‘Sequence’.
Author(s)
Lars Snipen.
See Also
Examples
## Not run: prot.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa")faa <- readFasta(prot.file)msa <- msalign(faa)## End(Not run)Multiple alignment using MUSCLE
Description
Computing a multiple sequence alignment using the MUSCLE software.
Usage
muscle( in.file, out.file, muscle.exe = "muscle", quiet = FALSE, diags = FALSE, maxiters = 16)Arguments
in.file | Name of FASTA file with input sequences. |
out.file | Name of file to store the result. |
muscle.exe | Command to run the external software muscle on the system (text). |
quiet | Logical, |
diags | Logical, |
maxiters | Maximum number of iterations. |
Details
The software MUSCLE (Edgar, 2004) must be installed and available on the system. The text inmuscle.exe must contain the exact command to invoke muscle on the system.
By defaultdiags = FALSE but can be set toTRUE to increase speed. This should be doneonly if sequences are highly similar.
By defaultmaxiters = 16. If you have a large number of sequences (a few thousand), or they are very long, then this may be too slow for practical use. A good compromise between speed and accuracyis to run just the first two iterations of the algorithm. On average, this gives accuracy equal toT-Coffee and speeds much faster than CLUSTALW. This is done by the optionmaxiters = 2.
Value
The result is written to the file specified inout.file.
Author(s)
Lars Snipen.
References
Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res, 32, 1792-1797.
See Also
Examples
## Not run: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa")muscle(in.file = fa.file, out.file = "delete_me.msa")## End(Not run)Length of ORF
Description
Computing the lengths of all ORFs in anorf.table.
Usage
orfLength(orf.table, aa = FALSE)Arguments
orf.table | A GFF-formatted |
aa | Logical, length in amino acids instead of bases. |
Details
By default, computes the length of an ORF in bases, including thestop codon. However, ifaa = TRUE, then the length is in amino acidsafter translation. This aa-length is the base-length divided by 3 and minus 1, unless the ORF is truncated and lacks a stop codon.
Value
A vector of lengths.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# See the example in the Help-file for findOrfs.Signature for each ORF
Description
Creates a signature text for orfs in anorf.table.
Usage
orfSignature(orf.table, full = TRUE)Arguments
orf.table | A |
full | Logical indicating type of signature. |
Details
A signature is a text that uniquely identifies each ORF in anorf.table, which is a GFF-table with columnsSeqid,Start,End andStrand.
The full signature (full = TRUE) contains theSeqid,Start,End andStrand information for each ORF, separated by semicolon";". This text is always unique to each ORF. Iffull = FALSE theSignature will not contain the starting position information for each ORF. This means all nested ORFs ending at the same stop-codon will then get identicalSignatures. This is useful for identifying which ORFs are nested within the same LORF.
Note that the signature you get withfull = FALSE containsSeqid, thenEnd if on the positiveStrand,Start otherwise, and thentheStrand.
Value
A text vector with theSignature for each ORF.
Author(s)
Lars Snipen.
See Also
Examples
# Using a genome file in this packagegenome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading genome and finding orfsgenome <- readFasta(genome.file)orf.tbl <- findOrfs(genome)# Compute signaturessignature.full <- orfSignature(orf.tbl)signature.reduced <- orfSignature(orf.tbl, full = FALSE)Read and write FASTA files
Description
Reads and writes biological sequences (DNA, RNA, protein) in the FASTA format.
Usage
readFasta(in.file)writeFasta(fdta, out.file, width = 0)Arguments
in.file | url/directory/name of (gzipped) FASTA file to read. |
fdta | A data.frame or tibble with sequence data, see ‘Details’ below. |
out.file | Name of (gzipped) FASTA file to create. |
width | Number of characters per line, or 0 for no line breaks. |
Details
These functions handle input/output of sequences in the commonly used FASTA format.For every sequence it is presumed there is one Header-line starting with a ‘>’. Iffilenames (in.file orout.file) have the extension.gz they will automatically becompressed/uncompressed. NOTE: This requires the R.utils R package.
The sequences are stored in a tibble, opening up all the possibilities in R forfast and easy manipulations. The content of the file is stored as two columns, ‘Header’and ‘Sequence’. If other columns are added, these will be ignored bywriteFasta.
The defaultwidth = 0 inwriteFasta results in no line breaks in the sequences(one sequence per line).
Value
readFasta returns a tibble with the contents of the (gzipped) FASTAfile stored in two columns of text. The first, named ‘Header’, containsthe headerlines and the second, named ‘Sequence’, contains the sequences.
writeFasta produces a (gzipped) FASTA file.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
## Not run: # We need a FASTA-file to read, here is one example file:fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")# Read and writefdta <- readFasta(fa.file)ok <- writeFasta(fdta[4:5,], out.file = "delete_me.fasta")# Make use of dplyr to copy parts of the file to another filereadFasta(fa.file) %>% filter(str_detect(Sequence, "TGA$")) %>% writeFasta(out.file = "TGAstop.fasta", width = 80) -> ok## End(Not run)Read and write FASTQ files
Description
Reads and writes files in the FASTQ format.
Usage
readFastq(in.file)writeFastq(fdta, out.file)Arguments
in.file | url/directory/name of (gzipped) FASTQ file to read. |
fdta | FASTQ object to write. |
out.file | url/directory/name of (gzipped) FASTQ file to write. |
Details
These functions handle input/output of sequences in the commonly used FASTQ format,typically used for storing DNA sequences (reads) after sequencing. Iffilenames (in.file orout.file) have the extension.gz they will automatically becompressed/uncompressed. NOTE: This requires the R.utils package.
The sequences are stored in a tibble, opening up all the possibilities in R forfast and easy manipulations. The content of the file is stored as three columns, ‘Header’,‘Sequence’ and ‘Quality’. If other columns are added, these will be ignored bywriteFastq.
Value
readFastq returns a tibble with the contents of the (gzipped) FASTQfile stored in three columns of text. The first, named ‘Header’, containsthe headerlines, the second, named ‘Sequence’, contains the sequences and the third, named ‘Quality’ contains the base quality scores.
writeFastq produces a (gzipped) FASTQ file.
Note
These functions will only handle files where each entry spans one single line, i.e. not the(uncommon) multiline FASTQ format.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
## Not run: # We need a FASTQ-file to read, here is one example file:fq.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.fastq.gz")# Read and writefdta <- readFastq(fq.file)ok <- writeFastq(fdta[1:3,], out.file = "delete_me.fq")# Make use of dplyr to copy parts of the file to another filereadFastq(fq.file) %>% mutate(Length = str_length(Sequence)) %>% filter(Length > 200) %>% writeFasta(out.file = "long_reads.fa") # writing to FASTA file## End(Not run)Reading and writing GFF-tables
Description
Reading or writing a GFF-table from/to file.
Usage
readGFF(in.file)writeGFF(gff.table, out.file)Arguments
in.file | Name of file with a GFF-table. |
gff.table | A table (data.frame or tibble) with genomic features information. |
out.file | Name of file. |
Details
A GFF-table is simply a tibble with columnsadhering to the format specified by the GFF3 format, seehttps://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for details. There isone row for each feature.
The following columns should always be in a fullgff.table of the GFF3 format:
Seqid. A unique identifier of the genomic sequence on which the feature resides.
Source. A description of the procedure that generated the feature, e.g.
"R-package micropan::findOrfs".Type The type of feature, e.g.
"ORF","16S"etc.Start. The leftmost coordinate. This is the start if the feature is on the Sense strand, butthe end if it is on the Antisense strand.
End. The rightmost coordinate. This is the end if the feature is on the Sense strand, butthe start if it is on the Antisense strand.
Score. A numeric score (E-value, P-value) from the
Source.Strand. A
"+"indicates Sense strand, a"-"Antisense.Phase. Only relevant for coding genes. the values 0, 1 or 2 indicates the reading frame, i.e. the number of bases to offset the
Startin order to be in the reading frame.Attributes. A single string with semicolon-separated tokens prociding additional information.
Missing values are described by"." in the GFF3 format. This is also done here, except for thenumerical columns Start, End, Score and Phase. HereNA is used, but this is replaced by"." when writing to file.
ThereadGFF function will also read files where sequences in FASTA format are added afterthe GFF-table. This file section must always start with the line##FASTA. This fasta objectis added to the GFF-table as an attribute (useattr(gff.tbl, "FASTA") to retrieve it).
Value
readGFF returns agff.table with the columns described above.
writeGFF writes the suppliedgff.table to a text-file.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
# Using a GFF file in this packagegff.file <- file.path(path.package("microseq"),"extdata","small.gff")# Reading gff-filegff.tbl <- readGFF(gff.file)Reverse-complementation of DNA
Description
The standard reverse-complement of nucleotide sequences.
Usage
reverseComplement(nuc.sequences, reverse = TRUE)Arguments
nuc.sequences | Character vector containing the nucleotide sequences. |
reverse | Logical indicating if complement should be reversed. |
Details
With ‘reverse = FALSE’ the DNA sequence is only complemented, not reversed.
This function will handle the IUPAC ambiguity symbols, i.e. ‘R’ isreverse-complemented to ‘Y’ etc.
Value
A character vector of reverse-complemented sequences.
Author(s)
Lars Snipen and Kristian Hovde Liland.
See Also
Examples
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")fa <- readFasta(fa.file)reverseComplement(fa$Sequence)#' # Or, make use of dplyr to manipulate tablesreadFasta(fa.file) %>% mutate(RevComp = reverseComplement(Sequence)) -> fa.tblTranslation according to the standard genetic code
Description
The translation from DNA(RNA) to amino acid sequence according to the standard genetic code.
Usage
translate(nuc.sequences, M.start = TRUE, no.stop = TRUE, trans.tab = 11)Arguments
nuc.sequences | Character vector containing the nucleotide sequences. |
M.start | A logical indicating if the amino acid sequence should start with M regardless of start codon. |
no.stop | A logical indicating if terminal stops (*) should be eliminated from the translated sequence |
trans.tab | Translation table, either 11 or 4 |
Details
Codons are by default translated according to translation table 11, i.e. the possible start codonsare ATG, GTG or TTG and stop codons are TAA, TGA and TAG. The only alternative implemented here istranslation table 4, which is used by some bacteria (e.g. Mycoplasma, Mesoplasma). Iftrans.tab is 4,the stop codon TGA istranslated to W (Tryptophan).
Value
A character vector of translated sequences.
Author(s)
Lars Snipen and Kristian Hovde Liland.
Examples
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")fa <- readFasta(fa.file)translate(fa$Sequence)# Or, make use of dplyr to manipulate tablesreadFasta(fa.file) %>% mutate(Protein = translate(Sequence)) -> fa.tbl