Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Basic Biological Sequence Handling
Version:2.1.7
Date:2025-09-17
Maintainer:Lars Snipen <lars.snipen@nmbu.no>
Description:Basic functions for microbial sequence data analysis. The idea is to use generic R data structures as much as possible, making R data wrangling possible also for sequence data.
License:GPL-2
Depends:R (≥ 4.0.0), tibble, stringr, dplyr, rlang
Suggests:R.utils
Imports:Rcpp (≥ 0.12.0), data.table (≥ 1.9.8)
ZipData:TRUE
LinkingTo:Rcpp (≥ 0.12.0)
URL:https://github.com/larssnip/microseq
RoxygenNote:7.3.2
NeedsCompilation:yes
Packaged:2025-09-18 08:00:23 UTC; larssn
Author:Lars Snipen [aut, cre], Kristian Hovde Liland [aut]
Repository:CRAN
Date/Publication:2025-09-18 08:30:02 UTC

Replace amino acids with codons

Description

Replaces aligned amino acids with their original codon triplets.

Usage

backTranslate(aa.msa, nuc.ffn)

Arguments

aa.msa

A fasta object with a multiple alignment, seemsalign.

nuc.ffn

A fasta object with the coding sequences, seereadFasta.

Details

This function replaces the aligned amino acids inaa.msa with theiroriginal codon triplets. This is possible only when the nucleotide sequences innuc.ffnare the exact nucleotide sequences behind the protein sequences that are aligned inaa.msa.

It is required that the first token of the ‘⁠Header⁠’ lines is identical for a protein sequenceinaa.msa and its nucleotide version in ‘⁠nuc.ffn⁠’, otherwise it is impossible tomatch them. Thus, they may not appear in the same order in the two input fasta objects.

When aligning coding sequences, one should in general always align their protein sequences, tokeep the codon structure, and then usebackTranslate to convert this into a nucleotide alignment, if required.

If the nuclotide sequences contain the stop codons, these will be removed.

Value

A fasta object similar toaa.msa, but where each amino acid has been replace by its corresponding codon. All gaps ‘⁠"-"⁠’ are replaced by triplets ‘⁠"---"⁠’.

Author(s)

Lars Snipen.

See Also

msalign,readFasta.

Examples

msa.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.msa")aa.msa <- readFasta(msa.file)nuc.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.ffn")nuc <- readFasta(nuc.file)nuc.msa <- backTranslate(aa.msa, nuc)

Finding coding genes

Description

Finding coding genes in genomic DNA using the Prodigal software.

Usage

findGenes(  genome,  prodigal.exe = "prodigal",  faa.file = "",  ffn.file = "",  proc = "single",  trans.tab = 11,  mask.N = FALSE,  bypass.SD = FALSE)

Arguments

genome

A table with columns Header and Sequence, containing the genome sequence(s).

prodigal.exe

Command to run the external software prodigal on the system (text).

faa.file

If provided, prodigal will output all proteins to this fasta-file (text).

ffn.file

If provided, prodigal will output all DNA sequences to this fasta-file (text).

proc

Either"single" or"meta", see below.

trans.tab

Either 11 or 4 (see below).

mask.N

Turn on masking of N's (logical)

bypass.SD

Bypass Shine-Dalgarno filter (logical)

Details

The external software Prodigal is used to scan through a prokaryotic genome to detect the proteincoding genes. The text inprodigal.exe must contain the exact command to invoke barrnap on the system.

In addition to the standard output from this function, FASTA files with protein and/or DNA sequences maybe produced directly by providing filenames infaa.file andffn.file.

The inputproc allows you to specify if the input data should be treated as a single genome(default) or as a metagenome. In the latter case thegenome are (un-binned) contigs.

The translation table is by default 11 (the standard code), but table 4 should be used for Mycoplasma etc.

Themask.N will prevent genes having runs of N inside. Thebypass.SD turn off the searchfor a Shine-Dalgarno motif.

Value

A GFF-table (seereadGFF for details) with one row for each detectedcoding gene.

Note

The prodigal software must be installed on the system for this function to work, i.e. the command‘⁠system("prodigal -h")⁠’ must be recognized as a valid command if you run it in the Console window.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readGFF,gff2fasta.

Examples

## Not run: # This example requires the external prodigal software# Using a genome file in this package.genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Searching for coding sequences, this is Mycoplasma (trans.tab = 4)genome <- readFasta(genome.file)gff.tbl <- findGenes(genome, trans.tab = 4)# Retrieving the sequencescds.tbl <- gff2fasta(gff.tbl, genome)# You may use the pipe operatorlibrary(ggplot2)readFasta(genome.file) %>%   findGenes(trans.tab = 4) %>%   filter(Score >= 50) %>%   ggplot() +  geom_histogram(aes(x = Score), bins = 25)## End(Not run)

Finding ORFs in genomes

Description

Finds all ORFs in prokaryotic genome sequences.

Usage

findOrfs(genome, circular = F, trans.tab = 11)

Arguments

genome

A fasta object (seereadFasta) with the genome sequence(s).

circular

Logical indicating if the genome sequences are completed, circular sequences.

trans.tab

Translation table.

Details

A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence starting with a start-codon (ATG, GTG or TTG), followed by an integer numberof triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unlesstrans.tab = 4, see below). This function will locate all such ORFs ina genome.

The argumentgenome is a fasta object, i.e. a table with columns ‘⁠Header⁠’ and ‘⁠Sequence⁠’, and will typically have several sequences(chromosomes/plasmids/scaffolds/contigs). It is vital that thefirst token (characters before first space) of every ‘⁠Header⁠’ isunique, since this will be used to identify these genome sequences in theoutput.

By default the genome sequences are assumed to be linear, i.e. contigs orother incomplete fragments of a genome. In such cases there will usually besome truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In theorf.table returned by this function thisis marked in the ‘⁠Attributes⁠’ column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the beginning or end of the genomic sequence, respectively. If the suppliedgenome is a completed genome,with circular chromosome/plasmids, set the flagcircular = TRUE and notruncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications ofthe GFF3 format, where a ‘⁠Start⁠’ cannot be larger than thecorresponding ‘⁠End⁠’.

An alternative translation table may be specified, and as of now the onlyalternative implemented is table 4. This means codon TGA is no longer a stop,but codes for Tryptophan. This coding is used by some bacteria(e.g. under the orders Entomoplasmatales and Mycoplasmatales).

Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all such nested ORFs, i.e. the same stop position may appear multiple times. If you want ORFs withthe most upstream start-codon only (LORFs), then filter the output from thisfunction withlorfs.

Value

This function returns anorf.table, which is simply a tibble with columns adhering to the GFF3 format specifications(agff.table), seereadGFF. If you want to retrievethe actual ORF sequences, usegff2fasta.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readGFF,gff2fasta,lorfs.

Examples

# Using a genome file in this packagegenome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading genome and finding orfsgenome <- readFasta(genome.file)orf.tbl <- findOrfs(genome)# Pipeline for finding LORFs of minimum length 100 amino acids# and collecting their sequences from the genomefindOrfs(genome) %>%  lorfs() %>%  filter(orfLength(., aa = TRUE) > 100) %>%  gff2fasta(genome) -> lorf.tbl

Finding rRNA genes

Description

Finding rRNA genes in genomic DNA using the barrnap software.

Usage

findrRNA(genome, barrnap.exe = "barrnap", bacteria = TRUE, cpu = 1)

Arguments

genome

A table with columns Header and Sequence, containing the genome sequence(s).

barrnap.exe

Command to run the external software barrnap on the system (text).

bacteria

Logical, the genome is either a bacteria (default) or an archea.

cpu

Number of CPUs to use, default is 1.

Details

The external software barrnap is used to scan through a prokaryotic genome to detect therRNA genes (5S, 16S, 23S).The text inbarrnap.exe must contain the exact command to invoke barrnap on the system.

Value

A GFF-table (seereadGFF for details) with one row for each detectedrRNA sequence.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readGFF,gff2fasta.

Examples

## Not run: # This example requires the external barrnap software# Using a genome file in this package.genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Searching for rRNA sequences, and inspectinggenome <- readFasta(genome.file)gff.tbl <- findrRNA(genome)print(gff.table)# Retrieving the sequencesrRNA <- gff2fasta(gff.tbl, genome)## End(Not run)

Retrieving annotated sequences

Description

Retrieving from a genome the sequences specified in agff.table.

Usage

gff2fasta(gff.table, genome)

Arguments

gff.table

Agff.table (data.frame or tibble) with genomic features information.

genome

A fasta object (tibble) with the genome sequence(s).

Details

Each row ingff.table (seereadGFF) describes a genomic featurein thegenome, which is a tibble with columns ‘⁠Header⁠’ and‘⁠Sequence⁠’. The information in the columns Seqid, Start, End and Strand are used toretrieve the sequences from the ‘⁠Sequence⁠’ column ofgenome. Every Seqid inthegff.table must match the first token in one of the ‘⁠Header⁠’ texts, inorder to retrieve from the correct ‘⁠Sequence⁠’.

Value

A fasta object with one row for each row ingff.table. TheHeader for each sequence is a summary of the information in thecorresponding row ofgff.table.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readGFF,findOrfs.

Examples

# Using two files in this packagegff.file <- file.path(path.package("microseq"),"extdata","small.gff")genome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading the genome firstgenome <- readFasta(genome.file)# Retrieving sequencesgff.table <- readGFF(gff.file)fa.tbl <- gff2fasta(gff.table, genome)# Alternative, using pipingreadGFF(gff.file) %>% gff2fasta(genome) -> fa.tbl

Extendedgregexpr with substring retrieval

Description

An extension of the functionbase::gregexpr enabling retrieval of the matching substrings.

Usage

gregexpr(  pattern,  text,  ignore.case = FALSE,  perl = FALSE,  fixed = FALSE,  useBytes = FALSE,  extract = FALSE)

Arguments

pattern

Character string containing aregular expression (or character string forfixed = TRUE) to be matched in the given character vector. Coerced byas.character to a character string if possible. If a character vector of length 2 or more is supplied, the first elementis used with a warning. Missing values are not allowed.

text

A character vector where matches are sought, or an object which can be coerced byas.character to a character vector.

ignore.case

IfFALSE, the pattern matching iscase sensitive and ifTRUE,case is ignored during matching.

perl

Logical. Should perl-compatible regexps be used? Has priority overextended.

fixed

Logical. IfTRUE, ‘⁠pattern⁠’ is a string to be matched as is. Overrides all conflicting arguments.

useBytes

Logical. IfTRUE the matching is done byte-by-byte rather than character-by-character.Seegrep for details.

extract

Logical indicating if matching substrings should be extracted and returned.

Details

Extended version ofbase:gregexpr that enables the return of the substrings matchingthe pattern. The last argument ‘⁠extract⁠’ is the only difference tobase::gregexpr. The defaultbehaviour is identical tobase::gregexpr, but settingextract=TRUE means the matching substringsare returned.

Value

It will either return what thebase::gregexpr would (extract = FALSE) or a ‘⁠list⁠’of substrings matching the pattern (extract = TRUE). There is one ‘⁠list⁠’ element for each string in‘⁠text⁠’, and each list element contains a character vector of all matching substrings in the correspondingentry of ‘⁠text⁠’.

Author(s)

Lars Snipen and Kristian Liland.

See Also

grep

Examples

sequences <- c("ACATGTCATGTCC", "CTTGTATGCTG")gregexpr("ATG", sequences, extract = TRUE)

Ambiguity symbol conversion

Description

Converting DNA ambiguity symbols to regular expressions, and vice versa.

Usage

iupac2regex(sequence)regex2iupac(sequence)

Arguments

sequence

Character vector containing DNA sequences.

Details

The DNA alphabet may contain ambiguity symbols, e.g. a W means either A or T.When using a regular expression search, these letters must be replaced by the properregular expression, e.g. W is replaced by [AT] in the string. Theiupac2regex makes thistranslation, whileregex2iupac converts the other way again (replace [AT] with W).

Value

A string where the ambiguity symbol has been replaced by a regular expression(iupac2regex) or a regular expression has been replaced by an ambiguity symbol(regex2iupac).

Author(s)

Lars Snipen.

Examples

iupac2regex("ACWGT")regex2iupac("AC[AG]GT")

Longest ORF

Description

Filtering anorf.table with ORF information to keep only the LORFs.

Usage

lorfs(orf.tbl)

Arguments

orf.tbl

Atibble with the nine columns of the GFF-format (seefindOrfs).

Details

For every stop-codon there are usually multiple possible start-codons in the same readingframe (nested ORFs). The LORF (Longest ORF) is defined as the longest of these nested ORFs,i.e. the ORF starting at the most upstream start-codon matching the stop-codon.

Value

A tibble with a subset of the rows of the argumentorf.tbl. After this filtering the Type variable inorf.tbl is changed to"LORF". If you want toretrieve the LORF sequences, usegff2fasta.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readGFF,findOrfs,gff2fasta.

Examples

# See the example in the Help-file for findOrfs.

Basic Biological Sequence Analysis

Description

A collection of functions for basic analysis of microbial sequence data.

Usage

microseq()

Details

Package: microseq
Type: Package
Version: 2.1.7
Date: 2025-09-17
License: GPL-2

Author(s)

Lars Snipen, Kristian Hovde Liland
Maintainer: Lars Snipen <lars.snipen@nmbu.no>


Convert alignment to matrix

Description

Converts a FASTA formatted multiple alignment to a matrix.

Usage

msa2mat(msa)

Arguments

msa

A fasta object with a multiple alignment, seemsalign'.

Details

This function converts the fasta objectmsa, containing a multiple alignment,to a matrix. This means each position in the alignment is a column in the matrix, and thecontent of the ‘⁠Header⁠’ column ofmsa is used as rownames of theh matrix.

Such a matrix is useful for conversion to aDNAbin object that is used by theapepackage for re-constructing phylogenetic trees.

Value

Amatrix where each row is a vector of aligned bases/amino acids.

Author(s)

Lars Snipen.

See Also

msalign,readFasta.

Examples

msa.file <- file.path(path.package("microseq"),"extdata", "small.msa")msa <- readFasta(msa.file)msa.mat <- msa2mat(msa)  # to use with ape::as.DNAbin(msa.mat)

Trimming multiple sequence alignments

Description

Trimming a multiple sequence alignment by discarding columns with too many gaps.

Usage

msaTrim(msa, gap.end = 0.5, gap.mid = 0.9)

Arguments

msa

A fasta object containing a multiple alignment.

gap.end

Fraction of gaps tolerated at the ends of the alignment (0-1).

gap.mid

Fraction of gaps tolerated inside the alignment (0-1).

Details

A multiple alignment is trimmed by removing columns with too many indels (gap-symbols). Any columns containing a fraction of gaps larger thangap.mid are discarded. For this reason,gap.midshould always be farily close to 1.0 therwise too many columns may be discarded, destroying the alignment.

Due to the heuristics of multiple alignment methods, both ends of the alignment tend to be uncertain and mostof the trimming should be done at the ends. Starting at each end, columns are discarded as long as their fraction of gapssurpassesgap.end. Typicallygap.end can be much smaller thangap.mid, but if set too low you risk that all columns are discarded!

Value

The trimmed alignment is returned as a fasta object.

Author(s)

Lars Snipen.

See Also

muscle,msalign.

Examples

msa.file <- file.path(path.package("microseq"),"extdata", "small.msa")msa <- readFasta(msa.file)print(str_length(msa$Sequence))msa.trimmed <- msaTrim(msa)print(str_length(msa.trimmed$Sequence))msa.mat <- msa2mat(msa)  # for use with ape::as.DNAbin(msa.mat)

Multiple alignment

Description

Quickly computing a smallish multiple sequence alignment.

Usage

msalign(fsa.tbl, machine = "microseq::muscle")

Arguments

fsa.tbl

A fasta object (data.frame or tibble) with input sequences.

machine

Function that does the 'dirty work'.

Details

This function computes a multiple sequence alignment given a set of sequences ina fasta object, seereadFasta for more on fasta objects.

It is merely a wrapper for the function named inmachine to avoid explicit writingand reading of files. This function should only be used for small data sets, since no resultfiles are stored. For heavier jobs, use themachine function directly.

At present, the onlymachine function implemented ismuscle, but otherthird-partymachines may be included later.

Note that this function will runmuscle with default settings, which is finefor small data sets.

Value

Results are returned as a fasta object, i.e. a tibble with columns‘⁠Header⁠’ and ‘⁠Sequence⁠’.

Author(s)

Lars Snipen.

See Also

muscle,msaTrim.

Examples

## Not run: prot.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa")faa <- readFasta(prot.file)msa <- msalign(faa)## End(Not run)

Multiple alignment using MUSCLE

Description

Computing a multiple sequence alignment using the MUSCLE software.

Usage

muscle(  in.file,  out.file,  muscle.exe = "muscle",  quiet = FALSE,  diags = FALSE,  maxiters = 16)

Arguments

in.file

Name of FASTA file with input sequences.

out.file

Name of file to store the result.

muscle.exe

Command to run the external software muscle on the system (text).

quiet

Logical,quiet = FALSE produces screen output during computations.

diags

Logical,diags = TRUE gives faster but less reliable alignment.

maxiters

Maximum number of iterations.

Details

The software MUSCLE (Edgar, 2004) must be installed and available on the system. The text inmuscle.exe must contain the exact command to invoke muscle on the system.

By defaultdiags = FALSE but can be set toTRUE to increase speed. This should be doneonly if sequences are highly similar.

By defaultmaxiters = 16. If you have a large number of sequences (a few thousand), or they are very long, then this may be too slow for practical use. A good compromise between speed and accuracyis to run just the first two iterations of the algorithm. On average, this gives accuracy equal toT-Coffee and speeds much faster than CLUSTALW. This is done by the optionmaxiters = 2.

Value

The result is written to the file specified inout.file.

Author(s)

Lars Snipen.

References

Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res, 32, 1792-1797.

See Also

msaTrim.

Examples

## Not run: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa")muscle(in.file = fa.file, out.file = "delete_me.msa")## End(Not run)

Length of ORF

Description

Computing the lengths of all ORFs in anorf.table.

Usage

orfLength(orf.table, aa = FALSE)

Arguments

orf.table

A GFF-formattedtibble.

aa

Logical, length in amino acids instead of bases.

Details

By default, computes the length of an ORF in bases, including thestop codon. However, ifaa = TRUE, then the length is in amino acidsafter translation. This aa-length is the base-length divided by 3 and minus 1, unless the ORF is truncated and lacks a stop codon.

Value

A vector of lengths.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

findOrfs.

Examples

# See the example in the Help-file for findOrfs.

Signature for each ORF

Description

Creates a signature text for orfs in anorf.table.

Usage

orfSignature(orf.table, full = TRUE)

Arguments

orf.table

Atibble with ORF information.

full

Logical indicating type of signature.

Details

A signature is a text that uniquely identifies each ORF in anorf.table, which is a GFF-table with columnsSeqid,Start,End andStrand.

The full signature (full = TRUE) contains theSeqid,Start,End andStrand information for each ORF, separated by semicolon";". This text is always unique to each ORF. Iffull = FALSE theSignature will not contain the starting position information for each ORF. This means all nested ORFs ending at the same stop-codon will then get identicalSignatures. This is useful for identifying which ORFs are nested within the same LORF.

Note that the signature you get withfull = FALSE containsSeqid, thenEnd if on the positiveStrand,Start otherwise, and thentheStrand.

Value

A text vector with theSignature for each ORF.

Author(s)

Lars Snipen.

See Also

findOrfs.

Examples

# Using a genome file in this packagegenome.file <- file.path(path.package("microseq"),"extdata","small.fna")# Reading genome and finding orfsgenome <- readFasta(genome.file)orf.tbl <- findOrfs(genome)# Compute signaturessignature.full <- orfSignature(orf.tbl)signature.reduced <- orfSignature(orf.tbl, full = FALSE)

Read and write FASTA files

Description

Reads and writes biological sequences (DNA, RNA, protein) in the FASTA format.

Usage

readFasta(in.file)writeFasta(fdta, out.file, width = 0)

Arguments

in.file

url/directory/name of (gzipped) FASTA file to read.

fdta

A data.frame or tibble with sequence data, see ‘Details’ below.

out.file

Name of (gzipped) FASTA file to create.

width

Number of characters per line, or 0 for no line breaks.

Details

These functions handle input/output of sequences in the commonly used FASTA format.For every sequence it is presumed there is one Header-line starting with a ‘>’. Iffilenames (in.file orout.file) have the extension.gz they will automatically becompressed/uncompressed. NOTE: This requires the R.utils R package.

The sequences are stored in a tibble, opening up all the possibilities in R forfast and easy manipulations. The content of the file is stored as two columns, ‘⁠Header⁠’and ‘⁠Sequence⁠’. If other columns are added, these will be ignored bywriteFasta.

The defaultwidth = 0 inwriteFasta results in no line breaks in the sequences(one sequence per line).

Value

readFasta returns a tibble with the contents of the (gzipped) FASTAfile stored in two columns of text. The first, named ‘⁠Header⁠’, containsthe headerlines and the second, named ‘⁠Sequence⁠’, contains the sequences.

writeFasta produces a (gzipped) FASTA file.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readFastq.

Examples

## Not run: # We need a FASTA-file to read, here is one example file:fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")# Read and writefdta <- readFasta(fa.file)ok <- writeFasta(fdta[4:5,], out.file = "delete_me.fasta")# Make use of dplyr to copy parts of the file to another filereadFasta(fa.file) %>%   filter(str_detect(Sequence, "TGA$")) %>%   writeFasta(out.file = "TGAstop.fasta", width = 80) -> ok## End(Not run)

Read and write FASTQ files

Description

Reads and writes files in the FASTQ format.

Usage

readFastq(in.file)writeFastq(fdta, out.file)

Arguments

in.file

url/directory/name of (gzipped) FASTQ file to read.

fdta

FASTQ object to write.

out.file

url/directory/name of (gzipped) FASTQ file to write.

Details

These functions handle input/output of sequences in the commonly used FASTQ format,typically used for storing DNA sequences (reads) after sequencing. Iffilenames (in.file orout.file) have the extension.gz they will automatically becompressed/uncompressed. NOTE: This requires the R.utils package.

The sequences are stored in a tibble, opening up all the possibilities in R forfast and easy manipulations. The content of the file is stored as three columns, ‘⁠Header⁠’,‘⁠Sequence⁠’ and ‘⁠Quality⁠’. If other columns are added, these will be ignored bywriteFastq.

Value

readFastq returns a tibble with the contents of the (gzipped) FASTQfile stored in three columns of text. The first, named ‘⁠Header⁠’, containsthe headerlines, the second, named ‘⁠Sequence⁠’, contains the sequences and the third, named ‘⁠Quality⁠’ contains the base quality scores.

writeFastq produces a (gzipped) FASTQ file.

Note

These functions will only handle files where each entry spans one single line, i.e. not the(uncommon) multiline FASTQ format.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

readFasta.

Examples

## Not run: # We need a FASTQ-file to read, here is one example file:fq.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.fastq.gz")# Read and writefdta <- readFastq(fq.file)ok <- writeFastq(fdta[1:3,], out.file = "delete_me.fq")# Make use of dplyr to copy parts of the file to another filereadFastq(fq.file) %>%   mutate(Length = str_length(Sequence)) %>%   filter(Length > 200) %>%   writeFasta(out.file = "long_reads.fa") # writing to FASTA file## End(Not run)

Reading and writing GFF-tables

Description

Reading or writing a GFF-table from/to file.

Usage

readGFF(in.file)writeGFF(gff.table, out.file)

Arguments

in.file

Name of file with a GFF-table.

gff.table

A table (data.frame or tibble) with genomic features information.

out.file

Name of file.

Details

A GFF-table is simply a tibble with columnsadhering to the format specified by the GFF3 format, seehttps://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for details. There isone row for each feature.

The following columns should always be in a fullgff.table of the GFF3 format:

Missing values are described by"." in the GFF3 format. This is also done here, except for thenumerical columns Start, End, Score and Phase. HereNA is used, but this is replaced by"." when writing to file.

ThereadGFF function will also read files where sequences in FASTA format are added afterthe GFF-table. This file section must always start with the line##FASTA. This fasta objectis added to the GFF-table as an attribute (useattr(gff.tbl, "FASTA") to retrieve it).

Value

readGFF returns agff.table with the columns described above.

writeGFF writes the suppliedgff.table to a text-file.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

findOrfs,lorfs.

Examples

# Using a GFF file in this packagegff.file <- file.path(path.package("microseq"),"extdata","small.gff")# Reading gff-filegff.tbl <- readGFF(gff.file)

Reverse-complementation of DNA

Description

The standard reverse-complement of nucleotide sequences.

Usage

reverseComplement(nuc.sequences, reverse = TRUE)

Arguments

nuc.sequences

Character vector containing the nucleotide sequences.

reverse

Logical indicating if complement should be reversed.

Details

With ‘⁠reverse = FALSE⁠’ the DNA sequence is only complemented, not reversed.

This function will handle the IUPAC ambiguity symbols, i.e. ‘⁠R⁠’ isreverse-complemented to ‘⁠Y⁠’ etc.

Value

A character vector of reverse-complemented sequences.

Author(s)

Lars Snipen and Kristian Hovde Liland.

See Also

iupac2regex,regex2iupac.

Examples

fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")fa <- readFasta(fa.file)reverseComplement(fa$Sequence)#' # Or, make use of dplyr to manipulate tablesreadFasta(fa.file) %>%  mutate(RevComp = reverseComplement(Sequence)) -> fa.tbl

Translation according to the standard genetic code

Description

The translation from DNA(RNA) to amino acid sequence according to the standard genetic code.

Usage

translate(nuc.sequences, M.start = TRUE, no.stop = TRUE, trans.tab = 11)

Arguments

nuc.sequences

Character vector containing the nucleotide sequences.

M.start

A logical indicating if the amino acid sequence should start with M regardless of start codon.

no.stop

A logical indicating if terminal stops (*) should be eliminated from the translated sequence

trans.tab

Translation table, either 11 or 4

Details

Codons are by default translated according to translation table 11, i.e. the possible start codonsare ATG, GTG or TTG and stop codons are TAA, TGA and TAG. The only alternative implemented here istranslation table 4, which is used by some bacteria (e.g. Mycoplasma, Mesoplasma). Iftrans.tab is 4,the stop codon TGA istranslated to W (Tryptophan).

Value

A character vector of translated sequences.

Author(s)

Lars Snipen and Kristian Hovde Liland.

Examples

fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn")fa <- readFasta(fa.file)translate(fa$Sequence)# Or, make use of dplyr to manipulate tablesreadFasta(fa.file) %>%  mutate(Protein = translate(Sequence)) -> fa.tbl

[8]ページ先頭

©2009-2025 Movatter.jp