- Notifications
You must be signed in to change notification settings - Fork8
Goalign is a set of command line tools and an API to manipulate multiple sequence alignments. It is implemented in Go language.
License
evolbioinfo/goalign
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Goalign is a set of command line tools to manipulate multiple alignments. It is implemented inGo language.
Goalign aims to handle multiple alignments inPhylip,Fasta,Nexus, andClustal formats, through several basic commands. Each command may print result (an alignment for example) in the standard output, and thus can be piped to the standard input of the next goalign command.
Input files may be local or remote files:
- If file name is of the form
http(s)://<URL>
, the file is download from the given URL. - Otherwise, the file is considered local.
Gzipped input files (.gz
extension) are supported, as well as XZ files (.xz
extension) and BZipped files (.bz[2]
extension).
Note:
TO manipulate phylogenetic trees, See alsoGotree.
If you use Gotree or Goalign, please cite:
Frédéric Lemoine, Olivier Gascuel
Gotree/Goalign: toolkit and Go API to facilitate the development of phylogenetic workflows,
NAR Genomics and Bioinformatics, Volume 3, Issue 3, September 2021, lqab075,doi
You can download ready to run binaries for the latest release in therelease section.Binaries are available for MacOS, Linux, and Windows (32 and 64 bits).
Once downloaded, you can just run the executable without any other downloads.
Goalign Docker image is accessible fromdocker hub. You may use it as following:
# Display goalign helpdocker run -v $PWD:$PWD -w $PWD -i -t evolbioinfo/goalign:v0.2.6 -h
Goaligndocker image is usable from singularity . You may use it as following:
# Pull image from docker hubsingularity pull docker://evolbioinfo/goalign:v0.2.6# Display goalign help./goalign-v0.2.6.simg -h
Goalign is also available onbioconda. Just type:
conda install -c bioconda goalign
To build goalign, you must firstdownload andinstall Go on your system (
Then you just have to type :
git clone git@github.com:evolbioinfo/goalign.gitcd goalignmake && make install# or go get . && go build .# or go get . && go install .
Thegoalign
executable should be located in the current folder (or the$GOPATH/bin
).
To test the executable:
./test.sh
goalign usescobra, and therefore proposes a command to generate auto completion scripts:
gotree completion -h
You may go to thedoc for a more detailed documentation of the commands.
- addid: Adds a string to each sequence identifier of the input alignment
- append: Concatenates several alignments by adding new alignments as new sequences of the first alignment
- build: Command to build output files : bootstrap for example
- seqboot : Generate bootstrap alignments
- clean: Removes gap sites/sequences
- sites : Removes sites with gaps
- seqs : Removes sequences with gaps
- codonalign: Aligns a given nt fasta file using a corresponding aa alignment (by codons)
- compress: Removes identical patterns/sites from alignment
- compute: Different computations (distances, etc.)
- distances: compute evolutionary distances for nucleotide alignment
- entropy: compute entropy of alignment sites
- pssm: compute position-specific scoring matrix
- concat: Concatenates several alignments by concatenating each sequences having the same name
- consensus: Compute a basic majority consensus of an input alignment
- dedup: Remove sequences that have the same sequence
- diff : Compare all sequences to the first one of the alignment, and count the differences
- divide: Divide an input alignment in several output files (one per alignment)
- draw: Draw alignments
- biojs: Display an input alignment in an html file usingBioJS
- png: Display an input alignment in a png file, one sequence per line and one pixel per character
- extract: Extract several sub-alignments, potentially composed of several blocks, from an input alignment, using an coordinate file
- identical: Tell whether two alignments are identical
- mask: Replace positions by N (of nucleotides) or X (if amino-acids)
- mutate: Add substitutions (~sequencing errors), or gaps, uniformly in an input alignment
- gaps: Add gaps uniformly in an input alignment
- snvs: Add substitutions uniformly in an input alignment
- orf: Find the longest orf in all given sequences in forward strand
- phase: Try to find reference orf(s) (aa) in input sequences, and align it on the same phase
- phasent: Try to find reference sequence (nt) in input sequences, and align it on the same phase
- random: Generate random sequences
- reformat: Reformats input alignment into several formats
- fasta
- nexus
- paml
- clustal
- phylip
- tnt
- rename: Rename sequences of the input alignment, (using a map file, with a regexp, or just clean names)
- replace: Replace characters in sequences of input alignment using a regex
- sample: Samples sequences or subalignments
- seqs: Randomly samples a subset of sequences from the input alignment
- sites: Extracts a sub-alignment starting a a random position, and with a given length
- rarefy: Down-samples input alignment, taking into accounts weights/counts of all sequences
- shuffle: A set of commands to shuffle an alignment
- recomb: Recombine some sequences (copy/paste)
- rogue: simulate sort of rogue taxa by shuffling some sequences
- seqs: Shuffle sequence order in the alignment
- sites: Shuffle "vertically" some sites of the alignments
- swap: Swap portions of some sequences (cut/paste)
- split: Split an input alignment according to partitions defined in a partition file
- stats: Prints different characteristics of the alignment
- alleles
- alphabet
- char
- gaps
- length
- mutations
- nalign
- nseq
- taxa
- subseq: Extract a subsequence from the alignment (coordinates on alignment reference or on a given sequence reference)
- subsites: Extract sites from the input alignment (coordinates on alignment reference or on a given sequence reference, or informative sites)
- subset: Take a subset of sequences from the input alignment
- sw: Aligns 2 sequences using Smith & Waterman algorithm
- translate: Translate input sequences/alignment (supports IUPAC code)
- transpose: Transpose input alignment
- trim: This command trims names of sequences or sequences themselves
- name
- seq
- unalign: Unaligns input alignment
- version: Prints the current version of goalign
- Generate a random alignemnt and print statistics
goalign random | goalign stats
- Trim names of a random alignment and finally rename it back
goalign random > align.fagoalign trim name -n 3 -m map -i align.fa > align_rename.fagoalign rename -i align_rename.fa -m map -r
- Reformat a fasta alignment to phylip
goalign random | goalign reformat phylip
- Reformat a clustal alignment to fasta
goalign random --amino-acids --clustal --nb-seqs 2 | goalign reformat fasta --clustal
- Reformat a phylip alignment to fasta
goalign random -p | goalign reformat fasta -p
- Add a prefix to all sequence names of the alignment
goalign random | goalign addid -n "Dataset1_"
- Add a suffix to all sequence names of the alignment
goalign random | goalign addid -r -n "_Dataset1"
- Take a random sample (10 sequences) from an input alignment
goalign random -n 10000 | goalign sample -n 10
- Extract all sequences whose name starts with "mammal"
goalign subset -e '^mammal.*$' -i align.fasta
- Extract all sequences whose name does match the regexp
goalign subset -r -e '^mammal.*$' -i align.fasta
- Extract a sub sequences going from position 10 and with a length of 100
goalign subseq -i align.fasta -s 9 -l 10
- Compute a "logo" like consensus
goalign compute pssm -n 4 -i align.fasta
- Compute an evolutionary distance matrix (dna alignment only, 5 threads)
goalign compute distance -m k2p -i align.fasta -t 5
- Compute site entropry
goalign compute entropy -i align.fasta
- Build 100 bootstrap alignments from an input alignment, in a single tar.gz file (5 threads)
goalign random -n 500 | goalign build seqboot -S -n 100 --gz --tar -t 5 -o boot
- Build 100 bootstrap alignments from an input alignment, in 100 .gz files (5 threads)
goalign random -n 500 | goalign build seqboot -S -n 100 --gz -t 5 -o boot
- Parse a Phylip single alignment file and export it in Fasta
package mainimport ("fmt""os""github.com/evolbioinfo/goalign/align""github.com/evolbioinfo/goalign/io/fasta""github.com/evolbioinfo/goalign/io/phylip")funcmain() {varerrerrorvarf*os.Filevaralign align.Alignmentf,err=os.Open("f.phy")iferr!=nil {panic(err)}ifalign,err=phylip.NewParser(f).Parse();err!=nil {panic(err)}else {fmt.Println(fasta.WriteSequences(align))}}
- Parse a Phylip multi alignments file and export it in Fasta
package mainimport ("fmt""os""github.com/evolbioinfo/goalign/align""github.com/evolbioinfo/goalign/io/fasta""github.com/evolbioinfo/goalign/io/phylip")funcmain() {varf*os.Filevaralignschan align.Alignmentvarerrerrorf,err=os.Open("f.phy")iferr!=nil {panic(err)}aligns=make(chan align.Alignment,15)iferr=phylip.NewParser(f).ParseMultiple(aligns);err!=nil {panic(err)}else {foral:=rangealigns {fmt.Println(fasta.WriteSequences(al))}}}
- Parse a Fasta file and export it in Nexus
package mainimport ("fmt""os""github.com/evolbioinfo/goalign/align""github.com/evolbioinfo/goalign/io/fasta""github.com/evolbioinfo/goalign/io/nexus")funcmain() {varf*os.Filevaralign align.Alignmentvarerrerrorf,err=os.Open("f.fasta")iferr!=nil {panic(err)}ifalign,err=fasta.NewParser(f).Parse();err!=nil {panic(err)}else {fmt.Println(nexus.WriteAlignment(align))}}
- Parse a Fasta file and export it in Phylip
package mainimport ("fmt""os""github.com/evolbioinfo/goalign/align""github.com/evolbioinfo/goalign/io/fasta""github.com/evolbioinfo/goalign/io/phylip")funcmain() {varf*os.Filevaralign align.Alignmentvarerrerrorf,err=os.Open("f.fasta")iferr!=nil {panic(err)}ifalign,err=fasta.NewParser(f).Parse();err!=nil {panic(err)}else {fmt.Println(phylip.WriteAlignment(align,false))}}
- Iterating over alignment sequences
align.IterateChar(func(namestring,sequence []uint8) {fmt.Printf("Sequence: %s\n",name)})
- Append identifier at the beginning of all sequence names
align.AppendSeqIdentifier("IDENT",false)
- Alignment statistics
varnint=align.NbSequences()varlint=align.Length()
- Extract a sub alignment
varsubalign align.Alignmentvarerrerrorsubalign,err=align.SubAlign(0,100)
- Sort sequences by alphanumerical order
align.Sort()
- Copy/Clone the alignment
varclonealign align.Alignmentvarerrerrorclonealign,err=align.Clone()
- Get the sequence having a specific name
varsequencestringvarerrerrorsequence,err=align.GetSequence("nameofsequence")
- Build a bootstrap replicate
varbootstrap align.Alignmentbootstrap=align.BuildBootstrap()
- Randomly shuffle sequence order of alignment
align.ShuffleSequences()
- Compute evolutionary ditance matrix (5 threads)
import"github.com/evolbioinfo/goalign/distance"//...varmodel distance.DistModelvardistMatrix [][]float64model=distance.Model("k2p",false)distmatrix=distance.DistMatrix(align,nil,model,5)
- Other functions
Other functions are described in the godoc.
About
Goalign is a set of command line tools and an API to manipulate multiple sequence alignments. It is implemented in Go language.