CGG toolkit: Software components for computational genomics

Fig 2

Representation of a typical workflow using the reported tools.

Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such asUniProt (database symbol in green, upper left). To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option. The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left). For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed withUniProt or any other annotated collection). Sequence comparisons are executed withBLAST or other options with query Qvs. reference R (or in the case ofall-vs-all, disk symbol in green-blue gradient, upper middle). The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time. Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right). The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or gene-fusion detection, respectively, as well as for inspection and parsing for multiple alignments.

doi:https://doi.org/10.1371/journal.pcbi.1011498.g002

Movatterモバイル変換

CGG toolkit: Software components for computational genomics

Fig 2