Movatterモバイル変換

Stockholm format

From Wikipedia, the free encyclopedia

File format for biological sequence alignments

Stockholm format
Filename extensions	`.sto`,`.stk`
Internet media type	`text/x-stockholm-alignment`
Developed by	Erik Sonnhammers
Type of format	Bioinformatics
Open format?	yes
Website	sonnhammer.sbc.su.se/Stockholm.html

Stockholm format is amultiple sequence alignment format used byPfam,Rfam andDfam, to disseminate protein, RNA and DNA sequence alignments.^[1]^[2]^[3] The alignment editorsRalee,^[4]Belvu andJalview support Stockholm format as do the probabilisticdatabase search tools,Infernal andHMMER, and the phylogenetic analysis toolXrate. Stockholm format files often have thefilename extension.sto or.stk.^[5]

Syntax

[edit]

Awell-formed stockholm file always contains a header which states the format and version identifier, currently '# STOCKHOLM 1.0'. The header is then followed by a multiple lines, a mix of markup (starting with#) and sequences. Finally, the "//" line indicates the end of the alignment.

An example without markup looks like:

# STOCKHOLM 1.0#=GF ID   EXAMPLE<seqname> <aligned sequence><seqname> <aligned sequence><seqname> <aligned sequence>//

Sequences are written one per line. The sequence name is written first, and after any number of whitespaces the sequence is written. Sequence names are typically in the form "name/start-end" or just "name". Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

Mark-up lines start with#. The "parameters" are separated by whitespace, so an underscore ("_") instead of space should be used for the 1-char-per-column markups. Mark-up types defined include:

#=GF <feature> <Generic per-File annotation, free text>#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>#=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue>

Recommended features

[edit]

These feature names are used by Pfam and Rfam for specific types of annotation. (See thePfam and theRfam documentation under "Description of fields")

#=GF

[edit]

Pfam and Rfam may use the following tags:

Compulsory fields: ------------------ AC Accession number: Accession number in form PFxxxxx (Pfam) or RFxxxxx (Rfam). ID Identification: One word name for family. DE Definition: Short description of family. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one family. SS Source of structure: The source (prediction or publication) of the consensus RNA secondary structure used by Rfam. BM Build method: Command line used to generate the model SM Search method: Command line used to perform the search GA Gathering threshold: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score (and domain score for Pfam) of match in the full alignment. NC Noise Cutoff: Highest sequence score (and domain score for Pfam) of match not in full alignment. TP Type: Type of family -- presently Family, Domain, Motif or Repeat for Pfam. -- a tree with roots Gene, Intron or Cis-reg for Rfam. SQ Sequence: Number of sequences in alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. KW Keywords: Keywords. CC Comment: Comments. NE Pfam accession: Indicates a nested domain. NL Location: Location of nested domains - sequence ID, start and end of insert. WK Wikipedia link: Wikipedia page CL Clan: Clan accession MB Membership: Used for listing Clan membership For embedding trees: ---------------- NH New Hampshire A tree in New Hampshire eXtended format. TN Tree ID A unique identifier for the next tree. Other: ------ FR False discovery Rate: A method used to set the bit score threshold based on the ratio of expected false positives to true positives. Floating point number between 0 and 1. CB Calibration method: Command line used to calibrate the model (Rfam only, release 12.0 and later)

Notes: A tree may be stored on multiple #=GF NH lines.
If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.

#=GS

[edit]

Rfam and Pfam may use these features:

      Feature                    Description      ---------------------      -----------      AC <accession>             ACcession number      DE <freetext>              DEscription      DR <db>; <accession>;      Database Reference      OS <organism>              Organism (species)      OC <clade>                 Organism Classification (clade, etc.)      LO <look>                  Look (Color, etc.)

#=GR

[edit]

      Feature   Description            Markup letters      -------   -----------            --------------      SS        Secondary Structure    For RNA [.,;<>(){}[]AaBb.-_] --supports pseudoknot and further structure markup (see WUSS documentation)                                        For protein [HGIEBTSCX]      SA        Surface Accessibility  [0-9X]                     (0=0%-10%; ...; 9=90%-100%)      TM        TransMembrane          [Mio]      PP        Posterior Probability  [0-9*]                     (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)      LI        LIgand binding         [*]      AS        Active Site            [*]     pAS        AS - Pfam predicted    [*]     sAS        AS - from SwissProt    [*]      IN        INtron (in or after)   [0-2]      For RNA tertiary interactions:     ------------------------------     tWW       WC/WC        in trans   For basepairs: [<>AaBb...Zz]  For unpaired: [.]     cWH       WC/Hoogsteen in cis     cWS       WC/SugarEdge in cis     tWS       WC/SugarEdge in trans     notes: (1) {c,t}{W,H,S}{W,H,S} for general format.             (2) cWW is equivalent to SS.

#=GC

[edit]

The list of valid features includes those shown below, as well as the same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

      Feature   Description            Description      -------   -----------            --------------      RF        ReFerence annotation   Often the consensus RNA or protein sequence is used as a reference                                       Any non-gap character (e.g. x's) can indicate consensus/conserved/match columns                                       .'s or -'s indicate insert columns                                       ~'s indicate unaligned insertions                                       Upper and lower case can be used to discriminate strong and weakly conserved                                        residues respectively      MM        Model Mask             Indicates which columns in an alignment should be masked, such                                       that the emission probabilities for match states corresponding to                                       those columns will be the background distribution.

Notes

[edit]

Do not use multiple lines with the same #=GC label.
For a single sequence, do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
"X" in SA and SS means "residue with unknown structure".
The protein SS letters are taken fromDSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)
The RNA SS letters are taken from WUSS (Washington University Secondary Structure) notation. Matching nested parentheses characters <>, (), [], or {} indicate a basepair. The symbols '.', ',' and ';' indicate unpaired regions. Matched upper and lower case characters from theEnglish alphabet indicatepseudoknot interactions. The 5' nucleotide within the knot should be in uppercase and the 3' nucleotide lowercase.

Recommended placements

[edit]

#=GF Above the alignment
#=GC Below the alignment
#=GS Above the alignment or just below the corresponding sequence
#=GR Just below the corresponding sequence

Size limits

[edit]

There are no explicit size limits on any field. However, a simple parser that uses fixed field sizes should work safely on Pfam and Rfam alignments with these limits:

Line length: 10000.
<seqname>: 255.
<feature>: 255.

Examples

[edit]

A simple example of an Rfam alignment (UPSK RNA) with apseudoknot in Stockholm format is shown below:^[6]

# STOCKHOLM 1.0#=GF ID    UPSK#=GF SE    Predicted; Infernal #=GF SS    Published; PMID 9223489#=GF RN    [1]#=GF RM    9223489#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA#=GF RT    polymerase.#=GF RA    Deiman BA, Kortlever RM, Pleij CW;#=GF RL    J Virol 1997;71:5990-5996.AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCGM24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCGJ04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCGM24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG#=GC SS_cons                   .AAA....<<<<aaa....>>>>//

Here is a slightly more complex example showing the PfamCBS domain:

# STOCKHOLM 1.0#=GF ID CBS#=GF AC PF00571#=GF DE CBS domain#=GF AU Bateman A#=GF CC CBS domains are small intracellular modules mostly found#=GF CC in 2 or four copies within a protein.#=GF SQ 5#=GS O31698/18-71 AC O31698#=GS O83071/192-246 AC O83071#=GS O83071/259-312 AC O83071#=GS O31698/88-139 AC O31698#=GS O31698/88-139 OS Bacillus subtilisO83071/192-246          MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS#=GR O83071/192-246 SA  9998877564535242525515252536463774777O83071/259-312          MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY#=GR O83071/259-312 SS  CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEEO31698/18-71            MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS#=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHHO31698/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE#=GR O31698/88-139 SS   CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH#=GC SS_cons            CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEHO31699/88-139           EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE#=GR O31699/88-139 AS   ________________*____________________#=GR O31699/88-139 IN   ____________1____________2______0____//

References

[edit]

^Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al. (January 2009)."Rfam: updates to the RNA families database".Nucleic Acids Research.37 (Database issue):D136 –D140.doi:10.1093/nar/gkn766.PMC 2686503.PMID 18953034.
^Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, et al. (January 2008)."The Pfam protein families database".Nucleic Acids Research.36 (Database issue):D281 –D288.doi:10.1093/nar/gkm960.PMC 2238907.PMID 18039703.
^Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF (January 2021)."The Dfam community resource of transposable element families, sequence models, and genome annotations".Mobile DNA.12 (1): 2.doi:10.1186/s13100-020-00230-y.PMC 7805219.PMID 33436076.
^Griffiths-Jones S (January 2005)."RALEE--RNA ALignment editor in Emacs".Bioinformatics.21 (2):257–259.doi:10.1093/bioinformatics/bth489.PMID 15377506.
^"Alignment Fileformats". 22 May 2019. Retrieved22 May 2019.
^Deiman BA, Kortlever RM, Pleij CW (August 1997)."The role of the pseudoknot at the 3' end of turnip yellow mosaic virus RNA in minus-strand synthesis by the viral RNA-dependent RNA polymerase".Journal of Virology.71 (8):5990–5996.doi:10.1128/JVI.71.8.5990-5996.1997.PMC 191855.PMID 9223489.

External links

[edit]

Erik Sonnhammers' definition of Stockholm format

v t e Bioinformatics
Databases	Sequence databases:GenBank,European Nucleotide Archive,DNA Data Bank of Japan andChina National GeneBank Secondary databases:UniProt, database of protein sequences grouping togetherSwiss-Prot,TrEMBL andProtein Information Resource Other databases:BioNumbers,Protein Data Bank,Ensembl,InterPro,KEGG, andGene Ontology Specialised genomic databases:BOLD,Saccharomyces Genome Database,FlyBase,VectorBase,WormBase,Rat Genome Database,PHI-base,Arabidopsis Information Resource,GISAID andZebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server:ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Joint Genome Institute (JGI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons