Nature Reviews Genetics
  • Review Article
  • Published:

Classification schemes for protein structure and function

Nature Reviews Geneticsvolume 4pages508–519 (2003)

Key Points

  • There have been many attempts to classify proteins into groups of related molecules in terms of their primary, secondary and tertiary structure. An overview of these classification schemes is presented.

  • Structural classifications of proteins rely on similarity or metrics and the subsequent clustering of related molecules, or protein families, into groups.

  • Some schemes are manually curated whereas others are automatically derived. Although the quality of manual curations might generally be higher, they do not scale well with the increasing amount of protein sequence and structure information.

  • The coverage of known protein-sequence space is not homogeneous across these schemes. We argue that more comparative analyses of these schemes are required to derive a unified structural classification and to understand their similarities and differences.

  • More recently, functional classifications of proteins have also appeared, ranging from enzyme/reaction classifications to functional roles or cellular localization. These classifications rely heavily on functional genomics experiments and sequence annotations.

  • In principle, structural and functional classifications of proteins are independent of one another, and their relationships will further our understanding of the cellular roles of protein families.

  • Ultimately, a unified scheme for both structure and function should rely on a natural classification that takes into account both the phylogeny and biochemistry of the different protein types in nature.


We examine the structural and functional classifications of the protein universe, providing an overview of the existing classification schemes, their features and inter-relationships. We argue that a unified scheme should be based on a natural classification approach and that more comparative analyses of the present schemes are required both to understand their limitations and to help delimit the number of known protein folds and their corresponding functional roles in cells.

Figure 1: Coverage of protein sequence space by structural versus functional classifications.
Figure 2: Overlaying structural and functional classifications for a group of interconnected proteins.
Figure 3: Relative frequencies of subcellular localization labels forSaccharomyces cerevisiae gene products of known function.
Figure 4: Topology and network structure of known protein interactions from yeast.

We also thank C. Von Mering, I. Rigoutsos and colleagues at the European Bioinformatics Institute (EBI) for providing information onFigure 1.

A criterion or set of criteria that are stated in quantifiable terms.


Clustering is the process of grouping objects on the basis of their similarity. Distance-based hierarchical clustering is used to construct a tree of nested clusters on the basis of the proximity (or distance) between data points.


Initiatives to solve the structures of proteins that are encoded in an entire genome by high-throughput methods.


Dynamic programming is an algorithmic approach to solve sequential or multi-stage decision problems, such as finding optimal protein-sequence alignments. The position-weighted dynamic-programming method incorporates a matrix of substitution frequencies between amino acids, weighted by the degree of conservation of particular residues.


Hand-edited multiple sequence alignments that incorporate sequences that are described in the literature as belonging to the same family. From these seed alignments, hidden Markov models can be created that can in turn be used to search databases and identify new members of the family.


(HMM). A pattern-recognition approach that is used in bioinformatics for DNA/protein feature detection and sequence comparison. HMMs are based on transition probabilities for discrete states. These probabilities are usually derived from training sets such as seed alignments.


An approach that produces all patterns in any given data set in an efficient way that avoids the explicit enumeration of the entire pattern space.


Genes of common origin that have diverged through speciation rather than duplication. This term is sometimes ambiguously used to denote functionally equivalent genes that are of common origin in different organisms.


(NMR). An analytical chemistry technique that is used to study molecular structure and dynamics, which explores spectrum differences that are caused by the differential alignment of atomic spins in the presence of a strong magnetic field.


A measure of the disorder or unavailability of energy within a closed system.


The development and life cycle of a single organism.


An explicit formal specification of how to represent the objects, concepts and other entities within a domain of discourse, and the relationships among them. Ontologies are designed to create agreed vocabularies for exchanging information.

