Protein primary structure is thelinear sequence ofamino acids in apeptide orprotein.[1] By convention, theprimary structure of a protein is reported starting from theamino-terminal (N) end to thecarboxyl-terminal (C) end.Protein biosynthesis is most commonly performed byribosomes in cells. Peptides can also besynthesized in the laboratory. Protein primary structures can bedirectly sequenced, or inferred fromDNA sequences.
Amino acids are polymerised via peptide bonds to form a longbackbone, with the different amino acid side chains protruding along it. In biological systems, proteins are produced duringtranslation by a cell'sribosomes. Some organisms can also make short peptides bynon-ribosomal peptide synthesis, which often use amino acids other than theencoded 22, and may be cyclised, modified and cross-linked.
Peptides can besynthesised chemically via a range of laboratory methods. Chemical methods typically synthesise peptides in the opposite order (starting at the C-terminus) to biological protein synthesis (starting at the N-terminus).
Protein sequence is typically notated as a string of letters, listing the amino acids starting at theamino-terminal end through to thecarboxyl-terminal end. Either a three letter code or single letter code can be used to represent the 22 naturally encoded amino acids, as well as mixtures or ambiguous amino acids (similar tonucleic acid notation).[1][2][3]
Peptides can bedirectly sequenced, or inferred fromDNA sequences. Largesequence databases now exist that collate known protein sequences.
Amino Acid | 3-Letter[4] | 1-Letter[4] |
---|---|---|
Alanine | Ala | A |
Arginine | Arg | R |
Asparagine | Asn | N |
Aspartic acid | Asp | D |
Cysteine | Cys | C |
Glutamic acid | Glu | E |
Glutamine | Gln | Q |
Glycine | Gly | G |
Histidine | His | H |
Isoleucine | Ile | I |
Leucine | Leu | L |
Lysine | Lys | K |
Methionine | Met | M |
Phenylalanine | Phe | F |
Proline | Pro | P |
Pyrrolysine | Pyl | O |
Selenocysteine | Sec | U |
Serine | Ser | S |
Threonine | Thr | T |
Tryptophan | Trp | W |
Tyrosine | Tyr | Y |
Valine | Val | V |
Symbol | Description | Residues represented |
---|---|---|
X | Any amino acid, or unknown | All |
B | Aspartate or Asparagine | D, N |
Z | Glutamate or Glutamine | E, Q |
J | Leucine or Isoleucine | I, L |
Φ | Hydrophobic | V, I, L, F, W, M |
Ω | Aromatic | F, W, Y, H |
Ψ | Aliphatic | V, I, L, M |
π | Small | P, G, A, S |
ζ | Hydrophilic | S, T, H, N, Q, E, D, K, R, Y |
+ | Positively charged | K, R, H |
- | Negatively charged | D, E |
In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence ofamino acids along their backbone. However, proteins can become cross-linked, most commonly bydisulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying thecysteines involved in the protein's disulfide bonds. Other crosslinks includedesmosine.
The chiral centers of a polypeptide chain can undergoracemization. Although it does not change the sequence, it does affect the chemical properties of the sequence. In particular, theL-amino acids normally found in proteins can spontaneously isomerize at the atom to formD-amino acids, which cannot be cleaved by mostproteases. Additionally,proline can form stable trans-isomers at the peptide bond.
Additionally, the protein can undergo a variety ofpost-translational modifications, which are briefly summarized here.
The N-terminal amino group of a polypeptide can be modified covalently, e.g.,
The C-terminal carboxylate group of a polypeptide can also be modified, e.g.,
Finally, the peptideside chains can also be modified covalently, e.g.,
Most of the polypeptide modifications listed above occurpost-translationally, i.e., after theprotein has been synthesized on theribosome, typically occurring in theendoplasmic reticulum, a subcellularorganelle of the eukaryotic cell.
Many other chemical reactions (e.g., cyanylation) have been applied to proteins by chemists, although they are not found in biological systems.
In addition to those listed above, the most important modification of primary structure ispeptide cleavage (by chemicalhydrolysis or byproteases). Proteins are often synthesized in an inactive precursor form; typically, an N-terminal or C-terminal segment blocks theactive site of the protein, inhibiting its function. The protein is activated by cleaving off the inhibitory peptide.
Some proteins even have the power to cleave themselves. Typically, the hydroxyl group of a serine (rarely, threonine) or the thiol group of a cysteine residue will attack the carbonyl carbon of the preceding peptide bond, forming a tetrahedrally bonded intermediate [classified as a hydroxyoxazolidine (Ser/Thr) or hydroxythiazolidine (Cys) intermediate]. This intermediate tends to revert to the amide form, expelling the attacking group, since the amide form is usually favored by free energy, (presumably due to the strong resonance stabilization of the peptide group). However, additional molecular interactions may render the amide form less stable; the amino group is expelled instead, resulting in an ester (Ser/Thr) or thioester (Cys) bond in place of the peptide bond. This chemical reaction is called anN-O acyl shift.
The ester/thioester bond can be resolved in several ways:
The compression of amino acid sequences is a comparatively challenging task. The existing specialized amino acid sequence compressors are low compared with that of DNA sequence compressors, mainly because of the characteristics of the data. For example, modeling inversions is harder because of the reverse information loss (from amino acids to DNA sequence). The current lossless data compressor that provides higher compression is AC2.[5] AC2 mixes various context models using Neural Networks and encodes the data using arithmetic encoding.
The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad.Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later byEmil Fischer, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux.[6]
Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some scientists such asWilliam Astbury doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder.Hermann Staudinger faced similar prejudices in the 1920s when he argued thatrubber was composed ofmacromolecules.[6]
Thus, several alternative hypotheses arose. Thecolloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements byTheodor Svedberg that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements byArne Tiselius that indicated that proteins were single molecules. A second hypothesis, thecyclol hypothesis advanced byDorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensionalfabric. Other primary structures of proteins were proposed by various researchers, such as thediketopiperazine model ofEmil Abderhalden and thepyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved whenFrederick Sanger successfully sequencedinsulin[when?] and by the crystallographic determination of myoglobin and hemoglobin byMax Perutz andJohn Kendrew[when?].
Any linear-chain heteropolymer can be said to have a "primary structure" by analogy to the usage of the term for proteins, but this usage is rare compared to the extremely common usage in reference to proteins. InRNA, which also has extensivesecondary structure, the linear chain of bases is generally just referred to as the "sequence" as it is inDNA (which usually forms a linear double helix with little secondary structure). Other biological polymers such aspolysaccharides can also be considered to have a primary structure, although the usage is not standard.
The primary structure of a biological polymer to a large extent determines the three-dimensional shape (tertiary structure). Protein sequence can be used topredict local features, such as segments of secondary structure, or trans-membrane regions. However, the complexity ofprotein folding currently prohibitspredicting the tertiary structure of a protein from its sequence alone. Knowing the structure of a similarhomologous sequence (for example a member of the sameprotein family) allows highly accurate prediction of thetertiary structure byhomology modeling. If the full-length protein sequence is available, it is possible to estimate its generalbiophysical properties, such as itsisoelectric point.
Sequence families are often determined bysequence clustering, andstructural genomics projects aim to produce a set of representative structures to cover thesequence space of possible non-redundant sequences.
{{cite journal}}
: CS1 maint: multiple names: authors list (link)