Thecoding region of agene, also known as thecoding DNA sequence (CDS), is the portion of a gene'sDNA orRNA that codes for aprotein.[1] Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution ofprokaryotes andeukaryotes.[2] This can further assist in mapping thehuman genome and developing gene therapy.[3]
Although this term is also sometimes used interchangeably withexon, it is not the exact same thing: theexon can be composed of the coding region as well as the 3' and 5'untranslated regions of the RNA, and so therefore, an exon would be partially made up of coding region. The 3' and 5'untranslated regions of the RNA, which do not code for protein, are termednon-coding regions and are not discussed on this page.[4]
There is often confusion between coding regions andexomes and there is a clear distinction between these terms. While theexome refers to all exons within a genome, the coding region refers to sections of the DNA (orprimary transcript) or a singular section of processed mRNA which specifically codes for a certain kind of protein.
In 1978,Walter Gilbert published "Why Genes in Pieces" which first began to explore the idea that the gene is a mosaic—that each fullnucleic acid strand is not coded continuously but is interrupted by "silent" non-coding regions. This was the first indication that there needed to be a distinction between the parts of the genome that code for protein, now called coding regions, and those that do not.[5]

The evidence suggests that there is a general interdependence between base composition patterns and coding region availability.[6] The coding region is thought to contain a higherGC-content than non-coding regions. There is further research that discovered that the longer the coding strand, the higher the GC-content. Short coding strands are comparatively still GC-poor, similar to the low GC-content of the base composition translationalstop codons like TAG, TAA, and TGA.[7]
GC-rich areas are also where the ratiopoint mutation type is altered slightly: there are moretransitions, which are changes from purine to purine or pyrimidine to pyrimidine, compared totransversions, which are changes from purine to pyrimidine or pyrimidine to purine. The transitions are less likely to change the encoded amino acid and remain asilent mutation (especially if they occur in the thirdnucleotide of a codon) which is usually beneficial to the organism during translation and protein formation.[8]
This indicates that essential coding regions (gene-rich) are higher in GC-content and more stable and resistant tomutation compared to accessory and non-essential regions (gene-poor).[9] However, it is still unclear whether this came about through neutral and random mutation or through a pattern ofselection.[10] There is also debate on whether the methods used, such as gene windows, to ascertain the relationship between GC-content and coding region are accurate and unbiased.[11]


InDNA, the coding region is flanked by thepromoter sequence on the 5' end of thetemplate strand and the termination sequence on the 3' end. Duringtranscription, theRNA Polymerase (RNAP) binds to the promoter sequence and moves along the template strand to the coding region. RNAP then adds RNAnucleotides complementary to the coding region in order to form themRNA, substitutinguracil in place ofthymine.[12] This continues until the RNAP reaches the termination sequence.[12]
After transcription and maturation, themature mRNA formed encompasses multiple parts important for its eventual translation intoprotein. The coding region in an mRNA is flanked by the5' untranslated region (5'-UTR) and3' untranslated region (3'-UTR),[1] the5' cap, andPoly-A tail. Duringtranslation, theribosome facilitates the attachment of thetRNAs to the coding region, 3 nucleotides at a time (codons).[13] The tRNAs transfer their associatedamino acids to the growingpolypeptide chain, eventually forming the protein defined in the initial DNA coding region.

The coding region can be modified in order to regulate gene expression.
Alkylation is one form of regulation of the coding region.[15] The gene that would have been transcribed can be silenced by targeting a specific sequence. The bases in this sequence would be blocked usingalkyl groups, which create thesilencing effect.[16]
While theregulation of gene expression manages the abundance of RNA or protein made in a cell, the regulation of these mechanisms can be controlled by aregulatory sequence found before theopen reading frame begins in a strand of DNA. Theregulatory sequence will then determine the location and time that expression will occur for a protein coding region.[17]
RNA splicing ultimately determines what part of the sequence becomes translated and expressed, and this process involves cutting out introns and putting together exons. Where the RNAspliceosome cuts, however, is guided by the recognition ofsplice sites, in particular the 5' splicing site, which is one of the substrates for the first step in splicing.[18] The coding regions are within the exons, which become covalently joined together to form themature messenger RNA.
Mutations in the coding region can have very diverse effects on the phenotype of the organism. While some mutations in this region of DNA/RNA can result in advantageous changes, others can be harmful and sometimes even lethal to an organism's survival. In contrast, changes in the non-coding region may not always result in detectable changes in phenotype.

There are various forms of mutations that can occur in coding regions. One form issilent mutations, in which a change in nucleotides does not result in any change in amino acid after transcription and translation.[20] There also existnonsense mutations, where base alterations in the coding region code for a premature stop codon, producing a shorter final protein.Point mutations, or single base pair changes in the coding region, that code for different amino acids during translation, are calledmissense mutations. Other types of mutations includeframeshift mutations such asinsertions ordeletions.[20]
Some forms of mutations arehereditary (germline mutations), or passed on from a parent to its offspring.[21] Such mutated coding regions are present in all cells within the organism. Other forms of mutations are acquired (somatic mutations) during an organism's lifetime, and may not be constant cell-to-cell.[21] These changes can be caused bymutagens,carcinogens, or other environmental agents (ex.UV). Acquired mutations can also be a result of copy-errors duringDNA replication and are not passed down to offspring. Changes in the coding region can also bede novo (new); such changes are thought to occur shortly afterfertilization, resulting in a mutation present in the offspring's DNA while being absent in both the sperm and egg cells.[21]
There exist multiple transcription and translation mechanisms to prevent lethality due to deleterious mutations in the coding region. Such measures includeproofreading by someDNA Polymerases during replication,mismatch repair following replication,[22] and the 'Wobble Hypothesis' which describes thedegeneracy of the third base within an mRNA codon.[23]
While it is well known that the genome of one individual can have extensive differences when compared to the genome of another, recent research has found that some coding regions are highly constrained, or resistant to mutation, between individuals of the same species. This is similar to the concept of interspecies constraint inconserved sequences. Researchers termed these highly constrained sequences constrained coding regions (CCRs), and have also discovered that such regions may be involved in highpurifying selection. On average, there is approximately 1 protein-altering mutation every 7 coding bases, but some CCRs can have over 100 bases in sequence with no observed protein-altering mutations, some without even synonymous mutations.[24] These patterns of constraint between genomes may provide clues to the sources of raredevelopmental diseases or potentially even embryonic lethality. Clinically validated variants andde novo mutations in CCRs have been previously linked to disorders such asinfantile epileptic encephalopathy, developmental delay and severe heart disease.[24]

While identification ofopen reading frames within a DNA sequence is straightforward, identifying coding sequences is not, because the cell translates only a subset of all open reading frames to proteins.[26] Currently CDS prediction uses sampling and sequencing of mRNA from cells, although there is still the problem of determining which parts of a given mRNA are actually translated to protein. CDS prediction is a subset ofgene prediction, the latter also including prediction of DNA sequences that code not only for protein but also for other functional elements such as RNA genes and regulatory sequences.
In bothprokaryotes andeukaryotes,gene overlapping occurs relatively often in both DNA and RNA viruses as an evolutionary advantage to reduce genome size while retaining the ability to produce various proteins from the available coding regions.[27][28] For both DNA and RNA,pairwise alignments can detect overlapping coding regions, including shortopen reading frames in viruses, but would require a known coding strand to compare the potential overlapping coding strand with.[29] An alternative method using single genome sequences would not require multiple genome sequences to execute comparisons but would require at least 50 nucleotides overlapping in order to be sensitive.[30]
{{cite book}}:|journal= ignored (help){{citation}}: CS1 maint: numeric names: authors list (link)