Inmolecular biology, aCCAAT box (also sometimes abbreviated aCAAT box orCAT box) is a distinct pattern ofnucleotides with GGCCAATCTconsensus sequence that occur upstream by 60–100 bases to the initialtranscription site. The CAAT box signals the binding site for theRNAtranscription factor, and is typically accompanied by aconserved consensus sequence. It is an invariantDNA sequence at about minus 70 base pairs from the origin of transcription in manyeukaryoticpromoters. Genes that have this element seem to require it for the gene to be transcribed in sufficient quantities. It is frequently absent from genes that encode proteins used in virtually all cells. This box along with theGC box is known for binding general transcription factors. Both of these consensus sequences belong to the regulatorypromoter. Full gene expression occurs when transcription activator proteins bind to each module within the regulatory promoter. Protein specific binding is required for the CCAAT box activation. These proteins are known as CCAAT box binding proteins/CCAAT box binding factors.
A CCAAT box is a feature frequently found before eukaryote coding regions, but is not found in prokaryotes.[2]
In the direction of transcription of the template strand, theconsensus sequence, or the calculated order of the most frequent residues, for the CAAT box was 3'-TG ATTGG (T/C)(T/C)(A/G)-5'. The use of parentheses denotes that either base is present, but it is not specified as to their relative frequencies. For example, "(T/C)" would mean that either thymine or cytosine are preferentially selected for.[3] Withinmetazoa (animal kingdom), the core binding factor (CBF)-DNA complex retains a high degree of conservation within the CCAAT binding motif, as well as the sequences flanking this pentameric motif. The CCAAT motif in plants (spinach was used in an experiment) differs slightly from metazoa in that it is actually a CAAT binding motif; the promoter lacks one of the two C residues from the pentameric motif, and the artificial addition of the second C has no significant effects on binding activity. Some sequences lack the CAAT-box completely. Secondly, the surrounding nucleotides in plants do not match the consensus sequence above determined by Biet al.[4]
The CAAT box is what is known as a core promoter, also known as the basal promoter or simply thepromoter, is a region of DNA that initiates transcription of a particular gene. This region, in particular for the CAAT box, is located about 60–100 bases upstream (towards the 5' end), however no less than 27 base pairs away, from theinitial transcription site or a eukaryote gene in which a complex of general transcription factors bind withRNA polymerase II prior to the initiation of transcription.[5][6] It is essential to the transcription that these core binding factors (also referred to as nuclear factor Y or NF-Y) are able to bind to the CCAAT motif. Experiments in many laboratories have shown that mutations to the CCAAT motif that cause a loss of CBF binding also decreases transcriptional activity in these promoters, suggesting that CBF-CCAAT complexes are essential for optimum transcriptional activity.[3]
In an experiment done with core binding factors (CBF) and DNA complexes, researchers were able to determine the preferential sequences of the promoter in a region over and immediately adjacent to the CAAT box, and two regions on either side of the CAAT box. By usingPCR-mediated random binding selection process, researchers were able to show that the sequence "3' - (T/C)G ATTGG (T/C)(T/C)(A/G) - 5'" immediately flanking the ATTGG region (CCAAT in the complementary strand) was preferentially selected on the coding strand (opposite of the template strand).[3][7][8] This was shown using an oligonucleotide sequence (R1) which contained 27 random nucleotides, flanked by a defined 20 nucleotide sequence on each side. While no single nucleotide was selected in every clone on either side of the ATTGG motif (CCAAT in the complementary strand), there were several nucleotides in positions selected with high frequency. Most notably from the sequence above was the G residue towards the 5' end of the ATTGG. The other residues also listed were notable, but there is a split between two residues. This same experiment also yielded the same sequence as shown above when using a different oligonucleotide (R2) that contained an ATTGG core and flanked by 12 5' random nucleotides and 10 3' random nucleotides. Both these sequences are very similar and confirmed in multiple experiments. For sequences that flanked the ATTGG motif with two adenine residues (AA) on its 5' end and G(A/G) on its 3' end, seems to have inhibited formation of the CBF-DNA complex and subsequently occurred in only 1% of the promoter sequences.[3] In another experiment performed with the major late promoter (MLP) of adenoviruses from a variety of host species, it was shown that the mutation of the CAAT box and CCAAT sequence, which is thought to play a pivotal role in the (MLP) of subgroup C human adenoviruses, in species with a deficient CAAT sequence. The transcription initiation at mutant MLP species was significantly reduced compared with that of the wild type or species in which there was a CAAT mutant. The failure to restore the normally functional adenoviruses, exhibited by a CAAT box, is consistent with the idea that the CAAT box plays a vital role in the adenovirus MLP and is preferred over other transcriptional elements.[9]
These core binding factors, or nuclear factors (NF-Y), are composed of three subunits – NF-YA, NF-YB, and NF-YC. Whereas in animals each NF-Y subunit is encoded by a single gene, there has been a diversification in plants in both structure and function. Families of NF-Y consist of between eight and 39 members per subunit. A large reason for this diversification is because of gene duplications and tandem duplications, which have helped contribute to the larger family sizes of NF-Y compared to the single encoded animal nuclear factors.[10] Each subunit contains an evolutionarily conserved part – theC-terminal of NF-YA, the central part of NF-YB, and theN-terminal of NF-YC, greater than 70% of these across species remains conserved. Neighboring regions however are generally not conserved.[6]
The NF-YA family encodes transcription factors that are variable in length (between 207 and 347 amino acids forM. truncatula). The NF-YA proteins are generally characterized by two domains that are strongly conserved in all higher eukaryotes investigated to date. The first domain (A1) contains 20 amino acids that forms analpha helix that appears significant in its interactions with NF-YB and NF-YC. The second domain (A2) is adjacent to the A1 domain by a conserved linker sequence is a sequence of 21 amino acids vital in the specific DNA to CCAAT box binding. The A1 and A2 domains are conserved towards the C-terminus of mammals, but occupy a more central region in plant NF-YA subunits. In plants, the NF-YA subunit has evolved to regulate the development of a facultative root organ only present in leguminous plants and shown to be expressed in root tissue. It was shown to have drought-resistant-like properties, becoming upregulated during drought stress in the roots and leaves ofArabidopsis. NF-YA mutants have shown a loss of function and a hypersensitivity to drought-like conditions, and, in contrast, overexpression of NF-YA has resulted indrought resistance.[10]
The NF-YB family is, similar to the NF-YA subunit, variable in length, however, on average much smaller than the NF-YA subunit (90–240 amino acids in "M. truncatula"). They have been characterized with a structure and amino acid composition similar to thehistone fold motif (HFM). This is composed of three alpha-helices separated by two beta strand-loop domains. Similar to NF-YA, NF-YB has been shown to also improve drought resistance when overexpressed and also the promotion of flowering inArabidopsis.[10]
The NF-YC proteins are an intermediate size between that of NF-YA and NF-YB proteins (117–292 amino acids inM. truncatula) and also contain the HFM that is prevalent in NF-YB proteins. It has also been shown to be involved in flowering time in certain plants (overexpression leads to earlier flowering) where its influence is potentially regulated by the binding of the protein CONSTANS (CO) to the NF-YC subunit.[10]
Because of the evolutionary change in NF-Y encoding genes in plants, they subsequently have a large range of potential trimeric complexes. For example, inArabidopsis, 36 NF-Y transcription factor subunits (including 10 NF-YA, 13 NF-YB, and 13 NF-YC subunits) have been identified and which could theoretically form 1690 unique complexes (which contains one of each type of subunit). This number, of course is higher than what actually happens since some subunits have specific binding patterns. Functional analyses on NF-Y encoding genes in plants have shown, as a result of their evolutionary diversification relative to their animal counterparts, have acquired diverse specific functions, such as embryo development, flowering time control, ER-stress, drought stress, and nodule and root development. This may only be a small portion of their capabilities, since the number of theoretically combinations of NF-Y complexes is so large and only a small portion can actually be created (less than 10% of all possible interactions were confirmed in both directions in yeast).[10]
Another aspect of the CCAAT binding motif is theCCAAT/enhancer binding proteins (C/EBPs). They are a group of transcription factors of 6 members (α-ζ), which are highly conserved and bind to the CCAAT motif. While research on these binding proteins is relatively recent, their function has been shown to have vital roles in cellular proliferation and differentiation,metabolism, inflammation, and immunity in various cells, but specificallyhepatocytes,adipocytes, andhematopoietic cells.[11] For example, in adipocytes, this has been shown in a variety of experiments with mice: ectopic expression of these C/EBPs (C/EBPα and C/EBPβ) were able to initiate the differentiation programs of the cell, even in the absence ofadipogenic hormones, or the differentiation of preadipocytes to adipocytes (or fat cells). In addition, an overabundance of these C/EBPs (specifically, C/EBPδ) causes an accelerated response. And furthermore, in cells lacking C/EBP or in C/EBP-deficient mice, both are unable to undergo adipogenesis. This results in the mice dying fromhypoglycemia, or the reduced lipid accumulation in adipose tissue.[12] The C/EBPs follow a general basic-leucine zipper (bZIP) domain at theC-terminus and are able to form dimers with other C/EBPs or other transcription factors. Thisdimerization allows the C/EBPs to bind specifically to DNA through apalindromic sequence in the major groove of DNA. They are regulated through various means, includinghormones,mitogens,cytokines,nutrients, and other various factors.[11]