- Research article
- Open access
- Published:
Evolution of theNANOG pseudogene family in the human and chimpanzee genomes
BMC Evolutionary Biologyvolume 6, Article number: 12 (2006)Cite this article
8278Accesses
7Altmetric
Abstract
Background
TheNANOG gene is expressed in mammalian embryonic stem cells where it maintains cellular pluripotency. An unusually large family of pseudogenes arose from it with one unprocessed and ten processed pseudogenes in the human genome. This article compares theNANOG gene and its pseudogenes in the human and chimpanzee genomes and derives an evolutionary history of this pseudogene family.
Results
TheNANOG gene and all pseudogenes exceptNANOGP8 are present at their expected orthologous chromosomal positions in the chimpanzee genome when compared to the human genome, indicating that their origins predate the human-chimpanzee divergence. Analysis of flanking DNA sequences demonstrates thatNANOGP8 is absent from the chimpanzee genome.
Conclusion
Based on the most parsimonious ordering of inferred source-gene mutations, the deduced evolutionary origins for theNANOG pseudogene family in the human and chimpanzee genomes, in order of most ancient to most recent, areNANOGP6,NANOGP5, NANOGP3,NANOGP10,NANOGP2,NANOGP9,NANOGP7,NANOGP1, andNANOGP4. All of these pseudogenes were fixed in the genome of the human-chimpanzee common ancestor.NANOGP8 is the most recent pseudogene and it originated exclusively in the human lineage after the human-chimpanzee divergence.NANOGP1 is apparently an unprocessed pseudogene. Comparison of its sequence to the functionalNANOG gene's reading frame suggests that this apparent pseudogene remained functional after duplication and, therefore, was subject to selection-driven conservation of its reading frame, and that it may retain some functionality or that its loss of function may be evolutionarily recent.
Background
Processed pseudogenes are derived from reverse transcription of RNA molecules followed by insertion of DNA copies into the genome. Therefore, for a processed pseudogene to be inherited from one organismal generation to the next, it must be derived from RNAs encoded by genes expressed in cells of the germline or the embryonic precursors of these cells. The homeobox geneNANOG is expressed in mammalian embryonic stem cells where its product, a homeobox transcription factor, maintains pluripotency of these cells [1–3]. Therefore,NANOG is an excellent candidate as a possible source of inherited processed pseudogenes. In fact, ten processed pseudogenes derived fromNANOG are present in the human genome, an unusually large family of inherited processed pseudogenes derived from a single gene [4–6].
The human chromosomal region containing theNANOG gene has also undergone a tandem duplication resulting in two copies of theNANOG gene on chromosome 12. The two copies are approximately 97% identical and their transcripts are spliced differently [4,5]. Although there is EST-based evidence that both copies are transcribed, Booth and Holland [4] have argued that one of the two copies is an unprocessed pseudogene, which they namedNANOGP1. They named the ten processed pseudogenesNANOGP2 throughNANOGP11. Two are located on the X chromosome, two on chromosome 6, and one each on chromosomes 2, 7, 9, 10, 14, and 15.NANOGP2 andNANOGP4 throughNANOGP10 are full-length or nearly full-length processed pseudogenes lacking introns.NANOGP3 andNANOGP11 are truncated fragments of processed pseudogenes [4,6].
Studies of unprocessed pseudogene evolution in primates are abundant, dating back to the early 1980s [7–9]. Although several studies are directed at pseudogene families [4,6,10], most focus on the evolution of a single processed pseudogene [11–14]. The relatively large number of processedNANOG pseudogenes and the recent release of the Build 1.1 assembly of the chimpanzee (Pan troglodytes) genome [15] provide an excellent opportunity to elucidate the evolutionary history of theNANOG gene and its large pseudogene family. This article compares the humanNANOG gene and its pseudogenes with their chimpanzee orthologues and from this comparison derives an evolutionary history of this pseudogene family.
Results and discussion
We identified the chimpanzee orthologues of the humanNANOG gene and all of its pseudogenes exceptNANOGP8 using MEGABLAST and BLASTN searches of the Build 1.1 version of the chimpanzee genome assembly. Table1 summarizes the chromosomal and genomic locations of the humanNANOG gene and pseudogenes and their chimpanzee orthologues. MEGABLAST and BLASTN searches of the chimpanzee genome did not reveal any otherNANOG sequences, suggesting that no newNANOG pseudogenes have arisen in the chimpanzee lineage since it diverged from the human lineage. However, we cannot rule out the possibility that additionalNANOG pseudogenes may be present in the chimpanzee genome because unsequenced gaps remain in the Build 1.1 assembly. Our data indicate that theNANOG gene and all pseudogenes exceptNANOGP8 are in their expected orthologous positions in the chimpanzee genome, and thatNANOGP8 is not present in the chimpanzee genome.
Chimpanzee orthologues ofNANOG andNANOGP1
NANOG [GenBank:NM_024865] is the functional gene in the human genome, whereasNANOGP1 [GenBank:AK097770] is apparently an unprocessed pseudogene derived from tandem duplication of the chromosomal region containingNANOG. However, cDNA and EST data show thatNANOGP1 may be transcriptionally active, albeit at a lower level thanNANOG, and that its transcripts are spliced differently than those derived fromNANOG. Hart et al. [5] designatedNANOGP1 asNANOG2 and referred to it as a functional gene, whereas Booth and Holland [4] argued that because of its relatively high degree of divergence fromNANOG, and the comparative paucity and ambiguity of transcripts derived from it,NANOGP1 is an unprocessed duplication pseudogene.
MEGABLAST searches of the chimpanzee genome readily identified the orthologues ofNANOG andNANOGP1. However, the organization of the chimpanzee orthologue of the humanNANOG gene in the chimpanzee Build 1.1 genome assembly suggests that the gene is either rearranged in the chimpanzee genome, or that the assembly is incorrect within this gene. All four exons of the orthologue are present in the assembly but in two different GenBank accessions. The entire sequences of the 5' UTR, exon 1, and exon 2 are found in the region spanning nucleotides 683046 through 686855 of the chromosome 12 contig [GenBank:NW_114668], in a region on the short arm of chromosome 12 near the telomere at a location orthologous to that of the humanNANOG gene at 12p13.31. Introns 1 and 2 of the chimpanzee orthologue are also within this region but large segments of them are unsequenced. The complete sequences of exon 3, intron 3, exon 4, and the 3' UTR of the chimpanzee orthologue are found in nucleotides 3808 though 5350 of another accession [GenBank:NW_115304], which is known to reside on chromosome 12 but has not been placed in the Build 1.1 assembly of this chromosome. Furthermore, exon 4 in this accession contains an apparent single nucleotide-pair insertion mutation, resulting in a frameshift and premature termination codon in the reading frame.
To determine if the apparent gene rearrangement and frameshift mutation are present in the chimpanzeeNANOG gene, or whether these are assembly and sequencing errors, we compared the available sequences ofNANOG andNANOGP1 in the chimpanzee assembly and selected PCR primer sequences in regions that differed sufficiently to ensure specific amplification of theNANOG gene. To verify that the amplicons were not derived from processedNANOG pseudogenes, all target sequences included at least a portion of aNANOG-specific intron.
Two primer combinations amplified fragments that include the region of apparent misassembly within intron 2. Both of these primer combinations amplified PCR fragments of the sizes expected if the gene is intact. We sequenced these fragments (and all other amplified fragments) of the gene and found that their sequences most closely matched those of the intact humanNANOG gene and less closely the corresponding sequences in the human pseudogenes, includingNANOGP1, confirming that our sequences are derived from the intact chimpanzeeNANOG gene. Furthermore, our sequences show that the apparent frameshift mutation in exon 4 in the Build 1.1 assembly is a sequencing error. Our sequencing enabled us to assemble and annotate the genomic sequence of the intact chimpanzeeNANOG gene [GenBank:DQ179631].
HumanNANOGP8 and its absence in the chimpanzee genome
HumanNANOGP8 [GenBank:NG_004093] is located on human chromosome 15 at 15q13.3. It is the most recent of theNANOG processed pseudogenes and is the only one that carries anAlu element found in the 3' UTR of the humanNANOG gene. MEGABLAST and BLASTN searches of the chimpanzee genome failed to reveal the presence of aNANOGP8 orthologue; all significant hits were to theNANOG gene and otherNANOG pseudogenes. To determine whether or notNANOGP8 is indeed absent from the chimpanzee genome, we used 762 nucleotides flanking the 5' end and 458 nucleotides flanking the 3' end of the humanNANOGP8 pseudogene as queries in a BLASTN search of the chimpanzee genome. The search identified highly homologous and contiguous sequences on chimpanzee chromosome 15, spanning nucleotides 2765812 through 2767049 of the chromosome 15 contig [GenBank:NW_116401.1]. As shown in Figure1, theNANOGP8 gene is indeed absent from its predicted site in the chimpanzee genome.
Evidence that theNANOGP8pseudogene is absent from the chimpanzee genome. Sequences flanking the humanNANOGP8 pseudogene on chromosome 15 are present in chromosome 15 of the chimpanzee genome but the pseudogene is absent. Comparison of the human and chimpanzee sequences shows that theNANOGP8 pseudogene inserted itself into human chromosome 15 without duplication of the surrounding sequences.
OtherNANOG pseudogenes in the chimpanzee genome
We identified the chimpanzee orthologues of the humanNANOG processed pseudogenesNANOGP2,NANOGP3,NANOGP4,NANOGP5,NANOGP6,NANOGP7,NANOGP9,NANOGP10, andNANOGP11 in the Build 1.1 assembly. All of these pseudogenes are in their predicted chromosomal locations when compared to the human genome. The complete sequences of all of these pseudogenes exceptNANOGP5 andNANOGP9 are present in the chimpanzee genome assembly. A 100-nucleotide segment in the 3' UTR ofNANOGP5 and a 1760 nucleotide segment containing the 5' UTR and the entire reading frame ofNANOGP9 are unsequenced in the Build 1.1 assembly. However, the presence of 454 nucleotides of the 3' UTR, as well as orthologous flanking sequences, confirm the presence ofNANOGP9 at its expected position.
We attempted to amplify the chimpanzee orthologue ofNANOGP9 with primers designed to include small regions of flanking sequence on both ends to fully place it within the genome assembly. This pseudogene is embedded in repetitive sequences and, although our primers were designed to match what appeared to be small regions of nonrepetitive sequences in the flanking regions, they failed to amplify the target sequence. We designed primers to match unique sequences near the ends of theNANOGP9 reading frame (based on the human sequence) and successfully amplified and sequenced a region inNANOGP9 corresponding to positions 43–841 of the 918 nucleotide-pair reading frame in the functionalNANOG gene [GenBank:DQ301869]. We verified that the sequence is indeed fromNANOGP9 by its high similarity to the human orthologue. This sequence further confirms the presence ofNANOGP9 in the chimpanzee genome and it allowed us to compare the sequences of the human and chimpanzee orthologues.
This sequence also resolved a question about the origins ofNANOGP9 andNANOGP10. Both are located on the X chromosome and both contain a 15 nucleotide-pair deletion that does not appear in the alignment when these two pseudogenes are aligned with each other, suggesting that they share this deletion. These observations imply thatNANOGP9 andNANOGP10 may be the products of a single insertion event followed by duplication of the chromosomal segment containing the pseudogene. However, these deletions reside in a region consisting of ten copies of an imperfect 15 nucleotide-pair tandem repeat within the reading frame. The chimpanzeeNANOGP9 orthologue does not contain the deletion present in the human orthologue, whereas the chimpanzee and human orthologues ofNANOGP10 have the same deletion. This observation indicates that the deletion in humanNANOGP9 occurred after the H/C divergence and its origin is thus independent of the deletion inNANOGP10. Furthermore, we examined 5000 nucleotides on both sides of these pseudogenes and found no evidence of a duplication. We conclude thatNANOGP9 andNANOGP10 originated independently.
Evolution of theNANOG gene and pseudogene family
The entire functionalNANOG gene (according to our sequencing data) andNANOGP1 are present in both the human and chimpanzee genome assemblies at orthologous chromosomal positions. In the 3' UTR of theNANOG gene, there is anAlu element, which is missing fromNANOGP1 in both genomes. Therefore, theNANOGP1 unprocessed pseudogene arose through duplication of the chromosomal region containingNANOG before the human-chimpanzee (H/C) divergence and before insertion of theAlu element into theNANOG gene. Because the sameAlu element is present in both the human and chimpanzeeNANOG genes, its insertion must also have preceded the H/C divergence. The processed pseudogenesNANOGP2,NANOGP3,NANOGP4,NANOGP5,NANOGP6,NANOGP7,NANOGP9, andNANOGP10 lack thisAlu element. They thus likely arose before its insertion and, therefore, also predate the H/C divergence. The presence of theNANOGP11 pseudogene fragment in both the human and chimpanzee genomes likewise shows that its origin preceded H/C divergence.
The humanNANOGP8 pseudogene is highly similar to theNANOG gene, is absent from the chimpanzee genome, and contains the sameAlu element as theNANOG gene, indicating that this processed pseudogene is the most recent of theNANOG pseudogenes and was inserted into human chromosome 15 after the H/C divergence.
Based on the assumption of a pseudogene mutation rate of 1.25 × 10-9 mutations per site per year in humans [16,17], Booth and Holland [4] estimated the origin of theNANOGP8 pseudogene as the most recent at 5.2 million years ago, about the time of the H/C divergence. Our results demonstrate thatNANOGP8 arose after the H/C divergence, and thus are consistent with this date. Booth and Holland [4] estimated the origins of the other pseudogenes as ranging from over 150 million years ago forNANOGP6 to 22 million years ago forNANOGP1, with the caveat that these dates may be inaccurate, and are likely overestimates, because nucleotide substitution rates for pseudogenes are not well calibrated within this range.
Booth and Holland [4] determined the relative ages of the humanNANOG pseudogenes by counting the number of mutations in the reading-frame regions of the humanNANOG pseudogenes when compared to the reading frame of the functionalNANOG gene, scaling their analysis by counting adjacent deletions as a unit-site size of one to compensate for the reduced opportunity of substitution mutation in deleted regions. They concluded thatNANOGP6 is the most ancient of the pseudogenes, followed in order of most ancient to most recent byNANOGP5 orNANOGP3, thenNANOGP10, thenNANOGP9 orNANOGP2, thenNANOGP7, thenNANOGP4, thenNANOGP1, andNANOGP8 as the most recent. Booth and Holland's analysis did not distinguish the order ofNANOGP5 andNANOGP3 relative to each other, nor ofNANOGP2 andNANOGP9 relative to each other, because of similar degrees of divergence for each of these pairs of pseudogenes fromNANOG.
We conducted a similar analysis of relative age, with the same scaling for multiple-nucleotide deletions as a single unit site when those deletions were shared by the human and chimpanzee sequences. We identified mutations that occurred after the H/C divergence as differences between the human and chimpanzee sequences and corrected them to reflect the ancestral sequence at the time of the H/C divergence before completing our analysis. This correction was especially important forNANOGP10, which has accumulated 20 mutations since the H/C divergence, compared to 1–10 mutations for the other pseudogenes. We excludedNANOGP8 from this correction because of its absence in the chimpanzee genome. Also, sinceNANOGP3 is a truncated pseudogene with only 254 nucleotides within theNANOG coding region, we compared only the portions ofNANOG and the other pseudogenes that aligned with these 254 nucleotides when determining the relative age ofNANOGP3. The pseudogene fragmentNANOGP11 was not included in Booth and Holland's analysis nor ours because it lacks the entire reading frame and has no significant homology with several of the other processed pseudogenes.
Comparison of the sequences after these adjustments results in a relative order that is the same as that determined by Booth and Holland [4]. Also similar to Booth and Holland's conclusions, our analysis showed thatNANOGP3 andNANOGP5 were almost identical in the degree of similarity toNANOG (88.6% and 88.2%, respectively), and thatNANOGP2 andNANOGP9 were likewise nearly identical in the degree of divergence fromNANOG (94.6% and 94.4%, respectively). Thus, like Booth and Holland [4], we could not conclusively determine the relative orders within each of these two pairs of pseudogenes using this type of analysis.
Such an analysis assumes that natural selection has conserved the functional gene's sequence so that the modern sequence of the reading frame represents the source sequence of each of the pseudogenes. Under most circumstances, such an assumption cannot readily be tested. However, the periodic insertion and fixation of tenNANOG pseudogenes with a complete or partial reading frame should have left a record, albeit an imperfect one, of the functionalNANOG gene-sequence evolution. If we assume that the reading frame of the functionalNANOG gene has changed during the time when the pseudogenes were inserted into the genome, the mutational differences in the pseudogenes should consist of three different types: 1) source-gene mutations, defined as those that occurred in the functionalNANOG gene after the insertion of one pseudogene but before the insertion of another, resulting in a polymorphism between these pseudogenes, 2) post-insertion mutations, defined as those that occurred in a pseudogene after its insertion but before the H/C divergence, and 3) post-H/C divergence mutations, defined as mutations that occurred in theNANOG gene and its pseudogenes after the H/C divergence. We readily identified 88 post-H/C divergence mutations in the reading-frame regions of theNANOG gene and its pseudogenes, and in all but four cases we were able to determine the mutant and ancestral nucleotides at each site by comparison of the human and chimpanzee orthologues with theNANOG gene and the other pseudogenes.
Some of the source-gene mutations should be distinguishable from post-insertion pseudogene mutations in our data as a nucleotide that is identical in a set of older pseudogenes, which then changes to a different nucleotide in a set of younger pseudogenes. Moreover, if possible source-gene mutations can be identified, they can be used to reconstruct the evolutionary history of the pseudogene family, and to some extent the evolutionary history of the gene itself.
To reconstruct the evolutionary history of theNANOG gene and its pseudogene family with source-gene mutation analysis, we aligned the reading frame of the human and chimpanzeeNANOG gene with the corresponding sequences in all pseudogenes (exceptNANOGP11, which lacks the reading frame), and corrected (in all but four cases) post-H/C divergence mutations to reflect the ancestral sequence. We identified sites with possible source-gene mutations as a nucleotide shared by two or more pseudogenes and a different nucleotide shared by two or more additional pseudogenes. Any nucleotide present in a particular position in only one pseudogene was considered as a post-insertion pseudogene mutation. A total of 68 sites (out of 918) within the reading frame met these criteria for identification of possible source-gene mutations. We then identified the most parsimonious order of pseudogenes as the one which required the fewest number of source-gene mutations across these 68 sites.
The most parsimonious ordering of theNANOG pseudogenes (154 possible source-gene mutations across 68 sites) from most ancient to most recent isNANOGP6,NANOGP5,NANOGP3,NANOGP10,NANOGP2,NANOGP9,NANOGP7,NANOGP1,NANOGP4, andNANOGP8 as the most recent. The next most parsimonious ordering (156 mutations) is the same as the above order but with the positions ofNANOGP5 andNANOGP3 reversed. As a truncated pseudogene,NANOGP3 contains only 19 possible source-gene mutation sites. Of these, only five are informative in distinguishingNANOGP3 andNANOGP5, three supportingNANOGP5 as the older pseudogene and two supportingNANOGP3. Sites with only one mutation in a particular order are more likely to represent a true source-gene mutation than sites with multiple mutations, which probably consist of a combination of source-gene and post-insertion mutations. The three sites, 399, 531, and 568, that supportNANOGP5 as the older pseudogene require 1, 2, and 1 mutations to explain the order, respectively. The two sites that supportNANOGP3 as the older pseudogene (sites 390 and 566) require 5 and 4 mutations, respectively, to explain that order, suggesting that the most parsimonious order (NANOGP5 older thanNANOGP3) is also the most plausible with respect to these two pseudogenes. Additionally, our analysis clarifies the relative order ofNANOGP2 andNANOGP9 by clearly placingNANOGP2 as the older of the two (reversing their positions in the order requires 168 mutations).
The only notable discrepancy between the results of source-gene mutation analysis and ordering by overall similarity to the modernNANOG gene is the relative placement ofNANOGP1 andNANOGP4. In the latter analysis, the functionalNANOG gene is more similar toNANOGP1 (98.6%) than it is toNANOGP4 (96.4%), implying thatNANOGP4 is the older pseudogene. However, source-gene mutation analysis placesNANOGP4 as the more recent of the two. Examination of the mutations that distinguishNANOGP1 fromNANOG provides compelling evidence thatNANOGP1 is indeed the older pseudogene.NANOGP1 is an unprocessed pseudogene that arose from duplication of a segment of chromosome 12, and thus may have remained functional for an undetermined period of time after its formation. As Booth and Holland [4] pointed out,NANOGP1 cannot use the same initiation codon asNANOG because a mutation at position 25 in the reading frame produced a premature termination codon after only eight amino acids. This mutation is present in both the human and chimpanzee orthologues indicating that it preceded the H/C divergence. Booth and Holland noted, however, that of the three characterized human transcripts fromNANOGP1, two are alternatively spliced to remove all of exon 1, so that theNANOGP1 reading frame begins at a position corresponding to the 58th amino acid in the protein encoded byNANOG, which is an internal methionine in the NANOG protein. IfNANOGP1 did indeed remain functional after its formation, we would expect natural selection to conserve the sequence within its reading frame when compared toNANOG.
After correction to the ancestral sequence for post-H/C divergence mutations, 15 mutations distinguishNANOGP1 from theNANOG reading frame, and they are nonrandomly distributed. Twelve are clustered in a 121-nucleotide region entirely within exon 1 of theNANOG gene, a region removed during splicing in two characterizedNANOGP1 transcripts. Of the three mutations inNANOGP1's apparent reading frame, two are nonsynonymous and one is synonymous. A nonsynonymous mutation at position 246 is a guanine-to-thymine substitution that results in a lysine-to-asparagine substitution in the protein. Comparison with the human and chimpanzee sequences of the other pseudogenes reveals that this is a source-gene mutation that supportsNANOGP1 as being older thanNANOGP4. Comparison of this polymorphism to the sequences of the other pseudogenes reveals that the guanine inNANOGP1, and therefore the lysine in the protein, are ancestral, and that the source-gene mutation occurred after duplication ofNANOGP1 but before insertion ofNANOGP4. Interestingly, Booth and Holland [4] found through experimental sequencing that this particular mutation (and amino acid substitution) is polymorphic in modern humans, suggesting that neither lysine nor asparagine is detrimental to protein function at this position.
The other nonsynonymous mutation is a cytosine-to-thymine substitution at position 477, resulting in a proline-to-leucine substitution in the protein. Because proline and leucine have similar biochemical properties, this mutation is also not likely to adversely affect protein function. TheNANOG gene and all other pseudogenes in both the human and chimpanzee genomes have a cytosine residue at this position, indicating that this is a post-duplication mutation inNANOGP1.
The single synonymous mutation in the apparent reading frame is at position 384, which lies within the homeobox region. This is clearly a source-gene mutation that also supports the ordering ofNANOGP1 as being older thanNANOGP4. OnlyNANOG,NANOGP4, andNANOGP8 have a cytosine at this position; all other pseudogenes, includingNANOGP1, have a thymine at this position.
Taken in the aggregate, these observations strongly support the hypothesis thatNANOGP1 remained functional after duplication and, therefore, was subject to selection-driven conservation of its reading frame. They also raise the possibility thatNANOGP1 may retain some functionality or that its loss of function may be evolutionarily recent.
Nucleotide polymorphisms at possible source-gene mutation sites may represent true source-gene mutations or post-insertion pseudogene mutations. Sites in which a single mutation separates a set of older pseudogenes from a set of younger pseudogenes are the most plausible sites for identification of true source-gene mutations. In the most parsimonious ordering, 29 of the 68 sites contained a single possible source-gene mutation (Figure2). Twenty of these mutations are nonsynonymous and nine are synonymous. If a mutation is indeed a true source-gene mutation, the amino acid it encodes may be reflected in theNANOG proteins of other vertebrates. To determine if this is the case, we used the amino acid sequence of the polypeptide encoded by the humanNANOG gene [GenBank:NP_079141] as a query for a BLASTP search of the protein database of all organisms. Proteins from six species displayed full-length or nearly full length homology to theNANOG protein: crab-eating macaque (Macaca fascicularis [GenBank:BAD72891]), house mouse (Mus musculus [GenBank:XP_132755]), Norway rat (Rattus norvegicus [GenBank:XP_575662]), domestic cattle (Bos taurus [GenBank:AAY84556]), domestic goat (Capra hircus [GenBank:AAW50709]), and domestic dog (Canis familiaris [GenBank:XP_543828]). We excluded a match to a computationally generated hypothetical protein in chimpanzee [GenBank:XP_510125] because it is derived from the DNA sequence of chimpanzeeNANOGP7.
Potential single source-gene mutations in the most parsimonious ordering of theNANOGpseudogenes by source-gene mutation analysis. The left side depicts nucleotide sequences of theNANOG gene and pseudogenes after correction of post-H/C divergence mutations to the ancestral sequence. In two instances (sites 565 and 903), the ancestral sequence could not be determined, so both human and chimpanzee sequences are indicated with the human sequence on the left. At site 253, the human and chimpanzee sequences differ forNANOG, and the chimpanzee sequence is ancestral. However, we included the polymorphism because it explains the guanine inNANOGP8 (reverse arrow). Asterisks (*) denote post-insertion mutations and hyphens (-) denote deletions in the DNA sequences of the pseudogenes. The right side depicts inferred amino acid substitutions and the corresponding amino acids in the NANOG proteins of eight species:Cf =Canis familiaris,Ch =Capra hircus,Bt =Bos taurus,Rn =Rattus norvegicus,Mm =Mus musculus,Mf =Macaca fascicularis,Hs =Homo sapiens,Pt =Pan troglodytes. The "h" designation following a site number indicates that the site lies within the homeobox region.
As shown in Figure2, several of the putative source-gene mutations and their inferred effect on amino acid sequence in the human/chimpanzeeNANOG pseudogene family are consistent with the corresponding amino acids in theNANOG proteins of other eutherian mammals. For example, at site 52 in the reading frame, an adenine-to-guanine substitution in theNANOG gene apparently occurred after the insertion ofNANOGP10 but before the insertion ofNANOGP2, resulting in an asparagine-to-aspartic acid substitution in amino-acid residue 18 of the polypeptide. The dog, cattle, and rat proteins have asparagine at this position, whereas the macaque, chimpanzee, and human have aspartic acid at this position. Similar patterns of congruence between amino acid substitution and amino acid sequences in other mammals is evident at positions 250–251, 275, 568, 713, 817, and 820–821 of the reading frame (Figure2).
Another feature of the putative source-gene mutations is the paucity of amino acid substitutions at source-gene mutation sites within the homeobox region (positions 283–462 in the reading frame) indicative of high source-gene sequence conservation in this region. Six possible source-gene mutation sites are present within the homeobox region (three of which are single-mutation sites depicted in Figure2). Five of these six sites have only synonymous mutations. The single nonsynonymous mutation is at position 358, with thymine present inNANOGP7 andNANOGP9 and cytosine present in all other pseudogenes and theNANOG gene, resulting in a leucine-to-phenylalanine substitution in theNANOGP7 andNANOGP9 sequences. These thymines may be independent post-insertion mutations or they could be a source-gene mutation that reverted to its original sequence after the insertion ofNANOGP7.
Pseudogene mutations can be used to estimate the dates of origin for individual pseudogenes. However, only post-insertion mutations not subject to purifying selection are reliable indicators of the age of a pseudogene. Our analysis shows that, in the case of theNANOG pseudogene family, source-gene mutations are present and may contribute to a significant number of polymorphisms in the pseudogenes. Although some source-gene and post-insertion mutations may be readily distinguished based on their patterns when the pseudogenes are ordered, others may not be so easily discerned. Even when post-insertion mutations can be reliably identified, pseudogene evolution rates have not been well calibrated prior to the H/C divergence, as pointed out by Booth and Holland [4]. For these reasons, we have avoided age estimations in this study, focusing instead on the relative order ofNANOG pseudogene origins.
Conclusion
A synthesis of the results from this article with those of Booth and Holland [4] produces a straightforward evolutionary history of theNANOG pseudogene family in the human and chimpanzee genomes.NANOGP6 is the most ancient of the pseudogenes followed in order of most ancient to most recent by the processed pseudogenesNANOGP5,NANOGP3,NANOGP10,NANOGP2,NANOGP9,NANOGP7, andNANOGP4. Before insertion ofNANOGP4, the region on chromosome 12 containingNANOG underwent a duplication producingNANOGP1, which remained functional and subject to selection-driven conservation of its reading frame. All of these events, and the resulting fixation of their products in the genome, preceded the H/C divergence. Following the H/C divergence,NANOGP8 inserted itself into chromosome 15 in the human lineage.
Methods
DNA amplification, cloning, and sequencing
We obtained chimpanzee DNA (individual PR00226) from the Integrated Primate Biomaterials and Information Resource (IPBIR) of the Coriell Institute for Medical Research (Camden, NJ, USA). We selected sequences for PCR primers specific to the chimpanzeeNANOG gene by comparing theNANOG andNANOGP1 sequences from the Build 1.1 assembly and selecting sites with at least two variant nucleotides, with a variant nucleotide on the 3' end of each primer. We selected primer sequences for theNANOGP9 reading frame by identifying sites that contained two variants unique to humanNANOGP9 with a variant nucleotide on the 3' end of each primer. All oligonucleotide primers were manufactured by Integrated DNA Technologies, (Coralville, IA, USA). We amplified DNA using AccuprimeT Hi-FidelityTaq polymerase (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's recommendation at 2.5 mM MgCl2. The PCR amplification protocol consisted of an initial denaturation step of 1.5 min at 94°C, followed by 35 cycles of amplification consisting of 30 s denaturation at 94°C, 30 s for primer annealing at 57°C and between 1 and 5 min of extension at 68°C, depending on the anticipated product size (1 min/1 kb). We cloned the resulting amplicon using the pGEM-T Easy Vector System II (Promega, Madison, WI, USA), and identified recombinant clones by standard blue/white screening methods with IPTG and X-Gal. We purified plasmid DNA from each selected recombinant clone using a GenEluteTM plasmid miniprep Kit (Sigma, St. Louis, MO, USA) and quantified the DNA using a spectrophotometer. Isolated plasmid DNA was sequenced bidirectionally from M13 (F/R) primers. A 3,889 nucleotide-pair clone containing exon 1, intron 1 and part of exon 2 of theNANOG gene was sequenced by primer walking. DNA sequencing was performed at the Brigham Young University DNA Sequencing Center (Provo, UT, USA) using standard ABI Prism Taq dye-terminator cycle-sequencing methodology. DNA sequence chromatograms were analyzed with the Contig Express program in the Vector NTI software suite (InforMax, Frederick, MD, USA).
DNA sequence analysis
To initially identify the locations and DNA sequences of theNANOG gene and its pseudogenes in the chimpanzee genome, we used the GenBank entries for the humanNANOG gene and its 11 pseudogenes as queries for MEGABLAST searches of the chimpanzee genome Build 1.1 assembly with default settings including filtering for repetitive sequences. After identifying the genes and pseudogenes in the chimpanzee genome, we copied the sequences and alignments then confirmed and refined them with the "align two sequences" (bl2seq) BLAST tool with a word size of seven and filtering disabled. We further refined the alignments manually, especially on the ends of the sequences where word-size limitations failed at times to identify true alignments.
We copied flanking DNA sequences on both sides of the humanNANOGP8 andNANOGP9 pseudogenes and used them as MEGABLAST queries with default settings and filtering to search the chimpanzee genome to confirm whether or not these pseudogenes were present. After MEGABLAST identified the sequences, we refined alignments with the bl2seq tool with a word size of seven and filtering disabled and with manual refinements.
To facilitate determination of the evolutionary order of pseudogene origin, we copied the reading frame of the functional humanNANOG gene and used it as a query in the bl2seq tool with a word size of seven and filtering disabled to determine the best alignment with the corresponding regions of each of the pseudogenes exceptNANOGP11, which does not include the reading-frame region. We used these alignments to generate a multiple alignment of the reading-frame region of human and chimpanzee orthologues of theNANOG gene and all pseudogenes exceptNANOGP11. This multiple alignment allowed us to identify post-H/C divergence mutations and correct them to reflect the ancestral sequences, and to identify and distinguish between potential source-gene mutations and post-insertion mutations in the pseudogenes, as described in the results and discussion section.
Abbreviations
- UTR:
untranslated region
- H/C divergence:
human-chimpanzee divergence
- EST:
expressed sequence tag
References
Mitsui K, Tokuzawa Y, Itoh H, Segawa K, Murakami M, Takahashi K, Maruyama M, Maeda M, Yamanaka S: The homeoprotein Nanog is required for maintenance of pluripotency in mouse epiblast and ES cells. Cell. 2003, 113 (5): 631-642. 10.1016/S0092-8674(03)00393-3.
Chambers I, Colby D, Robertson M, Nichols J, Lee S, Tweedie S, Smith A: Functional expression cloning of Nanog, a pluripotency sustaining factor in embryonic stem cells. Cell. 2003, 113 (5): 643-655. 10.1016/S0092-8674(03)00392-1.
Wang SH, Tsai MS, Chiang MF, Li H: A novel NK-type homeobox gene, ENK (early embryo specific NK), preferentially expressed in embryonic stem cells. Gene Expr Patterns. 2003, 3 (1): 99-103. 10.1016/S1567-133X(03)00005-X.
Booth HAF, Holland PWH: Eleven daughters of NANOG. Genomics. 2004, 84 (2): 229-238. 10.1016/j.ygeno.2004.02.014.
Hart AH, Hartley L, Ibrahim M, Robb L: Identification, cloning and expression analysis of the pluripotency promoting Nanog genes in mouse and human. Dev Dyn. 2004, 230 (1): 187-198. 10.1002/dvdy.20034.
Pain D, Chirn GW, Strassel C, Kemp DM: Multiple retropseudogenes from pluripotent cell-specific gene expression indicates a potential signature for novel gene identification. J Biol Chem. 2005, 280 (8): 6265-6268. 10.1074/jbc.C400587200.
Proudfoot NJ, Maniatis T: The structure of a human alpha-globin pseudogene and its relationship to alpha-globin gene duplication. Cell. 1980, 21 (2): 537-544. 10.1016/0092-8674(80)90491-2.
Chang LY, Slightom JL: Isolation and nucleotide sequence analysis of the beta-type globin pseudogene from human, gorilla and chimpanzee. Journal of Molecular Biology. 1984, 180 (4): 767-784. 10.1016/0022-2836(84)90256-0.
Hasegawa M, Kishino H, Yano T: Man's place in Hominoidea as inferred from molecular clocks of DNA. J Mol Evol. 1987, 26 (1-2): 132-147. 10.1007/BF02111287.
Zhang Z, Gerstein M: The human genome has 49 cytochrome c pseudogenes, including a relic of a primordial gene that still functions in mouse. Gene. 2003, 312: 61-72. 10.1016/S0378-1119(03)00579-1.
Ueda S, Watanabe Y, Saitou N, Omoto K, Hayashida H, Miyata T, Hisajima H, Honjo T: Nucleotide sequences of immunoglobulin-epsilon pseudogenes in man and apes and their phylogenetic relationships. J Mol Biol. 1989, 205 (1): 85-90. 10.1016/0022-2836(89)90366-5.
Kawamura S, Ueda S: Immunoglobulin CH gene family in hominoids and its evolutionary history. Genomics. 1992, 13 (1): 194-200. 10.1016/0888-7543(92)90220-M.
Lomax MI, Welch MD, Darras BT, Francke U, Grossman LI: Novel use of a chimpanzee pseudogene for chromosomal mapping of human cytochrome c oxidase subunit IV. Gene. 1990, 86 (2): 209-216. 10.1016/0378-1119(90)90281-U.
Freimuth RR, Wiepert M, Chute CG, Wieben ED, Weinshilboum RM: Human cytosolic sulfotransferase database mining: identification of seven novel genes and pseudogenes. Pharmacogenomics J. 2004, 4 (1): 54-65. 10.1038/sj.tpj.6500223.
Chimpanzee Sequencing and Analysis Consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005, 437 (7055): 69-87. 10.1038/nature04072.
Nachman MW, Crowell SL: Estimate of the mutation rate per nucleotide in humans. Genetics. 2000, 156 (1): 297-304.
Martínez-Arias R, Calafell F, Mateu E, Comas D, Andres A, Bertranpetit J: Sequence variability of a human pseudogene. Genome Res. 2001, 11 (6): 1071-1085. 10.1101/gr.GR-1677RR.
Author information
Authors and Affiliations
Department of Plant and Animal Sciences, Brigham Young University, Provo, UT, 84602, USA
Daniel J Fairbanks & Peter J Maughan
- Daniel J Fairbanks
You can also search for this author inPubMed Google Scholar
- Peter J Maughan
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toDaniel J Fairbanks.
Additional information
Authors' contributions
DJF carried out all BLAST searches, sequence alignments, and evolutionary analyses. PJM carried out all DNA amplification, cloning, and experimental sequence assemblies. Both authors drafted the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Fairbanks, D.J., Maughan, P.J. Evolution of theNANOG pseudogene family in the human and chimpanzee genomes.BMC Evol Biol6, 12 (2006). https://doi.org/10.1186/1471-2148-6-12
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative