Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
                                  NCBI home page
Search in PMCSearch
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more:PMC Disclaimer | PMC Copyright Notice
BMC Biology logo

Classification and nomenclature of all human homeobox genes

Peter WH Holland1,✉,#,H Anne F Booth1,#,Elspeth A Bruford2
1Department of Zoology, University of Oxford, South Parks Road, Oxford, OX1 3PS, UK
2HUGO Gene Nomenclature Committee, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK

Corresponding author.

#

Contributed equally.

Received 2007 Mar 30; Accepted 2007 Oct 26; Collection date 2007.

Copyright © 2007 Holland et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PMCID: PMC2211742  PMID:17963489

Abstract

Background

The homeobox genes are a large and diverse group of genes, many of which play important roles in the embryonic development of animals. Increasingly, homeobox genes are being compared between genomes in an attempt to understand the evolution of animal development. Despite their importance, the full diversity of human homeobox genes has not previously been described.

Results

We have identified all homeobox genes and pseudogenes in the euchromatic regions of the human genome, finding many unannotated, incorrectly annotated, unnamed, misnamed or misclassified genes and pseudogenes. We describe 300 human homeobox loci, which we divide into 235 probable functional genes and 65 probable pseudogenes. These totals include 3 genes with partial homeoboxes and 13 pseudogenes that lack homeoboxes but are clearly derived from homeobox genes. These figures exclude the repetitiveDUX1toDUX5homeobox sequences of which we identified 35 probable pseudogenes, with many more expected in heterochromatic regions. Nomenclature is established for approximately 40 formerly unnamed loci, reflecting their evolutionary relationships to other loci in human and other species, and nomenclature revisions are proposed for around 30 other loci. We use a classification that recognizes 11 homeobox gene 'classes' subdivided into 102 homeobox gene 'families'.

Conclusion

We have conducted a comprehensive survey of homeobox genes and pseudogenes in the human genome, described many new loci, and revised the classification and nomenclature of homeobox genes. The classification scheme may be widely applicable to homeobox genes in other animal genomes and will facilitate comparative genomics of this important gene superclass.

Background

Homeobox genes are characterized by the possession of a particular DNA sequence, the homeobox, which encodes a recognizable although very variable protein domain, the homeodomain [1,2]. Most homeodomains are 60 amino acids in length, although exceptions are known. Many homeodomain proteins are transcription factors with important roles in embryonic patterning and cell differentiation, and several have been implicated in human diseases and congenital abnormalities [3].

The homeobox genes have been variously subdivided into superclasses, classes, subclasses or groups, although there has been much inconsistency in the use of these terms. The most commonly recognized groupings are the ANTP, PRD, LIM, POU, HNF, SINE, TALE, CUT, PROS and ZF groups (or variants of these names), although these are not always given equal rank in classification schemes [1,2,4-8]. There is more consensus in classification at a lower level, just above the level of the gene, where very similar genes are grouped into gene families. Widely recognized gene families include Dlx, Evx, Msx, Cdx, En, Otx, Pitx, Otx and Emx (or variants of these names), amongst many others, although there is variation particularly concerning how many gene families are used for the HOX, PAX and NK homeobox genes. Despite the numerous discrepancies, the common principle of classification is the same. The goal of any scheme is to mirror evolutionary diversification, so that 'closely related' genes are placed in the same gene family, and related gene families are placed in the same gene class or other higher grouping. It should be borne in mind, however, that the pathway of evolutionary diversification is never completely known for any large and complex set of genes.

The initial analyses of the draft human genome sequence published in 2001 included estimates of the number of human homeobox genes. Venter et al [9] found 160 homeobox genes, containing 178 homeobox sequences, using large-scale automated classification; while the IHGSC team [10] gave a much higher estimate of 267 homeobox genes. Both were based on draft coverage of the human genome and would be expected to be missing some genes, as well as confusing pseudogenes with genes. In the same year, Banerjee-Basu and Baxevanis [8] presented an analysis of 129 human homeodomain sequences, but this was far from a comprehensive survey. More recently, there have been two more accurate surveys of homeobox genes in the human genome. Nam and Nei [11] found 230 homeobox genes, containing 257 homeobox sequences. Ryan et al [7] found 228 homeodomain sequences in the NCBI RefSeq database of October 2004. Our analyses (described here) revealed many homeobox genes that were incorrectly annotated, named or classified and many homeobox pseudogenes that had previously been missed. We report a complete survey of homeobox loci in the euchromatic regions of the human genome, appropriate gene nomenclature and a consistent classification scheme.

Results and Discussion

How many homeobox genes and pseudogenes?

Using exhaustive database screening, followed by manual examination of sequences, we identified 300 homeobox loci in the human genome. Distinguishing which of these loci are functional genes and which are non-functional pseudogenes was difficult in some cases. Most loci classified as pseudogenes in this study are integrated reverse-transcribed transcripts, readily recognized by their dispersed genomic location, complete lack of intron sequences, and (in some cases) 3' homopolymeric run of adenine residues. A small minority are duplicated copies of genes, recognized by physical linkage to their functional counterparts and the same (or similar) exon-intron arrangement. In general, retrotransposed gene copies are non-functional (and therefore pseudogenes) from the moment of integration because they lack 5' promoter regions necessary for transcription. However, such sequences can occasionally acquire new promoters and become functional as 'retrogenes'. Duplicated gene copies often possess 5'promoter regions (as they are often encompassed by the duplication event); most degenerate to pseudogenes due to redundancy in a process known as non-functionalization, however some can be preserved as functional genes through sub- or neo-functionalization. Thus, in both instances, reliable indicators of non-functionality were sought in order to assign pseudogene status, notably frameshift mutations, premature stop codons and non-synonymous substitutions at otherwise conserved sites in the original coding region.

We currently estimate that the 300 human homeobox loci comprise 235 functional genes and 65 pseudogenes (Table1). These figures include three functional genes that possess partial homeobox sequences (PAX2,PAX5andPAX8) and retrotransposed pseudogenes that correspond to only part of the original transcript, whether or not it includes the homeobox region or indeed any of the original coding region. Consequently, 13 retrotransposed pseudogenes that lack homeobox sequences are included (NANOGP11,TPRX1P1,TPRX1P2,POU5F1P7,POU5F1P8,IRX4P1,TGIF2P2,TGIF2P3,TGIF2P4,CUX2P1,CUX2P2,SATB1P1,ZEB2P1). We do not includePAX1,PAX9andCERS1; these are functional genes without homeobox motifs, albeit closely related to true homeobox genes (the other PAX and CERS genes).

Table 1.

Numbers of human genes, pseudogenes and gene families in each homeobox gene class. The human homeobox gene superclass contains a total of 235 probable functional genes and 65 probable pseudogenes. These are divided between 102 gene families, which are in turn divided between eleven gene classes.

ClassSubclassNumber of gene familiesNumber of genesNumber of pseudogenes
ANTPHOXL14520
NKL234819b
PRDPAX37a0
PAXL284324c, d
LIM6120
POU7168e
HNF230
SINE360
TALE62010f
CUT373g
PROS120
ZF5141h
CERS15i0

Totals102235a65b-h

aIncludesPAX2,PAX5andPAX8that have a partial homeobox; excludesPAX1andPAX9that lack a homeobox.

bIncludesNANOGP11that lacks a homeobox.

cExcludes intronless and repetitiveDUX1toDUX5sequences.

dIncludesTPRX1P1andTPRX1P2that lack a homeobox.

eIncludesPOU5F1P7andPOU5F1P8that lack a homeobox.

fIncludesIRX4P1,TGIF2P2,TGIF2P3andTGIF2P4that lack a homeobox.

gIncludesCUX2P1,CUX2P2andSATB1P1that lack a homeobox.

hIncludesZEB2P1that lacks a homeobox.

iExcludesCERS1that lacks a homeobox.

The total number of homeobox sequences in the human genome is higher than 300 for two reasons. First, several genes and pseudogenes possess more than one homeobox sequence, notably members of the Dux (double homeobox), Zfhx and Zhx/Homez gene families. Second, we have excluded a set of sequences related to humanDUX4(DUX1toDUX5), which have become part of 3.3 kb repetitive DNA elements present in multiple copies in the genome [12-14]. Few of these tandemly-repeated sequences are likely to be functional as expressed proteins, and all were probably derived by retrotransposition from functional DUX gene transcripts (see below). The fact that they are not included in the total count, therefore, is likely to have limited bearing on understanding the diversity and normal function of human homeobox genes. Hence, our figure of 300 homeobox loci is the most useful current estimate of the repertoire of human homeobox genes and pseudogenes.

Classification

We propose a simple classification scheme for homeobox genes, based on two principal ranks: gene class and gene family. A gene class contains one or more gene families, which in turn will contain one or more genes. In a few cases, it is useful to erect an intermediate rank between these levels, and for this we use the term subclass. For the entire set of homeobox genes, we use the term superclass.

For the rank of gene family, we use a specific evolutionary-based definition based on common practice in the field of comparative genomics and developmental biology. We define a gene family as a set of genes derived from a single gene in the most recent common ancestor of bilaterian animals (here defined as the latest common ancestor ofDrosophilaand human). This definition has been made explicitly in previous work [2,6] but is actually a principle that has been in widespread, but rather inconsistent, use for over a decade [15]. For example, amongst the homeobox genes, the En (engrailed) gene family was originally defined to include humanEN1andEN2, plusDrosophila enandinv[16]; these four genes arose by independent duplication from a single gene in the most recent common ancestor of insects and vertebrates. Moving outside the homeobox genes, this principle is also widespread; for example, the Hh (hedgehog) gene family was defined to include mouseShh,DhhandIhh, plusDrosophila hh[17]. To clarify boundaries between gene families, we conducted molecular phylogenetic analyses of human homeodomain sequences, using a range of protostome and occasionally cnidarian homeodomain sequences as outgroups (Additional files1 and2).

While the gene family definition described above is generally workable for homeobox genes, by necessity there are some exceptions. One type of exception relates to genes with an unknown ancestral number. For example, there is uncertainty as to whether there were one or two Dlx (distal-less) genes in the most recent common ancestor of bilaterians; however it is common practice to refer to a single Dlx gene family [18]. Thus, we stick with convention for this set of genes. There is similar uncertainty over the ancestral number of Irx (iroquois) genes [19], and again we treat these as a single gene family. The HOX genes are an interesting case as their precise number in the most recent common ancestor of bilaterians is unknown due to lack of phylogenetic resolution between 'central' genes [20]. Here we divide the HOX genes into seven gene families: the 'anterior' Hox1 and Hox2 gene families, the 'group 3' Hox3 gene family, the 'central' Hox4, Hox5 and Hox6-8 gene families, and the 'posterior' Hox9-13 gene family. Another type of exception relates to 'orphan' genes. These are genes that have been found in one species (for example human) but not in other species, or at least not in a wide diversity of Metazoa. Some of these will be ancient genes that have been secondarily lost from the genomes of some species, in which case these comply with our evolutionary definition of a gene family made above. Others, however, will be rapidly evolving genes that originated from another homeobox gene and then diverged to such an extent that their origins are unclear [21]. Whenever origins are unclear, we must define a new gene family to encompass those genes, even though they may not date back to the latest common ancestor of bilaterians. In these cases, the gene family is erected to recognize a set of distinct genes on the basis of DNA and protein sequence, rather than on evolutionary origins.

Using the aforementioned criteria, we recognize 102 homeobox gene families in the human genome (Table1). We are aware that other homeobox gene families exist in bilaterians but have been lost from humans (for example, Nk7, Ro, Hbn, Repo and Cmp; [7]), and we recognize that some gene family boundaries will alter as new information is obtained. Nonetheless, at the present time the 102 gene families provide a sound framework for the study of human homeobox genes.

It is much more difficult to propose a rigorous evolutionary definition for the rank of gene class. Every attempt to classify genes above the level of gene family involves a degree of arbitrariness. We define gene classes by taking two principal criteria into account. First, gene classes should ideally be monophyletic assemblages of gene families. To identify probable monophyletic groups of gene families, we conducted molecular phylogenetic analyses of homeodomain sequences, and looked for sets of gene families that group together stably, regardless of the precise composition of the dataset used (Figures1,2,3; Additional files3,4,5). Some gene families were difficult to place from sequence data alone, and were found in different gene classes (or subclasses) depending on the precise dataset analyzed or the phylogenetic method employed. This is perhaps not surprising as trees that encompass many homeobox genes can only be built with a short sequence alignment (the homeodomain); under these conditions, phylogenetic trees can only be used as a guide to possible classification, not the absolute truth. In ambiguous cases, we used the chromosomal location of genes to guide possible resolution between alternative hypotheses. Second, some homeobox gene classes can be characterized by the presence of additional protein domains outside of the homeodomain [2]. Recognized protein domains associated with homeodomains include the PRD domain, LIM domain, POU-specific domain, POU-like domain, SIX domain, various MEINOX-related domains, the CUT domain, PROS domain, and various ZF domains [2].

Figure 1.

Figure 1

Maximum likelihood phylogenetic tree of human ANTP-class homeodomains. Arbitrarily rooted phylogenetic tree of human ANTP-class homeodomains constructed using the maximum likelihood method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. The proposed division between the HOXL and NKL subclasses is indicated. The position ofEN1andEN2is unstable; this tree places them in the NKL subclass, whereas neighbor-joining analysis of the same dataset places them at the base of the two subclasses (Additional file3). Interrelationships of genes in the Nk4 and Nk2.2 families are also unstable (in this tree and Additional file3 respectively); in these cases synteny within and between genomes clearly resolves gene families. Detailed relationships between different gene families should not be inferred from this tree.

Figure 2.

Figure 2

Maximum likelihood phylogenetic tree of human PRD-class homeodomains. Arbitrarily rooted phylogenetic tree of human PRD-class homeodomains constructed using the maximum likelihood method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded, as are the partial homeodomains of PAX2, PAX5 and PAX8, and the HOPX homeodomain because its extremely divergent sequence destabilizes the overall tree topology. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single Dux-family gene. In this tree Dux-family homeodomains are not monophyletic, even within the same gene; however, monophyly is recovered by neighbor-joining analysis (Additional file4). Detailed relationships between different gene families should not be inferred from this tree.

Figure 3.

Figure 3

Maximum likelihood phylogenetic tree of human homeodomains excluding ANTP and PRD classes. Arbitrarily rooted phylogenetic tree of human homeodomains excluding the ANTP and PRD classes constructed using the maximum likelihood method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single gene. Classes and/or families are color coded as shown in the key. The LIM and ZF classes are not recovered as two distinct monophyletic groups, a result also found by neighbor-joining analysis (Additional file5). The multiple homeodomains of Zfhx-family proteins and Zhx/Homez-family proteins are also dispersed in the tree, presumably artefactually. Detailed relationships between different gene families should not be inferred from this tree.

Using the aforementioned criteria, we recognize eleven homeobox gene classes in the human genome: ANTP, PRD, LIM, POU, HNF, SINE, TALE, CUT, PROS, ZF and CERS (Table1). There is no expectation that the eleven gene classes will be of similar size, simply because some classes will have undergone more expansion by gene duplication than others. In the human genome, the ANTP and PRD classes are much larger than the other classes. Although gene classes should ideally be monophyletic, it is possible that the ZF homeobox gene class, characterized by the presence of zinc finger motifs in most of its members, is polyphyletic (Figure3; Additional file5). In other words, domain shuffling may have brought together a homeobox sequence and a zinc finger sequence on more than one occasion. The same may also be true for the LIM class; alternatively the apparent polyphyly of LIM-class homeodomains could be a consequence of LIM domain loss or artefactual placement of some ZF-class homeodomains in phylogenetic analyses (Figure3; Additional file5).

In theory, it is possible to recognize higher level associations above the level of the gene class, because the diversification of homeobox genes will have taken place by a continual series of gene duplication events. We do not propose names for hierarchical levels above the rank of class, and consider that gene name, gene family and gene class (and occasionally subclass) convey sufficient information for most purposes.

We use a consistent convention for writing gene classes and gene families. We present the names of all gene classes in abbreviated non-italicized upper case – for example, the ANTP and PRD classes – to avoid confusion with gene symbols (Antpandprd) or indeed gene names (Antennapediaandpaired). In contrast, we present the names of all gene families in non-italicized title case; for example, the Cdx, En and Gsc gene families. We have used this style consistently in recent work [6,21-23] and note that several other authors have done likewise [4,7,24]. We suggest that this style, and most of these gene family names, can be used in other bilaterian genomes. Extending the scheme to non-bilaterians is more difficult, however, and awaits clarification of the relationship between the homeobox genes of sponges, placozoans, cnidarians and bilaterians [7,25].

The ANTP homeobox class

The ANTP class derives its name from theAntennapedia(Antp) gene, one of the HOX genes within the ANT-C homeotic complex ofDrosophila melanogaster. The human genome has 39 HOX genes, arranged into four Hox clusters. Here we divide the HOX genes into seven gene families: Hox1, Hox2, Hox3, Hox4, Hox5, Hox6-8 and Hox9-13. The HOX genes are not the only ANTP-class genes, and we recognize a total of 37 gene families in this class (Table1). We divide these 37 gene families between two subclasses that are relatively well-supported in phylogenetic analyses: the HOXL and the NKL subclasses (Figure1; Additional file3). As previously discussed, the subclasses are largely consistent with the chromosomal positions of genes [26,27]. The HOXL (HOX-Like or HOX-Linked) genes primarily map to two fourfold paralogous regions: the Hox paralogon (2q, 7p/q, 12q and 17q) and the ParaHox paralogon (4q, 5q, 13q and Xq) (Figure4). The NKL (NK-Like or NK-Linked) genes are more dispersed, but there is a concentration on the NKL or MetaHox paralogon (2p/8p, 4p, 5q and 10q) (Figure4). Somewhat aberrantly, the Dlx and En gene families group with the NKL subclass in phylogenetic analyses (Figure1; Additional file3), but with the HOXL subclass on the basis of chromosomal positions (Figure4).

Figure 4.

Figure 4

Chromosomal distribution of human homeobox genes. Ideograms of human chromosomes showing the locations of human homeobox genes. Hox clusters are each shown as a single line for simplicity. Probable pseudogenes are not shown. Genes are color coded according to their class or family (see key). Map positions were obtained through the Ensembl Genome Browser.

Most of the 37 gene families in the ANTP class have been clearly defined before. We draw attention here to several cases that could cause confusion. Other details can be found in Table2.

Table 2.

Human ANTP class homeobox genes and pseudogenes

Human ANTP-class homeobox genes and pseudogenes
HOXL subclass
FamilyGene symbolGene nameLocationEntrez gene IDPrevious symbols

CdxCDX1caudal type homeobox 15q321044
CDX2caudal type homeobox 213q12.21045CDX3
CDX4caudal type homeobox 4Xq13.21046
EvxEVX1even-skipped homeobox 17p15.22128
EVX2even-skipped homeobox 22q31.1344191
GbxGBX1gastrulation brain homeobox 17q36.12636
GBX2gastrulation brain homeobox 22q37.22637
GsxGSX1GS homeobox 113q12.2219409GSH1
GSX2GS homeobox 24q12170825GSH2
Hox1HOXA1homeobox A17p15.23198HOX1F
HOXB1homeobox B117q21.323211HOX2I
HOXD1homeobox D12q31.13231HOX4G
Hox2HOXA2homeobox A27p15.23199HOX1K
HOXB2homeobox B217q21.323212HOX2H
Hox3HOXA3homeobox A37p15.23200HOX1E
HOXB3homeobox B317q21.323213HOX2G
HOXD3homeobox D32q31.13232HOX4A
Hox4HOXA4homeobox A47p15.23201HOX1D
HOXB4homeobox B417q21.323214HOX2F
HOXC4homeobox C412q13.133221HOX3E
HOXD4homeobox D42q31.13233HOX4B
Hox5HOXA5homeobox A57p15.23202HOX1C
HOXB5homeobox B517q21.323215HOX2A
HOXC5homeobox C512q13.133222HOX3D
Hox6-8HOXA6homeobox A67p15.23203HOX1B
HOXB6homeobox B617q21.323216HOX2B
HOXC6homeobox C612q13.133223HOX3C
HOXA7homeobox A77p15.23204HOX1A
HOXB7homeobox B717q21.323217HOX2C
HOXB8homeobox B817q21.323218HOX2D
HOXC8homeobox C812q13.133224HOX3A
HOXD8homeobox D82q31.13234HOX4E
Hox9-13HOXA9homeobox A97p15.23205HOX1G
HOXB9homeobox B917q21.323219HOX2E
HOXC9homeobox C912q13.133225HOX3B
HOXD9homeobox D92q31.13235HOX4C
HOXA10homeobox A107p15.23206HOX1H
HOXC10homeobox C1012q13.133226HOX3I
HOXD10homeobox D102q31.13236HOX4D, HOX4E
HOXA11homeobox A117p15.23207HOX1I
HOXC11homeobox C1112q13.133227HOX3H
HOXD11homeobox D112q31.13237HOX4F
HOXC12homeobox C1212q13.133228HOX3F
HOXA13homeobox A137p15.23209HOX1J
HOXB13homeobox B1317q21.3210481
HOXC13homeobox C1312q13.133229HOX3G
HOXD13homeobox D132q31.13239HOX4I
MnxMNX1motor neuron and pancreas homeobox 17q36.33110HLXB9, HB9, HOXHB9
MeoxMEOX1mesenchyme homeobox 117q21.314222MOX1
MEOX2mesenchyme homeobox 27p21.14223MOX2, GAX
PdxPDX1pancreatic and duodenal homeobox 113q12.23651IPF1, IUF1, IDX1, STF1

NKL subclass

BarhlBARHL1BarH-like homeobox 19q34.1356751
BARHL2BarH-like homeobox 21p22.2343472
BarxBARX1BARX homeobox 19q22.3256033
BARX2BARX homeobox 211q24.38538
BsxBSXbrain specific homeobox11q24.1390259
DbxDBX1developing brain homeobox 111p15.1120237
DBX2developing brain homeobox 212q12440097
DlxDLX1distal-less homeobox 12q31.11745
DLX2distal-less homeobox 22q31.11746TES1
DLX3distal-less homeobox 317q21.331747
DLX4distal-less homeobox 417q21.331748DLX7, DLX8, DLX9, BP1
DLX5distal-less homeobox 57q21.31749
DLX6distal-less homeobox 67q21.31750
EmxEMX1empty spiracles homeobox 12p13.22016
EMX2empty spiracles homeobox 210q26.112018
EnEN1engrailed homeobox 12q14.22019
EN2engrailed homeobox 27q36.32020
HhexHHEXhematopoietically expressed homeobox10q23.333087HEX, PRH, PRHX
HlxHLXH2.0-like homeobox1q413142HLX1, HB24
LbxLBX1ladybird homeobox 110q24.3210660LBX1H, HPX6
LBX2ladybird homeobox 22p13.185474
MsxMSX1msh homeobox 14p16.24487HOX7
MSX2msh homeobox 25q35.24488HOX8, MSH
MSX2P1msh homeobox 2 pseudogene17q23.255545HPX5, MSX2P
NanogNANOGNanog homeobox12p13.3179923
NANOGP1Nanog homeobox pseudogene 112p13.31404635NANOG2
NANOGP2Nanog homeobox pseudogene 22q36.1414131NANOGP4
NANOGP3Nanog homeobox pseudogene 36p12.1340217
NANOGP4Nanog homeobox pseudogene 47p15.1414132NANOGP2
NANOGP5Nanog homeobox pseudogene 59q31.1414133
NANOGP6Nanog homeobox pseudogene 610q24.2414134
NANOGP7Nanog homeobox pseudogene 714q32.12414130NANOGP3
NANOGP8Nanog homeobox pseudogene 815q14388112NANOGP1
NANOGP9Nanog homeobox pseudogene 9Xq12349386NANOGP6
NANOGP10Nanog homeobox pseudogene 10Xp11.3349372NANOGP5
NANOGP11Nanog homeobox pseudogene 116q25.2414135
Nk1NKX1-1NK1 homeobox 14p16.354279NKX1.1, HSPX153, HPX153
NKX1-2NK1 homeobox 210q26.13390010NKX1.2, C10orf121
Nk2.1NKX2-1NK2 homeobox 114q13.37080NKX2.1, NKX2A, TTF1, TITF1
NKX2-4NK2 homeobox 420p11.224823NKX2.4, NKX2D
Nk2.2NKX2-2NK2 homeobox 220p11.224821NKX2.2, NKX2B
NKX2-8NK2 homeobox 814q13.326257NKX2.8, NKX2H
Nk3NKX3-1NK3 homeobox 18p21.24824NKX3.1, NKX3A
NKX3-2NK3 homeobox 24p15.33579NKX3.2, NKX3B, BAPX1
Nk4NKX2-3NK2 homeobox 310q24.2159296NKX2.3, NKX2C, NKX4-3, CSX3
NKX2-5NK2 homeobox 55q35.11482NKX2.5, NKX2E, NKX4-1, CSX, CSX1
NKX2-6NK2 homeobox 68p21.2137814NKX2.6, NKX4-2, CSX2
Nk5/HmxHMX1H6 family homeobox 14p16.13166NKX5-3, H6
HMX2H6 family homeobox 210q26.133167NKX5-2, H6L
HMX3H6 family homeobox 310q26.13340784NKX5-1
Nk6NKX6-1NK6 homeobox 14q21.234825NKX6.1, NKX6A
NKX6-2NK6 homeobox 210q26.384504NKX6.2, NKX6B, GTX
NKX6-3NK6 homeobox 38p11.21157848NKX6.3
NotoNOTOnotochord homeobox2p13.2344022
TlxTLX1T-cell leukemia homeobox 110q24.323195HOX11, TCL3
TLX2T-cell leukemia homeobox 22p13.13196HOX11L1, NCX
TLX3T-cell leukemia homeobox 35q35.130012HOX11L2, RNX
VaxVAX1ventral anterior homeobox 110q26.1111023
VAX2ventral anterior homeobox 22p13.325806
VentxVENTXVENT homeobox10q26.327287VENTX2, HPX42B
VENTXP1VENT homeobox pseudogene 1Xp21.3139538VENTX2P1, NA88A
VENTXP2VENT homeobox pseudogene 213q31.1347975VENTX2P2
VENTXP3VENT homeobox pseudogene 312q21.1349814VENTX2P3
VENTXP4VENT homeobox pseudogene 43p24.2152101VENTX2P4
VENTXP5VENT homeobox pseudogene 58p12442384
VENTXP6VENT homeobox pseudogene 68q21.11552879
VENTXP7VENT homeobox pseudogene 73p24.3391518VENTX1, HPX42

Human ANTP class homeobox genes and pseudogenes including full names, chromosomal locations, Entrez Gene IDs and previous symbols.NANOGP1is a duplicate ofNANOG.

∘ Cdx, Gsx and Pdx gene families. Some authors refer to the Pdx gene family as the Xlox gene family [28]. One gene from each of these families (CDX2,GSX1andPDX1) forms the ParaHox cluster at 13q12.2 (Figure4), and clustering of Cdx, Gsx and Pdx genes is ancestral for chordates [28].

∘ Mnx gene family. This gene family name derives from a previous study [29]. The family includes one gene in the human genome:MNX1(formerlyHLXB9), and two genes in the chicken genome:Mnx1(formerlyHB9) andMnx2(formerlyMNR2). Some authors refer to the Mnx gene family as the Exex gene family due to theDrosophilaorthologexex[7].

∘ Dlx gene family. It is currently unclear if this gene family is derived from one or more genes in the common ancestor of bilaterians [18]. Phylogenetic analyses place this gene family firmly within the NKL subclass (Figure1; Additional file3), but chromosomal positions (on the Hox chromosomes 2, 7 and 17) place it within the HOXL subclass (Figure4). Here we favor placement of the Dlx gene family within the NKL subclass due to strong phylogenetic support.

∘ En gene family. Phylogenetic analyses place this gene family either within the NKL subclass (maximum likelihood; Figure1) or close to the division between the NKL and HOXL subclasses (neighbor-joining; Additional file3). Here we place the En gene family within the NKL subclass, although we note that humanEN2maps close to the clear HOXL-subclass genesGBX1andMNX1on chromosome 7 (Figure4).

∘ Nk2.1 and Nk2.2 gene families. The genesNKX2-1(formerlyTITF1),NKX2-4,NKX2-2andNKX2-8divide into two distinct gene families each with an invertebrate ortholog, not a single Nk2 gene family.NKX2-1andNKX2-4are collectively orthologous toDrosophila scroand amphioxusAmphiNk2-1[30,31]; these comprise one gene family: Nk2.1.NKX2-2andNKX2-8are collectively orthologous toDrosophila vndand amphioxusAmphiNk2-2[31,32]; these comprise a second gene family: Nk2.2.

∘ Nk4 gene family. The genesNKX2-3,NKX2-5andNKX2-6form a gene family, quite distinct from other human genes that confusingly share the prefixNKX2. These three genes are actually orthologs ofDrosophila tin(formerlyNK4); they are not orthologs ofDrosophila vnd(formerlyNK2) orscro[33]. Therefore, they do not belong to the Nk2.1 or Nk2.2 gene families, but belong to a separate Nk4 gene family. As the three gene names have very extensive current usage, it may be difficult for revised names to be used consistently. In this situation, we don't alter the current names, but raise for discussion the possibility of these genes being renamed to the more logicalNKX4-1(NKX2-5),NKX4-2(NKX2-6) andNKX4-3(NKX2-3), or toCSX1(NKX2-5),CSX2(NKX2-6) andCSX3(NKX2-3), based on the alternative nameCSX1forNKX2-5[34].

∘ Noto gene family. This gene family falls close to the division between the ANTP and PRD classes in phylogenetic analyses (Additional files1 and2). We favor placement within the ANTP class as the humanNOTOgene is chromosomally linked to the clear ANTP-class (NKL-subclass) genesEMX1,LBX2,TLX2andVAX2on chromosome 2 (Figure4), suggesting ancestry by ancient tandem duplication.

Most of the 100 genes in the ANTP class have been adequately named previously. However, several genes were unnamed or misnamed prior to this study. We have updated these as follows.

GSX2[Entrez Gene ID: 170825] is the second of two human members of the Gsx gene family. This previously unnamed gene has clear orthology to mouseGsh2, inferred from sequence identity and synteny. We designate the geneGSX2and revise the nomenclature of the other human member of the family fromGSH1toGSX1[Entrez Gene ID: 219409], in accordance with homeobox gene nomenclature convention.

MNX1[Entrez Gene ID: 3110] is the only member of the Mnx gene family in the human genome. This gene was previously known asHLXB9; we rename itMNX1because it is not part of a series of at least nine related genes.

PDX1[Entrez Gene ID: 3651] is the only member of the Pdx gene family in the human genome. This gene was previously known asIPF1; we rename itPDX1because the majority of published studies use this as the gene symbol.

BSX[Entrez Gene ID: 390259] is the only member of the Bsx gene family in the human genome. We designate this previously unnamed geneBSXon the basis of clear orthology to the mouseBsxgene, inferred from sequence identity and synteny.

DBX1[Entrez Gene ID: 120237] andDBX2[Entrez Gene ID: 440097] are the only two members of the Dbx gene family in the human genome. We designate these previously unnamed genesDBX1andDBX2on the basis of clear orthology to mouseDbx1andDbx2, inferred from sequence identity and synteny.

NKX1-1[Entrez Gene ID: 54729] andNKX1-2[Entrez Gene ID: 390010] are the only two members of the Nk1 gene family in the human genome. These genes were previously known asHSPX153andC10orf121respectively; we rename themNKX1-1andNKX1-2on the basis of clear orthology to mouseNkx1-1andNkx1-2, inferred from sequence identity and synteny.

NKX2-1[Entrez Gene ID: 7080] is the first of two human members of the Nk2.1 gene family. This gene was previously known asTITF1; we rename itNKX2-1to show that it is a member of the Nk2.1 gene family.

NKX2-6[Entrez Gene ID: 137814] is the third of three human members of the Nk4 gene family. We designate this previously unnamed geneNKX2-6on the basis of clear orthology to mouseNkx2-6, inferred from sequence identity and synteny, although nomenclature revision for the entire Nk4 gene family should be discussed (see above).

NKX3-2[Entrez Gene ID: 579] is the second of two human members of the Nk3 gene family. This gene was previously known asBAPX1; we rename itNKX3-2to show that it is a member of the Nk3 gene family.

NKX6-3[Entrez Gene ID: 157848] is the third of three human members of the Nk6 gene family. We designate this previously unnamed geneNKX6-3on the basis of clear orthology to mouseNkx6-3, inferred from sequence identity and synteny.

VENTX[Entrez Gene ID: 27287] is the only functional member of the Ventx gene family in the human genome. This gene was previously known asVENTX2. We remove the numerical suffix from this gene symbol because we discovered that the sequence formerly known asVENTX1is actually a retrotransposed pseudogene derived from this gene. Accordingly, we also replace theVENTX1symbol withVENTXP7(see below).

In contrast to the previous descriptions of probable functional genes, there has been much less research on pseudogenes within the ANTP class. Eleven pseudogenes derived from the humanNANOGgene have been described previously [22], while four pseudogenes in the Ventx gene family have been reported following routine annotation of the human genome. We have identified two additional Ventx-family pseudogenes (VENTXP5andVENTXP6), and also found two cases of pseudogenes that were originally mistaken for functional genes (MSX2P1andVENTXP7). In all cases, we have clarified the origins and organization of these pseudogenes. This research brings the total number of ANTP-class pseudogenes in the human genome to 19.

MSX2P1[Entrez Gene ID: 55545]. A short cDNA sequence [EMBL:X74862] related to the Msx gene family was reported previously [35]; the former Entrez Gene record labeledHSHPX5was based on this sequence. This locus was later provisionally calledMSX4, as it was distinct from humanMSX1andMSX2, and by synteny it was clearly not the ortholog of mouseMsx3[27]. It is now clear that this locus was formed by retrotransposition of mRNA fromMSX2and hence we name itMSX2P1. The genomic sequence ofMSX2P1can now be accessed via the Reference Sequence collection [RefSeq:NR_002307]. The pseudogene shares 91% sequence identity withMSX2mRNA, lacks intronic sequence, and has remnants of a 3' poly(A) tail. It is intriguing, but probably coincidental, that theMSX2P1pseudogene has integrated at 17q23.2, close to several ANTP-class genes (HOXB cluster,MEOX1,DLX3andDLX4).

NANOGP1[Entrez Gene ID: 404635]. We follow Booth and Holland [22] and classifyNANOGP1as a pseudogene that arose by tandem duplication ofNANOG. The alternative view, argued by Hart et al [36], is that this locus is a functional gene, and should be namedNANOG2. There is evidence for transcription of this locus in human embryonic stem cells [36], and for selection-driven conservation of the open reading frame [37], but as yet no clear evidence for function.

NANOGP8[Entrez Gene ID: 388112]. We follow Booth and Holland [22] and classifyNANOGP8as a retrotransposed pseudogene. The alternative view, argued by Zhang et al [38], is that this locus is a functional retrogene. There is evidence for transcription and translation of this locus in cancer cell lines and tumors [38], but no evidence yet for a role in normal tissues.

VENTXP1[Entrez Gene ID: 139538],VENTXP2[Entrez Gene ID: 347975],VENTXP3[Entrez Gene ID: 349814] andVENTXP4[Entrez Gene ID: 152101]. These fourVENTXretrotransposed pseudogenes have been reported previously, and were originally known asVENTX2P1toVENTX2P4. The correction of theVENTX2gene symbol to simplyVENTX(see above) means that each of the pseudogene names should also change; we rename themVENTXP1toVENTXP4.VENTXP1is transcribed but due to mutations it can no longer encode a homeodomain protein; it can however encode an antigenic peptide (NA88A) responsible for T-cell stimulation in response to melanoma [39].

VENTXP5[Entrez Gene ID: 442384]. We designate this previously unnamed sequenceVENTXP5because it is clearly a retrotransposed pseudogene ofVENTX. The genomic sequence ofVENTXP5can now be accessed via the Reference Sequence collection [RefSeq:NG_005091]. The pseudogene shares 83% identity withVENTXmRNA (after masking of an Alu element in the parental mRNA sequence), lacks intronic sequence, and has remnants of a 3' poly(A) tail.

VENTXP6[Entrez Gene ID: 552879]. We designate this previously unannotated sequenceVENTXP6because it is clearly a retrotransposed pseudogene ofVENTX. Its lack of annotation may reflect the fact that it is located within an intron of an unrelated and well characterized gene,STAU2. The genomic sequence ofVENTXP6can now be accessed via the Reference Sequence collection [RefSeq:NG_005090]. The pseudogene shares 87% identity withVENTXmRNA (after masking of an Alu element in the parental mRNA sequence) and lacks intronic sequence.

VENTXP7[Entrez Gene ID: 391518]. A short cDNA sequence [EMBL:X74864] was reported previously and namedHPX42[35]. This was later renamed theVENTX1gene, after it was found to be related toXenopusVentx-family genes. Our analysis of the genomic sequence at this locus reveals that it is actually a retrotransposed pseudogene of theVENTXgene (formerlyVENTX2); thus we designate itVENTXP7. The genomic sequence ofVENTXP7can now be accessed via the Reference Sequence collection [RefSeq:NR_002311]. The pseudogene shares 86% identity withVENTXmRNA (after masking of an Alu element in the parental mRNA sequence), lacks intronic sequence, and has remnants of a 3' poly(A) tail.

One other gene could conceivably be included in the ANTP class, but is excluded from our survey. This gene [Entrez Gene ID: 360030; GenBank:AY151139], has been annotated as a homeobox gene and is located just 20 kb fromNANOG. However, no homeodomain was detected when the deduced protein was analyzed for conserved domains. Also, secondary structure prediction did not predict the expected organisation of alpha helices. Alignment with the NANOG homeodomain reveals identity of the KQ and WF motifs, either side of the same intron position (44/45), but few other shared residues. It is possible, but unproven, that the locus arose by tandem duplication of part, or all, of theNANOGhomeobox gene. This gene has generated two retrotransposed pseudogenes: one at 2q11.2 and another at 12q24.33.

The PRD homeobox class

The PRD class derives its name from thepaired(prd) gene ofDrosophila melanogaster. In previous studies, the PRD class has been subdivided in several different ways, often based on identify of the amino acid at residue 50 in the homeodomain, for example S50, K50 and Q50. These categories are not monophyletic groupings of genes and so can be misleading if we aim for a classification scheme that reflects evolution [5]. Here we divide the PRD class into two subclasses of unequal size: the PAX subclass (containing seven PAX genes, excludingPAX1andPAX9), and the PAXL subclass (containing 43 non-PAX genes and many pseudogenes) (Table1). PAX genes are defined by possession of a conserved paired-box motif, distinct from the homeobox, coding for the 128-amino-acid PRD domain. Of the nine human genes possessing a paired-box (PAX1toPAX9), only four also contain a complete homeobox (PAX3,PAX7,PAX4andPAX6). Three genes have a partial homeobox (PAX2,PAX5andPAX8), while two lack a homeobox entirely (PAX1andPAX9). Phylogenetic analyses using PAX genes from a range of species suggest that these are secondary conditions, and that the ancestral PAX gene probably possessed both motifs [40]. The PAX genes do not constitute a single gene family, because it is clear that the latest common ancestor of the Bilateria contained four PAX genes. Three of these are ancestors of the PRD-class homeobox gene families Pax2/5/8, Pax3/7 and Pax4/6; the fourth is the ancestor ofPAX1andPAX9. Thus the PAX subclass contains three gene families. We divide the PAXL subclass into 28 gene families, although as explained below not all of these date to the base of the Bilateria. Thus, we recognize a total of 31 gene families in the PRD class (Table1).

Many of the 31 gene families in the PRD class have been clearly defined before. We draw attention here to newly defined gene families and cases that could cause confusion. Other details can be found in Table3.

Table 3.

Human PRD class homeobox genes and pseudogenes

Human PRD-class homeobox genes and pseudogenes
FamilyGene symbolGene nameLocationEntrez gene IDcPrevious symbols
AlxALX1ALX homeobox 112q21.318092CART1
ALX3ALX homeobox 31p13.3257
ALX4ALX homeobox 411p11.26059
ArgfxARGFXarginine-fifty homeobox3q13.33503582
ARGFXP1arginine-fifty homeobox pseudogene 15q23.2503583
ARGFXP2arginine-fifty homeobox pseudogene 217q11.2503640
ArxARXaristaless related homeoboxXp21.3170302ISSX
DmbxDMBX1diencephalon/mesencephalon brain homeobox 11p34.1127343MBX, OTX3, PAXB
DprxDPRXdivergent paired-related homeobox19q13.42503834
DPRXP1divergent paired-related homeobox pseudogene 12q32.1503641
DPRXP2divergent paired-related homeobox pseudogene 26p21.31503643
DPRXP3divergent paired-related homeobox pseudogene 314q13.2503644
DPRXP4divergent paired-related homeobox pseudogene 417q11.2503645
DPRXP5divergent paired-related homeobox pseudogene 521q22.13503646
DPRXP6divergent paired-related homeobox pseudogene 6Xp11.4503647
DPRXP7divergent paired-related homeobox pseudogene 7Xq23503648
DrgxDRGXdorsal root ganglia homeobox10q11.23644168DRG11, PRRXL1
DuxDUXAdouble homeobox A19q13.43503835
DUXAP1double homeobox A pseudogene 12p11.2503630
DUXAP2double homeobox A pseudogene 28q22.3503631
DUXAP3double homeobox A pseudogene 310q11.21503632
DUXAP4double homeobox A pseudogene 410q11.21503633
DUXAP5double homeobox A pseudogene 511q23.3503634
DUXAP6double homeobox A pseudogene 615q26.1503635
DUXAP7double homeobox A pseudogene 720p11.23503636
DUXAP8double homeobox A pseudogene 822q11.21503637
DUXAP9double homeobox A pseudogene 914qcen503638
DUXAP10double homeobox A pseudogene 1014q11.2503639
DUXBdouble homeobox B16q23.1100033411
EsxESX1ESX homeobox 1Xq22.280712ESX1L, ESXR1
GscGSCgoosecoid homeobox14q32.13145258GSC1
GSC2goosecoid homeobox 222q11.212928GSCL
HesxHESX1HESX homeobox 13p14.38820RPX, ANF
HopxHOPXHOP homeobox4q1284525HOP, OB1, LAGY, NECC1, SMAP31
IsxISXintestine specific homeobox22q12.391464RAXLX
LeutxLEUTXLeucine twenty homeobox19q13.2342900
MixMIXLMix paired-like homeobox1q42.1283881MIX, MIXL1, MILD1
NoboxNOBOXNOBOX oogenesis homeobox7q35135935OG2, OG2X
OtpOTPorthopedia homeobox5q14.123440
OtxOTX1orthodenticle homeobox 12p155013
OTX2orthodenticle homeobox 214q22.35015
OTX2P1orthodenticle homeobox 2 pseudogene9q21.2100033409OTX2P
CRXcone-rod homeobox19q13.321406OTX3
Pax2/5/8PAX2paired box 210q24.315076
PAX5paired box 59p13.25079BSAP
PAX8paired box 82q137849
Pax3/7PAX3paired box 32q36.15077HUP2
PAX7paired box 71p36.135081HUP1, PAX7B
Pax4/6PAX4paired box 47q32.15078
PAX6paired box 611p135080
PhoxPHOX2Apaired-like homeobox 2a11q13.4401PMX2A, ARIX
PHOX2Bpaired-like homeobox 2b4p138929PMX2B, NBPhox
PitxPITX1pituitary homeobox 15q31.15307PTX1, POTX, BFT
PITX2pituitary homeobox 24q255308PTX2, ARP1, RGS, RIEG, RIEG1
PITX3pituitary homeobox 310q24.325309PTX3
PropPROP1PROP paired-like homeobox 15q35.35626
PrrxPRRX1paired related homeobox 11q24.35396PRX1, PMX1, PHOX1
PRRX2paired related homeobox 29q34.1151450PRX2, PMX2
RaxRAXretina and anterior neural fold homeobox18q21.3130062RX
RAX2retina and anterior neural fold homeobox 219p13.384839QRX, RAXL1
RhoxRHOXF1Rhox homeobox family, member 1Xq24158800PEPP1, OTEX
RHOXF2Rhox homeobox family, member 2Xq2484528PEPP2
RHOXF2BRhox homeobox family, member 2BXq24727940PEPP2L
SeboxSEBOXSEBOX homeobox17q11.2645832OG9, OG9X
ShoxSHOXshort stature homeoboxXp22.33/ Yp11.326473SHOXY, GCFX, PHOG
SHOX2short stature homeobox 23q25.326474SHOT, OG12, OG12X
TprxTPRX1tetra-peptide repeat homeobox 119q13.32284355
TPRX2Ptetra-peptide repeat homeobox 2 pseudogene19q13.32503627
TPRX1P1tetra-peptide repeat homeobox 1 pseudogene 110q22.3503628
TPRX1P2tetra-peptide repeat homeobox 1 pseudogene 210q22.3503629
TPRXLtetra-peptide repeat homeobox-like3p25.1348825
UncxUNCXUNC homeobox7p22.3340260PHD1, UNCX4.1
VsxVSX1visual system homeobox 120p11.2130813KTCN, RINX
VSX2visual system homeobox 214q24.3338917RET1, HOX10, CHX10

Human PRD class homeobox genes and pseudogenes including full names, chromosomal locations, Entrez Gene IDs and previous symbols. Pax2/5/8-family genes contain a partial homeobox.RHOXF2Bis a duplicate ofRHOXF2.TPRX2Pis a duplicate ofTPRX1.

∘ Argfx, Dprx and Tprx gene families. There are no known invertebrate members of these three gene families. Therefore, these are exceptions to the rule defining gene families as dating to the base of the Bilateria. The Dprx and Tprx gene families may have arisen by duplication and very extensive divergence fromCRX, a member of the Otx gene family, during mammalian evolution; origins ofARGFXare obscure [21].

∘ Dux gene family. Members of this gene family are characterized by the presence of two closely-linked homeobox motifs. Most members are intronless sequences present in multiple polymorphic copies within the 3.3 kb family of tandemly repeated elements associated with heterochromatin. These comprise the sequences known asDUX1toDUX5reported in previous studies [12-14] and numerousDUX4copies detected in this study (see below). The absence of introns suggests that these sequences may have originated by retrotransposition from an mRNA transcript, thus they are probably non-functional. There are two noticeable exceptions; these members known asDUXAandDUXBpossess introns, thus either one could be the progenitor for the large number of intronless Dux-family sequences found in the human genome.DUXAhas spawned 10 retrotransposed pseudogenes and has been described previously [21].DUXBis described here (see below).

∘ Hopx gene family. Phylogenetic analyses places this gene family, containing a single very divergent homeobox geneHOPX(formerlyHOP), either within the PRD class (maximum likelihood; Additional file1) or close to Zhx/Homez-family genes (neighbor-joining; Additional file2). We favor placement in the PRD class for three reasons. First, the HOPX homeodomain has highest sequence identity with PRD-class homeodomains (GSC: 38% and PAX6: 36%). Second, the HOPX homeodomain possesses the same combination of residues that are invariably conserved across human PRD-class homeodomains (Additional file6). Third, the HOPX homeodomain shares the 46/47 intron position seen in many PRD-class homeodomains.HOPXdoes not map particularly near any other homeobox genes, although the closest isGSX2in the ANTP class at 4q12 (Figure4).HOPXis not a typical PRD-class homeobox gene; the homeodomain has a single amino acid insertion between helix I and helix II (Additional file6), and lacks the ability to bind DNA [41,42].

∘ Leutx gene family. This gene family contains a single gene in the human genome,LEUTX, and no known invertebrate members. We placeLEUTXin the PRD class for four reasons. First, there is weak phylogenetic support for this placement (Additional files1 and2). Second, the LEUTX homeodomain possesses the same combination of residues that are invariably conserved across human PRD-class homeodomains (except for a leucine at position 20; Additional file6). Third, the LEUTX homeodomain shares the 46/47 intron position seen in many PRD-class homeodomains. Fourth, theLEUTXgene is located close to the PRD-class genesTPRX1,CRX,DPRXandDUXAon the distal end of the long arm of chromosome 19 (Figure4). This fourth observation leads us to hypothesize that this gene family arose by tandem duplication and extensive divergence during mammalian evolution.

∘ Nobox gene family. This gene family falls close to the division between the ANTP and PRD classes in both maximum likelihood and neighbor-joining phylogenetic analyses (Additional files1 and2). We favor placement within the PRD class because the NOBOX homeodomain has higher sequence identity with PRD-class homeodomains (up to 55%) than with ANTP-class homeodomains (up to 46%). Chromosomal position does not shed light on the issue, as its location at 7q35 is close to both ANTP- and PRD-class genes (Figure4).

∘ Otx gene family. This very well known gene family was originally considered to contain humanOTX1andOTX2(and their mouse orthologs) and theDrosophila otdgene [43]. Later, it was shown that theCRXgene is a member of the same gene family, deriving from the same ancestral gene. Thus,CRXcould be considered the trueOTX3gene [44]. Unfortunately, theOTX3symbol was formerly used erroneously for a gene in a different family, now calledDMBX1, thus complicating its future use. The gene family name Otx is derived by majority rule from the constituent genes.

∘ Pax2/5/8 gene family. This gene family is also known as Pax group II; it containsPAX2,PAX5andPAX8, clearly derived from a single ancestral gene [45]. These genes have partial homeoboxes.

∘ Pax3/7 gene family. This gene family is also known as Pax group III; it containsPAX3andPAX7, clearly derived from a single ancestral gene [46].

∘ Pax4/6 gene family. This gene family is also known as Pax group IV; it containsPAX4andPAX6. There is confusion as to whether this should be split into two gene families, because invertebrate homologs generally group withPAX6in phylogenetic analyses and not as an outgroup to the two genes as might be expected. We follow the generally accepted view and groupPAX4andPAX6into a single gene family, proposing thatPAX4is a divergent member, not an ancient gene [40].

∘ Rhox gene family. The mouse Rhox cluster was first described as comprising twelve X-linked homeobox genes, all selectively expressed in reproductive tissues [47]. Subsequent studies reported a total of 32 genes in the cluster, with the additional genes attributed to recent tandem duplications [48-51]. The human genome contains three homeobox genes at Xq24 that are clearly members of the Rhox gene family based on sequence identity, molecular phylogenetics, intron positions and chromosomal location. These areRHOXF1(formerlyOTEX/PEPP1),RHOXF2(formerlyPEPP2) andRHOXF2B(formerlyPEPP2b/PEPP3).

Most of the 50 genes in the PRD class have been adequately named previously. However, several genes were unnamed or misnamed prior to this study. We have updated these as follows.

ALX1[Entrez Gene ID: 8092] is the first of three human members of the Alx gene family. This gene was previously known asCART1; we rename itALX1because it is related toALX3andALX4; all three genes were formed by duplication from a single ancestral invertebrate gene [52].

DRGX[Entrez Gene ID: 117065] is the only member of the newly defined Drgx gene family in the human genome. This gene was previously known asPRRXL1andDRG11, and there is a clear mouse ortholog (Prrxl1). The symbolPRRXL1is misleading because it infers membership of the Prrx gene family, containingPRRX1andPRRX2in the human genome. Several lines of evidence suggest it belongs to a different gene family. First, this gene (at 10q11.23) is not located in the same paralogon asPRRX1(1q24.3) andPRRX2(9q34.11) so they are not three paralogs generated during genome duplication in early vertebrate evolution. Second, it has a completely different exon-intron structure from the Prrx-family genes, and it does not contain a Prrx domain or an OAR domain (present inPRRX1andPRRX2; [53]). Third, the homeodomain is only 73% identical to PRRX1 and PRRX2 homeodomains, much lower than the 80-100% usually encountered for members of the same gene family in humans. Finally, we have identified theDrosophilaortholog,IP09201. The homeodomains ofDrosophilaIP09201 and human DRGX form a highly supported monophyletic group in our maximum likelihood (90%; Additional file1) and neighbor-joining (97%; Additional file2) phylogenetic analyses. The new symbolDRGX(dorsal root ganglia homeobox) incorporates the root of the former symbolDRG11, referring to expression of the rodent ortholog in dorsal root ganglia neurons [54].

DUXB[Entrez Gene ID: 100033411] is a human member of the Dux (double homeobox) gene family. As previously discussed, most members of this gene family are intronless and are probably derived by retrotransposition of an mRNA transcript from a functional intron-containing Dux gene (or duplication of such an integrant). Booth and Holland [21] described theDUXAgene containing five introns (including one within each homeobox), and noted the existence of a second intron-containing human Dux-family gene provisionally designatedDUXB. TheDUXBnomenclature is endorsed here. No cDNA or EST sequences have been reported forDUXB.

GSC2[Entrez Gene ID: 2928] is the second of two human members of the Gsc gene family. This gene was previously known asGSCL; we rename itGSC2to remove the inadvertent implication that it is not a true gene, and also to reflect the clear orthology to chickGsc2as inferred by phylogenetic analysis and synteny.

HOPX[Entrez Gene ID: 84525] is the only member of the newly defined Hopx gene family in the human genome. The mouse version of the gene was first identified first and namedHop(homeodomain only protein) because the encoded protein is just 73 amino acids long, with 61 of these making up the homeodomain [41,42]. TheHOPgene symbol is not ideal as it is also used for unrelated genes, includinghopscotchinDrosophilaandhop-sterilein mouse. Therefore, we revise the gene symbol fromHOPtoHOPX(HOP homeobox) in accordance with homeobox gene nomenclature convention.

LEUTX[Entrez Gene ID: 342900] is the only member of the newly defined Leutx gene family in the human genome. We designate this previously unnamed geneLEUTX(leucine twenty homeobox) to reflect the presence of a leucine residue at the otherwise highly conserved homeodomain position 20; other PRD-class homeodomains have a phenylalanine at this position (Additional file6). Studies of mutations in other homeobox genes suggest that mutation to leucine alters transcriptional activity of a homeodomain protein [55].

RAX2[Entrez Gene ID: 84839] is the second of two human members of the Rax gene family. This gene was previously known asRAXL1; we rename itRAX2to standardize nomenclature.

RHOXF1[Entrez Gene ID: 158800] andRHOXF2[Entrez Gene ID: 84528] are two of three human members of the Rhox gene family. These genes were previously known asOTEX/PEPP1andPEPP2respectively. The prefixPEPPis not suitable as it is used for numerous aminopeptidase P-encoding genes. Thus, we replace the gene symbolsOTEX/PEPP1andPEPP2withRHOXF1andRHOXF2respectively, to reflect their orthologous relationship with the mouse Rhox cluster (containing 32 genes, see above) whilst avoiding inadvertent equivalence to specific genes within the cluster.

RHOXF2B[Entrez Gene ID: 727940] is the third human member of the Rhox gene family. This locus was referred to in previous studies asPEPP2b[56] andPEPP3[51]. The prefixPEPPcannot be approved for reasons noted above.RHOXF2Bis located very close toRHOXF1andRHOXF2at Xq24 and is clearly a very recent duplicate ofRHOXF2. The genomic sequences at these two loci share 99% identity over exonic, intronic and approximately 20 kb flanking regions. Over the coding region, there are just two nucleotide substitutions (both nonsynonymous); one of these results in an unusual change within the homeodomain (arginine to cysteine at position 18). We currently listRHOXF2Bas a functional gene, although it is possible that it is a duplicated pseudogene.

SEBOX[Entrez Gene ID: 645832] is the only member of the Sebox gene family in the human genome. The human gene is the ortholog of mouseSeboxbased on their locations in syntenic chromosomal regions (17q11.2 and 11B5 respectively) and presence of the same intron positions. However, sequence identity is lower than normal for orthologous genes in mouse and human (78% amino acid identity over the homeodomain) and there is evidence that the human gene has undergone divergence. Most surprisingly, the human sequence has two unusual substitutions in the homeodomain [57]. At homeodomain position 51, the human sequence codes for lysine whereas mouse has asparagine; an earlier analysis of 346 homeodomain sequences found asparagine to be invariant at this position [1,2]. Similarly, at homeodomain position 53, human has tryptophan whereas mouse has arginine; this position is almost invariably arginine [1,2]. These sequence changes in the important third helix raise the possibility that humanSEBOXcould have accumulated mutations as a non-functional pseudogene. Until this is shown more clearly we consider it to be a functional, but divergent, gene. This gene was previously known asOG9XwithSEBOXas the alternative symbol; we favorSEBOXbecause theOGprefix was originally used for several unrelated homeobox genes.

UNCX[Entrez Gene ID: 340260] is the only member of the Uncx gene family in the human genome. This gene was previously known asUNCX4.1; we remove the numerals to giveUNCXas these do not denote a series within a gene family.

VSX2[Entrez Gene ID: 338917] is the second of two human members of the Vsx gene family. This gene was previously known asCHX10; we rename itVSX2to better reflect its paralogous relationship toVSX1.VSX2has been used as an alias for this gene in other vertebrate species and the gene symbolCHX10has the disadvantage of implicitly suggesting presence of at least nine paralogs in human (CHX1toCHX9), which do not exist.

Unlike the situation with the ANTP class, many of the pseudogenes within the PRD class have been well characterized. A previous study has described and named two pseudogenes in the Argfx gene family, seven pseudogenes in the Dprx gene family, four pseudogenes in the Tprx gene family, and 10 pseudogenes derived from theDUXAgene [21]. There is also a possibility that theSEBOXandRHOXF2Bloci are non-functional pseudogenes, as described above. We have identified a previously undescribed pseudogene from the Otx gene family (OTX2P1), and argue that the majority of Dux-family sequences are pseudogenes.

OTX2P1[Entrez Gene ID: 100033409]. We designate this previously undescribed sequenceOTX2P1because it is clearly a retrotransposed pseudogene ofOTX2. The genomic DNA sequence ofOTX2P1shares significant homology withOTX2transcript variant 2 [RefSeq:NM_172337]. There is an Alu element (AluSx subfamily) insertion, a Made1 (Mariner derived element 1) insertion, and a 1182-nucleotide deletion inOTX2P1compared toOTX2. TheOTX2P1sequence lacks introns, ends with a poly(A) tail, and harbors critical sequence alterations (including a three-nucleotide insertion introducing a stop codon into the deduced homeodomain).

DUX1[EMBL:AJ001481],DUX2[GenBank:AF068744],DUX3[GenBank:AF133130] andDUX5[GenBank:AF133131]. These sequences have been cloned in previous studies [12,13]. We detected no matches with 100% identity toDUX1,DUX2,DUX3orDUX5in build 35.1 of the human genome sequence, which covers the euchromatic regions of each chromosome. This concurs with previous studies indicating thatDUX1,DUX2,DUX3andDUX5are found in heterochromatin on human acrocentric chromosomes; each is apparently present in multiple copies within members of the 3.3 kb family of tandemly repeated DNA elements [12,13]. Because the majority of human heterochromatin has not been sequenced, and may be variable between individuals, the exact number of copies ofDUX1,DUX2,DUX3andDUX5is unknown. It is also debatable whether these loci encode functional proteins. These sequences lack introns and, as discussed above, are most likely derived from intron-containing genes in the Dux family, such asDUXAorDUXB.

DUX4[GenBank:AF117653]. This sequence has been extensively studied as some of its multiple copies exist within the 3.3 kb repetitive elements of the D4Z4 locus at 4q35 [14]. The polymorphic D4Z4 locus is linked to facioscapulohumeral muscular dystrophy (FSHD); between 12 and 96 tandem copies of 3.3 kb elements are present in unaffected individuals and deletions leaving a maximum of eight such elements have been associated with FSHD [58]. In build 35.1 of the human genome sequence, we identified 35 loci at 10 chromosomal locations containing a total of 58DUX4(and highly similar) homeobox sequences. This should not be taken as a precise figure due to copy number polymorphism and the possibility of additional copies existing in currently unsequenced heterochromatic regions. Some of the copies are 100% identical to the previously reportedDUX4sequence over the homeobox regions, others have single nucleotide polymorphisms, some have critical sequence mutations, and others have just a single homeobox. Most of the copies are located in tandemly repeated arrays (for example, on chromosomes 4, 10 and 16) and others are alone in the genome (for example, a single copy resides at 3p12.3). The majority ofDUX4copies are unlikely to encode functional proteins as suggested by their intronless, mutated and tandemly repeated nature. The lack of introns indicates they are most likely derived from intron-containing genes in the Dux family, such asDUXAorDUXB.

The LIM homeobox class

The LIM class encodes proteins with two LIM domains (named from the nematodelin-11, mammalianIsl1and nematodemec-3genes) N-terminal to a typical (i.e. 60-amino-acid) homeodomain. The LIM domain is a protein-protein interaction domain of approximately 55 amino acids comprising two specialised cysteine-rich zinc fingers in tandem [59]. Importantly, human genes also exist that encode LIM domains but not homeodomains. These LIM domains are divergent from the LIM domains encoded by LIM homeobox genes, and hence these genes are unlikely to be derived by loss of the homeobox. There is one exception: the human Lmo gene family encodes LIM domains that have been grouped by sequence similarity and domain arrangement with the LIM domains of the LIM homeobox gene class [59]. Thus, this gene family may have secondarily lost the homeobox, although this remains untested. Only genes encoding both LIM domains and homeodomains are included in our LIM homeobox gene count.

We have identified a total of twelve LIM-class homeobox genes in the human genome (Tables1 and4), consistent with previous work [60]. Phylogenetic analyses of homeodomains do not always recover the LIM class as a monophyletic group, depending on the dataset and method used (Figure3; Additional files1,2 and5), but it is likely that the class evolved from a single fusion event that brought together LIM domains and a homeodomain. Phylogenetic analyses of homeodomains divide the LIM class into six gene families (Figure3; Additional files1,2 and5), consistent with previous studies [60]. Each gene family has two human members and dates to a single ancestral gene in the most recent common ancestor of bilaterians [60]. We have not found any human LIM-class pseudogenes.

Table 4.

Human LIM, POU, HNF, SINE, TALE, CUT, PROS, ZF AND CERS class homeobox genes and pseudogenes

Human LIM-class homeobox genes
FamilyGene symbolGene nameLocationEntrez gene IDPrevious symbols

IslISL1ISL LIM homeobox 15q11.23670
ISL2ISL LIM homeobox 215q24.364843
Lhx1/5LHX1LIM homeobox 117q123975LIM1
LHX5LIM homeobox 512q24.1364211
Lhx2/9LHX2LIM homeobox 29q33.39355LH2
LHX9LIM homeobox 91q31.356956
Lhx3/4LHX3LIM homeobox 39q34.38022M2-LHX3
LHX4LIM homeobox 41q25.389884GSH4
Lhx6/8LHX6LIM homeobox 69q33.226468LHX6.1
LHX8LIM homeobox 81p31.1431707LHX7
LmxLMX1ALMX LIM homeobox 1A1q24.14009LMX1, LMX1.1
LMX1BLMX LIM homeobox 1B9q33.34010LMX2, LMX1.2

Human POU-class homeobox genes and pseudogenes

HdxHDXhighly divergent homeoboxXq21.1139324CXorf43
Pou1POU1F1POU class 1 homeobox 13p11.25449PIT1, GHF1
Pou2POU2F1POU class 2 homeobox 11q24.25451OCT1, OTF1
POU2F2POU class 2 homeobox 219q13.25452OCT2, OTF2
POU2F3POU class 2 homeobox 311q23.325833OCT11, PLA1, EPOC1, SKN1A
Pou3POU3F1POU class 3 homeobox 11p34.35453OCT6, OTF6, SCIP
POU3F2POU class 3 homeobox 26q16.25454OCT7, OTF7, BRN2, POUF3
POU3F3POU class 3 homeobox 32q12.15455OTF8, BRN1
POU3F4POU class 3 homeobox 4Xq21.15456OTF9, BRN4
Pou4POU4F1POU class 4 homeobox 113q31.15457BRN3A, RDC1, Oct-T1
POU4F2POU class 4 homeobox 24q31.225458BRN3B, BRN3.2
POU4F3POU class 4 homeobox 35q325459BRN3C
Pou5POU5F1POU class 5 homeobox 16p21.335460OCT3, OTF3, OCT4, OTF4
POU5F1P1POU class 5 homeobox 1 pseudogene 18q24.215462OTF3C, OTF3P1, POU5FLC8
POU5F1P2POU class 5 homeobox 1 pseudogene 28q22.3100009665
POU5F1P3POU class 5 homeobox 1 pseudogene 312p13.31642559OTF3L, POU5F1L, POU5FLC12
POU5F1P4POU class 5 homeobox 1 pseudogene 41q22645682POU5FLC1
POU5F1P5POU class 5 homeobox 1 pseudogene 510q21.3100009667
POU5F1P6POU class 5 homeobox 1 pseudogene 63q21.3100009668
POU5F1P7POU class 5 homeobox 1 pseudogene 73q12.1100009669
POU5F1P8POU class 5 homeobox 1 pseudogene 817q25.3100009670
POU5F2POU class 5 homeobox 25q15134187SPRM1
Pou6POU6F1POU class 6 homeobox 112q13.135463BRN5, MPOU, TCFB1
POU6F2POU class 6 homeobox 27p14.111281WT5, WTSL, RPF1

Human HNF-class homeobox genes

HmboxHMBOX1homeobox containing 18p1279618HNF1LA, PBHNF
Hnf1HNF1AHNF1 homeobox A12q24.316927TCF1, HNF1, LFB1
HNF1BHNF1 homeobox B17q126928TCF2, LFB3, VHNF1

Human SINE-class homeobox genes

Six1/2SIX1SIX homeobox 114q23.16495
SIX2SIX homeobox 22p2110736
Six3/6SIX3SIX homeobox 32p216496
SIX6SIX homeobox 614q23.14990OPTX2, Six9
Six4/5SIX4SIX homeobox 414q23.151804AREC3
SIX5SIX homeobox 519q13.32147912DMAHP

Human TALE-class homeobox genes and pseudogenes

IrxIRX1iroquois homeobox 15p15.33IRX-5
IRX1P1iroquois homeobox 1 pseudogene 113q12.1279192IRXA1
IRX2iroquois homeobox 25p15.33646390
IRX3iroquois homeobox 316q12.2153572IRX-1
IRX4iroquois homeobox 45p15.3350805
IRX4P1iroquois homeobox 4 pseudogene 118p11.22100009671
IRX5iroquois homeobox 516q12.279190IRX2A
IRX6iroquois homeobox 616q12.2IRX-3, IRX7
MeisMEIS1Meis homeobox 12p144211
MEIS2Meis homeobox 215q144212MRG1
MEIS3Meis homeobox 319q13.3256917MRG2
MEIS3P1Meis homeobox 3 pseudogene 117p124213MRG2, MEIS3, MEIS4
MEIS3P2Meis homeobox 3 pseudogene 217p11.2257468
MkxMKXmohawk homeobox10p12.1283078IRXL1, IFRX, C10orf48
PbxPBX1pre-B-cell leukemia homeobox 11q23.35087
PBX2pre-B-cell leukemia homeobox 26p21.325089G17, HOX12, PBX2MHC
PBX2P1pre-B-cell leukemia homeobox 2 pseudogene 13q245088PBXP1, PBX2
PBX3pre-B-cell leukemia homeobox 39q33.35090
PBX4pre-B-cell leukemia homeobox 419p13.1180714
PknoxPKNOX1PBX/knotted homeobox 121q22.35316PREP1, PKNOX1C
PKNOX2PBX/knotted homeobox 211q24.263876PREP2
TgifTGIF1TGFB-induced factor homeobox 118p11.317050TGIF, HPE4
TGIF1P1TGFB-induced factor homeobox1 pseudogene 119q13.32126052
TGIF2TGFB-induced factor homeobox 220q11.2360436
TGIF2P1TGFB-induced factor homeobox 2 pseudogene 11q44126826
TGIF2P2TGFB-induced factor homeobox 2 pseudogene 215q21.1100009674
TGIF2P3TGFB-induced factor homeobox 2 pseudogene 315q21.1100009672
TGIF2P4TGFB-induced factor homeobox 2 pseudogene 414q24.2100009673
TGIF2LXTGFB-induced factor homeobox 2-like, X-linkedXq21.3190316TGIFLX (retrogene)
TGIF2LYTGFB-induced factor homeobox 2-like, Y-linkedYp11.290655TGIFLY (retrogene)

Human CUT-class homeobox genes and pseudogenes

OnecutONECUT1one cut homeobox 115q21.33175HNF6, HNF6A
ONECUT2one cut homeobox 218q21.319480OC2
ONECUT3one cut homeobox 319p13.3390874
CuxCUX1cut-like homeobox 17q22.11523CUTL1, CUX, CDP, COY1
CUX2cut-like homeobox 212q24.1223316CUTL2
CUX2P1cut-like homeobox 2 pseudogene 110p14-
CUX2P2cut-like homeobox 2 pseudogene 24q32.1-
SatbSATB1SATB homeobox 13p24.36304
SATB2SATB homeobox 22q33.123314

Human PROS-class homeobox genes

ProxPROX1prospero homeobox 11q415629
PROX2prospero homeobox 214q24.3283571

Human ZF-class homeobox genes and pseudogenes

AdnpADNPactivity-dependent neuroprotector homeobox20q13.1323394ADNP1
ADNP2ADNP homeobox 218q2322850ZNF508
TshzTSHZ1teashirt zinc finger homeobox 118q22.310194TSH1
TSHZ2teashirt zinc finger homeobox 220q13.2128553TSH2, ZNF218, ZABC2, OVC10-2
TSHZ3teashirt zinc finger homeobox 319q1257616TSH3, ZNF537
ZebZEB1zinc finger E-box binding homeobox 110p11.226935ZFHX1A, deltaEF1, TCF8, ZEB
ZEB2zinc finger E-box binding homeobox 22q22.39839ZFHX1B, SIP1, SMADIP1
ZEB2P1zinc finger E-box binding homeobox 2 pseudogene 14p15.32100033412
ZfhxZFHX2zinc finger homeobox 214q11.285446
ZFHX3zinc finger homeobox 316q22.3463ATBT, ATBF1
ZFHX4zinc finger homeobox 48q21.1179776ZFH4
Zhx/ZHX1zinc fingers and homeoboxes 18q24.1311244
HomezZHX2zinc fingers and homeoboxes 28q24.1322882
ZHX3zinc fingers and homeoboxes 320q1223051TIX1
HOMEZhomeobox and leucine zipper encoding14q11.257594

Human CERS-class homeobox genes

CersCERS2ceramide synthase 21p36.13-q24.129956LASS2, TRH3, TMSG1
CERS3ceramide synthase 315q26.3204219LASS3
CERS4ceramide synthase 419p13.379603LASS4, TRH1
CERS5ceramide synthase 512q13.1291012LASS5, TRH4
CERS6ceramide synthase 62q31253782LASS6

Human homeobox genes and pseudogenes, excepting the ANTP and PRD classes, including full names, chromosomal locations, Entrez Gene IDs and previous symbols. TheHOMEZgene is in the ZF class but encodes a protein with leucine zippers instead of zinc fingers.

The POU homeobox class

The POU class generally encodes proteins with a POU-specific domain (named from the mammalian genesPit1(nowPou1f1),OCT1andOCT2(nowPOU2F1andPOU2F2), andnematodeunc-86) N-terminal to a typical homeodomain. The POU-specific domain is a DNA-binding domain of approximately 75 amino acids; the POU-specific domain and the homeodomain are collectively known as the bipartite POU domain [61].

We have identified a total of 16 POU-class homeobox genes in the human genome (Tables1 and4). The genes form a distinct grouping even if the POU-specific domain is disregarded – phylogenetic analyses of homeodomains recover the POU class as a monophyletic group (Figure3; Additional files1,2 and5). There are six widely recognized gene families within the POU class (Pou1 to Pou6), and nomenclature revisions approximately 10 years ago clarified which genes belong to which gene family [62]. We have placed two additional genes (HDXandPOU5F2) in the POU class on the basis of their deduced homeodomain sequences, even though one of these genes (HDX) does not encode a POU-specific domain. We have erected a new gene family for this gene, bringing the total number of gene families in the POU class to seven. We have also identified a total of eight POU-class pseudogenes in the human genome (Tables1 and4); we have named six of these (POU5F1P2,POU5F1P4toPOU5F1P8), and revised the nomenclature of one other (POU5F1P3).

HDX[Entrez Gene ID: 139324]. This gene was previously known asCXorf43. The gene encodes a highly divergent atypical (68-amino-acid) homeodomain but not a POU-specific domain, and thus it is debatable whether it should be placed within the POU class. Phylogenetic analyses of homeodomains place it basally in a clade with the POU class (Figure3; Additional files1 and5), or within the POU class (Additional file2), suggesting that the HDX protein either diverged before the POU-specific domain became associated with the homeodomain or lost the POU-specific domain during evolution. Further information on this gene may allow this tentative classification to be revisited.

POU5F2[Entrez Gene ID: 134187]. We designate this previously unnamed genePOU5F2on the basis of clear orthology to the mouseSprm1gene, which has been assigned the second member of the Pou5 gene family [63]. The symbolPOU5F2ensures the gene conforms with standardized nomenclature for the POU class.

POU5F1P2[GeneID: 100009665],POU5F1P3(formerlyPOUF51L) [GeneID: 5461],POU5F1P4[GeneID: 100009666],POU5F1P5[GeneID: 100009667],POU5F1P6[GeneID: 100009668],POU5F1P7[GeneID: 100009669] andPOU5F1P8[GeneID: 100009670]. Prior to this study, a single retrotransposed pseudogene of thePOU5F1gene had been annotated and designatedPOU5F1P1[Entrez Gene ID: 5462]. AnotherPOU5F1-related sequence of unknown status had been annotated and designatedPOUF5F1L[GeneID: 5461]. We replace the gene symbolPOUF5F1LwithPOU5F1P3as this sequence is a retrotransposed pseudogene ofPOUF51. Our analyses of the human genome sequence identified a further six pseudogenes ofPOU5F1, which we name sequentiallyPOU5F1P2,POU5F1P4through toPOU5F1P8. Each clearly aligns to the mRNA sequence ofPOU5F1but with sequence alterations, indicating origin by retrotransposition.POU5F1P2andPOU5F1P6have frameshift mutations in the homeobox.POU5F1P5andPOU5F1P6have stop codons in the homeobox.POU5F1P7andPOU5F1P8are partial integrants ofPOU5F1mRNA excluding the homeobox –POU5F1P7covers part of the 3' untranslated region andPOU5F1P8a short region around the start codon.

The HNF homeobox class

The HNF class (named after the rat geneHnf1) encodes proteins with a POU-like domain N-terminal to a highly atypical homeodomain. The POU-like domain, as its name indicates, is weakly similar in sequence to the POU-specific domain [64]; more importantly, it has nearly the same three-dimensional structure and mode of DNA binding as the POU-specific domain [65].

We have identified a total of three HNF-class homeobox genes in the human genome (Tables1 and4), consistent with previous work [66,67]. The homeodomains encoded by the humanHNF1AandHNF1Bgenes are atypical in possessing 21 extra amino acid residues between the second and third alpha helices (Additional file6). We place these two genes in a single gene family (Hnf1) within the HNF class, implying derivation from a single invertebrate gene. Examination of their chromosomal locations concurs with this view.HNF1AandHNF1Bmap to parts of the genome known to have duplicated in early vertebrate evolution, namely 12q24.31 (HNF1A, nearLHX5and on the same arm as the HOXC cluster) and 17q12 (HNF1B, betweenLHX1and the HOXB cluster) (Figure4). The use of the A and B suffixes is unfortunate, as numerals are generally used to distinguish paralogs of this age, but is retained at present due to widespread and stable use. The homeodomain encoded by the humanHMBOX1gene is atypical in possessing 15 extra amino acid residues between the second and third alpha helices (Additional file6). Phylogenetic analyses confirm previous suggestions [67] thatHMBOX1is more distantly related toHNF1AandHNF1B(Figure3; Additional files1,2 and5). We place this gene in a separate gene family (Hmbox) within the same class. We have not found any human HNF-class pseudogenes.

The SINE homeobox class

The SINE class (named after theDrosophilageneso:sine oculis) encodes proteins with a SIX domain N-terminal to a typical homeodomain. The SIX domain is a DNA-binding domain of approximately 115 amino acids; both the SIX domain and the homeodomain are required for DNA binding [68].

We have identified a total of six SINE-class homeobox genes in the human genome (Tables1 and4), consistent with previous work [68,69]. The genes form a distinct grouping even if the SIX domain is disregarded – phylogenetic analyses of homeodomains recover the SIX class as a monophyletic group (Figure3; Additional files1,2 and5). Phylogenetic analyses of homeodomains divide the SIX class into three gene families (Figure3; Additional files1,2 and5), consistent with previous studies [68,69]. Each gene family has two human members and dates to a single ancestral gene in the most recent common ancestor of bilaterians [68,69]. We have not found any human SINE-class pseudogenes.

The TALE homeobox class

TALE (three amino acid loop extension) class genes are distinguished by the presence of three extra amino acids between the first and second alpha helices of the encoded homeodomain [1,2,70]. Genes belonging to the TALE class encode proteins with various domains outside of the atypical homeodomain.

We have identified a total of 20 TALE-class homeobox genes in the human genome (Tables1 and4). The genes form a distinct grouping in phylogenetic analyses even when the three extra homeodomain residues are excluded from the sequence alignment (Figure3; Additional file5). Bürglin [2] has given the TALE group the rank of 'superclass' and distinguished between several 'classes' by the presence of distinct domains outside of the homeodomain. These are the IRX domain, MKX domains, the MEIS domain, the PBC domain and TGIF domains [2,71-73]. Along with some others [4,7,24], we have given the TALE group the rank of 'class' containing several 'gene families'; this maintains consistent terminology throughout the present paper. Phylogenetic analyses of homeodomains divide the TALE class into six gene families (Figure3; Additional files1,2 and5), including an Mkx family containing the recently describedMKXgene, which is distinguished from Irx-family genes phylogenetically and by absence of an IRX domain [73,74]. It should be noted that the established name of the Pknox gene family does not indicate orthology with Knox-family genes of plants. We have also identified a total of 10 TALE-class pseudogenes in the human genome (Tables1 and4); we have named six of these (IRX4P1,TGIF1P1andTGIF2P1toTGIF2P4), and revised the nomenclature of two others (IRX1P1andPBX2P1).

IRX1P1[Entrez Gene ID: 646390]. This sequence was previously known asIRXA1; we rename itIRX1P1because it is clearly a retrotransposed pseudogene ofIRX1and not a functional gene. TheIRX1P1sequence aligns to the mRNA ofIRX1but has a frameshift mutation and two stop codons in the homeobox.

IRX4P1[Entrez Gene ID: 100009671]. We designate this previously unannotated sequenceIRX4P1because it is clearly a retrotransposed pseudogene ofIRX4. TheIRX4P1sequence is a partial integrant derived from a region of theIRX4mRNA around the stop codon; it lacks the homeobox.

PBX2P1[Entrez Gene ID: 5088]. This sequence was previously known asPBXP1; we rename itPBX2P1because it is clearly a retrotransposed pseudogene ofPBX2. The former name ofPBXP1did not indicate its transcript of origin. ThePBX2P1sequence aligns to the mRNA ofPBX2but has a frameshift mutation in the coding region.

TGIF1P1[Entrez Gene ID: 126052]. We designate this previously unannotated sequenceTGIF1P1because it is clearly a retrotransposed pseudogene ofTGIF1. The locus has many sequence alterations when compared toTGIF1mRNA, including a 48 nucleotide insertion within the homeobox.

TGIF2P1[GeneID: 126826],TGIF2P2[GeneID: 100009674],TGIF2P3[GeneID: 100009672] andTGIF2P4[GeneID: 100009673]. These four sequences were unannotated prior to this study. We designate themTGIF2P1toTGIF2P4because they are clearly pseudogenes ofTGIF2. Each aligns to the mRNA sequence ofTGIF2but with sequence alterations, indicating origin by retrotransposition.TGIF2P1has many sequence alterations, including a frameshift mutation in the homeobox.TGIF2P2andTGIF2P3are very similar neighboring loci that must have originated by tandem duplication of a retrotransposedTGIF2mRNA; neither includes the homeobox.TGIF2P4is a short partial integrant derived from part of the 3' untranslated region ofTGIF2mRNA.

The CUT homeobox class

The CUT class (named after theDrosophilagenecut) generally encodes proteins with one or more CUT domains N-terminal to a typical homeodomain. The CUT domain is a DNA-binding domain of approximately 75 amino acids [75]. There are three widely recognized gene families within the CUT class in humans (Onecut, Cux, Satb; [76]). A fourth gene family (Cmp), lacking a CUT domain but sharing a CMP domain with the Satb gene family, is absent from vertebrates. Bürglin and Cassata [76] have proposed that the vertebrate Satb gene family evolved from the invertebrate Cmp gene family.

We have identified a total of seven CUT-class homeobox genes in the human genome (Tables1 and4). Although grouped together by presence of CUT domains, the homeodomains of the Onecut, Cux and Satb gene families are quite divergent and do not always form a monophyletic group in phylogenetic analyses (Additional files2 and5). Topologies that separate the gene families are also only weakly supported, so it is most parsimonious to assume that the class is actually monophyletic but the constituent genes underwent rapid sequence divergence following their initial duplications. We have revised the nomenclature of two CUT-class genes (CUX1andCUX2). We have also identified a total of three CUT-class pseudogenes in the human genome (Tables1 and4); we have named all of these (CUX2P1,CUX2P2andSATB1P1).

CUX1[Entrez Gene ID: 1523] andCUX2[Entrez Gene ID: 23316]. These genes were previously known asCUTL1andCUTL2respectively. We rename themCUX1andCUX2in accordance with homeobox gene nomenclature convention.

CUX2P1andCUX2P2. These sequences were unannotated prior to this study. We designate themCUX2P1andCUX2P2because they are clearly retrotransposed pseudogenes ofCUX2. Both are short partial integrants derived fromCUX2mRNA, excluding the homeobox –CUX2P1covers part of the coding region at the 5' end andCUX2P2part of the 3' untranslated region.

SATB1P1[Entrez Gene ID: 100033410]. We designate this previously unannotated sequenceSATB1P1because it is clearly a retrotransposed pseudogene ofSATB1.SATB1P1is a short partial integrant derived from part of the 3' untranslated region ofSATB1mRNA; it does not encompass the homeobox.

The PROS homeobox class

The PROS class (named after theDrosophilagenepros) encodes proteins with a PROS domain C-terminal to an atypical homeodomain. The PROS domain is a DNA-binding domain of approximately 100 amino acids [77]. PROS-class genes encode a highly divergent homeodomain with three extra amino acids. These additional residues are inserted at a different position compared to the TALE class, being between the second and third alpha helices (Additional file6).

We have identified a total of two PROS-class homeobox genes in the human genome (Tables1 and4), which we have placed in a single gene family (Prox). The highly divergent homeodomain sequence and unusual structural features provide justification for PROS being a separate gene class, despite the small number of genes. In phylogenetic analyses, PROS-class homeodomains are situated on a long branch, very distant from other classes (Figure3; Additional files1,2 and5). The humanPROX1gene is well characterized; we have identified and named its paralog,PROX2. We have not found any human PROS-class pseudogenes.

PROX2[Entrez Gene ID: 283571]. We designate this previously unannotated genePROX2on the basis of clear orthology to the mouseProx2gene, inferred from sequence identity and synteny. The homeobox of humanPROX2has two introns and unusually the splice sites of the first (5') intron (AT-AA) do not follow the GT-AG donor-acceptor rule. This has also been noted for mouseProx2[78].

The ZF homeobox class

The ZF (zinc finger) class generally encodes proteins with zinc finger motifs, in addition to one or more homeodomains. As noted earlier, phylogenetic analyses of homeodomains does not recover the ZF class as a monophyletic group (Figure3; Additional files1,2 and5). We recognize that this suggests that zinc finger motifs and homeodomains may have been brought together on three separate occasions in evolution; nonetheless, it is convenient and informative to group these into a single class. Inclusion of theHOMEZgene in the ZF class may be surprising, as this gene does not encode zinc fingers. However, as previously noted [79] and reproduced in our phylogenetic analyses (Figure3; Additional files1,2 and5), the multiple homeodomain sequences of this gene are clearly related to those encoded by theZHX1,ZHX2andZHX3genes.

We have identified a total of 14 ZF-class homeobox genes in the human genome (Tables1 and4), which we have placed in five gene families (Adnp, Tshz, Zeb, Zfhx and Zhx/Homez). We have also identified one ZF-class pseudogenes in the human genome (Tables1 and4). We have revised the nomenclature of five of these loci (ADNP2,ZEB1,ZEB2,ZEB2P1andZFHX3).

ADNP2[Entrez Gene ID 22850]. This gene was previously known asZNF508; we rename itADNP2to reflect its paralogous relationship toADNP.

ZEB1[Entrez Gene ID: 6935] andZEB2[Entrez Gene ID: 9839]. These genes were previously known asZFHX1AandZFHX1Brespectively. We rename themZEB1andZEB2to distinguish them from genes belonging to the distantly related Zfhx gene family.

ZEB2P1[Entrez Gene ID: 100033412]. This retrotransposed pseudogene ofZEB2has been described previously [80]. Our new nomenclature (ZEB2P1) reflects the origin of this locus.

ZFHX3[Entrez Gene ID: 463]. This gene was previously known asATBF1; we rename itZFHX3to reflect its close relationship toZFHX2andZFHX4; indeedZFHX3was a synonym for this gene.

The CERS homeobox class

The highly unusual CERS (ceramide synthase) class, also known as the LASS (longevity assurance) class, comprises a single gene family that is highly conserved amongst eukaryotes and includes the yeast gene and original memberLAG1. There are six CERS-class genes in the human genome (CERS1toCERS6) and five of these (CERS2toCERS6) encode proteins with a homeodomain sequence [81,82]. These are, however, extremely divergent from the homeodomains of other gene classes. Secondary structure prediction analyses suggest these sequences have the potential to encode three alpha helices in the appropriate positions (data not shown). The most surprising characteristic of these genes is that biochemical studies predict them to encode transmembrane proteins, with the homeodomain on the cytosolic side of the endoplasmic reticulum membrane, and hence they could not act as DNA-binding proteins or transcription factors [81,82]. It is possible that an ancestor of these genes gained a homeobox through exon shuffling, or alternatively this could represent convergent evolution. We include onlyCERS2toCERS6in our comprehensive compilation of human homeobox genes, asCERS1lacks a homeobox motif.

Chromosomal distribution of human homeobox genes

The chromosomal locations of genes can give clues to evolutionary ancestry, including patterns of gene duplication, and the possible existence of gene clusters. In Figure4, we show the chromosomal locations of all human homeobox genes. We do not include probable pseudogenes on these ideograms, because most of these have originated by reverse transcription of mRNA and secondary integration into the genome, and hence give no insight into ancestral locations of genes. The highly repetitiveDUX1toDUX5sequences are also not shown, as these have undergone secondary amplification and are also most likely non-functional (see above).

The first observation is that there are homeobox genes on every human chromosome. Even the two sex chromosomes harbor homeobox genes, withSHOX(short stature homeobox) in the PAR1 pseudoautosomal region at the tip of the short arms of X and Y being the best known. Haploinsufficiency ofSHOXis implicated in the short stature phenotype of Turner syndrome patients who lack one copy of the X chromosome [83]. There are also nine other homeobox genes in non-pseudoautosomal regions of the X chromosome, including three tandemly-arranged members of the Rhox gene family, collectively homologous to the multiple Rhox (reproductive homeobox) genes of mouse. Only one of the homeobox genes on the X chromosome, the TALE-class geneTGIF2LX, has a distinct homolog on the Y chromosome, calledTGIF2LY. These genes map to the largest homology block shared by the unique regions of the X and Y chromosomes, spanning 3.5 Mb. It has been proposed that the ancestor of these two genes arose by retrotransposition ofTGIF2mRNA [84].

The autosomes with the lowest number of homeobox genes are chromosomes 21 (with justPKNOX1) and 22 (withGSC2andISX). Examination of the remaining autosomes reveals that homeobox genes are quite dispersed with some interesting regional accumulations. The best known examples of close linkage between homeobox genes are the four Hox clusters on human chromosomes 2, 7, 12 and 17, comprising 9, 11, 9 and 10 genes respectively; each of these is shown as just a single line on each ideogram for simplicity (Figure4). These should not be considered in isolation, however, because many other ANTP-class genes map in the vicinity of the Hox clusters [26,27]. These include genes very tightly linked to the Hox clusters, notably the Evx-family genes (on chromosomes 2 and 7), Dlx-family genes (on chromosomes 2 and 17), and Meox-family genes (on chromosomes 2 and 17).

There are other concentrations of ANTP-class genes away from the Hox clusters. These are the ParaHox cluster (GSX1,PDX1,CDX2) on chromosome 13, and four sets of NKL-subclass genes on 2p/8p (split), 4p, 5q and 10q, hypothesized to be derived from an ancestral array by duplication [26,33]. The accumulation on the distal half of the long arm of chromosome 10 is particularly striking, comprising eleven ANTP-class genes from 10 gene families. This is not a tight gene cluster, but it is compatible with ancestry by extensive tandem gene duplication followed by dispersal. Discounting the rather aberrant case of the Hox clusters, this region of the long arm of chromosome 10 is the most homeobox-rich region of the human genome.

There are additional groupings of homeobox genes outside the ANTP class. These include two TALE-class Irx clusters on chromosomes 5 and 16 homologous to the described mouse Irx clusters [19], and a set of PRD-class genes on chromosome 19 proposed to be derived from theCRXhomeobox gene by duplication and rapid divergence [21]. Perhaps the most interesting case, however, is found on the tip of the long arm of chromosome 9, where there is a concentration of homeobox genes from disparate gene classes. Four LIM-class genes, one ANTP-class gene, one PRD-class gene and one TALE-class gene are found in this location. Although dispersed over a large region, and not forming a tight gene cluster, the linkages are nonetheless intriguing. It is possible that these linkages reflect ancestry from the very ancient gene duplications that must have generated the distinctive homeobox gene classes found within animal genomes.

Conclusion

We identified 300 homeobox loci in the euchromatic regions of the human genome, and divide these into 235 probable functional genes and 65 probable pseudogenes. Not all of these loci possess a homeobox because for completeness we include all sequences derived from homeobox-containing genes. The number of homeobox sequences is also different from the number of loci because several genes contain multiple homeobox motifs. The figures exclude the repetitiveDUX1toDUX5homeobox sequences of which we identified 35 probable pseudogenes, with many more expected in heterochromatic regions.

New or revised nomenclature is proposed for approximately 70 of the 300 homeobox loci in order to clarify orthologous relationships between human and mouse, to indicate evolutionary relationships within a gene family, to distinguish genes from pseudogenes, and to indicate pseudogene origins. The loci are also classified into a simple hierarchical scheme, comprising 102 gene families within eleven gene classes. The classification scheme proposed may be widely applicable to homeobox genes from other animals.

The 235 probable functional homeobox genes map to every human chromosome with some interesting regional concentrations of genes. These include a large number of ANTP-class genes on the distal end of the long arm of chromosome 10, and a combination of LIM-, ANTP-, PRD- and TALE-class genes on the distal end of the long arm of chromosome 9. These associations may be remnants of common ancestry early in animal evolution.

Methods

The finished human genome sequence (build 35.1) was subjected to a series of tBLASTn searches [85,86] using known homeodomain sequences from the ANTP, PRD, LIM, POU, HNF, SINE, TALE, CUT, PROS and ZF classes. No arbitrary E-value cut-off was selected, but instead each list of hits was analyzed manually until true homeodomain sequences ceased to be detected. Definition of a homeodomain used a combination of CD-search for conserved protein domains implemented through BLASTp [85,86] and secondary structure prediction by JPred implemented through the Barton Group, University of Dundee [87]. Each time a new or divergent homeodomain match was found, the tBLASTn process was repeated. Six very divergent gene families were undetected by this method but found by text searching: Hopx, Adnp, Tshz, Zeb, Zhx/Homez and Cers. To ensure that every pseudogene was detected, including truncated or decayed versions lacking the homeobox, the full mRNA sequence of each gene was deduced and used in a BLASTn search of the human genome sequence [85,86]. Pseudogenes were recognized as those genomic regions with similarity to non-repetitive DNA sequences of the parent gene, even if aligning to only part of the locus. Pseudogenes undergo mutational decay and would eventually become unrecognizable, but in practice ambiguous cases were not encountered. Exon-intron structures of novel loci were deduced by comparison between genomic sequence and cDNA, EST or retrotransposed pseudogene sequences, as previously described [21]. Several unnamed human loci were identified as probable orthologs of known mouse genes; orthology was deduced by a combination of homeodomain sequence similarity and synteny, examined through the mouse genome sequence (build 34.1) and the Ensembl Genome Browser [88].

Phylogenetic analyses were performed with homeodomain sequences, after each had been edited to an alignment of 60 amino acids (Additional file7), using the maximum likelihood [89] and neighbor-joining [90] methods. Maximum likelihood trees were constructed using PhyML [91], with a JTT model of amino acid substitution, four categories of between-site rate heterogeneity, a gamma distribution parameter estimated from the data and 500 bootstrap resamplings. Neighbor-joining trees were constructed using PHYLIP (.)[92], with a JTT model of amino acid substitution and 1000 bootstrap resamplings. For defining human gene families, allDrosophilahomeodomains were first combined with all human homeodomains in maximum likelihood and neighbor-joining analyses to enable divergentDrosophilagenes to be identified and removed. These include genes lost from human, as well genes known to have undergone unusually rapid evolution inDrosophila. For the Hox3 family the rapidly evolvingDrosophilagenesbcd,zenandzen2were then replaced by an ortholog from centipede (Sm Hox3b), and for the Nk4 family the rapid evolvingDrosophilagenetinwas replaced by an ortholog from annelid (Pd NK4). In addition, six genes from other protostome or cnidarian genomes were added to represent gene families known to be missing fromDrosophila(Pdx family:Ps Xlox; Alx family:Nv CART1; Dmbx family:Hv manacle; Pou1 family:Nv POU1; Hnf1 family:Nv HNF; Pknox family:Am Prep). Only 100 bootstrap resamplings were performed on this dataset because of its large size (354 homeodomains). Trees were displayed using TreeExplorer [93]. Genes encoding partial homeodomains, and probable pseudogenes, were not included in the phylogenetic analyses. With short alignments, phylogenetic trees can only be used as guides to relationships, not absolute indicators of evolutionary history, and the trees presented in this paper should be interpreted in this light.

Authors' contributions

PWHH designed the study and contributed to gene identification and to gene nomenclature revisions. HAFB carried out database searches, annotations and phylogenetic analyses and contributed to gene nomenclature revisions. PWHH and HAFB drafted the manuscript. EB contributed to gene nomenclature revisions, discussed these with the research community and databases, and implemented the agreed changes.

All authors edited and approved the final manuscript.

Supplementary Material

Additional file 1

Maximum likelihood phylogenetic tree of all human plus selected protostome and cnidarian homeodomains for identification of gene families. Arbitrarily rooted phylogenetic tree of all human plus selected protostome and cnidarian homeodomains constructed using the maximum likelihood (ML) method. Bootstrap values supporting gene family designations are shown. Homeodomain sequences derived from pseudogenes are excluded. This ML tree should be compared with the neighbor-joining (NJ) tree shown in Additional file2. The dataset used for both ML and NJ analyses includes all human homeodomains, mostDrosophila melanogasterhomeodomains, plus selected additional homeodomains from other protostomes or cnidarians when the gene family is divergent or absent inDrosophila. DivergentDrosophilagenes that do not group with human genes were identified by construction of a preliminary, non-bootstrapped ML and NJ trees, and subsequently removed from the dataset. These include genes lost from human, as well genes known to have undergone unusually rapid evolution inDrosophila. For the Hox3 family the rapidly evolvingDrosophilagenesbcd,zenandzen2were replaced withSm Hox3b, and for the Nk4 family the rapid evolvingDrosophilagenetinwas replaced withPd NK4. In addition, six genes from other protostome or cnidarian genomes were added to represent gene families known to be missing fromDrosophila(Pdx family:Ps Xlox; Alx family:Nv CART1; Dmbx family:Hv manacle; Pou1 family:Nv POU1; Hnf1 family:Nv HNF; Pknox family:Am Prep). Species abbreviations:Am,Apis mellifera(honeybee);Dm,Drosophila melanogaster(fruitfly);Hv,Hydra vulgaris(hydrozoan);Nv,Nematostella vectensis(starlet sea anemone);Pd,Platynereis dumerilii(annelid worm);Ps,Phascolion strombus(sipunculan worm);Sm,Strigamia maritima(centipede). ML performed more poorly than NJ in recovering several well known gene families, notably Hox4, Hox5, Nk4 and Alx. In contrast, ML did recoverPROP1andCG32532as a true gene family; NJ did not. The invertebrate gene does not always lie as a strict outgroup to all human genes in a family; this effect is expected when using a short alignment. Instead, distinct grouping of invertebrate and human genes is taken as evidence of ancestry from a single gene. A few ambiguous cases were encountered, notably divergence ofDrosophila H2.0in the proposed Hlx gene family, and resolution within the Pax4/6 gene family, which is recovered as two families in NJ but one in ML. As explained in the text, several human gene families contain 'orphan' genes without invertebrate orthologs; these are Barx, Nanog, Noto, Vax, Ventx, Argfx, Dprx, Dux, Esx, Hesx, Hopx, Isx, Leutx, Mix, Nobox, Rhox, Sebox, Tprx, Hdx, Pou5, Hmbox, Satb, Adnp and Zhx/Homez. Zeb and Mkx would be placed in this category based on our ML and NJ trees, although other data suggest thatDrosophila zfh1andCG11617respectively may be the protostome orthologs [73,94]. Tshz is only an apparent orphan family; the clearDrosophilaortholog simply lacks the homeobox [95,96]. Phylogenetic analysis is just one source of evidence for allocation of genes to gene families and identification of boundaries between gene families; complementary criteria used are synteny between species and paralogy within the human genome. Our ML and NJ trees should not be used to allocate gene families to gene classes, because other diagnostic characters such as insertions within the homeodomain, key amino acid residues, and several motifs outside of the homeodomain are excluded from the analysis. Indeed, artefactual mixing of the TALE and SINE classes occurs in both ML and NJ trees.

Click here for file (87.1KB, pdf)
Additional file 2

Neighbor-joining phylogenetic tree of all human plus selected protostome and cnidarian homeodomains for identification of gene families. Arbitrarily rooted phylogenetic tree of all human plus selected protostome and cnidarian homeodomains constructed using the neighbor-joining (NJ) method. Bootstrap values supporting gene family designations are shown. Homeodomain sequences derived from pseudogenes are excluded. Comparison of NJ and ML trees, and description of the dataset used, is given in the legend to Additional file1. Several artefactual mixing of classes occurs in this NJ tree, notably splitting of the CUT class, mixing of the TALE and SINE classes and aberrant placement ofHOPX.

Click here for file (26.1KB, pdf)
Additional file 3

Neighbor-joining phylogenetic tree of human ANTP-class homeodomains, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human ANTP-class homeodomains constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. The proposed division between the HOXL and NKL subclasses is indicated. The position ofEN1andEN2is unstable; this tree places them close to the base of the HOXL/NKL divergence, whereas maximum likelihood analysis of the same dataset places them firmly in the NKL subclass (Figure1). Interrelationships of genes in the Nk2.2 and Nk4 families are also unstable (in this tree and Figure1 respectively); in these cases synteny within and between genomes clearly resolves gene families. Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (23.6KB, pdf)
Additional file 4

Neighbor-joining phylogenetic tree of human PRD-class homeodomains, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human PRD-class homeodomains constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded, as are the partial homeodomains of PAX2, PAX5 and PAX8, and the HOPX homeodomain because its extremely divergent sequence destabilizes the overall tree topology. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single Dux-family gene. Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (16.6KB, pdf)
Additional file 5

Neighbor-joining phylogenetic tree of human homeodomains excluding ANTP and PRD classes, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human homeodomains excluding the ANTP and PRD classes constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single gene. Classes and/or families are color coded as shown in the key. The LIM and ZF classes are not recovered as two distinct monophyletic groups, a result also found by maximum likelihood analysis (Figure3). The multiple homeodomains of Zfhx-family proteins and Zhx/Homez-family proteins are also dispersed in the tree, presumably artefactually. Monophyly of the CUT class is not recovered in this tree, but is by maximum likelihood analysis (Figure3). Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (33.2KB, pdf)
Additional file 6

Multiple sequence alignment of all human plus selected protostome and cnidarian homeodomains. The consensus homeodomain sequence (shown several times for reference) was derived from a compilation of 247 human homeodomain sequences. The three horizontal lines indicate the positions of the three alpha-helices. The numbering scheme refers to amino acid position in the canonical 60-amino-acid homeodomain; insertions relative to this sequence are shown when present. Black shaded resides are invariant between all human homeodomains within each class (or family in the case of the ZF homeodomains). Sequence accession numbers are shown. For each gene family designation, maximum likelihood and neighbor-joining bootstrap support values are indicated (see Additional files1 and2). These values are not shown if the gene family does not form a monophyletic group in phylogenetic analyses (in which case n/a is written) or if an invertebrate homolog could not be found.

Click here for file (126.5KB, pdf)
Additional file 7

Phylogenetic input file. All human and invertebrate homeodomains used in phylogenetic analyses are shown, after alignment and removal of insertions to give a uniform 60-amino-acid alignment.

Click here for file (107.4KB, htm)

Acknowledgments

Acknowledgements

We thank Rebecca Furlong, Tokiharu Takahashi, Hidetoshi Saiga, Naohito Takatori, David Ferrier, Mario Pestarino, Thomas Bürglin and reviewers for helpful advice. Research undertaken by PWHH and HAFB was supported by the BBSRC and the Wellcome Trust. The work of EAB and the HUGO Gene Nomenclature Committee is supported by NHGRI grant P41 HG003345 and the Wellcome Trust.

Contributor Information

Peter WH Holland, Email: peter.holland@zoo.ox.ac.uk.

H Anne F Booth, Email: anne.booth@merton.oxon.org.

Elspeth A Bruford, Email: hgnc@genenames.org.

References

  1. Bürglin TR. A comprehensive classification of homeobox genes. In: Duboule D, editor. Guidebook to the Homeobox Genes. Oxford: Oxford University Press; 1994. pp. 25–71. [Google Scholar]
  2. Bürglin TR. Homeodomain proteins. In: Meyers RA, editor. Encyclopedia of Molecular Cell Biology and Molecular Medicine. 2. Vol. 6. Weinheim: Wiley-VCH Verlag GmbH & Co; 2005. pp. 179–222. [Google Scholar]
  3. Boncinelli E. Homeobox genes and disease. Curr Op Genet Dev. 1997;7:331–337. doi: 10.1016/S0959-437X(97)80146-3. [DOI] [PubMed] [Google Scholar]
  4. Edvardsen RB, Seo H-C, Jensen MF, Mialon A, Mikhaleva J, Bjordal M, Cartry J, Reinhardt R, Weissenbach J, Wincker P, et al. Remodelling of the homeobox gene complement in the tunicate Oikopleura dioica. Curr Biol. 2005;15:R12–R13. doi: 10.1016/j.cub.2004.12.010. [DOI] [PubMed] [Google Scholar]
  5. Galliot B, de Vargas C, Miller D. Evolution of homeobox genes: Q50 Paired-like genes founded the Paired class. Dev Genes Evol. 1999;209:186–197. doi: 10.1007/s004270050243. [DOI] [PubMed] [Google Scholar]
  6. Holland PWH, Takahashi T. The evolution of homeobox genes: implications for the study of brain development. Brain Res Bull. 2005;66:484–490. doi: 10.1016/j.brainresbull.2005.06.003. [DOI] [PubMed] [Google Scholar]
  7. Ryan JF, Burton PM, Mazza ME, Kwong GK, Mullikin JC, Finnerty JR. The cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidence from the starlet sea anemone, Nematostella vectensis. Genome Biol. 2006;7:R64. doi: 10.1186/gb-2006-7-7-r64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Banerjee-Basu S, Baxevanis AD. Molecular evolution of the homeodomain family of transcription factors. Nucleic Acids Res. 2001;29:3258–3269. doi: 10.1093/nar/29.15.3258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
  10. IHGSC Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  11. Nam J, Nei M. Evolutionary change of the numbers of homeobox genes in bilateral animals. Mo Bio Evol. 2005;22:2386–2394. doi: 10.1093/molbev/msi229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Beckers M-C, Gabriëls J, van der Maarel S, De Vriese A, Frants RR, Collen D, Belayew A. Active genes in junk DNA? Characterization of DUX genes embedded within 3.3 kb repeated elements. Gene. 2001;264:51–57. doi: 10.1016/S0378-1119(00)00602-8. [DOI] [PubMed] [Google Scholar]
  13. Ding H, Beckers M-C, Plaisance S, Marynen P, Collen D, Belayew A. Characterization of a double homeodomain protein (DUX1) encoded by a cDNA homologous to 3.3 kb dispersed repeated elements. Hum Mol Genet. 1998;7:1681–1694. doi: 10.1093/hmg/7.11.1681. [DOI] [PubMed] [Google Scholar]
  14. Gabriëls J, Beckers M-C, Ding H, De Vriese A, Plaisance S, van der Maarel SM, Padberg GW, Frants RR, Hewitt JE, Collen D, et al. Nucleotide sequence of the partially deleted D4Z4 locus in a patient with FSHD identifies a putative gene within each 3.3 kb element. Gene. 1999;236:25–32. doi: 10.1016/S0378-1119(99)00267-X. [DOI] [PubMed] [Google Scholar]
  15. Akam ME, Holland PWH, Ingham PW, Wray G. The evolution of developmental mechanisms. Development. 1994. pp. 135–142.
  16. Joyner AL, Hanks M. The engrailed genes: evolution of function. Semin Dev Bio. 1991;2:435–445. [Google Scholar]
  17. Echelard Y, Epstein DJ, St-Jacques B, Shen L, Mohler J, McMahon JA, McMahon AP. Sonic hedgehog, a member of a family of putative signaling molecules, is implicated in the regulation of CNS polarity. Cell. 1993;75:1417–1430. doi: 10.1016/0092-8674(93)90627-3. [DOI] [PubMed] [Google Scholar]
  18. Stock DW, Ellies DL, Zhao Z, Ekker M, Ruddle FH, Weiss KM. The evolution of the vertebrate Dlx genefamily. Proc Natl Acad Sci USA. 1996;93:10858–10863. doi: 10.1073/pnas.93.20.10858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Peters T, Dildrop R, Ausmeier K, Ruther U. Organization of mouse Iroquois homeobox genes in two clusters suggests a conserved regulation and function in vertebrate development. Genome Res. 2000;10:1453–1462. doi: 10.1101/gr.144100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. de Rosa R, Grenier JK, Andreeva T, Cook CE, Adoutte A, Akam M, Carroll SB, Balavoine G. Hox genes in brachiopods and priapulids and protostome evolution. Nature. 1999;399:772–776. doi: 10.1038/21631. [DOI] [PubMed] [Google Scholar]
  21. Booth HAF, Holland PWH. Annotation, nomenclature and evolution of four novel homeobox genes expressed in the human germ line. Gene. 2007;387:7–14. doi: 10.1016/j.gene.2006.07.034. [DOI] [PubMed] [Google Scholar]
  22. Booth HAF, Holland PWH. Eleven daughters of NANOG. Genomics. 2004;84:229–238. doi: 10.1016/j.ygeno.2004.02.014. [DOI] [PubMed] [Google Scholar]
  23. Castro LFC, Rasmussen SLK, Holland PWH, Holland ND, Holland LZ. A Gbx homeobox gene in amphioxus: insights into ancestry of the ANTP class and evolution of the midbrain/hindbrain boundary. Dev Biol. 2006;295:40–51. doi: 10.1016/j.ydbio.2006.03.003. [DOI] [PubMed] [Google Scholar]
  24. Dearden PK, Wilson MJ, Sablan L, Osborne PW, Havler M, McNaughton E, Kimura K, Milshina NV, Hasselmann M, Gempe T, et al. Patterns of conservation and change in honey bee developmental genes. Genome Res. 2006;16:1376–1384. doi: 10.1101/gr.5108606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Monteiro AS, Schierwater B, Dellaporta SL, Holland PWH. A low diversity of ANTP class homeobox genes in Placozoa. Evol Dev. 2006;8:174–182. doi: 10.1111/j.1525-142X.2006.00087.x. [DOI] [PubMed] [Google Scholar]
  26. Castro LFC, Holland PWH. Chromosomal mapping of ANTP class homeobox genes in amphioxus: piecing together ancestral genomes. Evol Dev. 2003;5:459–465. doi: 10.1046/j.1525-142X.2003.03052.x. [DOI] [PubMed] [Google Scholar]
  27. Pollard SL, Holland PWH. Evidence for 14 homeobox gene clusters in human genome ancestry. Curr Biol. 2000;10:1059–1062. doi: 10.1016/S0960-9822(00)00676-X. [DOI] [PubMed] [Google Scholar]
  28. Brooke NM, Garcia-Fernàndez J, Holland PWH. The ParaHox gene cluster is an evolutionary sister of the Hox gene cluster. Nature. 1998;392:920–922. doi: 10.1038/31933. [DOI] [PubMed] [Google Scholar]
  29. Ferrier DEK, Brooke NM, Panopoulou G, Holland PWH. The Mnx homeobox gene class defined by HB9, MNR2 and amphioxus AmphiMnx. Dev Genes Evol. 2001;211:103–107. doi: 10.1007/s004270000124. [DOI] [PubMed] [Google Scholar]
  30. Venkatesh TV, Holland ND, Holland LZ, Su M-T, Bodmer R. Sequence and developmental expression of amphioxus AmphiNk2-1: insights into the evolutionary origin of the vertebrate thyroid gland and forebrain. Dev Genes Evol. 1999;209:254–259. doi: 10.1007/s004270050250. [DOI] [PubMed] [Google Scholar]
  31. Holland ND, Venkatesh TV, Holland LZ, Jacobs DK, Bodmer R. Amphink2-tin, an amphioxus homeobox gene expressed in myocardial progenitors: insights into evolution of the vertebrate heart. Dev Biol. 2003;255:128–137. doi: 10.1016/S0012-1606(02)00050-7. [DOI] [PubMed] [Google Scholar]
  32. Hislop NR, de Jong D, Hayward DC, Ball EE, Miller DJ. Tandem organization of independently duplicated homeobox genes in the basal cnidarian Acropora millepora. Dev Genes Evol. 2005;215:268–273. doi: 10.1007/s00427-005-0468-y. [DOI] [PubMed] [Google Scholar]
  33. Luke GN, Castro LFC, McLay K, Bird C, Coulson A, Holland PWH. Dispersal of NK homeobox gene clusters in amphioxus and humans. Proc Natl Acad Sci USA. 2003;100:5292–5295. doi: 10.1073/pnas.0836141100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Shiojima I, Komuro I, Mizuno T, Aikawa R, Akazawa H, Oka T, Yamazaki T, Yazaki Y. Molecular cloning and characterization of human cardiac homeobox gene CSX1. Circulation Res. 1996;79:920–929. doi: 10.1161/01.res.79.5.920. [DOI] [PubMed] [Google Scholar]
  35. Moretti P, Simmons P, Thomas P, Haylock D, Rathjen P, Vadas M, D'Andrea R. Identification of homeobox genes expressed in human haemopoietic progenitor cells. Gene. 1994;144:213–219. doi: 10.1016/0378-1119(94)90380-8. [DOI] [PubMed] [Google Scholar]
  36. Hart AH, Hartley L, Ibrahim M, Robb L. Identification, cloning and expression analysis of the pluripotency promoting Nanog genes in mouse and human. Dev Dynamics. 2004;230:187–198. doi: 10.1002/dvdy.20034. [DOI] [PubMed] [Google Scholar]
  37. Fairbanks D, Maughan P. Evolution of the NANOG pseudogene family in the human and chimpanzee genomes. BMC Evol Biol. 2006;6:12. doi: 10.1186/1471-2148-6-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zhang J, Wang X, Li M, Han J, Chen B, Wang B, Dai J. NANOGP8 is a retrogene expressed in cancers. FEBS J. 2006;273:1723–1730. doi: 10.1111/j.1742-4658.2006.05186.x. [DOI] [PubMed] [Google Scholar]
  39. Moreau-Aubry A, Le Guiner S, Labarrière N, Gesnel M-C, Jotereau F, Breathnach R. A processed pseudogene codes for a new antigen recognized by a CD8+ T cell clone on melanoma. J Exp Med. 2000;191:1617–1623. doi: 10.1084/jem.191.9.1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Balczarek KA, Lai Z-C, Kumar S. Evolution and functional diversification of the paired box (Pax) DNA-binding domains. Mol Biol Evol. 1997;14:829–842. doi: 10.1093/oxfordjournals.molbev.a025824. [DOI] [PubMed] [Google Scholar]
  41. Chen F, Kook H, Milewski R, Gitler AD, Lu MM, Li J, Nazarian R, Schnepp R, Jen K, Biben C, et al. Hop is an unusual homeobox gene that modulates cardiac development. Cell. 2002;110:713–723. doi: 10.1016/S0092-8674(02)00932-7. [DOI] [PubMed] [Google Scholar]
  42. Shin CH, Liu Z-P, Passier R, Zhang C-L, Wang D-Z, Harris TM, Yamagishi H, Richardson JA, Childs G, Olson EN. Modulation of cardiac growth and development by HOP, an unusual homeodomain protein. Cell. 2002;110:725–735. doi: 10.1016/S0092-8674(02)00933-9. [DOI] [PubMed] [Google Scholar]
  43. Simeone A, Acampora D, Gulisano M, Stornaiuolo A, Boncinelli E. Nested expression domains of four homeobox genes in developing rostral brain. Nature. 1992;358:687–690. doi: 10.1038/358687a0. [DOI] [PubMed] [Google Scholar]
  44. Plouhinec J-L, Sauka-Spengler T, Germot A, Le Mentec C, Cabana T, Harrison G, Pieau C, Sire J-Y, Véron G, Mazan S. The mammalian Crx genes are highly divergent representatives of the Otx5 gene family, a gnathostome orthology class of orthodenticle-related homeogenes involved in the differentiation of retinal photoreceptors and circadian entrainment. Mol Biol Evol. 2003;20:513–521. doi: 10.1093/molbev/msg085. [DOI] [PubMed] [Google Scholar]
  45. Wada H, Saiga H, Satoh N, Holland PWH. Tripartite organization of the ancestral chordate brain and the antiquity of placodes: insights from ascidian Pax-2/5/8, Hox and Otx genes. Development. 1998;125:1113–1122. doi: 10.1242/dev.125.6.1113. [DOI] [PubMed] [Google Scholar]
  46. Wada H, Holland PWH, Sato S, Yamamoto H, Satoh N. Neural tube is partially dorsalized by overexpression of HrPax-37: the ascidian homologue of Pax-3 and Pax-7. Dev Biol. 1997;187:240–252. doi: 10.1006/dbio.1997.8626. [DOI] [PubMed] [Google Scholar]
  47. MacLean JA, 2nd, Chen MA, Wayne CM, Bruce SR, Rao M, Meistrich ML, Macleod C, Wilkinson MF. Rhox: a new homeobox gene cluster. Cell. 2005;120:369–382. doi: 10.1016/j.cell.2004.12.022. [DOI] [PubMed] [Google Scholar]
  48. Jackson M, Watt AJ, Gautier P, Gilchrist D, Driehaus J, Graham GJ, Keebler J, Prugnolle F, Awadalla P, Forrester LM. A murine specific expansion of the Rhox cluster involved in embryonic stem cell biology is under natural selection. BMC Genom. 2006;7:212. doi: 10.1186/1471-2164-7-212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. MacLean JA, 2nd, Lorenzetti D, Hu Z, Salerno WJ, Miller J, Wilkinson MF. Rhox homeobox gene cluster: recent duplication of three family members. Genesis. 2006;44:122–129. doi: 10.1002/gene.20193. [DOI] [PubMed] [Google Scholar]
  50. Morris L, Gordon J, Blackburn CC. Identification of a tandem duplicated array in the Rhox alpha locus on mouse chromosome X. Mamm Genome. 2006;17:178–187. doi: 10.1007/s00335-005-0138-4. [DOI] [PubMed] [Google Scholar]
  51. Wang X, Zhang J. Remarkable expansions of an X-linked reproductive homeobox gene cluster in rodent evolution. Genomics. 2006;88:34–43. doi: 10.1016/j.ygeno.2006.02.007. [DOI] [PubMed] [Google Scholar]
  52. Wimmer K, Zhu X-X, Rouillard JM, Ambros PF, Lamb BJ, Kuick R, Eckart M, Weinhäusl A, Fonatsch C, Hanash SM. Combined restriction landmark genomic scanning and virtual genome scans identify a novel human homeobox gene, ALX3, that is hypermethylated in neuroblastoma. Genes Chromosomes Cancer. 2002;33:285–294. doi: 10.1002/gcc.10030. [DOI] [PubMed] [Google Scholar]
  53. Norris RA, Scott KK, Moore CS, Stetten G, Brown CR, Jabs EW, Wulfsberg EA, Yu J, Kern MJ. Human PRRX1 and PRRX2 genes: cloning, expression, genomic localization, and exclusion as disease genes for Nager syndrome. Mamm Genome. 2000;11:1000–1005. doi: 10.1007/s003350010193. [DOI] [PubMed] [Google Scholar]
  54. Saito T, Greenwood A, Sun Q, Anderson DJ. Identification by differential RT-PCR of a novel paired homeodomain protein specifically expressed in sensory neurons and a subset of their CNS targets. Mol Cell Neurosci. 1995;6:280–292. doi: 10.1006/mcne.1995.1022. [DOI] [PubMed] [Google Scholar]
  55. Heathcote K, Braybrook C, Abushaban L, Guy M, Khetyar ME, Patton MA, Carter ND, Scambler PJ, Syrris P. Common arterial trunk associated with a homeodomain mutation of NKX2.6. Hum Mol Genet. 2005;14:585–593. doi: 10.1093/hmg/ddi055. [DOI] [PubMed] [Google Scholar]
  56. Wayne CM, MacLean JA, 2nd, Cornwall G, Wilkinson MF. Two novel human X-linked homeobox genes, hPEPP1 and hPEPP2, selectively expressed in the testis. Gene. 2002;301:1–11. doi: 10.1016/S0378-1119(02)01087-9. [DOI] [PubMed] [Google Scholar]
  57. Cinquanta M, Rovescalli AC, Kozak CA, Nirenberg M. Mouse Sebox homeobox gene expression in skin, brain, oocytes, and two-cell embryos. Proc Natl Acad Sci USA. 2000;97:8904–8909. doi: 10.1073/pnas.97.16.8904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Wijmenga C, Frants RR, Hewitt JE, van Deutekom JCT, van Geel M, Wright TJ, Padberg GW, Hofker MH, van Ommen G-JB. Molecular genetics of facioscapulohumeral muscular dystrophy. Neuromusc Dis. 1993;3:487–491. doi: 10.1016/0960-8966(93)90102-P. [DOI] [PubMed] [Google Scholar]
  59. Kadrmas JL, Beckerle MC. The LIM domain: from the cytoskeleton to the nucleus. Nat Rev Mol Cell Biol. 2004;5:920–931. doi: 10.1038/nrm1499. [DOI] [PubMed] [Google Scholar]
  60. Hobert O, Westphal H. Functions of LIM-homeobox genes. Trends Genet. 2000;16:75–83. doi: 10.1016/S0168-9525(99)01883-1. [DOI] [PubMed] [Google Scholar]
  61. Phillips K, Luisi B. The virtuoso of versatility: POU proteins that flex to fit. J Mol Biol. 2000;302:1023–1039. doi: 10.1006/jmbi.2000.4107. [DOI] [PubMed] [Google Scholar]
  62. Ryan AK, Rosenfeld MG. POU domain family values: flexibility, partnerships, and developmental codes. Genes Dev. 1997;11:1207–1225. doi: 10.1101/gad.11.10.1207. [DOI] [PubMed] [Google Scholar]
  63. Andersen B, Rosenfeld MG. POU domain factors in the neuroendocrine system: lessons from developmental biology provide insights into human disease. Endocrine Rev. 2001;22:2–35. doi: 10.1210/er.22.1.2. [DOI] [PubMed] [Google Scholar]
  64. Baumhueter S, Mendel DB, Conley PB, Kuo CJ, Turk C, Graves MK, Edwards CA, Courtois G, Crabtree GR. HNF-1 shares three sequence motifs with the POU domain proteins and is identical to LF-B1 and APF. Genes Dev. 1990;4:372–379. doi: 10.1101/gad.4.3.372. [DOI] [PubMed] [Google Scholar]
  65. Chi Y-I, Frantz JD, Oh B-C, Hansen L, Dhe-Paganon S, Shoelson SE. Diabetes mutations delineate an atypical POU domain in HNF-1alpha. Mol Cell. 2002;10:1129–1137. doi: 10.1016/S1097-2765(02)00704-9. [DOI] [PubMed] [Google Scholar]
  66. Bach I, Mattei M-G, Cereghini S, Yaniv M. Two members of an HNF1 homeoprotein family are expressed in human liver. Nucleic Acids Res. 1991;19:3553–3559. doi: 10.1093/nar/19.13.3553. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Chen S, Saiyin H, Zeng X, Xi J, Liu X, Li X, Yu L. Isolation and functional analysis of human HMBOX1, a homeobox containing protein with transcriptional repressor activity. Cytogen Genome Res. 2006;114:131–136. doi: 10.1159/000093328. [DOI] [PubMed] [Google Scholar]
  68. Kawakami K, Sato S, Ozaki H, Ikeda K. Six family genes-structure and function as transcription factors and their roles in development. BioEssays. 2000;22:616–626. doi: 10.1002/1521-1878(200007)22:7<616::AID-BIES4>3.0.CO;2-R. [DOI] [PubMed] [Google Scholar]
  69. Gallardo ME, Lopez-Rios J, Fernaud-Espinosa I, Granadino B, Sanz R, Ramos C, Ayuso C, Seller MJ, Brunner HG, Bovolenta P, et al. Genomic cloning and characterization of the human homeobox gene SIX6 reveals a cluster of SIX genes in chromosome 14 and associates SIX6 hemizygosity with bilateral anophthalmia and pituitary anomalies. Genomics. 1999;61:82–91. doi: 10.1006/geno.1999.5916. [DOI] [PubMed] [Google Scholar]
  70. Bertolino E, Reimund B, Wildt-Perinic D, Clerc RG. A novel homeobox protein which recognizes a TGT core and functionally interferes with a retinoid-responsive motif. J Biol Chem. 1995;270:31178–31188. doi: 10.1074/jbc.270.52.31178. [DOI] [PubMed] [Google Scholar]
  71. Bürglin TR. Analysis of TALE superclass homeobox genes (MEIS, PBC, KNOX, Iroquois, TGIF) reveals a novel domain conserved between plants and animals. Nucleic Acids Res. 1997;25:4173–4180. doi: 10.1093/nar/25.21.4173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Bürglin TR. The PBC domain contains a MEINOX domain: coevolution of Hox and TALE homeobox genes? Dev Genes Evol. 1998;208:113–116. doi: 10.1007/s004270050161. [DOI] [PubMed] [Google Scholar]
  73. Bürglin TR, Mukherjee K. Comprehensive analysis of animal TALE homeobox genes: new conserved motifs and cases of accelerated evolution. J Mol Evol. 2007;65:137–153. doi: 10.1007/s00239-006-0023-0. [DOI] [PubMed] [Google Scholar]
  74. Anderson DM, Arredondo J, Hahn K, Valente G, Martin JF, Wilson-Rawls J, Rawls A. Mohawk is a novel homeobox gene expressed in the developing mouse embryo. Dev Dynam. 2006;235:792–801. doi: 10.1002/dvdy.20671. [DOI] [PubMed] [Google Scholar]
  75. Harada R, Bérubé G, Tamplin OJ, Denis-Larose C, Nepveu A. DNA-binding specificity of the cut repeats from the human cut-like protein. Mol Cell Biol. 1995;15:129–140. doi: 10.1128/mcb.15.1.129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Bürglin TR, Cassata G. Loss and gain of domains during evolution of cut superclass homeobox genes. Int J Dev Biol. 2002;46:115–123. [PubMed] [Google Scholar]
  77. Yousef MS, Matthews BW. Structural basis of prospero-DNA interaction: implications for transcription regulation in developing cells. Structure. 2005;13:601–607. doi: 10.1016/j.str.2005.01.023. [DOI] [PubMed] [Google Scholar]
  78. Nishijima I, Ohtoshi A. Characterization of a novel prospero-related homeobox gene, Prox2. Mol Gen Genom. 2006;275:471–478. doi: 10.1007/s00438-006-0105-0. [DOI] [PubMed] [Google Scholar]
  79. Bayarsaihan D, Enkhmandakh B, Makeyev A, Greally JM, Leckman JF, Ruddle FH. Homez, a homeobox leucine zipper gene specific to the vertebrate lineage. Proc Natl Acad Sci USA. 2003;100:10358–10363. doi: 10.1073/pnas.1834010100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Nelles L, Van de Putte T, van Grunsven L, Huylebroeck D, Verschueren K. Organization of the mouse Zfhx1b gene encoding the two-handed zinc finger repressor Smad-interacting protein-1. Genomics. 2003;82:460–469. doi: 10.1016/S0888-7543(03)00169-1. [DOI] [PubMed] [Google Scholar]
  81. Mizutani Y, Kihara A, Igarashi Y. Mammalian Lass6 and its related family members regulate synthesis of specific ceramides. Biochem J. 2005;390:263–271. doi: 10.1042/BJ20050291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Pewzner-Jung Y, Ben-Dor S, Futerman AH. When do Lasses (longevity assurance genes) become CerS (ceramide synthases)?: Insights into the regulation of ceramide synthesis. J Biol Chem. 2006;281:25001–25005. doi: 10.1074/jbc.R600010200. [DOI] [PubMed] [Google Scholar]
  83. Rao E, Weiss B, Fukami M, RumpAndreas , Niesler B, Mertz A, Muroya K, Binder G, Kirsch S, Winkelmann M, et al. Pseudoautosomal deletions encompassing a novel homeobox gene cause growth failure in idiopathic short stature and Turner syndrome. Nat Genet. 1997;16:54–63. doi: 10.1038/ng0597-54. [DOI] [PubMed] [Google Scholar]
  84. Blanco-Arias P, Sargent CA, Affara NA. The human-specific Yp11.2/Xq21.3 homology block encodes a potentially functional testis-specific TGIF-like retroposon. Mamm Genome. 2002;13:463–468. doi: 10.1007/s00335-002-3010-9. [DOI] [PubMed] [Google Scholar]
  85. NCBI BLASThttp://www.ncbi.nlm.nih.gov/BLAST/
  86. McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004;32:W20–W25. doi: 10.1093/nar/gkh435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. JPredhttp://www.compbio.dundee.ac.uk/Software/JPred/jpred.html
  88. Ensembl Genome Browserhttp://www.ensembl.org/
  89. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  90. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  91. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704. doi: 10.1080/10635150390235520. [DOI] [PubMed] [Google Scholar]
  92. Felsenstein J. PHYLIP: Phylogeny Inference Package (version 3.2) Cladistics. 1989;5:164–166. [Google Scholar]
  93. TreeExplorerhttp://evolgen.biol.metro-u.ac.jp/TE/TE_man.html
  94. Liu M, Su M, Lyons GE, Bodmer R. Functional conservation of zinc-finger homeodomain gene zfh1/SIP1 in Drosophila heart development. Dev Genes Evol. 2006;216:683–693. doi: 10.1007/s00427-006-0096-1. [DOI] [PubMed] [Google Scholar]
  95. Manfroid I, Caubit X, Kerridge S, Fasano L. Three putative murine Teashirt orthologues specify trunk structures in Drosophila in the same way as the Drosophilateashirt gene. Development. 2004;131:1065–1073. doi: 10.1242/dev.00977. [DOI] [PubMed] [Google Scholar]
  96. Caubit X, Coré N, Boned A, Kerridge S, Djabali M, Fasano L. Vertebrate orthologues of the Drosophila region-specific patterning gene teashirt. Mech Dev. 2000;91:445–448. doi: 10.1016/S0925-4773(99)00318-4. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1

Maximum likelihood phylogenetic tree of all human plus selected protostome and cnidarian homeodomains for identification of gene families. Arbitrarily rooted phylogenetic tree of all human plus selected protostome and cnidarian homeodomains constructed using the maximum likelihood (ML) method. Bootstrap values supporting gene family designations are shown. Homeodomain sequences derived from pseudogenes are excluded. This ML tree should be compared with the neighbor-joining (NJ) tree shown in Additional file2. The dataset used for both ML and NJ analyses includes all human homeodomains, mostDrosophila melanogasterhomeodomains, plus selected additional homeodomains from other protostomes or cnidarians when the gene family is divergent or absent inDrosophila. DivergentDrosophilagenes that do not group with human genes were identified by construction of a preliminary, non-bootstrapped ML and NJ trees, and subsequently removed from the dataset. These include genes lost from human, as well genes known to have undergone unusually rapid evolution inDrosophila. For the Hox3 family the rapidly evolvingDrosophilagenesbcd,zenandzen2were replaced withSm Hox3b, and for the Nk4 family the rapid evolvingDrosophilagenetinwas replaced withPd NK4. In addition, six genes from other protostome or cnidarian genomes were added to represent gene families known to be missing fromDrosophila(Pdx family:Ps Xlox; Alx family:Nv CART1; Dmbx family:Hv manacle; Pou1 family:Nv POU1; Hnf1 family:Nv HNF; Pknox family:Am Prep). Species abbreviations:Am,Apis mellifera(honeybee);Dm,Drosophila melanogaster(fruitfly);Hv,Hydra vulgaris(hydrozoan);Nv,Nematostella vectensis(starlet sea anemone);Pd,Platynereis dumerilii(annelid worm);Ps,Phascolion strombus(sipunculan worm);Sm,Strigamia maritima(centipede). ML performed more poorly than NJ in recovering several well known gene families, notably Hox4, Hox5, Nk4 and Alx. In contrast, ML did recoverPROP1andCG32532as a true gene family; NJ did not. The invertebrate gene does not always lie as a strict outgroup to all human genes in a family; this effect is expected when using a short alignment. Instead, distinct grouping of invertebrate and human genes is taken as evidence of ancestry from a single gene. A few ambiguous cases were encountered, notably divergence ofDrosophila H2.0in the proposed Hlx gene family, and resolution within the Pax4/6 gene family, which is recovered as two families in NJ but one in ML. As explained in the text, several human gene families contain 'orphan' genes without invertebrate orthologs; these are Barx, Nanog, Noto, Vax, Ventx, Argfx, Dprx, Dux, Esx, Hesx, Hopx, Isx, Leutx, Mix, Nobox, Rhox, Sebox, Tprx, Hdx, Pou5, Hmbox, Satb, Adnp and Zhx/Homez. Zeb and Mkx would be placed in this category based on our ML and NJ trees, although other data suggest thatDrosophila zfh1andCG11617respectively may be the protostome orthologs [73,94]. Tshz is only an apparent orphan family; the clearDrosophilaortholog simply lacks the homeobox [95,96]. Phylogenetic analysis is just one source of evidence for allocation of genes to gene families and identification of boundaries between gene families; complementary criteria used are synteny between species and paralogy within the human genome. Our ML and NJ trees should not be used to allocate gene families to gene classes, because other diagnostic characters such as insertions within the homeodomain, key amino acid residues, and several motifs outside of the homeodomain are excluded from the analysis. Indeed, artefactual mixing of the TALE and SINE classes occurs in both ML and NJ trees.

Click here for file (87.1KB, pdf)
Additional file 2

Neighbor-joining phylogenetic tree of all human plus selected protostome and cnidarian homeodomains for identification of gene families. Arbitrarily rooted phylogenetic tree of all human plus selected protostome and cnidarian homeodomains constructed using the neighbor-joining (NJ) method. Bootstrap values supporting gene family designations are shown. Homeodomain sequences derived from pseudogenes are excluded. Comparison of NJ and ML trees, and description of the dataset used, is given in the legend to Additional file1. Several artefactual mixing of classes occurs in this NJ tree, notably splitting of the CUT class, mixing of the TALE and SINE classes and aberrant placement ofHOPX.

Click here for file (26.1KB, pdf)
Additional file 3

Neighbor-joining phylogenetic tree of human ANTP-class homeodomains, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human ANTP-class homeodomains constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. The proposed division between the HOXL and NKL subclasses is indicated. The position ofEN1andEN2is unstable; this tree places them close to the base of the HOXL/NKL divergence, whereas maximum likelihood analysis of the same dataset places them firmly in the NKL subclass (Figure1). Interrelationships of genes in the Nk2.2 and Nk4 families are also unstable (in this tree and Figure1 respectively); in these cases synteny within and between genomes clearly resolves gene families. Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (23.6KB, pdf)
Additional file 4

Neighbor-joining phylogenetic tree of human PRD-class homeodomains, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human PRD-class homeodomains constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded, as are the partial homeodomains of PAX2, PAX5 and PAX8, and the HOPX homeodomain because its extremely divergent sequence destabilizes the overall tree topology. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single Dux-family gene. Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (16.6KB, pdf)
Additional file 5

Neighbor-joining phylogenetic tree of human homeodomains excluding ANTP and PRD classes, for comparison to maximum likelihood tree. Arbitrarily rooted phylogenetic tree of human homeodomains excluding the ANTP and PRD classes constructed using the neighbor-joining method. Bootstrap values supporting internal nodes with over 70% are shown. Homeodomain sequences derived from pseudogenes are excluded. Roman numeral suffixes are used to distinguish multiple homeodomains encoded by a single gene. Classes and/or families are color coded as shown in the key. The LIM and ZF classes are not recovered as two distinct monophyletic groups, a result also found by maximum likelihood analysis (Figure3). The multiple homeodomains of Zfhx-family proteins and Zhx/Homez-family proteins are also dispersed in the tree, presumably artefactually. Monophyly of the CUT class is not recovered in this tree, but is by maximum likelihood analysis (Figure3). Detailed relationships between different gene families should not be inferred from this tree.

Click here for file (33.2KB, pdf)
Additional file 6

Multiple sequence alignment of all human plus selected protostome and cnidarian homeodomains. The consensus homeodomain sequence (shown several times for reference) was derived from a compilation of 247 human homeodomain sequences. The three horizontal lines indicate the positions of the three alpha-helices. The numbering scheme refers to amino acid position in the canonical 60-amino-acid homeodomain; insertions relative to this sequence are shown when present. Black shaded resides are invariant between all human homeodomains within each class (or family in the case of the ZF homeodomains). Sequence accession numbers are shown. For each gene family designation, maximum likelihood and neighbor-joining bootstrap support values are indicated (see Additional files1 and2). These values are not shown if the gene family does not form a monophyletic group in phylogenetic analyses (in which case n/a is written) or if an invertebrate homolog could not be found.

Click here for file (126.5KB, pdf)
Additional file 7

Phylogenetic input file. All human and invertebrate homeodomains used in phylogenetic analyses are shown, after alignment and removal of insertions to give a uniform 60-amino-acid alignment.

Click here for file (107.4KB, htm)

Articles from BMC Biology are provided here courtesy ofBMC

ACTIONS

RESOURCES


[8]ページ先頭

©2009-2025 Movatter.jp