The role of genomic rearrangements, primarily segmental and tandem duplications, in recent human evolution is now well recognized. Several examples of human genes that have recently arisen by fusion of separate parental genes have been documented.1
Somatic acquisition of fusion genes is also a well-described pathogenic mechanism in hematologic and soft tissue malignancies. Typically, these fusions encode oncogenic, chimeric proteins that play a critical role in driving malignant growth. Consequently, they are excellent targets for therapy, as exemplified by the dramatic clinical effects of the tyrosine kinase inhibitor imatinib against theBCR-ABL fusion in chronic myeloid leukemia.
In hematologic malignancies, the presence of a particular gene fusion is usually signaled by the finding of a recurrent, acquired chromosomal rearrangement in bone marrow-derived metaphases. However some fusions are cytogenetically cryptic and thus may evade detection. Although the incidence of gene fusions associated with visible karyotypic abnormalities is well established, the incidence of cryptic fusions is unknown, The initial aim of our study was a systematic search for cryptic fusion genes in patients with atypical myeloproliferative neoplasms using array comparative genomic hybridization analysis, exploiting the fact that most of these abnormalities are associated with small genomic copy number changes that interrupt the target genes. For example,FIP1L1-PDGFRA in chronic eosinophilic leukemia is formed by an 800 kb intrachromosomal deletion2 andNUP214-ABL in T-cell acute lymphoblastic leukemia is formed by a 500 kb episomal amplification.3
The use of high-resolution array comparative hybridizaton has revealed substantial constitutional copy number variation in large stretches of the human genome4–6 and, therefore, in any analysis of malignancy it is essential to distinguish acquired copy number variants (CNV) from those that are inherited. In this study we describe the unexpected finding of a constitutional polymorphic gene fusion occurring at a relatively high frequency in apparently healthy Europeans.
Samples of bone marrow or peripheral blood from patients with atypical myeloproliferative neoplasms were analyzed by CGH array (37 samples) and by reverse transcriptase polymerase chain reaction (RT-PCR) (575 samples). All were accrued between 1999 and 2007 at the Wessex Regional Genetics Laboratory (WRGL), UK. A cohort of 120 anonymized control peripheral blood samples were from healthy first-degree relatives of individuals investigated for a variety of genetic conditions at the WRGL. DNA samples from ten individuals, previously identified as having a CNV in theTFG-GPR128 region7 from data deposited at the Database of Genomic Variants (http://projects.tcag.ca/variation/), were obtained from the PopGen Project based in Kiel, Germany.8 Two PopGen population cohorts were used as controls, cohort 1 comprised 737 anonymous blood donors of German descent collected by the PopGen biobank,8 cohort 2 comprised 539 individuals with phenotypic and Affymetrix SNP 6.0 array data recruited through the Blood Service of the University Hospital Schleswig-Holstein. All individuals completed questionnaires and underwent physical examination at PopGen facilities.
Targeted Agilent arrays (Agilent Technologies, Palo Alto, CA, USA) were designed using the Agilent online array design program eArray (https://earray.chem.agilent.com/earray). An initial array targeted approximately 500 candidate genes for involvement in atypical myeloproliferative neoplasms, including a 60 kb region comprising 50 oligonucleotides centered uponTFG. A second array included 1100 oligonucleotides targeting 500 kb centered upon theTFG-GPR128 CNV. All procedures were performed according to the manufacturer’s protocols. Array data were analyzed using CGH Analytics software (Agilent Technologies).
A 3100 Genetic analyzer (Applied Biosystems, Foster City, CA, USA) was used for Genescan analysis and genomic sequencing. Sequences were analyzed using Mutation Surveyor (Softgenetics, State College, PA, USA) and Genescan results by Genescan and Genotype software (Applied Biosystems). Primer sequences are provided inOnline Supplementary Table S1.
Fluorescencein situ hybridization (FISH) was performed using standard techniques, as previously described.9
We used Galaxy (http://main.g2.bx.psu.edu),10 an interactive web-based portal which allows users to carry out computational operations on large data sets from remote databases, to perform anin silico screen for other examples of fusion genes resulting from CNV amplifications. We chose to screen only for CNV-amplification derived fusion expressed sequence tags (EST) for two reasons: (i) fusion EST caused by CNV deletion would be mimicked by fusion EST arising from the splicing together of exons from neighboring genes by exon skipping, thereby increasing the false discovery rate, (ii) theoretically, the chance of CNV breakpoints lying within genes and therefore potentially giving rise to fusions is the same for both amplifications and deletions, therefore only identifying CNV-amplification derived fusions would still give an indication of the incidence of these events. We were aided in the screen by the fact that EST such as those fromTFG-GPR128 which map to two different genes, have separate entries in the University of California, Santa Cruz (UCSC) Genome Bioinformatics table all_est (http://genome.ucsc.edu) for each locus. EST from the UCSC all_est table were, therefore, filtered to select those with two entries, for which the two entries satisfied the following conditions: (i) on the same chromosome; (ii) running in the same orientation; (iii) no overlap of more than 10 bp within the EST sequence; (iv) the 5′ table entry mapping to a genomic position ‘downstream’ of the 3′ entry; and (v) mapping to genomic positions not separated by more than 5 Mb (this would include all CNV, larger rearrangements would be detectable cytogenetically) and falling within two different genes. Resulting EST were then visualized on the UCSC browser to select those mapping to exons with canonical splice sites at the fusion gene junction. An explanatory diagram is shown inFigure 1.
To search for copy number changes associated with cryptic fusion genes, we designed an Agilent 44K custom array that targeted 500 candidate genes, including all tyrosine kinases, known gene targets of oncogenic chromosomal rearrangements plus a selection of other oncogenes and signaling molecules. Each gene was targeted with 50–75 probes to ensure sensitive detection of mosaic acquired abnormalities in a variable background of normal cells. This report focuses on the results of one of the array gene targets,TFG (TRK-fused gene).
Of 37 samples from patients with atypical myeloproliferative neoplasms that we initially analyzed, one case with hypereosinophilic syndrome showed an amplified region with a breakpoint withinTFG. This gene is a known target of acquired chromosomal translocations that generate fusions withALK,NTRK1 andNR4A3 (NOR1) in anaplastic large cell lymphoma,11 thyroid carcinoma12 and skeletal myxoid chondrosarcoma,13 respectively.
We clarified the breakpoints by hybridization of the same sample to an Agilent 244K whole genome array, which demonstrated an amplification of 111 kb with breakpoints within intron 3 ofTFG and intron 1 of the proximal gene,GPR128 (Figure 2A). A tandem duplication of this genomic region would, therefore, place 5′TFG sequences upstream of the 3′ part ofGPR128 with the possibility of forming a fusion gene (Figure 2B). RT-PCR on cDNA from the same patient demonstrated expression of a fusion transcript (Figure 3A) and sequencing showed this to be in frame (Figure 3B). To identify the genomic breakpoint, we employed PCR with primers placed at several kilobase intervals within the breakpoint introns. The breakpoints were found to lie at regions of homology inTFG at chr3:101,928,847 andGPR128 at chr3:101,817,568 (UCSC hg18) (Figure 3C). FISH with BAC RP11-398O21 on metaphases from aTFG-GPR128-positive individual, which entirely overlapped the amplification, showed single signals on each chromosome 3 (Figure 4) indicating that the region of amplification had not been excised and relocated to another part of the genome.
Figure 1.Relationship between gene, EST and entries in the UCSC all_est table and criteria for EST selection.
Initially, two factors pointed toTFG-GPR128 being a novel leukemia fusion gene: (i) no CNV at that time was present in the relevant databases, and (ii) the predicted fusion protein was structurally similar to known oncogenic fusions, i.e. a partner (TFG) that was a known translocation target and which contributed a coiled-coil self-association motif, fused to a gene for a protein with signaling potential (GPR128).
Figure 2.Comparative genomic hybridization (CGH) array results and proposed structure of a 111 kb amplification. (A) Hybridization to an Agilent 244K CGH array allowed breakpoints to be positioned withinTFG intron 3 andGPR128 intron 1. The log2 ratio of signals within the amplification suggested two additional copies of the CNV region. (B) Non-rearrangedGPR128 lies upstream ofTFG (top). The CNV amplification, with breakpoints withinTFG andGPR128, results in one (middle) or two (bottom) copies of theTFG-GPR128 fusion gene. Arrows show direction of transcription.
A cohort of 575 cDNA samples from patients with atypical myeloproliferative neoplasms was screened by RT-PCR, thereby identifying a further seven (1.2%)TFG-GPR128-positive cases. However, a screen of 120 DNA samples from healthy control individuals, using an amplification refractory mutation system (ARMS) PCR with primers flanking the breakpoints, identified three positive individuals (2.5%), indicating that the amplification is a constitutional CNV and not an acquired mutation. Supporting evidence was obtained in one patient who presented with anETV6-PDGFRB fusion and becameETV6-PDGFRB-negative during imatinib therpay, as determined by RT-PCR, but in whom both presentation and remission samples were found to be positive forTFG-GPR128. In addition, allTFG-GPR128-positive samples were found to share precisely the same genomic breakpoints. Constitutional DNA was not available from the other cases with atypical myeloproliferative neoplasms who were positive.
Subsequent to this analysis, a CNV at this location was identified in three large genome-wide studies7,14,15 and data deposited in the Database of Genomic Variants (http://projects.tcag.ca/variation/). In one of these studies,7 506 unrelated individuals from the population-based PopGen biobank of samples from Schleswig-Holstein, Germany were genotyped on the Affymetrix 500K single nucleotine polymorphism (SNP) array, which led to the identification of ten individuals (2.0%) with CNV overlapping theTFG-GPR128 CNV region. We analyzed DNA from these ten individuals by ARMS PCR and found all to be positive for the fusion. Taken together, the data (3/120 local controls and 10/506 PopGen samples) suggest an incidence of around 2% in the UK and German populations.
Although we were able to confirm the presence of theTFG-GPR128 fusion in all ten PopGen samples, the 500K SNP array data from Pinto7 placed the breakpoints in nine of the ten samples several kilobases from the breakpoints we had identified (Figure 5A). It was, therefore, possible that theTFG-GPR128 CNV lay closely adjacent to other CNV. To clarify the precise distribution of CNV in this region, we examined all ten PopGen samples using a custom 4x44K Agilent array with 1100 probes targeted to lie within a 500 kb region centered on theTFG-GPR128 CNV. Eight samples showed two additional copies and two showed a single additional copy with no evidence of any adjacent CNV (Figure 5B). Individuals with two copies ofTFG-GPR128 may have either two copies of the chromosome carrying the fusion or one chromosome without the fusion plus a chromosome with two copies of the fusion. Since the frequency ofTFG-GPR128 homozygotes is predicted to be 0.0001, the latter explanation seems more likely; however, in the absence of parental DNA this remains speculative.
Figure 3.Analysis ofTFG-GPR128 DNA and RNA. (A) RT-PCR with primers withinTFG andGPR128 identified a fusion transcript. (B) Sequencing showedTFG exon 3 to be fused in frame toGPR128 exon 2. (C) Genomic sequencing demonstrated microhomology at the breakpoint regions. Genomic coordinates according to UCSC hg18.
The identification of tenTFG-GPR128-positive individuals from the PopGen biobank allowed us to examine whether the fusion is associated with any common disease phenotype (Online Supplementary Table S2). We compared the incidence of common diseases in the tenTFG-GPR128-positive individuals with that in 737 controls and an apparent excess of neuropathies (stroke, migraine, Parkinson’s disease, essential tremor, epilepsy or other unclassified neuropathy) was observed (5/10TFG-GPR128-positive individualsversus 110/737 controls;P=0.01, Fisher’s exact test, two-tailed, uncorrected for multiple testing). However, this association could not be confirmed in a second cohort of 539 individuals for whom Affymetrix SNP 6.0 data were available. Using conservative criteria to avoid false positives, i.e. with amplification breakpoints closely overlapping the known CNV breakpoints, fiveTFG-GPR128-positive cases were identified, none of whom was classed as having neuropathies. Notably, there was no association ofTFG-GPR128 positivity and malignancy; however, this does not exclude the possibility of a low penetrance predisposition of malignancy or other clinical phenotype.
Figure 4.FISH onTFG-GPR128 positive cells with BAC RP11-398O21 (which entirely contains the CNV) and a chromosome 3 painting probe showing a single region of hybridization indicating that the amplified region has not been excised and relocated to another chromosome.
To investigate the origin ofTFG-GPR128, we examined markers within the CNV. A microsatellite at chr3:101,840,545-101,840,591 (UCSC hg18) was particularly informative, showing an allele with 12 TG nucleotide repeats in all 19TFG-GPR128-positive individuals examined whereas only 2/52 controls withoutTFG-GPR128 had a 12-repeat allele. To investigate the haplotype structure of the CNV region a linkage disequilibrium map was constructed of theTFG-GPR128 region using the LDMAP program16 and data from the 539 samples (534 controls and fiveTFG-GPR128-positive cases) that had Affymetrix SNP 6.0 genotypes. A core region of high linkage disequilibrium and relatively low haplotype diversity was determined as spanning 0.5 linkage disequilibrium units either side of the TG repeat. This region covered about 250 kb and contained 71 SNP between rs6777810 (chr3:101,729,832, UCSC hg18) and rs9850273 (chr3:101,982,055, UCSC hg18). Haplotype analysis was undertaken with PHASE,17 which identified 121 distinct haplotypes, with 15 having ten or more copies in the sample, collectively accounting for 81% of the assigned haplotypes. The fiveTFG-GPR128-positive cases were found to all share the fourth most common haplotype which was also found in 95/534 (17.8%) of controls (Online Supplementary Table S3). The sharing of an extended haplotype spanning more than 250 kb in the cases is consistent with a single ancestral origin as indicated by microsatellite analysis.
Figure 5.Variability of theTFG-GPR128 CNV region. (A) Previously published SNP array data demonstrated apparently variable CNV boundaries (from the Database of Genomic Variants). (B) Hybridization of tenTFG-GPR128 positive samples identified by Pinto7 to a custom Agilent 44K array (1100 probes targeted to lie within a 500 kb region centered on theTFG-GPR128 CNV) showed identical breakpoints with either two copies (left-hand array, eight cases) or one copy (right-hand array, two cases) ofTFG-GPR128 with no evidence of adjacent CNV. Signal intensities are plotted on the x-axis as log2 ratio.
Table 1.Fusion EST identified by anin silico search.
Considerable diversity in CNV distribution has been described both within and between populations.15 A study of the original 270 HapMap samples (30 Yoruba trios from Nigeria; 30 trios of European descent from Utah, USA (CEPH); 45 unrelated Japanese from Tokyo, Japan and 45 unrelated Han Chinese from Beijing, China) using Affymetrix 500K SNP arrays5 did not find any examples of theTFG-GPR128 CNV. Because of the potential variability in CNV calls highlighted by the analysis of PopGen samples, we reanalyzed these data [using data deposited at the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/), accession GSE5173], focusing specifically on theTFG-GPR128 region and confirmed the absence of the CNV. In one worldwide study of 29 human populations comprising a total of 485 individuals,15 twoTFG-GPR128-associated CNV positive individuals were identified, one from a Middle Eastern Bedouin population (n=47) and a second from a European Basque population (n=13). In the study by Zogopoulos,14 of a total of 1190 control individuals, sixTFG-GPR128 CNV-positive individuals, all of Caucasian origin, were identified. AllTFG-GPR128 cases identified so far are, therefore, of European or Middle Eastern ancestry. In our haplotype analysis, the SNP rs9873555 was identified as tagging the extended haplotype common to the TFG-GPR128-positive cases and several minor haplotypes. From International HapMap Data (http://www.hapmap.org), the minor allele frequencies for rs9873555 in Japanese, Han Chinese, Yoruba and CEPH populations are 0, 0.006, 0.137 and 0.161, respectively, therefore providing further evidence that the background upon whichTFG-GPR128 has arisen is restricted. However, the possibility thatTFG-GPR128 might also have arisen independently in other populations cannot be excluded.
We sought to determine whether other CNV give rise to fusion genes. We used anin silico strategy to search for chimeric mRNA associated with known constitutional genomic copy number gains similar to theTFG-GPR128 CNV. A search of the UCSC Genome Bioinformatics database table all_est (http://genome.ucsc.edu) using Galaxy (http://main.g2.bx.psu.edu) and the strategy outlined inFigure 1, resulted in the identification of 122 fusion EST. These were examined manually using the UCSC browser to select those chimeric EST entirely overlapping with exons and with canonical splice sites at the junction of the fusion partner genes i.e. formed by precise fusion of intact exons from the two genes. This resulted in 26 fusion EST from 19 different fusions, including five EST fromTFG-GPR128 (Table 1), which confirmed the validity of our method. An additional search using the NCBI Basic Local Alignment Search Tool (BLAST) (http://blast.ncbi.nlm.nih.gov) with each fusion EST identified two furtherTFG-GPR128 EST and a single additionalMSH5-AIF1 fusion EST. None of the EST fully overlapped with known CNV but three showed partial overlaps:CD44-PDHX,NRG4-SCAPER,SLC38A7-GOT2. Currently there is no clear evidence that these are CNV-derived rather than examples of trans-splicing or technical artifacts.
We describe here the finding of a polymorphic in-frame gene fusion,TFG-GPR128, associated with a CNV. The fusion appears to have been derived from a single ancestral event and, in a limited analysis, is not associated with any obvious clinical phenotype. The precise normal cellular roles of TFG and GPR128 are unclear. GPR128 is an orphan member of the adhesion subfamily of G-protein coupled receptors which are characterized by a long serine/threonine-rich N-terminus,18 most of which is predicted to be retained in the TFG-GPR128 fusion protein. TFG has SH2 and SH3 binding motifs plus an N-terminal coiled-coil domain that may mediate self-association. TheC. elegans homolog of TFG suppresses apoptosis and is required for normal cell-size control.19 Furthermore, a role for TFG in signaling is supported by the observations that TFG interacts with PTEN,20 the NF-κB pathway proteins (IKBKG) NEMO and TANK21 and regulates (PTPN6) SHP-1 activity.21 From NCBI Unigene EST profile data (http://www.ncbi.nlm.nih.gov/UniGene), expression of GPR128 appears to be restricted to the adrenal gland, intestine, kidney, liver, lung, placenta, skin, stomach and testis whereas TFG is strongly and ubiquitously expressed. Consequently the TFG-GPR128 fusion protein, if indeed it is translated from the chimeric mRNA, would be expected to have similar expression to TFG and potentially may affect signaling in diverse tissues. Although we found no obvious clinical phenotype associated with the presence ofTFG-GPR128, it is possible that relatively subtle changes may emerge on more detailed analysis, for example once the normal functions of TFG and GPR128 are better understood.
It is remarkable thatTFG is also the target of acquired, oncogenic translocations in malignancy and suggests there may be a common mechanism of fusion gene formation. Currently, however, there is no evidence for this: theTFG breakpoint inTFG-GPR128 occurs in intron 3 whileTFG-ALK breakpoints occur in introns 3, 4 and 6,22 andTFG-NOR1 in intron 7.13 Moreover, although we found microhomology at theTFG andGPR128 breakpoints, which suggests formation by non-allelic homologous recombination, no evidence of homology was reported at the breakpoint sites inALK-TFG fusion genes.22 It seems likely that involvement ofTFG in both oncogenic and non-oncogenic fusion genes is coincidental, although it is possible that some unknown local structural feature may increase the probability of translocations at theTFG locus.
Gene fusions are well known in evolution and may provide the opportunity for the acquisition of novel functions. Examples includePOMZP3, formed by fusion of thePOM121 membrane glycoprotein andZP3 (zona pellucida glycoprotein 3) 3–5 million years ago,23USP6, a hominoid-specific fusion formed by the fusion ofUSP32 andTBC1D3 approximately 21–33 million years ago,24 andLRTOMT, a candidate catechol-O-methyltransferase that is mutated in non-syndromic deafness.25 The finding of theTFG-GPR128 polymorphism may thus be viewed as an intermediate step in evolution that could theoretically culminate in fixation or elimination.
Given the high frequency of CNV in the human genome, we searched for other polymorphic gene fusions in EST databases. Although we failed to find any other examples, EST are derived predominantly from the 5′ and 3′ ends of genes and thus fusions involving the central regions of genes may be under-represented. High throughput transcriptome analysis using sequencing26 or paired end ditag analysis27 on a population basis may yield further examples of polymorphic gene fusions. Indeed, paired end ditag analysis of two cancer cell lines yielded a surprisingly large number of candidate chimeric transcripts that may potentially have been acquired or inherited.27
In summary, the work presented here describes a further example of genomic complexity at the population level which may have implications for understanding human evolution. Our work also emphasizes the importance of careful discrimination of oncogenic changes found in tumor samples from non-pathogenic normal variation.