Movatterモバイル変換


[0]ホーム

URL:


Gene ID Mapping for Genotype-TissueExpression (GTEx) Data

Nan Xiao <https://nanx.me>
Gao Wang <https://www.tigerwang.org>
Lei Sun <sunl@uchicago.edu>

Introduction

The Genotype-Tissue Expression (GTEx) project(Lonsdale et al. 2013) aims at measuring humantissue-specific gene expression levels. With the collected data, we willbe able to explore the landscape of gene expression, gene regulation,and their deep connections with genetic variations.

Raw GTEx data contains expression measurements from various types ofelements (such as genes, pseudogenes, noncoding DNA sequences) coveringthe whole genome. For some analysis, it might be desirable to only keepa subset of the data, for example, data from protein coding genes. Insuch cases, mapping the original Ensembl gene IDs to Entrez gene IDs orHGNC symbols become an essential step in the analysis pipeline.

grex offers aminimal dependencysolution to do such ID mappings. Currently, an Ensembl ID from GTEx canbe mapped to its Entrez gene ID, HGNC gene symbol, and UniProt ID, withbasic annotation information such as HGNC gene name, cytogeneticlocation, and gene type. We also limit our scope on the Ensembl IDsappeared in the gene read count data. Ensembl IDs from transcript datawill be considered in future versions.

Mapping table

To facilitate such ID conversion tasks, thegrex packagehas a built-in mapping table derived from the well-known annotation datapackageorg.Hs.eg.db(Carlson2015). The mapping data we used has integrated mappinginformation from Ensembl and NCBI, to maximize the possibility offinding a matched Entrez ID. The R script for creating the mapping tableis locatedhere.

Not surprisingly, when creating such a table, there were hundreds ofcases where a single Ensembl ID can be mapped to multiple Entrez geneIDs. To create a one-to-one mapping, we took a simple approach: we justremoved the duplicated Entrez IDs and only kept the first we encounteredin the original database. Therefore, there might be cases where themapping is not 100% accurate. If you have such doubts for particularresults, please try searching the original ID on the Ensembl website andsee if we got a correct mapped ID.

Code example

As an example, we use the Ensembl IDs from GTEx V7 gene count dataand select 100 IDs:

library("grex")data("gtexv7")id <- gtexv7[101:200]df <- grex(id)tail(df)#>          ensembl_id entrez_id hgnc_symbol                                   hgnc_name#> 95  ENSG00000266075      <NA>        <NA>                                        <NA>#> 96  ENSG00000272153      <NA>        <NA>                                        <NA>#> 97  ENSG00000116198      9731      CEP104                     centrosomal protein 104#> 98  ENSG00000169598      1677        DFFB       DNA fragmentation factor subunit beta#> 99  ENSG00000198912    339448    C1orf174         chromosome 1 open reading frame 174#> 100 ENSG00000236423 100133612   LINC01134 long intergenic non-protein coding RNA 1134#>     cyto_loc uniprot_id   gene_biotype#> 95      <NA>       <NA>           <NA>#> 96      <NA>       <NA>           <NA>#> 97   1p36.32     O60308 protein_coding#> 98   1p36.32     B4DZS0 protein_coding#> 99   1p36.32     Q8IYL3 protein_coding#> 100  1p36.32       <NA>        lincRNA

The elements which cannot be mapped accurately will beNA.

Genes with a mapped Entrez ID:

filtered_genes <-  df[    !is.na(df$"entrez_id"),    c("ensembl_id", "entrez_id", "hgnc_symbol", "gene_biotype")  ]head(filtered_genes)#>        ensembl_id entrez_id hgnc_symbol         gene_biotype#> 1 ENSG00000162576     54587       MXRA8       protein_coding#> 2 ENSG00000175756     54998    AURKAIP1       protein_coding#> 4 ENSG00000221978     81669       CCNL2       protein_coding#> 5 ENSG00000224870    148413  MRPL20-AS1 processed_transcript#> 6 ENSG00000242485     55052      MRPL20       protein_coding#> 8 ENSG00000235098    441869     ANKRD65       protein_coding

If you want to start from the raw GENCODE gene IDs provided by GTEx(e.g. ENSG00000227232.4), the functioncleanid() can help you remove the.versionpart in them, to produce Ensembl IDs.

What’s next?

Conventionally, the next step is removing (or imputing) the geneswithNA IDs, and then select the genes to keep. Notably, aswas observed in the complete gene read count data, in about 100 cases,multiple Ensembl IDs can be mapped to one single Entrez ID.Post-processing steps may also be needed for such genes.

Acknowledgements

We thank members of theStephens lab (Kushal K Dey,Michael Turchin) for their valuable suggestions and helpful discussionson this problem.

References

Carlson, Marc. 2015.org.Hs.eg.db:Genome Wide Annotation for Human.
Lonsdale, John, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, EdmundLo, Saboor Shad, Richard Hasz, et al. 2013.“The Genotype-TissueExpression (GTEx) Project.”Nature Genetics45 (6): 580–85.

[8]ページ先頭

©2009-2025 Movatter.jp