| MethodTranscriptomics TranscriptomicsHPA RNA-seq data overviewNormalization of transcriptomics dataClassification of transcriptomics dataExplanation of the specificity categoryExplanation of the distribution categoryGene clustering of transcriptomics dataData preprocessingGene clusteringCluster annotationClustering visualizationGTEx RNA-seq dataFANTOM5 CAGE data TranscriptomicsHPA RNA-seq data overviewIn total, 1206cell lines, 40human tissues, 193 samples from micro-dissected areas and regions of thehuman brain, and 18immune cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19mouse tissue samples and 32pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq. Normal tissue specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections. For a total number of 186 normal tissue samples mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases. Normalization of transcriptomics dataFor both theHPA andGTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets intoconsensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx humantissues, HPAimmune cell types, HPAcell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections. For thebrain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms. Consensus transcript expression levels for each gene were summarized in 51 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 37. TheFANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue. Mouse andpig transcriptomic data generated by the HPA in collaboration withBGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region. Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded. Classification of transcriptomics dataThe consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 154 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types. Explanation of the specificity category
Explanation of the distribution category
Gene clustering of transcriptomics dataThe RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.
Data preprocessingFor each dataset, genes with expression level > 1 in at least one of the samples were selected. The data was genewise scaled to Z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule (eigenvalue ≥ 1) and explaining at least 80% of the total variance. Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation). Gene clusteringBased on the distances, a k-nearest neighbors (kNN) graph was computeted based on 20 nearest neighbors, which was subsequently to find clusters of similarly expressed genes via Louvain clustering. To account for the stochasticity in the louvain algorithim, the clustering was performed 100 times. The results were later collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster. Cluster annotationThe clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment. Clustering visualizationThe clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster. GTEx RNA-seq dataThe Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.3.0 (v8) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes that could be mapped from Gencode v26 toEnsembl version 109. The GTEx retina data are based on EyeGEx data fromRatnapriya et al., Nature Genetics 2019 and transcript abundance estimation was performed usingKallisto v0.48.0 usingEnsembl version 109 as reference genome.
FANTOM5 CAGE dataThe Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from theFANTOM5 repository and mapped toEnsembl version 109.
|
Contact
The Project
The Human Protein Atlas