Transcriptomics

HPA RNA-seq data overview

In total, 1206cell lines, 40human tissues, 193 samples from micro-dissected areas and regions of thehuman brain, and 18immune cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19mouse tissue samples and 32pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq.

Normal tissue specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections.

For a total number of 186 normal tissue samples mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases.

Normalization of transcriptomics data

For both theHPA andGTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets intoconsensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx humantissues, HPAimmune cell types, HPAcell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections.

For thebrain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms.

Consensus transcript expression levels for each gene were summarized in 51 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 37.

TheFANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue.

Mouse andpig transcriptomic data generated by the HPA in collaboration withBGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region.

Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded.

Classification of transcriptomics data

The consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 154 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types.

Explanation of the specificity category

Category	Description
Enriched	nTPM in a particular tissue/region/cell type at least four times any other tissue/region/cell type
Group enriched	nTPM in a group (of 2-5 tissues, brain regions, single cell types or cell lines, or 2-10 immune cell types) at least four times any other tissue/region/cell line/immune cell type/cell type
Enhanced	Enhanced: nTPM in a one or several tissues, brain regions, cell lines, immune cell types or single cell types that has at least four times the mean of all tissue/region/cell types
Low specificity	nTPM ≥ 1 in at least one tissue/region/cell type but not elevated in any tissue/region/cell type
Not detected	nTPM < 1 in all tissue/region/cell types

An additional category "elevated", containing all genes in the first three categories (tissue/cell line/cell type enriched, group enriched and tissue/cell line/cell type enhanced), has been used for some parts of the analysis. TS/CS-score (Tissue Specificity/Cell Specificity score) is calculated for “elevated” tissues/cell lines. TS/CS-score is calculated as the fold change from the tissue/cell line with highest RNA to the tissue/cell line with second highest RNA.

Explanation of the distribution category

Category	Description
Detected in single	Detected in a single tissue/region/cell type
Detected in some	Detected in more than one but less than one third of tissues/regions/cell types
Detected in many	Detected in at least a third but not all tissues/regions/cell types
Detected in all	Detected in all tissues/regions/cell types
Not detected	nTPM < 1 in all tissues/regions/cell types

Gene clustering of transcriptomics data

The RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.

Clustering	Number of tissues, cell types or cell lines	Sample aggregation level
Tissue	78	Averaged nTPM expression per tissue type (40 HPA and 38 GTEX tissue types)
Single cell type	1175	Averaged nCPM expression per cell type cluster
Cell lines	1206	nTPM expression of individual cell lines
Immune cells	103	Averaged nTPM expression per immune cell
Brain	193	Averaged nTPM expression per brain region

Data preprocessing

For each dataset, genes with expression level > 1 in at least one of the samples were selected. The data was genewise scaled to Z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule (eigenvalue ≥ 1) and explaining at least 80% of the total variance. Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation).

Gene clustering

Based on the distances, a k-nearest neighbors (kNN) graph was computeted based on 20 nearest neighbors, which was subsequently to find clusters of similarly expressed genes via Louvain clustering. To account for the stochasticity in the louvain algorithim, the clustering was performed 100 times. The results were later collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster.

Cluster annotation

The clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment.

Clustering visualization

The clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster.

GTEx RNA-seq data

The Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.3.0 (v8) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes that could be mapped from Gencode v26 toEnsembl version 109. The GTEx retina data are based on EyeGEx data fromRatnapriya et al., Nature Genetics 2019 and transcript abundance estimation was performed usingKallisto v0.48.0 usingEnsembl version 109 as reference genome.

Tissue	GTEx tissue	Number of samples
Adipose tissue	Adipose - Subcutaneous	714
	Adipose - Visceral (Omentum)	587
Adrenal gland	Adrenal Gland	295
Amygdala	Brain - Amygdala	181
Blood vessel	Artery - Aorta	472
	Artery - Coronary	268
	Artery - Tibial	691
Breast	Breast - Mammary Tissue	514
Caudate	Brain - Caudate (basal ganglia)	300
Cerebellum	Brain - Cerebellar Hemisphere	277
	Brain - Cerebellum	266
Cerebral cortex	Brain - Anterior cingulate cortex (BA24)	233
	Brain - Cortex	270
	Brain - Frontal Cortex (BA9)	269
Cervix	Cervix - Ectocervix	24
	Cervix - Endocervix	23
Colon	Colon - Sigmoid	419
	Colon - Transverse	479
Endometrium	Uterus - Endometrium	27
Esophagus	Esophagus - Mucosa	614
Fallopian tube	Fallopian Tube	29
Heart muscle	Heart - Atrial Appendage	461
	Heart - Left Ventricle	452
Hippocampus	Brain - Hippocampus	255
Hypothalamus	Brain - Hypothalamus	257
Kidney	Kidney - Cortex	104
	Kidney - Medulla	11
Liver	Liver	262
Lung	Lung	604
Nucleus accumbens	Brain - Nucleus accumbens (basal ganglia)	285
Ovary	Ovary	193
Pancreas	Pancreas	362
Pituitary gland	Pituitary	313
Prostate	Prostate	282
Putamen	Brain - Putamen (basal ganglia)	254
Retina	Retina	105
Salivary gland	Minor Salivary Gland	181
Skeletal muscle	Muscle - Skeletal	818
Skin	Skin - Not Sun Exposed (Suprapubic)	651
	Skin - Sun Exposed (Lower leg)	754
Small intestine	Small Intestine - Terminal Ileum	207
Spinal cord	Brain - Spinal cord (cervical c-1)	204
Spleen	Spleen	277
Stomach	Stomach	407
Substantia nigra	Brain - Substantia nigra	183
Testis	Testis	414
Thyroid gland	Thyroid	684
Urinary bladder	Bladder	77
Vagina	Vagina	170

FANTOM5 CAGE data

The Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from theFANTOM5 repository and mapped toEnsembl version 109.

Tissue	FANTOM5 tissue	Sample description	FANTOM5 sample id
Adipose tissue	Adipose tissue	65,65,76 years, mixed	FF:10010-101C1
Amygdala	Amygdala	76 years, female	FF:10151-102I7
Appendix	Appendix	29 years, male	FF:10189-103D9
Breast	Breast	77 years, female	FF:10080-102A8
Caudate	Caudate nucleus	76 years, female	FF:10164-103B2
Cerebellum	Cerebellum	22-68 years, mixed	FF:10083-102B2
	Cerebellum	76 years, female	FF:10166-103B4
Cervix	Cervix	40,46,57,65 years, female	FF:10013-101C4
Colon	Colon	62,83,84 years, mixed	FF:10014-101C5
Corpus callosum	Corpus callosum	24-68 years, mixed	FF:10042-101F6
Ductus deferens	Ductus deferens	24 years, male	FF:10196-103E7
Endometrium	Uterus	23-63 years, female	FF:10100-102D1
Epididymis	Epididymis	24 years, male	FF:10197-103E8
Esophagus	Esophagus	68,74,75 years, mixed	FF:10015-101C6
Frontal lobe	Frontal lobe	32-61 years, mixed	FF:10040-101F4
Gallbladder	Gall bladder	57 years, male	FF:10198-103E9
Globus pallidus	Globus pallidus	76 years, female	FF:10161-103A8
	Globus pallidus	60 years, female	FF:10175-103C4
Heart muscle	Heart	70,73,74 years, mixed	FF:10016-101C7
	Left ventricle	73 years, female	FF:10078-102A6
	Left atrium	40 years, male	FF:10079-102A7
Hippocampus	Hippocampus	76 years, female	FF:10153-102I9
	Hippocampus	60 years, female	FF:10169-103B7
Insular cortex	Insula	20-68 years, mixed	FF:10039-101F3
Kidney	Kidney	60,62,63 years, female	FF:10017-101C8
Liver	Liver	64,69,70 years, mixed	FF:10018-101C9
Locus coeruleus	Locus coeruleus	76 years, female	FF:10165-103B3
	Locus coeruleus	60 years, female	FF:10182-103D2
Lung	Lung	46,65,94 years, mixed	FF:10019-101D1
	Lung - right lower lobe	29 years, male	FF:10075-102A3
Lymph node	Lymph node	30 years, male	FF:10077-102A5
Medial frontal gyrus	Medial frontal gyrus	76 years, female	FF:10150-102I6
Medial temporal gyrus	Medial temporal gyrus	76 years, female	FF:10156-103A3
	Medial temporal gyrus	60 years, female	FF:10183-103D3
Medulla oblongata	Medulla oblongata	18-64 years, mixed	FF:10038-101F2
	Medulla oblongata	76 years, female	FF:10155-103A2
	Medulla oblongata	60 years, female	FF:10174-103C3
Nucleus accumbens	Nucleus accumbens	23-56 years, mixed	FF:10037-101F1
Occipital cortex	Occipital cortex	76 years, female	FF:10163-103B1
Occipital lobe	Occipital lobe	27 years, male	FF:10076-102A4
Occipital pole	Occipital pole	22-68 years, mixed	FF:10036-101E9
Olfactory bulb	Olfactory region	87 years, female	FF:10195-103E6
Ovary	Ovary	47,75,84 years, female	FF:10020-101D2
Pancreas	Pancreas	52 years, male	FF:10049-101G4
Paracentral gyrus	Paracentral gyrus	22-69 years, mixed	FF:10035-101E8
Parietal lobe	Parietal lobe	35-89 years, mixed	FF:10034-101E7
	Parietal lobe	76 years, female	FF:10157-103A4
	Parietal lobe	60 years, female	FF:10171-103B9
Pituitary gland	Pituitary gland	76 years, female	FF:10162-103A9
Placenta	Placenta	female	FF:10021-101D3
Pons	Pons	18-54 years, mixed	FF:10033-101E6
Postcentral gyrus	Postcentral gyrus	44-52 years, mixed	FF:10032-101E5
Prostate	Prostate	73,79,93 years, male	FF:10022-101D4
Putamen	Putamen	60 years, female	FF:10176-103C5
Retina	Retina	24-65 years, mixed	FF:10030-101E3
Salivary gland	Salivary gland	16-60 years, mixed	FF:10093-102C3
	Parotid gland	23 years, male	FF:10199-103F1
	Submaxillary gland	24 years, male	FF:10202-103F4
Seminal vesicle	Seminal vesicle	24 years, male	FF:10201-103F3
Skeletal muscle	Skeletal muscle	55,79,79 years, mixed	FF:10023-101D5
	Skeletal muscle - soleus muscle	male	FF:10282-104F3
Small intestine	Small intestine	15,40,85 years, mixed	FF:10024-101D6
Smooth muscle	Smooth muscle	20-68 years, male	FF:10048-101G3
Spinal cord	Spinal cord	76 years, female	FF:10159-103A6
	Spinal cord	60 years, female	FF:10181-103D1
Spleen	Spleen	39,50,70 years, male	FF:10025-101D7
Substantia nigra	Substantia nigra	76 years, female	FF:10158-103A5
Temporal cortex	Temporal lobe	32-61 years, mixed	FF:10031-101E4
Testis	Testis	34,53,86 years, male	FF:10026-101D8
	Testis	14-64 years, male	FF:10096-102C6
Thalamus	Thalamus	76 years, female	FF:10154-103A1
Thymus	Thymus	0.5,0.5,0.83 years old infant years, male	FF:10027-101D9
Thyroid gland	Thyroid	67,68,78 years, mixed	FF:10028-101E1
Tongue	Tongue	28 years, male	FF:10203-103F5
Tonsil	Tonsil	22-61 years, mixed	FF:10047-101G2
Urinary bladder	Bladder	55,58,79 years, mixed	FF:10011-101C2
Vagina	Vagina	68 years, female	FF:10204-103F6

Movatterモバイル変換