Movatterモバイル変換


[0]ホーム

URL:


We use cookies to enhance the usability of our website. If you continue, we'll assume that you are happy to receive all cookies.More information.Don't show this again.
HPA
RESOURCES
ABOUT
NEWS
LEARN
DATA
HELP
Fields »
Search result

Field
Term
Gene name
Class
Subclass
Class
Keyword
Chromosome
External id
Tissue
Cell type
Expression
Antibody panel
Tissue
Main location
Patient ID
Annotation
Tissue
Category
Tau score
Cluster
Reliability
Brain region
Category
Tau score
Brain region
Category
Tau score
Brain region
Category
Tau score
Cluster
Reliability
Tissue
Cell type
Enrichment
Cell type
Category
Tau score
Cell type
Category
Tau score
Cell type
Category
Tau score
Cell lineage
Category
Tau score
Cluster
Cluster
Location
Searches
Location
Cell line
Class
Type
Phase
Reliability
Cancer
Prognosis
Cancer
Category
Cancer
Category
Tau score
Cluster
Variants
Interacting gene (ensg_id)
Type
Number of interactions
Pathway
Category
Score
Score
Score
Validation
Validation
Validation
Validation
Antibodies
Data type
Column


Immunohistochemistry - tissues

The Human Protein Atlas contains images of histological sections from normal and cancer tissues obtained by immunohistochemistry. Antibodies are labeled with DAB (3,3'-diaminobenzidine) and the resulting brown staining indicates where an antibody has bound to its corresponding antigen. The section is furthermore counterstained with hematoxylin to enable visualization of microscopical features. Tissue microarrays are used to show antibody staining in samples from 144 individuals corresponding to 44 different normal tissue types, and samples from 216 cancer patients corresponding to 20 different types of cancer (movie about tissue microarray production and immunohistochemical staining). Each sample is represented by 1 mm tissue cores, resulting in a total number of 576 images for each antibody. Normal tissues are represented by samples from three individuals each, one core per individual, except for endometrium, skin, soft tissue and stomach, which are represented by samples from six individuals each and parathyroid gland, which is represented by one sample. Protein expression is annotated in 76 different normal cell types present in these tissue samples. For cancer tissues, two cores are sampled from each individual and protein expression is annotated in tumor cells. A small fraction of the 576 images are missing for most antibodies due to technical issues. Specimens containing normal and cancer tissue have been collected and sampled from anonymized paraffin embedded material of surgical specimens, in accordance with approval from the local ethics committee. For selected proteins extended tissue profiling is performed in addition to standard tissue microarrays. Examined tissues include mouse brain, human lactating breast, eye, thymus and extended samples of adrenal gland, skin and brain.
Since specimens are derived from surgical material, normal is here defined as non-neoplastic and morphologically normal. It is not always possible to obtain fully normal tissues and thus several of the tissues denoted as normal will include alterations due to inflammation, degeneration and tissue remodeling. In rare tissues, hyperplasia or benign proliferations are included as exceptions. It should also be noted that within normal morphology there may exist interindividual differences and variations due to primary diseases, age, sex etc. Such differences may also affect protein expression and thereby immunohistochemical staining patterns. Samples from cancer are also derived from surgical material. Due to subgroups and heterogeneity of tumors within each cancer type, included cases represent a typical mix of specimens from surgical pathology. The inclusion of tumors is based on availability and representativity, however, an effort has been made to include high and low grade malignancies where such is applicable. In certain tumor groups, subtypes have been included, e.g. breast cancer includes both ductal and lobular cancer, lung cancer includes both squamous cell carcinoma and adenocarcinoma and liver cancer includes both hepatocellular and cholangiocellular carcinoma etc. Tumor heterogeneity and interindividual differences may be reflected in diverse expression of proteins resulting in variable immunohistochemical staining patterns.

Annotation

In order to provide an overview of protein expression patterns, all images of tissues stained by immunohistochemistry are manually annotated by a specialist followed by verification by a second specialist. Annotation of each different normal and cancer tissue is performed using fixed guidelines for classification of immunohistochemical results. Each tissue is examined for representability, and subsequently immunoreactivity in the different cell types present in normal or cancer tissues was annotated. Basic annotation parameters include an evaluation of i) staining intensity (negative, weak, moderate or strong), ii) fraction of stained cells (<25%, 25-75% or >75%) and iii) subcellular localization (nuclear and/or cytoplasmic/membranous). The manual annotation also provides two summarizing texts describing the staining pattern for each antibody in normal tissues and in cancer tissues.
The terminology and ontology used is compliant with standards used in pathology and medical science.SNOMED classification is used for assignment of topography and morphology. SNOMED classification also underlies the given original diagnosis from which normal as well as cancer samples were collected.
A histological dictionary used in the annotation is available as a PDF-document, containing images stained by immunohistochemistry using antibodies included in the Human Protein Atlas. The dictionary displays subtypes of cells distinguishable from each other and also shows specific expression patterns in different intracellular structures. Annotation dictionary:screen usage (15 MB),printing (95 MB).

Knowledge-based annotation

Knowledge-based annotation aims to create a comprehensive overview of protein expression patterns in normal human tissues. This is achieved by stringent evaluation of immunohistochemical staining pattern, RNA-seq data from internal and external sources and available protein/gene characterization data, with special emphasis on RNA-seq. Annotated protein expression profiles are performed using single antibodies as well as independent antibodies (two or more independent antibodies directed against different, non-overlapping epitopes on the same protein). For independent antibodies, the immunohistochemical data from all the different antibodies are taken into consideration. The immunohistochemical staining pattern in normal tissues is subjectively annotated according to strict guidelines. It is based on the experienced evaluation of positive immunohistochemical signals in the 76 normal cell types analyzed. The review also takes suboptimal experimental procedures and interindividual variations into consideration.
The final annotated protein expression is considered a best estimate and as such reflects the most probable histological distribution and relative expression level for each protein. To enable a protein expression profile, one or several of the following additional data sources is necessary; i) an independent antibody targeting another epitope of the same protein ii) RNA-seq data, and iii) available protein/gene characterization data. The result of the knowledge-based annotation is considered inconclusive when the information available at the time of analysis is evaluated as not sufficient for verification of the staining pattern and an estimation of the expected protein expression. The knowledge-based protein expression profiles are performed using fixed guidelines on evaluation and presentation of the resulting expression profiles. Standardized explanatory sentences are used when necessary to provide additional information required for full understanding of the expression profile. A reliability score, set as Enhanced, Supported, Approved, or Uncertain is set for each annotated protein expression profile based on evaluation of all available data.

Reliability score

A reliability score is manually set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on knowledge-based evaluation of available RNA-seq data, protein/gene characterization data and immunohistochemical data from one or several antibodies designed towards non-overlapping sequences of the same gene. The reliability score is based on the 44 normal tissues analyzed, and is displayed on both the Tissue Atlas and the Pathology Atlas.

The reliability score is divided into Enhanced, Supported, Approved, or Uncertain. If there is available data from more than one antibody, the staining patterns of all antibodies are taken into consideration during the evaluation of the reliability score.

Enhanced
One or several antibodies targeting non-overlapping sequences of the same gene have obtained enhanced validation based on either orthogonal or independent antibody validation methods.

Supported
If one of the following criteria is fulfilled:

  • At least one antibody shows high or medium consistency between RNA levels and staining pattern, but the antibody does not qualify for Orthogonal validationand staining pattern is consistent with valid literature, or there is no valid literature available
  • At least one antibody has RNA consistency defined as “Cannot be evaluated”and staining pattern is consistent with valid literature
  • Paired antibodies (several antibodies targeting non-overlapping sequences) show similar staining pattern, but the antibodies do not qualify for Independent antibody validationand staining pattern is consistent with valid literature, or there is no valid literature availa

Approved
If one of the following criteria is fulfilled:

  • At least one antibody shows high or medium consistency between RNA levels and staining patternand staining pattern is inconsistent with valid literature
  • At least one antibody shows low consistency between RNA levels and staining patternand staining pattern is consistent with valid literature
  • At least one antibody has RNA consistency defined as “Cannot be evaluated”and staining pattern is partly consistent with valid literature, or consistent with limited literature
  • Paired antibodies show partly similar expression patterns

Uncertain
If one of the following criteria is fulfilled:

  • Only multi-targeting antibodies are available. Multi-targeting antibodies are used for genes where it was not possible to generate single-targeting antibodies due to high sequence identity among proteins belonging to different genes. These genes are in many cases closely related and belong to known gene families, and in these cases a multi-targeting antibody was produced that has >80% sequence identity to transcripts of the genes belonging to the family and low sequence identity to the transcripts of all other human genes.
  • At least one antibody shows low or very low consistency between RNA and staining pattern, or RNA consistency is defined as “Cannot be evaluated”and staining pattern is inconsistent with valid literature, or there is no valid literature available
  • Paired antibodies show dissimilar expression patterns

Multiplex immunohistochemistry/IF - tissues

As part of the Tissue Atlas resource, the multiplex immunohistochemistry(mIHC)/IF data was generated by staining tissue microarrays obtained from histological sections from normal tissues. The mIHC/IF tissue data displays high-resolution, 6-plex images of proteins labeled by indirect mIHC and in addition to conventional IHC, thus providing spatial information on protein expression patterns related to distinct single cells and cell types, or even cellular states and histological and biological structures embedded in the tissue.

Similarly to conventional IHC, in mIHC/IF, primary antibodies are first labeled with secondary antibodies coupled with horseradish peroxidase (HRP) (or similar). Further, the method utilizes tyramide signal amplification (TSA) where fluorescent tyramide molecules are catalyzed by HRP which creates a fluorescent precipitate on and proximal to the binding site. The ability to run several staining-stripping-cycles allows for tissue sections with up to 6 labeled proteins per slide. Lastly, the slides are counterstained with DAPI (4′,6-diamidino-2-phenylindole). In this setup, tissue microarrays consisting of doublet 1 mm cores from three patients are used to profile each protein.

Annotation

The protein localization is manually annotated by assessing the target of interest by estimating the fraction of cells that overlap with the panel antibodies and, when applicable, also annotating their subcellular localization. For each slide, the tissue cores are examined for representability as well. The annotation parameters include an evaluation of i) fraction of cells with expression of unknown protein that overlap with panel markers (<25%, 25-75% or >75%), and ii) subcellular localization (nuclear and/or cytoplasmic/plasma membrane/membrane) of the staining. The manual annotation also provides two summarizing texts describing the staining pattern for each antibody. The marker proteins, targetted by the panel antibodies, may be limited in their ability to label all cells of the intended cell type/structure, as defined in the literature.


Cilia panel

The panel for ciliated cells was developed with the aim to study the spatial protein expression of cilia proteins. For each unknown protein, the antibody targeting the protein is labeled with the available TSA-fluorophore (OPAL 520) not occupied by the marker proteins.

Cilia panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Cilia cell bodyAGR3HPA053942OPAL570Cyan
Cilia cell nucleusFOXJ1HPA005714OPAL780Magenta
Basal bodyCROCCHPA021191OPAL690Red
Cilia transition zoneNPHP4HPA065526OPAL480Yellow
Cilia axonemeDNAH9HPA052641OPAL620White
Empty slotUnknown protein of interest-OPAL520Green


Kidney panel

For kidney, a antibody panel was developed to characterize the spatial localization of kidney proteins mainly in renal tubules but also in podocytes. An endothelial cell marker was also added to distinguish non-podocytes in the glomerular compartment. For each unknown protein, the antibody targeting the protein is labeled with the available TSA-fluorophore (OPAL 520) not occupied by the marker proteins.

Kidney panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Collecting ductsAQP2HPA046834OPAL690Cyan
Distal tubulesCASRHPA039686OPAL570Red
Proximal tubulesACSM2A/BHPA057699OPAL620White
PodocytesPTPROHPA034525OPAL480Yellow
Endothelial cellsCD34HPA036722OPAL780Magenta
Empty slotUnknown protein of interest-OPAL520Green


Salivary gland panel

The antibody panel for salivary gland was generated to profile the different glandular tissues (serous and mucus glands) and ductal structures (small ducts, large ducts and ionocytes). For each unknown protein, the antibody targeting the protein is labeled with the available TSA-fluorophore (OPAL 520) not occupied by the marker proteins.

Salivary gland panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Serous aciniLPOHPA028688OPAL480Cyan
Mucus aciniMUC5BCAB009396OPAL780White
Small ductsSLC13A2HPA014963OPAL690Magenta
Large ductsATP6V1B1HPA031847OPAL570Red
IonocytesFOXI1HPA071469OPAL620Yellow
Empty slotUnknown protein of interest-OPAL520Green


Testis panels

For testis, two panels have been developed where the aim was i) to capture the transition of spermatogonial stem cells to preleptotene spermatocytes (Spermatogonia panel), ii) to identify the expression of proteins during spermatocyte differentiation and meiosis (Spermatocytes panel), iii) to characterize the proteins during sperm transformation, a process called spermiogenesis (Spermatids panel), and iv) mapping out the proteins Sertoli-specific proteins (Sertoli cells panel). For each unknown protein, the antibody targeting the protein is labeled with the available TSA-fluorophore (OPAL 520) not occupied by the marker proteins.

Spermatogonia panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Spermatogonia 0UTF1CAB022384OPAL480Yellow
Spermatogonia 1IRF2BPLHPA050862OPAL620White
Spermatogonia 2-3DMRT1HPA027850OPAL690Cyan
Spermaotogonia 4CTCFLHPA001472OPAL780Magenta
Spermatocytes 1BEND2HPA013142OPAL570Red
Empty slotUnknown protein of interest-OPAL520Green


Spermatocytes panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Spermatocytes 1HELLSHPA063242OPAL480Yellow
Spermatocytes 2SCML1HPA035270OPAL690Cyan
Spermatocytes 3TCFL5HPA076419OPAL780Magenta
Spermatids earlySUN5HPA048529OPAL620White
Spermatids latePRM1HPA055150OPAL570Red
Empty slotUnknown protein of interest-OPAL520Green


Spermatids panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Spermatids early 1LYARHPA035881OPAL780Magenta
Spermatids early 2OLAHHPA037948OPAL690Cyan
Spermatids late 1C3HPA020432OPAL480Yellow
Spermatids late 2SPATA24HPA044000OPAL570Red
Spermatids late 3TPPP2HPA004120OPAL620White
Empty slotUnknown protein of interest-OPAL520Green


Sertoli cells panel

Cell typeMarker proteinAntibodyFluorescent labelPseudo-color
Sertoli cytoplasmDIAPH2CAB015461OPAL570Red
Sertoli membraneCD99CAB000020OPAL690White
Sertoli nucleiHMGN5HPA000511OPAL780Magenta
Spermatogonia and spermatocytesDDX4HPA037764OPAL620Cyan
SpermatidsSPACA1HPA043297OPAL480Yellow
Empty slotUnknown protein of interest-OPAL520Green


Data reliability

For each antibody and protein, an internal reliability assessment is performed to ensure high quality data before release. The antibody staining pattern of the unknown protein is always reviewed against its corresponding conventional IHC staining pattern for reproducibility, and against available tissue and single-cell RNA-seq data, and protein/gene characterization data. This assessment should not be confused with theReliability scoring performed for the tissue-wide analysis. The reproducibility of the panel the panel marker proteins are also assessed to ensure high quality of the annotation.


Immunohistochemistry/IF - mouse brain

As a complement to the immunohistochemically stained tissues, the protein atlas also includes the mouse brain atlas as a sub compartment of the normal tissue atlas. In which comprehensive profiles are available in mouse brain. A selected set of targets have been analyzed by using the antibodies in serial sections of mouse brain which covers 129 areas and subfields of the brain, several of these regions difficult to cover in the human brain. In addition pituitary, retina and trigeminal ganglions are included in recent and future image series but not annotated yet.

The tissue microarray method used within the human protein atlas enabled the global mapping of proteins in the human body, including the brain. Currently, the human tissue atlas covers four areas of the human brain: cerebral cortex, hippocampus, caudate and cerebellum. Due to the heterogeneous structure of the brain, with many nuclei and cell-types organized in complex networks, it is difficult to achieve a comprehensive overview in a 1 mm tissue sample. Analysis of more human brain samples, including smaller brain nuclei, is thus desirable in order to generate a more detailed map of protein distribution in the brain. Therefore, we here complemented the human brain atlas effort with a more comprehensive analysis of the mouse brain. A series of mouse brain sections is explored for protein expression and distribution in a large number of brain regions.

Antibodies are selected against protein involved in normal brain physiology, brain development and neuropathological processes. A limit of 60% homology (human vs mouse) is used as cut off when comparing the PrEST sequence for the antibody targets.

Selected antibodies are applied to test-sections containing brain regions or cell types with known expression based on in situ hybridization (Allen Brain Atlas) and single cell RNAseq data (Linnarsson Lab andBarres Lab). Staining patterns are evaluated based on consistency between staining patterns of multiple antibodies against the same target and match to transcriptomics data. Antibody immunoreactivity is visualized using tyramid signal amplification shown in green. A nuclear reference staining (DAPI) is visualized in blue. The immunofluorescence protocol is standardized through antibody concentration and incubation time are variable depending on protein abundance and antibody affinity determined during the test staining. The complete mouse brain profile is represented by serial coronal sections of adult mouse brain, 16 µm thick. Stained slides are then scanned and digitalized before further processing.

Table 1. Brain regions. Abbreviations are based on The Mouse Brain in Stereotaxic Coordinates, Third Edition: The coronal plates and diagrams (ISBN: 9780123742445)

RegionAbbreviationAllen Brain Atlas
cerebral cortexcerebral cortexfrontal association cortexfraFRP
cerebral cortexcerebral cortexmotor cortexmMO
cerebral cortexcerebral cortexcingulate cortexcgACA
cerebral cortexcerebral cortexpiriform cortex, L1pirl1PIR1
cerebral cortexcerebral cortexpiriform cortex, L2pirl2PIR2
cerebral cortexcerebral cortexpiriform cortex, L3pirl3PIR3
cerebral cortexcerebral cortexinsular cortexiAI
cerebral cortexcerebral cortexsomatosensory cortexsSS
cerebral cortexcerebral cortexretrosplenial granular cortexrsgRSP
cerebral cortexcerebral cortexparietal association cortexpPTLp
cerebral cortexcerebral cortexentorhinal cortexentENT
cerebral cortexcerebral cortexvisual cortexvVIS
olfactory bulbolfactory bulbanterior olfactory nucleusaonAON
olfactory bulbolfactory bulbgranule cell layergroMOBgr
olfactory bulbolfactory bulbinternal plexiform layeriplMOBipl
olfactory bulbolfactory bulbmitral cell layermiMOBmi
olfactory bulbolfactory bulbglomerular layerglMOBgl
olfactory bulbolfactory bulbrostral migratory streamrmsSEZ
olfactory bulbolfactory bulbexternal plexiform layereplMOBopl
olfactory bulbolfactory bulbexternal plexiform layer of the accessory OBepla
olfactory bulbolfactory bulbgranule cell layer of the accessory OBgraAOBgr
olfactory bulbolfactory bulbglomerular layer of the accessory OBglaAOBgl
hippocampal formationhippocampuspolymorph layer of the dentate gyruspodgDG-po
hippocampal formationhippocampusmolecular layer of the dentate gyrusmodgDG-mo
hippocampal formationhippocampusgranular dentate gyrusgrdgDG-sg
hippocampal formationhippocampusCA1 - oriens layerca1orCA1so
hippocampal formationhippocampusCA1 - pyramidal layerca1pyCA1sp
hippocampal formationhippocampusCA1 - radiatum layerca1raCA1sr
hippocampal formationhippocampusCA2 - oriens layerca2orCA2so
hippocampal formationhippocampusCA2 - pyramidal layerca2pyCA2sp
hippocampal formationhippocampusCA2 - radiatum layerca2raCA2sr
hippocampal formationhippocampusCA3 - oriens layerca3orCA3so
hippocampal formationhippocampusCA3 - pyramidal layerca3pyCA3sp
hippocampal formationhippocampusCA3 - radiatum layerca3raCA3sr
hippocampal formationhippocampusstratum lucidumsluCA3slu
hippocampal formationhippocampuslacunosum molecularelmolCA1slm
hippocampal formationhippocampussubiculumsubSUB
amygdalaamygdalanucleus of the lateral olfactory tractlotNLOT
amygdalaamygdalabasal medial amygdaloid nucleusbmaBMA
amygdalaamygdalabasal lateral amygdala**blaBLA
amygdalaamygdalabasal lateral amygdaloid nucleusblaBLA
amygdalaamygdalacortical amygdalaacoCOA
amygdalaamygdalacentral amygdalaceCEA
amygdalaamygdalamedial amygdaloid nucleusmeaMEA
thalamusthalamusmedial geniculate nucleusmgMG
thalamusthalamusparafascicular thalamic nucleuspfPF
thalamusthalamuspregeniculate nucleuspgGENd
thalamusthalamusstria terminalisstst
thalamusthalamuszona incertaziZI
thalamusthalamusanterodorsal thalamic nucleusadAD
thalamusthalamusreticular thalamic nucleusrtRT
thalamusthalamusvental anterior thalamic nucleusvaVAL
thalamusthalamusmedial habenular nucleusmhbMH
thalamusthalamuslaterodorsal thalamic arealdLD
thalamusthalamusparaventricular thalamic nucleuspvPVT
thalamusthalamuscentral medial thalamic areacmCM
thalamusthalamusventral lateral thalamic areavlVP
thalamusthalamusventral medial thalamic areavmVM
thalamusthalamuslateral habenulal nucleuslhbLH
thalamusthalamusventral posterior thalamusvptVP
thalamusthalamusanterior pretactal nucleusaptPRT
thalamusthalamusretromammillary nucleusrmSUM
hypothalamushypothalamusdorsal tuberomammillary nucleusdtmTMd
hypothalamushypothalamusmammillary nucleusmnMBO
hypothalamushypothalamusperiventricular hypothalamic nucleuspePVi
hypothalamushypothalamussupraoptic nucleussoSO
hypothalamushypothalamustuberal nucleustuTU
hypothalamushypothalamusventral tuberomammillary nucleusvtmTMv
hypothalamushypothalamuslateral preoptic arealpoLPO
hypothalamushypothalamusmedial preoptic areampoMEPO
hypothalamushypothalamussuprachiasmatic nucleusschSCH
hypothalamushypothalamusparaventricular hypothalamic nucleuspaPVH
hypothalamushypothalamusanterior hypothalamic area, centralahcAHN
hypothalamushypothalamusventral medial hypothalamic nucleusvmhVMH
hypothalamushypothalamusventral medial hypothalamic nucleus**vmhVMH
hypothalamushypothalamusarcuate nucleusarcARH
hypothalamushypothalamusarcuate nucleus**arcARH
hypothalamushypothalamuspeduncular part of lateral hypothalmusplhPH
hypothalamushypothalamusdorsal medial hypothalamic nucleusdmDMH
hypothalamushypothalamusstria terminalis**stst
cerebellumcerebellummoleuclar layer of the cerebellumcemolCBXmo
cerebellumcerebellumPurkinje layer of the cerebellumcepurCBXpu
cerebellumcerebellumgranular layer of the cerebellumcegrCBXgr
circumventricular organscircumventricular organssubcommissural organsco
circumventricular organscircumventricular organssubfornical organsfoSFO
circumventricular organscircumventricular organsmedian eminencemeME
circumventricular organscircumventricular organsmedullaapAP
ponsponskoelliker-fuse nucleuskfKF
ponsponsmotor tregiminal nucleus5nV
ponsponsparabrachial nucleuspbpPB
ponsponsprinciple sensory trigeminal nucleuspr5PSV
ponsponslocus coeruleuslcLC
ponsponspontine nucleuspnPG
ponsponsvestibular nucleusveVNC
ponsponspontine reticular nucleus, oralpnoPRNr
ponsponslateral lemniscusllNLL
ponsponssuperior paraolivary nucleusspoPOR
medulla oblongatamedulla oblongatanucleus of the solitary tractsolNTS
medulla oblongatamedulla oblongataraphe magnus nucleusrmgRM
medulla oblongatamedulla oblongatacochlear nucleuscnCN
medulla oblongatamedulla oblongatalateral paragigantocellular nucleuslpgPGRNl
medulla oblongatamedulla oblongataraphe pallidus nucleusrpaRPA
medulla oblongatamedulla oblongatafacial nucleus7nVII
medulla oblongatamedulla oblongatahypoglossal nucleus12nXII
medulla oblongatamedulla oblongataambiguus nucleusambAMB
medulla oblongatamedulla oblongataexternal cuneate nucleusecuCU
medulla oblongatamedulla oblongatainferior olivary nucleusioIO
medulla oblongatamedulla oblongataraphe obscures nucleusrobRO
medulla oblongatamedulla oblongatadorsal motor nucleus of vagus10nDMX
Show allShow less

Annotation

The digitalized images are processed (axel-adjusted and tissue edges defined) and regions of interest (ROIs) are then marked according to the table above. These ROIs are then used for image analysis and the relative fluorescence intensity is listed for each region. The relative fluorescence is defined intensity of the annotated region relative to the intensity of the region with highest intensity.

The overview and preserved orientation in the mouse brain has enabled us to annotate additional cell classes (ependymal), glial subpopulations (microglia, oligodendrocytes, and astrocytes), and additional brain specific subcellular locations (axon, dendrite, synapse, and glia endfeet) for each investigated protein.

All images of immunofluorescence stained sections were manually annotated by specially educated personnel followed by review and verification by a second qualified member of the staff. The cellular and subcellular location of the immunoreactivity is defined and a summarizing text is provided describing the general staining pattern.

Specificity is validated by comparing the data with in situ hybridization data (Allen brain atlas) and/or available literature; support from other data leads to a supportive reliability score, while more unknown targets are viewed as uncertain and awaits further validation.

Reliability score

A reliability score is set for all genes and indicates the level of reliability of the analyzed protein expression pattern based on available protein/RNA/gene characterization data.

The reliability score of the antibodies in mouse brain atlas is scored as Supported or Uncertain depending on support from in situ hybridization data (Allen brain atlas) and/or previous published data, UniProtKB/Swiss-Prot database.

Immunocytochemistry/IF - cells

The subcellular resource revolves around high-resolution, multicolor images of proteins labeled by indirect immunocytochemistry/immunofluorescence (ICC-IF). This provides spatial information on protein localization in terms of the subcellular distribution of the protein in organelles and subcellular structures at single cell level.

Three cell lines, originally U2OS, A-431 and U-251 MG, originating from different human tissues were chosen to be included in the analysis of protein subcellular localization by ICC-IF. The cell line panel has since been expanded to cover more cell types and lineages, e.g. tumor cell lines from mesenchymal, epithelial and glial tumors, as well as cell lines that have immortalized by introduction of telomerase. The selection was furthermore based on morphological characteristics and widespread use of these cell lines. Information regarding sex and age of the donor, cellular origin and source islisted here. In order to localize the whole human proteome on a subcellular level in one specific cell line, most proteins are stained in U2OS. Two additional cell lines are selected based on mRNA expression data. Some proteins have also been stained in one or more ciliated cells lines and/or in human sperm, originating from a single healthy donor. In addition to the human cells, many proteins have been stained in the mouse cell line NIH 3T3, given that the human and mouse genes are orthologous.

The standard immunostaining protocol for ICC can be found on the open access repository for science methods atprotocols.io. For the great majority of antibodies, fixation is achieved with paraformaldehyde (PFA), but for a few antibodies, this is replaced by methanol in order to better preserve the morphology of certain cellular structures. For each gene, the use of PFA or methanol, as well as dilution factors for the antibodies, are stated in the Antibodies and Validation section. In order to facilitate the annotation of the subcellular localization of the protein targeted by the HPA antibody, the cells are also stained with reference markers: (i) DAPI for the nucleus, (ii) anti-tubulin antibody for microtubules, and (iii) anti-calreticulin or anti-KDEL for the endoplasmic reticulum (ER). For ciliated cells lines, an antibody targeting ARL13B has been used to mark primary cilia and an antibody targeting pericentrin (PCNT) has been used to mark basal bodies. In human sperm, an antibody targeting acetylated tubulin has been used as a marker for flagella and an antibody targeting citrate synthase (CS) has been used as a marker for mitochondria.

The resulting confocal images are single slice images representing one optical section of the cells, except for ciliated cell lines and human sperm, in which case z-stacks are shown. The microscope settings are standardized, but the detector gain is optimized for each sample. The different organelle probes are displayed as different channels in the multicolor images, with the HPA antibody staining shown in green, nucleus in blue, microtubules in red and ER in yellow.

Annotation

In order to provide an interpretation of the staining patterns, all images generated by ICC-IF are manually annotated. For each cell line and antibody, the staining is described in terms of subcellular location(s) and single-cell variability (SCV). The table below lists the subcellular locations used for annotation, with links to the cell structure dictionary entry and corresponding GO terms. SCVs within an immunofluorescence image are classified as intensity variation (variation in their expression level) or as spatial variation (variation in the spatial distribution).

Subcellular locationGO term
Acrosome
Actin filamentsGO:0015629
AggresomeGO:0016235
Annulus
Basal bodyGO:0036064
Calyx
Cell JunctionsGO:0030054
Centriolar satelliteGO:0034451
CentrosomeGO:0005813
Cleavage furrowGO:0032154
Connecting piece
Cytokinetic bridgeGO:0045171
Cytoplasmic bodiesGO:0036464
CytosolGO:0005829
End piece
Endoplasmic reticulumGO:0005783
EndosomesGO:0005768
Equatorial segment
Flagellar centriole
Focal adhesion sitesGO:0005925
Golgi apparatusGO:0005794
Intermediate filamentsGO:0045111
KinetochoreGO:0000776
Lipid dropletsGO:0005811
LysosomesGO:0005764
Microtubule endsGO:1990752
MicrotubulesGO:0015630
Mid piece
MidbodyGO:0030496
Midbody ringGO:0090543
MitochondriaGO:0005739
Mitotic chromosomeGO:0005694
Mitotic spindleGO:0072686
Nuclear bodiesGO:0016604
Nuclear membraneGO:0031965
Nuclear specklesGO:0016607
NucleoliGO:0005730
Nucleoli fibrillar centerGO:0001650
Nucleoli rimGO:0005730
NucleoplasmGO:0005654
Perinuclear theca
PeroxisomesGO:0005777
Plasma membraneGO:0005886
Primary ciliumGO:0005929
Primary cilium tip
Primary cilium transition zone
Principal piece
Rods & Rings
VesiclesGO:0043231

Knowledge-based annotation

The knowledge-based annotation aims to provide an interpretation of the detected subcellular localization of a protein. In the first step, stainings in different cell lines with the same antibody are reviewed and the results are compared with external experimental protein/gene characterization data for subcellular localization, available in the UniProtKB/Swiss-Prot database. In the second step, all antibodies targeting the same protein are taken in consideration for a final annotation of the subcellular distribution of the protein.

Reliability score

Each location is separately given one of the four reliability scores (Enhanced, Supported, Approved, or Uncertain) based on available protein/RNA/gene characterization data from both HPA and the UniProtKB/Swiss-Prot database. The reliability score also encompass several additional factors, including reproducibility of the antibody staining in different cell lines, correlation between staining intensity and RNA expression levels, and assays for enhanced antibody validation. Enhanced validation is achieved by using antibodies binding to different epitopes on the same target protein (independent antibody validation), by assessing staining intensity upon knockdown/knockout of the target protein (genetic validation) and/or by matching of the signal with a GFP-tagged protein (recombinant expression validation), and experimental evidence for subcellular location described in literature. The individual location relibility scores are summarized in an overall gene reliability score.

There are four different reliability scores:

  • Enhanced - The antibody has enhanced validation and there is no contradicting data, such as literature describing experimental evidence for a different location.
  • Supported - There is no enhanced validation of the antibody, but the annotated localization is reported in literature.
  • Approved - The localization of the protein has not been previously described and was detected by only one antibody without additional antibody validation.
  • Uncertain - The antibody-staining pattern contradicts experimental data or expression is not detected at RNA level.

Protein array

All purified antibodies are analyzed on antigen microarrays. The specificity profile for each antibody is determined based on the interaction with 384 different antigens including its own target. The antigens present on the arrays are consecutively exchanged in order to correspond to the next set of 384 purified antibodies. Each microarray is divided into 21 replicated subarrays, enabling the analysis of 21 antibodies simultaneously. The antibodies are detected through a fluorescently labeled secondary antibody and a dual color system is used in order to verify the presence of the spotted proteins. A specificity profile plot is generated for each antibody, where the signal from the binding to its own antigen is compared to the eventual off target interactions to all the other antigens. The vast majority (86%) of antibodies are given a pass and the remaining are failed either due to low signal or low specificity.

Western blot

Western blot analysis of antibody specificity has been done using a routine sample setup composed of IgG/HSA-depleted human plasma and protein lysates from a limited number of human tissues and cell lines. Antibodies with an uncertain routine WB have been revalidated using an over-expression lysate (VERIFY Tagged Antigen(TM), OriGene Technologies, Rockville, MD) as a positive control. Antibody binding was visualized by chemiluminescence detection in a CCD-camera system using a peroxidase (HRP) labeled secondary antibody.

Antibodies included in the Human Protein Atlas have been analyzed without further efforts to optimize the procedure and therefore it cannot be excluded that certain observed binding properties are due to technical rather than biological reasons and that further optimization could result in a different outcome.

Transcriptomics

HPA RNA-seq data

In total, 1206 cell lines, 40 human tissues and 18 immune cell types as well as total peripheral blood mononuclear cells (PBMC) have been analyzed by RNA-seq to estimate the transcript abundance of each protein-coding gene. Additionally, 19 mouse tissue samples and 32 pig tissue samples collected from the brain and retina of the animals were sampled and analyzed by RNA-seq.

Fornormal tissue andblood samples, specimens were collected with consent from patients and all samples were anonymized in accordance with approval from the local ethics committee (ref #2011/473 and ref #2015/1552-32) and Swedish rules and legislation. All tissues were collected from the Uppsala Biobank and RNA samples were extracted from frozen tissue sections. Blood samples were enriched for PBMC and granulocytes, labeled with antibodies and separated into subpopulation by flow sorting. Forcell lines, early-split samples were used as duplicates and total RNA was extracted using Qiagen RNeasy mini kit. Information regarding cellular origin and the source of each cell line is listedhere.

Formouse tissue, samples were collected and handled in accordance with Swedish laws and regulation, and all experiments were approved by the local ethical committee (Stockholms Norra Djurförsöksetiska Nämd N183/14). The animal experiments conformed to the European Communities Council Directive (86/609/EEC), and all efforts were made to minimize the suffering and the number of animals used. WT male (n = 2) and female (n = 2) C57BL/6J mice (2 month old) were obtained from Charles River Laboratories and maintained under standard conditions on a 12-hour day/night cycle, with water and food ad libitum. After washing out the blood, brains, pituitary gland, and spinal cord were quickly removed from the skull and spine and placed in ice-cold sterile PBS to make the tissue stiff and easier to dissect. The entire brain was carefully dissected into 17 sub-regions on an ice-cold surface. Retina samples were collected by separating the retina from the pigment layer in warm (37°C) PBS, pH 7.4. All dissected regions were placed in a 1.5 ml Eppendorf tube and snap-frozen in liquid nitrogen. Samples were stored at -80°C until further processing for the RNA extraction. Transcript expression of all brain regions, pituitary and retina were analysed. Tissue was homogenized mechanically using a TissueLyser LT (Qiagen) and total RNA was prepared using the RNeasy Mini isolation kit (Qiagen). This generated high-quality RNA, with 84% of the samples having RNA Integrity Number (RIN) values higher than 8.0 and only one sample removed due to a very low RIN value (less than 6.0). In total, 75 samples were subsequently used for library construction with Illumina TruSeq Stranded mRNA reagents. The Illumina HiSeq2500 platform was used for sequencing at approximately 20 million reads depth.

For a total number of 141 HPAcell line samples, 186normal tissue samples, and 109immune cell samples, mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2x100 bases. The RNA seq data for the remaining cell lines was imported from the Cancer Cell Line Encyclopedia (CCLE). More information about the cell line data can be foundhere. Immune cell mRNA sequencing was performed on an Illumina NovaSeq 6000 System in four S4 lanes with a read length of 2x150 bases. Transcript abundance estimation was performed usingKallisto v0.48.0. The 18 immune cell types are classified into six different lineages including B-cells, T-cells, NK-cells, monocytes, granulocytes and dendritic cells. More information can be foundhere.

TheHPA Human brain sample set contains of the human brain. The analysis is a collaboration with Human Brain Tissue Bank (HBTB; Semmelweis University, Budapest) in accordance with approval from the Committee of Science and Research Ethic of the Ministry of Health Hungary (ETT TUKEB: 189/KO/02.6008/2002/ETT) and the Semmelweis University Regional Committee of Science and Research Ethic (No. 32/1992/TUKEB) to remove human brain tissue samples, collect, store and use them for research. Samples were collected by Prof. Palkovits and RNA was extracted from frozen brain punches. Thehuman brain dataset is based on 966 samples of 193 regions analyzed using the MGI DNBSEQ-T7 platform. Thehuman prefrontal cortex dataset includes 165 samples from 3 male and 3 female donors providing a detailed overview of protein expression in 17 subregions of the prefrontal cortex and 3 reference cortical regions was analyzed using the Illumina sequencing platform.

Thepig tissue samples were collected and analyzed in collaboration withBGI. Pig brain used for mRNA analysis were collected and handled in accordance with national guidance for large experimental animals and under permission of the local ethical committee (ethical permission numbers No.44410500000078 and BGI-IRB18135) as well as conducted in line with European directives and regulations. The experimental minipigs (Chinese Bama Minipig) were provided by the Peral Lab Animal Sci & Tech Co.,Ltd (Permit number SYXK2017-0123). Male (n = 2) and female (n = 2) Chinese Bama minipigs (1 year old), were housed in a specific pathogen-free stable facility under standard conditions.The brain was cut in coronal slabs at the level of 1) frontal lobe/olfactory tract, 2) optic chiasm and 3) between hypothalamus and cerebral peduncle. Slabs were divided in 2 hemispheres exposing all main brain structures. For mRNA analysis, pieces of cerebral cortex and cerebellum were collected, based on a sampling strategy collecting a representative sample that contained all cell layers. All other regions were dissected and collected completely. Two samples (somatosensory cortex and periaqueductal gray) are missing from female 1 due to the fact that these two regions could not be identified with 100% certainty, and thus were excluded. Duplicate samples were taken from olfactory bulb from female 2, resulting in totally 119 brain samples and additional 8 samples (retina and pituitary gland), all in all 127 samples. All samples were stored at -80° C until RNA was extracted within one month.

GTEx RNA-seq data

The Genotype-Tissue Expression (GTEx) project collects and analyzes multiple human post mortem tissues. RNA-seq data from 36 of their tissue types was mapped based on RSEMv1.3.0 (v8) and the resulting TPM values have been included in the Human Protein Atlas for all corresponding genes that could be mapped from Gencode v26 toEnsembl version 109. The GTEx retina data are based on EyeGEx data fromRatnapriya et al., Nature Genetics 2019 and transcript abundance estimation was performed usingKallisto v0.48.0 usingEnsembl version 109 as reference genome.

TissueGTEx tissueNumber of samples
Adipose tissueAdipose - Subcutaneous663
Adipose - Visceral (Omentum)541
Adrenal glandAdrenal Gland258
AmygdalaBrain - Amygdala152
BreastBreast - Mammary Tissue459
CaudateBrain - Caudate (basal ganglia)246
CerebellumBrain - Cerebellar Hemisphere215
Brain - Cerebellum241
Cerebral cortexBrain - Anterior cingulate cortex (BA24)176
Brain - Cortex255
Brain - Frontal Cortex (BA9)209
CervixCervix - Ectocervix9
Cervix - Endocervix10
ColonColon - Sigmoid373
Colon - Transverse406
EndometriumUterus - Endometrium16
EsophagusEsophagus - Mucosa555
Fallopian tubeFallopian Tube9
Heart muscleHeart - Atrial Appendage429
Heart - Left Ventricle432
HippocampusBrain - Hippocampus197
HypothalamusBrain - Hypothalamus202
KidneyKidney - Cortex85
Kidney - Medulla4
LiverLiver226
LungLung578
Nucleus accumbensBrain - Nucleus accumbens (basal ganglia)246
OvaryOvary180
PancreasPancreas328
Pituitary glandPituitary283
ProstateProstate245
PutamenBrain - Putamen (basal ganglia)205
RetinaRetina105
Salivary glandMinor Salivary Gland162
Skeletal muscleMuscle - Skeletal803
SkinSkin - Not Sun Exposed (Suprapubic)604
Skin - Sun Exposed (Lower leg)701
Small intestineSmall Intestine - Terminal Ileum187
Spinal cordBrain - Spinal cord (cervical c-1)159
SpleenSpleen241
StomachStomach359
Substantia nigraBrain - Substantia nigra139
TestisTestis361
Thyroid glandThyroid653
Urinary bladderBladder21
VaginaVagina156

FANTOM5 CAGE data

The Functional Annotation of Mammalian Genomes 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type specific transcriptomes using Cap Analysis of Gene Expression (CAGE) (Takahashi H et al. (2012)), which is based on a series of full-length cDNA technologies developed in RIKEN. CAGE data for 60 of their tissues was obtained from theFANTOM5 repository and mapped toEnsembl version 109.

TissueFANTOM5 tissueSample descriptionFANTOM5 sample id
Adipose tissueAdipose tissue65,65,76 years, mixedFF:10010-101C1
AmygdalaAmygdala76 years, femaleFF:10151-102I7
AppendixAppendix29 years, maleFF:10189-103D9
BreastBreast77 years, femaleFF:10080-102A8
CaudateCaudate nucleus76 years, femaleFF:10164-103B2
CerebellumCerebellum22-68 years, mixedFF:10083-102B2
Cerebellum76 years, femaleFF:10166-103B4
CervixCervix40,46,57,65 years, femaleFF:10013-101C4
ColonColon62,83,84 years, mixedFF:10014-101C5
Corpus callosumCorpus callosum24-68 years, mixedFF:10042-101F6
Ductus deferensDuctus deferens24 years, maleFF:10196-103E7
EndometriumUterus23-63 years, femaleFF:10100-102D1
EpididymisEpididymis24 years, maleFF:10197-103E8
EsophagusEsophagus68,74,75 years, mixedFF:10015-101C6
Frontal lobeFrontal lobe32-61 years, mixedFF:10040-101F4
GallbladderGall bladder57 years, maleFF:10198-103E9
Globus pallidusGlobus pallidus76 years, femaleFF:10161-103A8
Globus pallidus60 years, femaleFF:10175-103C4
Heart muscleHeart70,73,74 years, mixedFF:10016-101C7
Left ventricle73 years, femaleFF:10078-102A6
Left atrium40 years, maleFF:10079-102A7
HippocampusHippocampus76 years, femaleFF:10153-102I9
Hippocampus60 years, femaleFF:10169-103B7
Insular cortexInsula20-68 years, mixedFF:10039-101F3
KidneyKidney60,62,63 years, femaleFF:10017-101C8
LiverLiver64,69,70 years, mixedFF:10018-101C9
Locus coeruleusLocus coeruleus76 years, femaleFF:10165-103B3
Locus coeruleus60 years, femaleFF:10182-103D2
LungLung46,65,94 years, mixedFF:10019-101D1
Lung - right lower lobe29 years, maleFF:10075-102A3
Lymph nodeLymph node30 years, maleFF:10077-102A5
Medial frontal gyrusMedial frontal gyrus76 years, femaleFF:10150-102I6
Medial temporal gyrusMedial temporal gyrus76 years, femaleFF:10156-103A3
Medial temporal gyrus60 years, femaleFF:10183-103D3
Medulla oblongataMedulla oblongata18-64 years, mixedFF:10038-101F2
Medulla oblongata76 years, femaleFF:10155-103A2
Medulla oblongata60 years, femaleFF:10174-103C3
Nucleus accumbensNucleus accumbens23-56 years, mixedFF:10037-101F1
Occipital cortexOccipital cortex76 years, femaleFF:10163-103B1
Occipital lobeOccipital lobe27 years, maleFF:10076-102A4
Occipital poleOccipital pole22-68 years, mixedFF:10036-101E9
Olfactory bulbOlfactory region87 years, femaleFF:10195-103E6
OvaryOvary47,75,84 years, femaleFF:10020-101D2
PancreasPancreas52 years, maleFF:10049-101G4
Paracentral gyrusParacentral gyrus22-69 years, mixedFF:10035-101E8
Parietal lobeParietal lobe35-89 years, mixedFF:10034-101E7
Parietal lobe76 years, femaleFF:10157-103A4
Parietal lobe60 years, femaleFF:10171-103B9
Pituitary glandPituitary gland76 years, femaleFF:10162-103A9
PlacentaPlacentafemaleFF:10021-101D3
PonsPons18-54 years, mixedFF:10033-101E6
Postcentral gyrusPostcentral gyrus44-52 years, mixedFF:10032-101E5
ProstateProstate73,79,93 years, maleFF:10022-101D4
PutamenPutamen60 years, femaleFF:10176-103C5
RetinaRetina24-65 years, mixedFF:10030-101E3
Salivary glandSalivary gland16-60 years, mixedFF:10093-102C3
Parotid gland23 years, maleFF:10199-103F1
Submaxillary gland24 years, maleFF:10202-103F4
Seminal vesicleSeminal vesicle24 years, maleFF:10201-103F3
Skeletal muscleSkeletal muscle55,79,79 years, mixedFF:10023-101D5
Skeletal muscle - soleus musclemaleFF:10282-104F3
Small intestineSmall intestine15,40,85 years, mixedFF:10024-101D6
Smooth muscleSmooth muscle20-68 years, maleFF:10048-101G3
Spinal cordSpinal cord76 years, femaleFF:10159-103A6
Spinal cord60 years, femaleFF:10181-103D1
SpleenSpleen39,50,70 years, maleFF:10025-101D7
Substantia nigraSubstantia nigra76 years, femaleFF:10158-103A5
Temporal cortexTemporal lobe32-61 years, mixedFF:10031-101E4
TestisTestis34,53,86 years, maleFF:10026-101D8
Testis14-64 years, maleFF:10096-102C6
ThalamusThalamus76 years, femaleFF:10154-103A1
ThymusThymus0.5,0.5,0.83 years old infant years, maleFF:10027-101D9
Thyroid glandThyroid67,68,78 years, mixedFF:10028-101E1
TongueTongue28 years, maleFF:10203-103F5
TonsilTonsil22-61 years, mixedFF:10047-101G2
Urinary bladderBladder55,58,79 years, mixedFF:10011-101C2
VaginaVagina68 years, femaleFF:10204-103F6

Tissue Cell Type resource: Using GTEx bulk RNAseq data to profile gene cell type specificity

GTEx data was used in a correlation-based integrative network analysis to determine the cell type specificity of all protein coding genes within a given tissue type. For more details on this analysis and the classifications, see the Tissue Cell Type sub-section of the Single Cell resourceMethods Summary.

scRNA-seq data

Inclusion criteria

The single cell RNA sequencing dataset is based on meta-analysis of literature on single cell RNA sequencing and single cell databases that include healthy human tissue. To avoid technical bias and to ensure that the single cell dataset can best represent the corresponding tissue, the following data selection criteria were applied: (1) Single cell transcriptomic datasets were limited to those based on the Chromium single cell gene expression platform from 10X Genomics (version 2 or 3); (2) Single cell RNA sequencing was performed on single cell suspension from tissues without pre-enrichment of cell types; (3) Only studies with >4,000 cells and 20 million read counts were included, (4) Only dataset whose pseudo-bulk transcriptomic expression profile is highly correlated with the transcriptomic expression profile of the corresponding HPA tissue bulk sample were included. It should be noted that exceptions were made for eye (~12.6 million reads), rectum (2,638 cells) and heart muscle (plate-based scRNA-seq) to include various cell types in the analysis.


Single cell transcriptomics datasets

In total, 31 different datasets were analyzed. These datasets were respectively retrieved from theSingle Cell Expression Atlas, theHuman Cell Atlas, theGene Expression Omnibus, theAllen Brain Map,European Genome-phenome Archive and theTabula Sapiens. The complete list of references is shownhere .


Clustering of single cell transcriptomics data

For each of the single cell transcriptomics datasets, the quantified raw sequencing data were downloaded from the corresponding depository database based on the accession number provided by the corresponding study in the available format. More in details, SRA files were downloaded for colon, kidney, liver, PBMC and testis, and subsequently converted into raw fastq files by SRA Toolkit (v2.10.9). As for other 25 tissues, raw fastq files were downloaded directly, including adipose tissue, bone marrow, breast, bronchus, endometrium, esophagus, eye, fallopian tube, heart muscle, lung, lymph node, ovary, pancreas, placenta, prostate, rectum, salivary gland, skeletal muscle, skin, small intestine, spleen, stomach, thymus, tongue, and vasculature. The quantified raw counting data was downloaded for brain specifically.

The single cell RNA-seq data processing followed the same pipeline as the HPA project. To quantify the transcript levels, the sequencing data were mapped to the human reference GRCh38.p13 cDNA, while datasets generated by the droplet-based 10X Genomics Chromium (10X) approach were processed by Cell Ranger (v6.1.2), and datasets generated by the plate-based scRNA-seq were processed by STAR (v2.7.9a). Based on the annotation from Ensembl Archive Release 103 (from HPA v23, gene ensemble ID were mapped to Ensembl Archive Release 109), the transcript abundances were aggregated into gene level as read counts, and these count matrices from the same tissue were further aggregated into one matrix. This result in 31 count matrices for 31 tissues, respectively, with a total of 60,666 genes included for further analysis. The downstream analysis followed an in-house pipeline using Scanpy (v1.7.1) in Python 3.8.5. In the pipeline, the data were filtered using two criteria: a cell is considered as valid if at least 200 genes are detected, and a gene is considered as valid if it is expressed in at least 10% of the cells. For tissues containing more than 10,000 cells, 1000 cells were used as cutoff. Subsequently, the cell counts were normalized to have a total count per cell of 10,000. For each dataset, the valid cells were then clustered using Louvain clustering function within Single-Cell Analysis in Python (Scanpy). Default values of parameters were used in clustering. More in detail, the features of cells were projected into a PCA space with 50 components using UMAP, and a k-nearest neighbours (KNN) graph was generated. 15 neighbours were used in the network for Louvain, and the resolution of clustering was set as 1.0. Finally, the total read counts for all genes in each cluster was calculated by adding up the read counts of each gene in all cells belonging to the corresponding cluster. The raw read counts were scaled to transcripts per million protein-coding genes (pTPM) for each of the single cell clusters and then normalized (nTPM) using Trimmed mean of M values (TMM) to allow for between-cluster comparisons. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded due to their low RNA content. Detailed calculation equations can be found in single cell typemethod summary.

Defining cell types

Each of the 557 different cell type clusters were manually annotated based on an extensive survey of >500 well-known tissue and cell type-specific markers, including both markers from the original publications, and additional markers used in pathology diagnostics. For each cluster, one main cell type was chosen by taking into consideration the expression of different markers. For a few clusters, no main cell type could be selected, and these clusters were not used for gene classification. The most relevant markers are presented in a heatmap on the Cell Type Atlas, in order to clarify cluster annotation to visitors.For the brain single nuclei data, cluster types populated with less than 30 cells were considered low reliability

Cell type dendrogram

The cell type dendrogram presented on theSingle Cell Type resource shows the relationship between the single cell types based on genome-wide expression. The dendrogram is based on agglomerative clustering of 1 - Spearman's rho between cell types using Ward's criterion. The dendrogram was then transformed into a hierarchical graph, and link distances were normalized to emphasize graph connections rather than link distances. Link width is proportional to the distance from the root, and links are colored according to cell type group if only one cell type group is present among connected leaves.

Normalization of transcriptomics data

For both theHPA andGTEx transcriptomics datasets, the average TPM value of all individual samples for each human tissue or human cell type was used to estimate the gene expression level. To be able to combine the datasets intoconsensus transcript expression levels, a pipeline was set up to normalize the data for all samples. In brief, all TPM values per sample were scaled to a sum of 1 million TPM (denoted pTPM) to compensate for the non-coding transcripts that had been previously removed. Next, all TPM values of all samples within each data source (HPA + GTEx humantissues, HPAimmune cell types, HPAcell lines) were normalized separately using Trimmed mean of M values (TMM) to allow for between-sample comparisons. The resulting normalized transcript expression values, denoted nTPM, were calculated for each gene in every sample. nTPM values below 0.1 are not visualized on the Atlas sections.

For thebrain dataset, an additional normalization was performed using linear regression to do the correction for inter-individual variation using the removeBatchEffect in the R package Limma with subject as a batch parameter. To reduce the technical variation between MGI and illumina platforms, 19 reference samples were included and run on both platforms. Intensity normalization based on reference samples was conducted to minimize technical variation between two platforms.

Consensus transcript expression levels for each gene were summarized in 50 human tissues based on transcriptomics data from the two sources HPA and GTEx. The consensus nTPM value for each gene and tissue type represents the maximum nTPM value based on HPA and GTEx. For tissues with multiple sub-tissues (brain regions, immune cells, lymphoid tissues and intestine) the maximum of all sub-tissues is used for the tissue type and the total number of tissue types in the human tissue consensus set is 36.

TheFANTOM5 dataset was normalized separately on the sample level using TMM. The normalized Tags Per Million for each gene were calculated based on the average of all individual samples for each human tissue.

Mouse andpig transcriptomic data generated by the HPA in collaboration withBGI, were normalized separately, according to the same procedure used for human tissues and cell types, no Limma adjustment was performed on the mouse and pig data. Consensus transcript expression levels is summarized into 13 brain regions for mouse brain and 15 regions for pig brain, where sub-regional samples were combined and the maximum of sub-regions used for the brain region.

Single cell type clusters were normalized separately from other transcriptomics datasets using TMM. To generate expression values per cell type, clusters were aggregated per cell type by first calculating the weighted mean nTPM in all cells with the same cluster annotation within a dataset. The values for the same cell types in different data sets were then mean averaged to a single aggregated value. Only clusters with medium and high reliability were included and clusters containing mixed cell types, Neutrophils and Platelets were excluded.

Classification of transcriptomics data

The consensus transcriptomics data was used to classify all genes according to their tissue-specific, single cell type-specific, brain region-specific, immune cell-specific or cell line-specific expression into two different schemas: specificity category and distribution category. These are defined based on the total set of all nTPM values in 40 tissues, 81 single cell types, 13 main regions of each mammalian brain,18 immune cell types or 1132 cell lines grouped into 28 cancer types and using a cutoff value of 1 nTPM as a limit for detection across all tissues or cell types.

Explanation of the specificity category

CategoryDescription
EnrichednTPM in a particular tissue/region/cell type at least four times any other tissue/region/cell type
Group enrichednTPM in a group (of 2-5 tissues, brain regions, single cell types or cell lines, or 2-10 immune cell types) at least four times any other tissue/region/cell line/immune cell type/cell type
EnhancedEnhanced: nTPM in a one or several tissues, brain regions, cell lines, immune cell types or single cell types that has at least four times the mean of all tissue/region/cell types
Low specificitynTPM ≥ 1 in at least one tissue/region/cell type but not elevated in any tissue/region/cell type
Not detectednTPM < 1 in all tissue/region/cell types


An additional category "elevated", containing all genes in the first three categories (tissue/cell line/cell type enriched, group enriched and tissue/cell line/cell type enhanced), has been used for some parts of the analysis. TS/CS-score (Tissue Specificity/Cell Specificity score) is calculated for “elevated” tissues/cell lines. TS/CS-score is calculated as the fold change from the tissue/cell line with highest RNA to the tissue/cell line with second highest RNA.

Explanation of the distribution category

CategoryDescription
Detected in singleDetected in a single tissue/region/cell type
Detected in someDetected in more than one but less than one third of tissues/regions/cell types
Detected in manyDetected in at least a third but not all tissues/regions/cell types
Detected in allDetected in all tissues/regions/cell types
Not detectednTPM < 1 in all tissues/regions/cell types

External immune cell RNA-seq data

In addition to the immune cell type data from blood, generated within the Human Protein Atlas project, data from 15 immune cell types bySchmiedel et al. and 29 immune cell types as well as total PBMC byMonaco et al. have been incorporated into the Blood Atlas.

TheSchmiedel dataset is available at theDICE (Database of Immune Cell Expression, Expression quantitative trait loci (eQTLs) and Epigenomics) database, which was established to address how genetic variants associated with risk for human diseases affect gene expression in various cell types. The TPM values per gene for 15 immune cell types were mapped to the corresponding genes in the Ensembl version used in the Human Protein Atlas.

TheMonaco dataset contains data for 29 immune cell types within the peripheral blood mononuclear cell (PBMC) fraction of healthy donors using RNA-seq and flow cytometry. Raw data for 29 immune cells as well as total PBMC were analyzed using the same pipeline as for HPA-generated RNA-seq data and also normalized using TMM to allow for between-sample comparisons. Normalized gene expression values are reported as nTPM values.

Gene expression clustering of transcriptomics data

The RNA expression data has been used to classify protein-coding genes into expression clusters for tissues, single cell types, immune cells, and cell lines.

ClusteringNumber of tissues, cell types or cell linesSample aggregation level
Tissue50Averaged expression per tissue type
Single cell557Averaged expression per cell type cluster
Cell lines1206Expression of individual cell line
Immune cells103Averaged expression per immune cell
Brain193Averaged expression per brain region


Pre-processing the data for clustering

For each dataset, genes detected at nTPM > 1 in at least one of the samples were selected, and the data was genewise scaled to z-scores to account for differences in dynamic ranges between genes across samples. After scaling, the expression data was projected into a lower dimensional space using Principal Component Analysis (PCA), where a number of components were selected to satisfy Kaiser’s rule and at least 80% of variance explained.

Gene clustering

Gene to gene distances were calculated as the Spearman correlation of gene expression across samples, and transformed to Spearman distance (1 - Spearman correlation). The distances were transformed into a shared nearest neighbor graph and used for Louvain clustering to find clusters of genes with similar expression profiles within the graph. To account for stochasticity in the clustering process each clustering was run 100 times, and consequently collapsed into a single consensus clustering. Confidence of the gene-to-cluster assignment was calculated as the fraction of times that the gene was assigned to the cluster.

Cluster annotation

The clustering generated for each of the datasets is manually annotated to assign a specificity and function to each cluster. The annotation is based on overrepresentation analysis towards biological databases, including Gene Ontology, Reactome, PanglaoDB, TRRUST, and KEGG, as well as HPA classifications including subcellular location, protein class, secretion location and classification, and specificity toward tissues, single cell types, immune cells, brain regions, and cell lines. A reliability score is manually set for each cluster indicating the confidence of specificity and function assignment.

Clustering visualization

The clustering results are visualized in a UMAP. Colored polygons were generated to represent the main contiguous masses of genes corresponding to the same cluster. First, for each cluster, the two-dimensional density was estimated in the UMAP, and an area enveloping 95% of the total density was determined. The areas were moderated to include contiguous areas corresponding to at least 5% of the total area in the UMAP space. Finally, contiguous areas were converted to two-dimensional polygons per each cluster.

TCGA RNA-seq data

The Cancer Genome Atlas (TCGA) project of Genomic Data Commons (GDC) collects and analyzes multiple human cancer samples. RNA-seq data from 17 cancer types representing 21 cancer subtypes with a corresponding major cancer type in the Human Pathology Atlas were included to allow for comparisons between the protein staining data from the Human Protein Atlas and RNA-seq from TCGA data.

The TCGA RNA-seq data was mapped using the Ensembl gene id available from TCGA, and the FPKMs (number Fragments Per Kilobase of exon per Million reads) for each gene were subsequently used for quantification of expression with a detection threshold of 1 FPKM. Genes were categorized using the same classification as described above.

HPA cancer typeTCGA cancerNo. of samples in TCGA
Bladder Urothelial Carcinoma (TCGA)Bladder Urothelial Carcinoma (BLCA)169
Breast Invasive Carcinoma (TCGA)Breast Invasive Carcinoma (BRCA)1022
Breast Invasive Carcinoma (validation)Breast Invasive Carcinoma (BRCA)50
Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (TCGA)Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC)283
Colon Adenocarcinoma (TCGA)Colon Adenocarcinoma (COAD)254
Colon Adenocarcinoma (validation)Colon Adenocarcinoma (COAD)486
Glioblastoma Multiforme (TCGA)Glioblastoma Multiforme (GBM)141
Glioblastoma Multiforme (validation)Glioblastoma Multiforme (GBM)58
Head and Neck Squamous Cell Carcinoma (TCGA)Head and Neck Squamous Cell Carcinoma (HNSC)492
Kidney Chromophobe (TCGA)Kidney Chromophobe (KICH)64
Kidney Renal Clear Cell Carcinoma (TCGA)Kidney Renal Clear Cell Carcinoma (KIRC)521
Kidney Renal Clear Cell Carcinoma (validation)Kidney Renal Clear Cell Carcinoma (KIRC)100
Kidney Renal Papillary Cell Carcinoma (TCGA)Kidney Renal Papillary Cell Carcinoma (KIRP)282
Liver Hepatocellular Carcinoma (TCGA)Liver Hepatocellular Carcinoma (LIHC)362
Liver Hepatocellular Carcinoma (validation)Liver Hepatocellular Carcinoma (LIHC)231
Lung Adenocarcinoma (TCGA)Lung Adenocarcinoma (LUAD)497
Lung Adenocarcinoma (validation)Lung Adenocarcinoma (LUAD)105
Lung Squamous Cell Carcinoma (TCGA)Lung Squamous Cell Carcinoma (LUSC)489
Lung Squamous Cell Carcinoma (validation)Lung Squamous Cell Carcinoma (LUSC)68
Ovary Serous Cystadenocarcinoma (TCGA)Ovary Serous Cystadenocarcinoma (OV)349
Ovary Serous Cystadenocarcinoma (validation)Ovary Serous Cystadenocarcinoma (OV)81
Pancreatic Adenocarcinoma (TCGA)Pancreatic Adenocarcinoma (PAAD)176
Pancreatic Adenocarcinoma (validation)Pancreatic Adenocarcinoma (PAAD)80
Prostate Adenocarcinoma (TCGA)Prostate Adenocarcinoma (PRAD)480
Rectum Adenocarcinoma (TCGA)Rectum Adenocarcinoma (READ)88
Rectum Adenocarcinoma (validation)Rectum Adenocarcinoma (READ)207
Skin Cuteneous Melanoma (TCGA)Skin Cuteneous Melanoma (SKCM)99
Stomach Adenocarcinoma (TCGA)Stomach Adenocarcinoma (STAD)346
Testicular Germ Cell Tumor (TCGA)Testicular Germ Cell Tumor (TGCT)133
Thyroid Carcinoma (TCGA)Thyroid Carcinoma (THCA)495
Uterine Corpus Endometrial Carcinoma (TCGA)Uterine Corpus Endometrial Carcinoma (UCEC)176

TCGA survival

Based on the FPKM value of each gene, patients were classified into two expression groups and the correlation between expression level and patient survival was examined. The prognosis of each group of patients was examined by Kaplan-Meier survival estimators, and the survival outcomes of the two groups were compared by log-rank tests. Both median and maximally separated Kaplan-Meier plots are presented in the Human Protein Atlas, and genes with log rank P values less than 0.001 in maximally separated Kaplan-Meier analysis were defined as prognostic genes. If the group of patients with high expression of a selected prognostic gene has a higher observed event than expected event, it is an unfavorable prognostic gene; otherwise, it is a favorable prognostic gene. Genes with a median expression less than FPKM 1 were lowly expressed, and classified as unprognostic in the database even if they exhibited significant prognostic effect in survival analysis

Allen Mouse brain ISH dataset

The Allen Brain Atlas (ABA) is an open access database focusing on the brain, and includes both human and mouse expression data. The ABA is a part of the Allen Institute for Brain Science, which is one of the three branches of the Allen Institute. The Mouse brain In situ hybridization (ISH) data provides information on where in the adult mouse brain each gene is expressed (Lein ES et al. (2007)). We have imported the expression values available through the ABA API (© 2004 Allen Institute for Brain Science, Allen Mouse Brain Atlas) and show the regional expression grouped in the same manner as the other datasets visualized on the HPA Brain Atlas.

The Allen mouse brain ISH data was mapped to the mouse gene annotation of Ensembl version 109 using the probe nucleotide sequences provided through the Allen mouse brain API together with the blast program package. The mouse genes where then mapped to human genes using Ensembl orthologue data with a one-to-one restriction.

Evidence

Protein evidence is calculated for each gene based on three different sources: UniProt protein existence (UniProt evidence); neXtProt protein existence (neXtProt evidence); and a Human Protein Atlas antibody- or RNA based score (HPA evidence). In addition, for each gene, a protein evidence summary score is based on the maximum level of evidence in all three independent evidence scores (Evidence summary).

All scores are classified into the following categories:

  • Evidence at protein level
  • Evidence at transcript level
  • No evidence
  • Not available

UniProt evidence is based onUniProt protein existence data, which uses five types of evidence for the existence of a protein. All genes in the classes "Experimental evidence at protein level" or "Experimental evidence at transcript level" are classified into the first two evidence categories, whereas genes from the "Inferred from homology", "Predicted", or "Uncertain" classes are classified as "No evidence". Genes where the gene identifier could not be mapped to UniProt fromEnsembl version 109 are classified as "Not available".

neXtProt evidence is based onneXtProt protein existence data, which uses five types of evidence for the existence of a protein. All genes in the classes "Experimental evidence at protein level" or "Experimental evidence at transcript level" are classified into the first two evidence categories, whereas genes from the "Inferred from homology", "Predicted", or "Uncertain" classes are classified as "No evidence". Genes where the gene identifier could not be mapped to neXtProt fromEnsembl version 109 are classified as "Not available".

The HPA evidence is calculated based on the manual curation of Western blot, tissue profiling and subcellular location as well as transcript profiling. All genes with Data reliability "Supported" in one or both of the two methodsimmunohistochemistry andimmunofluorescence, or standard validation "Supported" for theWestern blot application (assays using over-expression lysates not included) are classified as "Evidence at protein level". For the remaining genes, all genes detected at nTPM > 1 in at least one of the HPA consensus, brain or immune cell sets used in theRNA-seq analysis based onHPA andGTEx are classified as "Evidence at transcript level". The remaining genes are classified as "No evidence".

Contact

The Project

The Human Protein Atlas

The Human Protein Atlas project is funded
by the Knut & Alice Wallenberg Foundation.


contact@proteinatlas.org

[8]ページ先頭

©2009-2025 Movatter.jp