Human assembly and gene annotation

Assembly

This site provides a data set based on the December 2013Homo sapiens high coverage assembly GRCh38 from theGenome Reference Consortium. This assembly is used by UCSC to create their hg38 database. The data set consists of gene models built from the genewise alignments of the human proteome as well as from alignments of human cDNAs using the cDNA2genome model of exonerate.

This release of the assembly has the following properties:

contig length total 3.4 Gb.
chromosome length total 3.1 Gb (excluding haplotypes).

It also includes 261 alt loci scaffolds, mainly in the LRC/KIR complex on chromosome 19 (35 alternate sequence representations) and theMHC region on chromosome 6 (7 alternate sequence representations).

Watch a video on YouTube about patches and haplotypes in the Human genome.

Patches

As the GRC maintains and improves the assembly, patches are being introduced. Currently, assembly patches are of two types:

Novel patch: new sequences that add alternative sequence at a loci and will remain as haplotypes in the next major assembly release by GRC
Fix patch: sequences that correct the reference sequence and will replace the given region of the reference assembly at the next major assembly release by GRC.

Other assemblies

Gene annotation

The Ensembl human gene annotations have been updated using Ensembl's automatic annotation pipeline. The updated annotation incorporates new protein and cDNA sequences which have become publicly available since the last GRCh38 genebuild (December 2013).

In the current release, we continue to display a joint gene set based on the merge between the automatic annotation from Ensembl and the manually curated annotation from Havana. See the statistics table, right, for the corresponding GENCODE version number. The Consensus Coding Sequence (CCDS) identifiers have also been mapped to the annotations. More information about theCCDS project.

Updated manual annotation from Havana is merged into the Ensembl annotation every release. Transcripts from the two annotation sources are merged if they share the same internal exon-intron boundaries (i.e. have identical splicing pattern) with slight differences in the terminal exons allowed. Importantly, all Havana transcripts are included in the final Ensembl/Havana merged (GENCODE) gene set.

Detailed information on genebuild (PDF)

The T2T-CHM13v2.0 assembly and annotation is available throughEnsembl Rapid Release.

More information

General information about this species can be found inWikipedia.

Statistics

Summary

Assembly	GRCh38.p14 (Genome Reference Consortium Human Build 38), INSDC AssemblyGCA_000001405.29, Dec 2013
Base Pairs	3,099,750,718
Golden Path Length	3,099,750,718
Assembly provider	Genome Reference Consortium
Annotation provider	Ensembl
Annotation method	Full genebuild
Genebuild started	Jan 2014
Genebuild released	Jul 2014
Genebuild last updated/patched	May 2025
Database version	115.38
Gencode version	GENCODE 49

Gene counts (Primary assembly)

Gene/transcipt that contains an open reading frame (ORF).Coding genes	19,869 (excl 664A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Non coding genes	42,124
Small non coding genes	4,866
Long non coding genes	35,042 (excl 302A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Misc non coding genes	2,216
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes	15,204 (excl 1A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
A transcript is the operational unit of a gene. In a genomic context, transcripts consist of one or more exons, with adjoining exons being separated by introns. The exons/introns are transcribed and then the introns spliced out. Transcripts may or may not encode a proteinGene transcripts	509,650

Gene counts (Alternative sequence)

Gene/transcipt that contains an open reading frame (ORF).Coding genes	3,302 (excl 34A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Non coding genes	2,008
Small non coding genes	349
Long non coding genes	1,457 (excl 33A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Misc non coding genes	202
A gene that has homology to known protein-coding genes but contain a frameshift and/or stop codon(s) which disrupts the ORF. Thought to have arisen through duplication followed by loss of function.Pseudogenes	2,061 (excl 1A readthrough transcript has exons that overlap exons from transcripts belonging to two or more different loci (in addition to the locus to which the readthrough transcript itself belongs).readthrough)
Gene transcripts	24,090

Other

Genscan gene predictions	50,174
Short Variants	1,112,354,242
Structural variants	7,862,163

Ensembl release 115 - September 2025 ©EMBL-EBIEMBL-EBI
http://asia.ensembl.org

Permanent link -View in archive site

Movatterモバイル変換

Favourite species

All species