Movatterモバイル変換

BGZF

From Wikipedia, the free encyclopedia

File format for block-based Gzip compression

BGZF (Blocked GNU Zip Format)
Filename extension	.gz
Internet media type	application/gzip
Magic number	`\x1f\x8b\x08\x04` (initial bytes, standard Gzip magic. BGZF adds specific extra fields in the header of each block.)
Developed by	SAMtools project / HTSlib
Initial release	c. 2009 (along with SAM/BAM format specification)
Type of format	Compressed file format, Indexed file format
Container for	Commonly used for bioinformatics data likeSAM,BAM,VCF records
Extended from	Gzip
Standard	https://samtools.github.io/hts-specs/SAMv1.pdf#page=13.12
Website	www.htslib.org/doc/bgzip.html

Blocked GNU Zip Format (BGZF) is a variant ofgzip file format that usesblock compression, a method that compresses data in independent blocks of content—each of which is a valid gzip file. This design is utilized widely inbioinformatics for genomic data compression.^[1] The block-based design provides efficient storage, random access with indexed queries,^[2]^[3] and parallel processing; allowing large-scale data processing.^[4]

The format was developed as part ofSAM/BAM specification andSAMtools.^[5] It is a core component of the commonBAM format (the binary version of theSequence Alignment Map format) and is also used to compress and indexVariant Call Format (VCF),FASTA, andBED files.^[6] Because each block is a standard gzip block, a BGZF file can be decompressed by any standard gzip-compatible tool, ensuring backward compatibility.^[7] A general purpose compression utility for producing BGZF filesbgzip is distributed with HTSlib software library.^[6]

Uses

[edit]

BGZF is widely utilized inbioinformatics for the compression of large datasets where efficient random access is a crucial requirement.^[1] Due to large sizes ofnext-generation sequencing data formats likeSAM files,^[8] they are compressed into binaryBAM format utilizing BGZF compression.^[4]^[9]

For random access, an index file is created for a BGZF-compressed file, typically usingTabix.^[10] This index stores the file offsets of the compressed blocks alongside the corresponding genomic coordinates, thus allowing a program to seek directly to the block containing the data queried, decompress only them, and retrieve the requested information, avoiding the need to process the entire file.^[10]

The format is also extensively employed for compressingvariant call files (VCF) along with their associatedTabix indexes,^[10] and similarly for other substantial genomic data files such asBED, GFF/GTF, and occasionallyFASTQ when indexed access is necessary.^[6] A broad range of bioinformatics software packages are equipped to read and write BGZF-compressed files; these include well-known tools likeSAMtools, HTSlib, BCF/VCFtools,^[11] Picard tools, the GATK, and libraries such asBiopython.^[12]^[13] The standard command-line utility for creating BGZF-compressed files and their corresponding.gzi indexes isbgzip, which is distributed as part of HTSlib.^[7]

BGZF has been adapted for development of more efficient data-specific compression methods and algorithms leveraging its block based design.^[14]

Design schema

[edit]

A BGZF file consists of a series of concatenated BGZF blocks. Each block, whether in its compressed or uncompressed state, is limited to a maximum size of 64 kilobytes. Each BGZF block is itself a fully compliant gzip archive, adhering to the specifications outlined inRFC 1952.^[15]

Schema of a single compressed block that constitutes the basic unit of a BGZF compressed file.

Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:

TheF.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfieldID values 66 and 67 (ASCII 'BC').
The length of the BGZF extra field payload (fieldLEN in the gzip specification) is 2 (two bytes of payload).
The payload of the BGZF extra field is a 16-bit unsigned integer inlittle-endian format. This integer gives the size of the containing BGZF block minus one.

This block design allows use of an associated index file (storing offsets of each BGZF block) to fetch and decompress only the block of data that pertains to the query, thus avoiding the computational overhead of reading and decompressing all BGZF blocks.^[10]

Random access

[edit]

EOF marker

[edit]

End-of-file marker for BGZF enables detection of erroneously truncated files and generate warnings or errors for the user. The EOF marker block is an empty (data block of length zero) BGZF block encoded with the defaultzlib compression level settings, and consists of the following 28 hexadecimal bytes:1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00The presence of an EOF marker by itself does not signal an end of the file, however, an EOF marker present at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it.^[15]

References

[edit]

^^a ^bLan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (2021-08-25)."Genozip: a universal extensible genomic data compressor".Bioinformatics.37 (16):2225–2230.doi:10.1093/bioinformatics/btab102.ISSN 1367-4811.PMC 8388020.PMID 33585897.BGZF-block level indexing that is common in standard indexes of genomic file formats
^Yamada, Taiju (2020-04-01)."7bgzf: Replacing samtools bgzip deflation for archiving and real-time compression".Computational Biology and Chemistry.85 107207.doi:10.1016/j.compbiolchem.2020.107207.ISSN 1476-9271.PMID 32092548.
^Danecek, Petr; Bonfield, James K; Liddle, Jennifer; Marshall, John; Ohan, Valeriu; Pollard, Martin O; Whitwham, Andrew; Keane, Thomas; McCarthy, Shane A; Davies, Robert M; Li, Heng (2021-02-01)."Twelve years of SAMtools and BCFtools".GigaScience.10 (2) giab008.doi:10.1093/gigascience/giab008.ISSN 2047-217X.PMC 7931819.PMID 33590861.[..] both formats can be either plain (uncompressed) or block-compressed with BGZF for random access and compact size.
^^a ^bHernaez, Mikel; Pavlichin, Dmitri; Weissman, Tsachy; Ochoa, Idoia (2019-07-20)."Genomic Data Compression".Annual Review of Biomedical Data Science.2:19–37.doi:10.1146/annurev-biodatasci-072018-021229.ISSN 2574-3414.
^Li, Heng; Handsaker, Bob; Wysoker, Alec; Fennell, Tim; Ruan, Jue; Homer, Nils; Marth, Gabor; Abecasis, Goncalo; Durbin, Richard; et al. (1000 Genome Project Data Processing Subgroup) (15 August 2009)."The Sequence Alignment/Map format and SAMtools".Bioinformatics.25 (16):2078–2079.doi:10.1093/bioinformatics/btp352.ISSN 1367-4803.PMC 2723002.PMID 19505943.
^^a ^b ^cBonfield, James K; Marshall, John; Danecek, Petr; Li, Heng; Ohan, Valeriu; Whitwham, Andrew; Keane, Thomas; Davies, Robert M (2021-02-01)."HTSlib: C library for reading/writing high-throughput sequencing data".GigaScience.10 (2) giab007.doi:10.1093/gigascience/giab007.ISSN 2047-217X.PMC 7931820.PMID 33594436.
^^a ^b"bgzip(1) manual page".www.htslib.org. Retrieved2025-06-03.
^Weeks, N. T. (2018)."Openmp task parallelism for faster genomic data processing"(PDF).Reading, decoding, sorting, encoding, and writing large sequence alignment files (tens or hundreds of GBs) can be time-consuming and resource intensive.
^Sadikin, Rifki; Arisal, Andria; Omar, Rofithah; Mazni, Nur Hidayah (November 2017). "Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster".2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE). pp. 421–425.doi:10.1109/ICITISEE.2017.8285542.ISBN 978-1-5386-0658-2.
^^a ^b ^c ^dLi, Heng (2011-03-01)."Tabix: fast retrieval of sequence features from generic TAB-delimited files".Bioinformatics.27 (5):718–719.doi:10.1093/bioinformatics/btq671.ISSN 1367-4803.PMC 3042176.PMID 21208982.
^Danecek, Petr; Bonfield, James K; Liddle, Jennifer; Marshall, John; Ohan, Valeriu; Pollard, Martin O; Whitwham, Andrew; Keane, Thomas; McCarthy, Shane A; Davies, Robert M; Li, Heng (1 February 2021)."Twelve years of SAMtools and BCFtools".GigaScience.10 (2) giab008.doi:10.1093/gigascience/giab008.ISSN 2047-217X.PMC 7931819.PMID 33590861.
^"Bio.bgzf module — Biopython 1.85 documentation". Biopython. Retrieved3 June 2025.
^"BlockCompressedOutputStream (htsjdk 2.8.1 API)". SAMtools/HTSlib. Retrieved3 June 2025.
^Li, Miaoxin; Li, Jiang; Li, Mulin Jun; Pan, Zhicheng; Hsu, Jacob Shujui; Liu, Dajiang J.; Zhan, Xiaowei; Wang, Junwen; Song, Youqiang; Sham, Pak Chung (2017-05-19)."Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework".Nucleic Acids Research.45 (9): e75.doi:10.1093/nar/gkx019.ISSN 1362-4962.PMC 5435951.PMID 28115622.
^^a ^b"HTS format specifications".samtools.github.io. Retrieved2025-04-01.

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT