- Notifications
You must be signed in to change notification settings - Fork0
Decompressors for hyparquet
License
hyparam/hyparquet-compressors
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This package provides decompressors for various compression codecs.It is designed to be used withhyparquet in order to provide full support for all parquet compression formats.
Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets. It supports a number of different compression formats, but most parquet files use snappy compression.
Hyparquet is a fast and lightweight parquet reader that is designed to work in both node.js and the browser.
By default, hyparquet only supportsuncompressed
andsnappy
compressed files (the most common parquet compression codecs). Thehyparquet-compressors
package extends support for all legal parquet compression formats.
hyparquet-compressors
works in both node.js and the browser. Uses js and wasm packages, no system dependencies.
To usehyparquet-compressors
withhyparquet
, simply pass thecompressors
object to theparquetReadObjects
function.
import{parquetReadObjects}from'hyparquet'import{compressors}from'hyparquet-compressors'constdata=awaitparquetReadObjects({ file, compressors})
Seehyparquet repo for more info.
Parquet compression types supported withhyparquet-compressors
:
- Uncompressed
- Snappy
- Gzip
- LZO
- Brotli
- LZ4
- ZSTD
- LZ4_RAW
Snappy compression useshysnappy for fast snappy decompression using a minimalWASM module.
We load the wasm modulesynchronously from base64 in the js file. This avoids a network request, and greatly simplifies bundling and serving wasm.
New gzip implementation adapted fromfflate.Includes modifications to handle repeated back-to-back gzip streams that sometimes occur in parquet files (but are not supported by fflate).
For gzip, theoutput
buffer argument is optional:
- If
output
is defined, the decompressor will write tooutput
until it is full. - If
output
is undefined, the decompressor will allocate a new buffer, and expand it as needed to fit the uncompressed gzip data. Importantly, the caller should use thereturned buffer.
Includes a minimal port ofbrotli.js.Our implementation uses gzip to pre-compress the brotli dictionary, in order to minimize the bundle size.
New LZ4 implementation includes support for legacy hadoop LZ4 frame format used on some old parquet files.
Usesfzstd for Zstandard decompression.
File | Size |
---|---|
hyparquet-compressors.min.js | 116.4kb |
hyparquet-compressors.min.js.gz | 75.2kb |
- https://parquet.apache.org/docs/file-format/data-pages/compression/
- https://en.wikipedia.org/wiki/Brotli
- https://en.wikipedia.org/wiki/Gzip
- https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
- https://en.wikipedia.org/wiki/Snappy_(compression)
- https://en.wikipedia.org/wiki/Zstd
- https://github.com/101arrowz/fflate
- https://github.com/101arrowz/fzstd
- https://github.com/foliojs/brotli.js
- https://github.com/hyparam/hysnappy
About
Decompressors for hyparquet