- Notifications
You must be signed in to change notification settings - Fork12
parquet file parser for javascript
License
hyparam/hyparquet
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Dependency free since 2023!
Hyparquet is a lightweight, dependency-free, pure JavaScript library for parsingApache Parquet files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.
Hyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.
Try hyparquet online: Drag and drop your parquet file ontohyperparam.app to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.
- Browser-native: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
- Performant: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
- TypeScript: Includes TypeScript definitions.
- Dependency-free: Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.7kb min.gz!
- Highly Compliant: Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.
Parquet is widely used in data engineering and data science for its efficient storage and processing of large datasets. What if you could use parquet files directly in the browser, without needing a server or backend infrastructure? That's what hyparquet enables.
Existing JavaScript-based parquet readers (likeparquetjs) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.Hyparquet is actively maintained and designed with modern web usage in mind.
Check out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application usingHighTable.
- Live Demo:https://hyparam.github.io/demos/hyparquet/
- Demo Source Code:https://github.com/hyparam/demos/tree/master/hyparquet
To read the contents of a parquet file in a node.js environment useasyncBufferFromFile
:
const{ asyncBufferFromFile, parquetReadObjects}=awaitimport('hyparquet')constfile=awaitasyncBufferFromFile(filename)constdata=awaitparquetReadObjects({ file})
Note: Hyparquet is published as an ES module, so dynamicimport()
may be required on the command line.
In the browser useasyncBufferFromUrl
to wrap a url for reading asynchronously over the network.It is recommended that you filter by row and column to limit fetch size:
const{ asyncBufferFromUrl, parquetReadObjects}=awaitimport('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')consturl='https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'constfile=awaitasyncBufferFromUrl({ url})// wrap url for async fetchingconstdata=awaitparquetReadObjects({ file,columns:['Breed Name','Lifespan'],rowStart:10,rowEnd:20,})
You can read just the metadata, including schema and data statistics using theparquetMetadataAsync
function.To load parquet metadata in the browser from a remote server:
import{parquetMetadataAsync,parquetSchema}from'hyparquet'constfile=awaitasyncBufferFromUrl({ url})constmetadata=awaitparquetMetadataAsync(file)// Get total number of rows (convert bigint to number)constnumRows=Number(metadata.num_rows)// Get nested table schemaconstschema=parquetSchema(metadata)// Get top-level column header namesconstcolumnNames=schema.children.map(e=>e.element.name)
You can also read the metadata synchronously usingparquetMetadata
if you have an array buffer with the parquet footer:
import{parquetMetadata}from'hyparquet'constmetadata=parquetMetadata(arrayBuffer)
Hyparquet accepts argumentfile
of typeAsyncBuffer
which is like a jsArrayBuffer
but theslice
method can returnPromise<ArrayBuffer>
.You can pass anArrayBuffer
anywhere that anAsyncBuffer
is expected, if you have the entire file in memory.
typeAwaitable<T>=T|Promise<T>interfaceAsyncBuffer{byteLength:numberslice(start:number,end?:number):Awaitable<ArrayBuffer>}
You can define your ownAsyncBuffer
to create a virtual file that can be read asynchronously. In most cases, you should probably useasyncBufferFromUrl
orasyncBufferFromFile
.
parquetReadObjects
is a convenience wrapper aroundparquetRead
that returns the complete rows asPromise<Record<string, any>[]>
. This is the simplest way to read parquet files.
parquetReadObjects({ file}):Promise<Record<string,any>[]>
parquetRead
is the "base" function for reading parquet files.It returns aPromise<void>
that resolves when the file has been read or rejected if an error occurs.Data is returned viaonComplete
oronChunk
callbacks passed as arguments.
The reason for this design is that parquet is a column-oriented format, and returning data in row-oriented format requires transposing the column data. This is an expensive operation in javascript. If you don't pass in anonComplete
argument toparquetRead
, hyparquet will skip this transpose step and save memory.
TheonChunk
callback allows column-oriented data to be streamed back as it is read.
interfaceColumnData{columnName:stringcolumnData:ArrayLike<any>rowStart:numberrowEnd:number}functiononChunk(chunk:ColumnData):void{console.log(chunk)}awaitparquetRead({ file, onChunk})
Pass therequestInit
option toasyncBufferFromUrl
to provide authentication information to a remote web server. For example:
constrequestInit={headers:{Authorization:'Bearer my_token'}}constfile=awaitasyncBufferFromUrl({ url, requestInit})
By default, data returned byparquetRead
in theonComplete
function will be onearray of columns per row.If you would like each row to be anobject with each key the name of the column, set the optionrowFormat
toobject
.
import{parquetRead}from'hyparquet'awaitparquetRead({ file,rowFormat:'object',onComplete:data=>console.log(data),})
TheparquetReadObjects
function defaults to returning an array of objects.
The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.Hyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.
Hyparquet is the most compliant parquet parser on earth — hyparquet can open more files than pyarrow, rust, and duckdb.
By default, hyparquet supports uncompressed and snappy-compressed parquet files.To support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use thehyparquet-compressors package.
Codec | hyparquet | with hyparquet-compressors |
---|---|---|
Uncompressed | ✅ | ✅ |
Snappy | ✅ | ✅ |
GZip | ❌ | ✅ |
LZO | ❌ | ✅ |
Brotli | ❌ | ✅ |
LZ4 | ❌ | ✅ |
ZSTD | ❌ | ✅ |
LZ4_RAW | ❌ | ✅ |
For faster snappy decompression, tryhysnappy, which uses WASM for a 40% speed boost on large parquet files.
You can include support for ALL parquetcompressors
plus hysnappy using thehyparquet-compressors package.
import{parquetReadObjects}from'hyparquet'import{compressors}from'hyparquet-compressors'constfile=awaitasyncBufferFromFile(filename)constdata=awaitparquetReadObjects({ file, compressors})
- https://github.com/apache/parquet-format
- https://github.com/apache/parquet-testing
- https://github.com/apache/thrift
- https://github.com/apache/arrow
- https://github.com/dask/fastparquet
- https://github.com/duckdb/duckdb
- https://github.com/google/snappy
- https://github.com/hyparam/hightable
- https://github.com/hyparam/hysnappy
- https://github.com/hyparam/hyparquet-compressors
- https://github.com/ironSource/parquetjs
- https://github.com/zhipeng-jia/snappyjs
Contributions are welcome!If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
Hyparquet development is supported by an open-source grant from Hugging Face 🤗
About
parquet file parser for javascript