hyparam/hyparquetPublic

NotificationsYou must be signed in to change notification settings
Fork23
Star547

parquet file parser for javascript

License

MIT license

547 stars 23 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 421 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
benchmark.js		benchmark.js
eslint.config.js		eslint.config.js
hyparquet.jpg		hyparquet.jpg
hyperparam.png		hyperparam.png
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Repository files navigation

hyparquet

Dependency free since 2023!

What is hyparquet?

Hyparquet is a lightweight, dependency-free, pure JavaScript library for parsingApache Parquet files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.

Hyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.

Parquet Viewer

Try hyparquet online: Drag and drop your parquet file ontohyperparam.app to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.

Features

Browser-native: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.
Performant: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.
TypeScript: Includes TypeScript definitions.
Dependency-free: Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.7kb min.gz!
Highly Compliant: Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.

Why hyparquet?

Parquet is widely used in data engineering and data science for its efficient storage and processing of large datasets. What if you could use parquet files directly in the browser, without needing a server or backend infrastructure? That's what hyparquet enables.

Existing JavaScript-based parquet readers (likeparquetjs) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.Hyparquet is actively maintained and designed with modern web usage in mind.

Demo

Check out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application usingHighTable.

Live Demo:https://hyparam.github.io/demos/hyparquet/
Demo Source Code:https://github.com/hyparam/demos/tree/master/hyparquet

Quick Start

Browser Example

In the browser useasyncBufferFromUrl to wrap a url for reading asynchronously over the network.It is recommended that you filter by row and column to limit fetch size:

const{ asyncBufferFromUrl, parquetReadObjects}=awaitimport('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')consturl='https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'constfile=awaitasyncBufferFromUrl({ url})// wrap url for async fetchingconstdata=awaitparquetReadObjects({  file,columns:['Breed Name','Lifespan'],rowStart:10,rowEnd:20,})

Node.js Example

To read the contents of a local parquet file in a node.js environment useasyncBufferFromFile:

const{ asyncBufferFromFile, parquetReadObjects}=awaitimport('hyparquet')constfile=awaitasyncBufferFromFile('example.parquet')constdata=awaitparquetReadObjects({ file})

Note: hyparquet is published as an ES module, so dynamicimport() may be required for old versions of node.

Parquet Writing

To create parquet files from javascript, check out thehyparquet-writer package.

Advanced Usage

Reading Metadata

You can read just the metadata, including schema and data statistics using theparquetMetadataAsync function.To load parquet metadata in the browser from a remote server:

import{parquetMetadataAsync,parquetSchema}from'hyparquet'constfile=awaitasyncBufferFromUrl({ url})constmetadata=awaitparquetMetadataAsync(file)// Get total number of rows (convert bigint to number)constnumRows=Number(metadata.num_rows)// Get nested table schemaconstschema=parquetSchema(metadata)// Get top-level column header namesconstcolumnNames=schema.children.map(e=>e.element.name)

You can also read the metadata synchronously usingparquetMetadata if you have an array buffer with the parquet footer:

import{parquetMetadata}from'hyparquet'constmetadata=parquetMetadata(arrayBuffer)

AsyncBuffer

Hyparquet requires an argumentfile of typeAsyncBuffer. AnAsyncBuffer is similar to a jsArrayBuffer but theslice method can return asyncPromise<ArrayBuffer>.

typeAwaitable<T>=T|Promise<T>interfaceAsyncBuffer{byteLength:numberslice(start:number,end?:number):Awaitable<ArrayBuffer>}

In most cases, you should probably useasyncBufferFromUrl orasyncBufferFromFile to create anAsyncBuffer for hyparquet.

asyncBufferFromUrl

If you want to read a parquet file remotely over http, useasyncBufferFromUrl to wrap an http url as anAsyncBuffer using http range requests.

PassrequestInit option to provide additional fetch headers for authentication (optional)
PassbyteLength if you know the file size to save a round trip HEAD request (optional)

consturl='https://s3.hyperparam.app/wiki_en.parquet'constrequestInit={headers:{Authorization:'Bearer my_token'}}// auth headerconstbyteLength=415958713// optionalconstfile:AsyncBuffer=awaitasyncBufferFromUrl({ url, requestInit, byteLength})constdata=awaitparquetReadObjects({ file})

asyncBufferFromFile

If you are in a node.js environment, useasyncBufferFromFile to wrap a local file as anAsyncBuffer:

import{asyncBufferFromFile,parquetReadObjects}from'hyparquet'constfile:AsyncBuffer=awaitasyncBufferFromFile('example.parquet')constdata=awaitparquetReadObjects({ file})

ArrayBuffer

You can provide anArrayBuffer anywhere that anAsyncBuffer is expected. This is useful if you already have the entire parquet file in memory.

Custom AsyncBuffer

You can implement your ownAsyncBuffer to create a virtual file that can be read asynchronously by hyparquet.

parquetRead vs parquetReadObjects

parquetReadObjects

parquetReadObjects is a convenience wrapper aroundparquetRead that returns the complete rows asPromise<Record<string, any>[]>. This is the simplest way to read parquet files.

parquetReadObjects({ file}):Promise<Record<string,any>[]>

parquetRead

parquetRead is the "base" function for reading parquet files.It returns aPromise<void> that resolves when the file has been read or rejected if an error occurs.Data is returned viaonComplete oronChunk oronPage callbacks passed as arguments.

The reason for this design is that parquet is a column-oriented format, and returning data in row-oriented format requires transposing the column data. This is an expensive operation in javascript. If you don't pass in anonComplete argument toparquetRead, hyparquet will skip this transpose step and save memory.

Chunk Streaming

TheonChunk callback returns column-oriented data as it is ready.onChunk will always return top-level columns, including structs, assembled as a single column. This may require waiting for multiple sub-columns to all load before assembly can occur.

TheonPage callback returns column-oriented page data as it is ready.onPage will NOT assemble struct columns and will always return individual sub-column data. Note thatonPagewill assemble nested lists.

In some cases,onPage can return data sooner thanonChunk.

interfaceColumnData{columnName:stringcolumnData:ArrayLike<any>rowStart:numberrowEnd:number}awaitparquetRead({  file,onChunk(chunk:ColumnData){console.log('chunk',chunk)},onPage(chunk:ColumnData){console.log('page',chunk)},})

Returned row format

By default, theonComplete function returns anarray of values for each row:[value]. If you would prefer each row to be anobject:{ columnName: value }, set the optionrowFormat to'object'.

import{parquetRead}from'hyparquet'awaitparquetRead({  file,rowFormat:'object',onComplete:data=>console.log(data),})

TheparquetReadObjects function defaults torowFormat: 'object'.

Supported Parquet Files

The parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.Hyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.

Hyparquet is the most compliant parquet parser on earth — hyparquet can open more files than pyarrow, rust, and duckdb.

Compression

By default, hyparquet supports uncompressed and snappy-compressed parquet files.To support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use thehyparquet-compressors package.

Codec	hyparquet	with hyparquet-compressors
Uncompressed	✅	✅
Snappy	✅	✅
GZip	❌	✅
LZO	❌	✅
Brotli	❌	✅
LZ4	❌	✅
ZSTD	❌	✅
LZ4_RAW	❌	✅

hysnappy

For faster snappy decompression, tryhysnappy, which uses WASM for a 40% speed boost on large parquet files.

hyparquet-compressors

You can include support for ALL parquetcompressors plus hysnappy using thehyparquet-compressors package.

import{parquetReadObjects}from'hyparquet'import{compressors}from'hyparquet-compressors'constdata=awaitparquetReadObjects({ file, compressors})

References

Contributions

Contributions are welcome!If you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.

Hyparquet development is supported by an open-source grant from Hugging Face 🤗

About

parquet file parser for javascript

Languages

JavaScript100.0%

Movatterモバイル変換

License

hyparam/hyparquet

Folders and files

Latest commit

History

Repository files navigation

hyparquet

What is hyparquet?

Parquet Viewer

Features

Why hyparquet?

Demo

Quick Start

Browser Example

Node.js Example

Parquet Writing

Advanced Usage

Reading Metadata

AsyncBuffer

asyncBufferFromUrl

asyncBufferFromFile

ArrayBuffer

Custom AsyncBuffer

parquetRead vs parquetReadObjects

parquetReadObjects

parquetRead

Chunk Streaming

Returned row format

Supported Parquet Files

Compression

hysnappy

hyparquet-compressors

References

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors11

Languages

Packages