Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Rust-based WebAssembly bindings to read and write Apache Parquet data

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE_APACHE
MIT
LICENSE_MIT
NotificationsYou must be signed in to change notification settings

kylebarron/parquet-wasm

Repository files navigation

WebAssembly bindings to read and write theApache Parquet format to and fromApache Arrow using the Rustparquet andarrow crates.

This is designed to be used alongside a JavaScript Arrow implementation, such as the canonicalJS Arrow library.

Including read and write support and all compression codecs, the brotli-compressed WASM bundle is 1.2 MB. Refer tocustom builds for how to build a smaller bundle. A minimal read-only bundle without compression support can be as small as 456 KB brotli-compressed.

Install

parquet-wasm is published to NPM. Install with

yarn add parquet-wasm

or

npm install parquet-wasm

API

Parquet-wasm has both a synchronous and asynchronous API. The sync API is simpler but requires fetching the entire Parquet buffer in advance, which is often prohibitive.

Sync API

Refer to these functions:

Async API

  • readParquetStream: Create aReadableStream that emits Arrow RecordBatches from a Parquet file.
  • ParquetFile: A class for reading portions of a remote Parquet file. UsefromUrl to construct from a remote URL orfromFile to construct from aFile handle. Note that when you're done using this class, you'll need to callfree to release any memory held by the ParquetFile instance itself.

Both sync and async functions return or accept aTable class, an Arrow table in WebAssembly memory. Refer to its documentation for moving data into/out of WebAssembly.

Entry Points

Entry pointDescriptionDocumentation
parquet-wasm,parquet-wasm/esm, orparquet-wasm/esm/parquet_wasm.jsESM, to be used directly from the Web as an ES ModuleLink
parquet-wasm/bundler"Bundler" build, to be used in bundlers such as WebpackLink
parquet-wasm/nodeNode build, to be used with synchronousrequire in NodeJSLink

ESM

Theesm entry point is the primary entry point. It is the default export fromparquet-wasm, and is also accessible atparquet-wasm/esm andparquet-wasm/esm/parquet_wasm.js (for symmetric importsdirectly from a browser).

Note that when using theesm bundles, you must manually initialize the WebAssembly module before using any APIs. Otherwise, you'll get an errorTypeError: Cannot read properties of undefined. There are multiple ways to initialize the WebAssembly code:

Asynchronous initialization

The primary way to initialize is by awaiting the default export.

importwasmInit,{readParquet}from"parquet-wasm";awaitwasmInit();

Without any parameter, this will try to fetch a file named'parquet_wasm_bg.wasm' at the same location asparquet-wasm. (E.g. this snippetinput = new URL('parquet_wasm_bg.wasm', import.meta.url);).

Note that you can also pass in a custom URL if you want to host the.wasm file on your own servers.

importwasmInit,{readParquet}from"parquet-wasm";// Update this version to match the version you're using.constwasmUrl="https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/parquet_wasm_bg.wasm";awaitwasmInit(wasmUrl);

Synchronous initialization

TheinitSync named export allows for

import{initSync,readParquet}from"parquet-wasm";// The contents of esm/parquet_wasm_bg.wasm in an ArrayBufferconstwasmBuffer=newArrayBuffer(...);// Initialize the Wasm synchronouslyinitSync(wasmBuffer)

Async initialization should be preferred over downloading the Wasm buffer and then initializing it synchronously, asWebAssembly.instantiateStreaming is the most efficient way to both download and initialize Wasm code.

Bundler

Thebundler entry point doesn't require manual initialization of the WebAssembly blob, but needs setup with whatever bundler you're using.Refer to the Rust Wasm documentation for more info.

Node

Thenode entry point can be loaded synchronously from Node.

const{readParquet}=require("parquet-wasm");constwasmTable=readParquet(...);

Using directly from a browser

You can load theesm/parquet_wasm.js file directly from a CDN

constparquet=awaitimport("https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/+esm")awaitparquet.default();constwasmTable=parquet.readParquet(...);

This specific endpoint will minify the ESM before you receive it.

Debug functions

These functions are not present in normal builds to cut down on bundle size. To create a custom build, seeCustom Builds below.

setPanicHook

setPanicHook(): void

Setsconsole_error_panic_hook in Rust, which provides better debugging of panics by having more informativeconsole.error messages. Initialize this first if you're getting errors such asRuntimeError: Unreachable executed.

The WASM bundle must be compiled with theconsole_error_panic_hook feature for this function to exist.

Example

import*asarrowfrom"apache-arrow";importinitWasm,{Compression,readParquet,Table,writeParquet,WriterPropertiesBuilder,}from"parquet-wasm";// Instantiate the WebAssembly contextawaitinitWasm();// Create Arrow Table in JSconstLENGTH=2000;constrainAmounts=Float32Array.from({length:LENGTH},()=>Number((Math.random()*20).toFixed(1)));constrainDates=Array.from({length:LENGTH},(_,i)=>newDate(Date.now()-1000*60*60*24*i));constrainfall=arrow.tableFromArrays({precipitation:rainAmounts,date:rainDates,});// Write Arrow Table to Parquet// wasmTable is an Arrow table in WebAssembly memoryconstwasmTable=Table.fromIPCStream(arrow.tableToIPC(rainfall,"stream"));constwriterProperties=newWriterPropertiesBuilder().setCompression(Compression.ZSTD).build();constparquetUint8Array=writeParquet(wasmTable,writerProperties);// Read Parquet buffer back to Arrow Table// arrowWasmTable is an Arrow table in WebAssembly memoryconstarrowWasmTable=readParquet(parquetUint8Array);// table is now an Arrow table in JS memoryconsttable=arrow.tableFromIPC(arrowWasmTable.intoIPCStream());console.log(table.schema.toString());// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>

Published examples

(These may use older versions of the library with a different API).

Performance considerations

Tl;dr: When you have aTable object (resulting fromreadParquet), try the newTable.intoFFIAPI to move it to JavaScript memory. This API is less well tested than theTable.intoIPCStream API, but should befaster and havemuch less memory overhead (by a factor of 2). If you hit any bugs, pleasecreate a reproducible issue.

Under the hood,parquet-wasm first decodes a Parquet file into Arrowin WebAssembly memory. Butthen that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The "normal"conversion APIs (e.g.Table.intoIPCStream) use theArrow IPCformat to get the data back to JavaScript. But thisrequires another memory copyinside WebAssembly to assemble the various arrays into a singlebuffer to be copied back to JS.

Instead, the newTable.intoFFI API uses Arrow'sC DataInterface to be able to copy or viewArrow arrays from within WebAssembly memory without any serialization.

Note that this approach uses thearrow-js-ffilibrary to parse the Arrow C Data Interface definitions. This library has not yet been tested inproduction, so it may have bugs!

I wrote aninteractive blogpost on this approachand the Arrow C Data Interface if you want to read more!

Example

import*asarrowfrom"apache-arrow";import{parseTable}from"arrow-js-ffi";importinitWasm,{wasmMemory,readParquet}from"parquet-wasm";// Instantiate the WebAssembly contextawaitinitWasm();// A reference to the WebAssembly memory object.constWASM_MEMORY=wasmMemory();constresp=awaitfetch("https://example.com/file.parquet");constparquetUint8Array=newUint8Array(awaitresp.arrayBuffer());constwasmArrowTable=readParquet(parquetUint8Array).intoFFI();// Arrow JS table that was directly copied from Wasm memoryconsttable:arrow.Table=parseTable(WASM_MEMORY.buffer,wasmArrowTable.arrayAddrs(),wasmArrowTable.schemaAddr());// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it// to release the Wasm memory.// Note that any access to the pointers in this table is undefined behavior after this call.// Calling any `wasmArrowTable` method will error.wasmArrowTable.drop();

Compression support

The Parquet specification permits several compression codecs. This library currently supports:

  • Uncompressed
  • Snappy
  • Gzip
  • Brotli
  • ZSTD
  • LZ4_RAW
  • LZ4 (deprecated)

LZ4 support in Parquet is a bit messy. As describedhere, there aretwo LZ4 compression options in Parquet (as of version 2.9.0). The original versionLZ4 is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:

It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperableLZ4_RAW codec.

It's currently unknown how widespread the ecosystem support is forLZ4_RAW. As ofpyarrow v7, it now writesLZ4_RAW by default and presumably has read support for it as well.

Custom builds

In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build ofparquet-wasm to keep bundle size at a minimum. If you install the Rust toolchain andwasm-pack (seeDevelopment), you can create a custom build with only the compression codecs you require.

The minimum supported Rust version in this project is 1.60. To upgrade your toolchain, userustup update stable.

Example custom builds

Reader-only bundle with Snappy compression:

wasm-pack build --no-default-features --features snappy --features reader

Writer-only bundle with no compression support, targeting Node:

wasm-pack build --target nodejs --no-default-features --features writer

Bundle with reader and writer support, targeting Node, usingarrow andparquet crates with all their supported compressions, withconsole_error_panic_hook enabled:

wasm-pack build \  --target nodejs \  --no-default-features \  --features reader \  --features writer \  --features all_compressions \  --features debug# Or, given the fact that the default feature includes several of these features, a shorter version:wasm-pack build --target nodejs --features debug

Refer to thewasm-pack documentation for more info on flags such as--release,--dev,target, and to theCargo documentation for more info on how to use features.

Available features

By default,all_compressions,reader,writer, andasync features are enabled. Use--no-default-features to remove these defaults.

  • reader: Activate read support.
  • writer: Activate write support.
  • async: Activate asynchronous read support.
  • all_compressions: Activate all supported compressions.
  • brotli: Activate Brotli compression.
  • gzip: Activate Gzip compression.
  • snappy: Activate Snappy compression.
  • zstd: Activate ZSTD compression.
  • lz4: Activate LZ4_RAW compression.
  • debug: Expose thesetPanicHook function for better error messages for Rust panics.

Node <20

On Node versions before 20, you'll have topolyfill the Web Cryptography API.

Future work

  • Example of pushdown predicate filtering, to download only chunks that match a specific condition
  • Column filtering, to download only certain columns
  • More tests

Acknowledgements

A starting point of my work came from @my-liminal-space'sread-parquet-browser (which is also dual licensed MIT and Apache 2).

@domoritz'sarrow-wasm was a very helpful reference for bootstrapping Rust-WASM bindings.


[8]ページ先頭

©2009-2025 Movatter.jp