- Notifications
You must be signed in to change notification settings - Fork16
zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser
License
liquidaty/zsv
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Playground (withoutsheet
viewer command):https://liquidaty.github.io/zsv
zsv+lib is a fast CSV parser library and extensible command-line utility. Itachieves high performance using SIMD operations,efficient memoryuse and other optimization techniques, and can also parsegeneric-delimited and fixed-width formats, as well as multi-row-span headers
The ZSV CLI can be compiled to virtually any target, includingWebAssembly, and offers features includingselect
,count
,direct CSVsql
,flatten
,serialize
,2json
conversion,2db
sqlite3conversion,stack
,pretty
,2tsv
,compare
,paste
,overwrite
and more.
The ZSV CLI also includessheet
, an in-console interactive grid viewer that includesbasic navigation, filtering [[, data editing and pivot table with drill down]],and that supports custom extensions:

Pre-built CLI packages are available viabrew
andnuget
.
A pre-built library package is available for Node (npm install zsv-lib
).Please note, this package is still in alpha and currently only exposes a smallsubset of the zsv library capabilities. More to come.
Anonline playground is available as well(without thesheet
feature due to browser limitations)
If you like zsv+lib, do not forget to give it a star! 🌟
Preliminary performance results compare favorably vs other CSV utilities (xsv
,tsv-utils
,csvkit
,mlr
(miller) etc). Below were results on a pre-M1 macOSMBA; on most platforms zsvlib was 2x faster, though in some cases the advantagewas smaller e.g. 15-25%) (below, mlr not shown as it was about 25x slower):
** See 12/19 update re M1 processor athttps://github.com/liquidaty/zsv/blob/main/app/benchmark/README.md
"CSV" is an ambiguous term. This library uses the same definition as Excel. Inaddition, it provides arow-level (as well as cell-level) API and provides"normalized" CSV output (e.g. input ofthis"iscell1,"thisis,"cell2
becomes"this""iscell1","thisis,cell2"
). Each of these three objectives (Excelcompatibility, row-level API and normalized output) has a measurable performanceimpact; conversely, it is possible to achieve-- which a number of other CSVparsers do-- much faster parsing speeds if any of these requirements (especiallyExcel compatibility) are dropped.
zsv
is an extensible CSV utility, which uses zsvlib, for tasks such as slicingand dicing, querying with SQL, combining, serializing, flattening,converting between CSV/JSON/sqlite3 and more.
zsv
is streamlined for easy development of custom dynamic extensions.
zsvlib andzsv
are written in C, but since zsvlib is a library, andzsv
extensions are just shared libraries, you can extendzsv
with your own code inany programming language, so long as it has been compiled into a shared librarythat implements the expectedinterface.
- Available as BOTH a library and an application (coming soon: standalonezsvutil library for common helper functions such as csv writer)
- Open-source, permissively licensed
- Handles real-world CSV the same way that spreadsheet programs do (includingedge cases). Gracefully handles (and can "clean") real-world data that may be"dirty".
- Runs on macOS (tested on clang/gcc), Linux (gcc), Windows (mingw), BSD(gcc-only) and in-browser (emscripten/wasm)
- Fastest (at least, vs all alternatives and on all platforms we've benchmarkedwhere 256-bit SIMD operations are available). Seeapp/benchmark/README.md
- Low memory usage (regardless of how big your data is) and size footprint forboth lib (~20k) and CLI executable (< 1MB)
- Handles general delimited data (e.g. pipe-delimited) and fixed-with input(with specified widths or auto-detected widths)
- Handles multi-row headers
- Handles input from any stream, including caller-defined streams accessed via asingle caller-defined
fread
-like function - Easy to use as a library in a few lines of code, via either pull or pushparsing
- Includes the
zsv
CLI with the following built-in commands:sheet
, an in-console interactive and extendable grid viewerselect
,count
,sql
query,desc
ribe,flatten
,serialize
,2json
,2db
,stack
,pretty
,2tsv
,paste
,compare
,overwrite
,jq
,prop
,rm
- easilyconvert between CSV/JSON/sqlite3
- compare multiple files
- overwrite cells in files
- CLI is easy to extend/customize with a few lines of code via modular plug-inframework. Just write a few custom functions and compile into a distributableDLL that any existing zsv installation can use.
Download pre-built binaries and packages for macOS, Windows, Linux and BSD fromtheReleases page.
You can also download pre-built binaries and packages fromActions for the latest commits andPRs but these are retained only for limited days.
Important
Formusl libc static build, the dynamicextensions are not supported!
Note
Afterv0.3.9-alpha
, all package artifacts will be properlyattested.To verify, you can useGitHub CLI like this:
gh attestation verify<downloaded-artifact> --repo liquidaty/zsv
...via Homebrew:
brew tap liquidaty/zsvbrew install zsv
...via MacPorts:
sudo port install zsv
For Linux (Debian/Ubuntu -*.deb
):
# Installsudo apt install ./zsv-amd64-linux-gcc.deb# Uninstallsudo apt remove zsv
For Linux (RHEL/CentOS -*.rpm
):
# Installsudo yum install ./zsv-amd64-linux-gcc.rpm# Uninstallsudo yum remove zsv
For Windows (*.nupkg
), install withnuget.exe
:
# Install via nuget custom feed (requires absolutes paths)md nuget-feednuget.exe add zsv .\<path>\zsv-amd64-windows-mingw.nupkg -source<path>/nuget-feednuget.exe install zsv -version<version> -source<path>/nuget-feed# Uninstallnuget.exe delete zsv<version> -source<path>/nuget-feed
For Windows (*.nupkg
), install withchoco.exe
:
# Installchoco.exe install zsv --pre -source<directory containing .nupkg file># Uninstallchoco.exe uninstall zsv
The zsv parser library is available for node:
npm install zsv-lib
Please note:
- This package is still in alpha and currently only exposes a small subset ofthe zsv library capabilities. More to come!
- The CLI is not yet available as a Node package
- If you'd like to use additional parser features, or use the CLI as a Nodepackage, please feel free to post a request in an issue here.
zsv
CLI is also available as a container image fromPackages.
The container image is published on every release. In addition to the specificrelease tag, the image is also tagged aslatest
i.e.zsv:latest
alwayspoints the latest released version.
Example:
$ docker pull ghcr.io/liquidaty/zsv# ...$ cat worldcitiespop_mil.csv| docker run -i ghcr.io/liquidaty/zsv count1000000
For image details, seeDockerfile. You may use this as abaseline for your own use cases as needed.
In a GitHub Actions workflow, you can usezsv/setup-action
to set up zsv+zsvlib:
-name:Set up zsv+zsvlibuses:liquidaty/zsv/setup-action@main
Seezsv/setup-action/README for more details.
SeeBUILD.md for more details.
Our objectives, which we were unable to find in a pre-existing project, are:
- Reasonably high performance
- Runs on any platform, including web assembly
- Available as both a library and a standalone executable / command-lineinterface utility (CLI)
- Memory-efficient, configurable resource limits
- Handles real-world CSV cases the same way that Excel does, including all edgecases (quote handling, newline handling (either
\n
or\r
), embeddednewlines, abnormal quoting e.g. aaa"aaa,bbb...) - Handles other "dirty" data issues:
- Assumes valid UTF8, but does not misbehave if input contains bad UTF8
- Option to specify multi-row headers
- Does not assume or stop working in case of inconsistent numbers of columns
- Easy to use library or extend/customize CLI
There are several excellent tools that achieve high performance. Among those weconsidered were xsv and tsv-utils. While they met our performance objective,both were designed primarily as a utility and not a library, and were not easyenough, for our needs, to customize and/or to support modular customizationsthat could be maintained (or licensed) independently of the related project (inaddition to the fact that they were written in Rust and D, respectively, whichhappen to be languages with which we lacked deep experience, especially for webassembly targeting).
Others we considered were Miller (mlr
),csvkit
and Go (csv module), whichdid not meet our performance objective. We also considered various otherlibraries using SIMD for CSV parsing, but none that we tried met the "real-worldCSV" objective.
Hence, zsv was created as a library and a versatile application, both optimizedfor speed and ease of development for extending and/or customizing to yourneeds.
zsv
comes with several built-in commands:
sheet
: an in-console, interactive grid viewerecho
: read CSV from stdin and write it back out to stdout. This is mostlyuseful for demonstrating how to use the API and also how to create a plug-in,and has several uses beyond that including adding/removing BOM, cleaning upbad UTF8, whitespace or blank column trimming, limiting output to a contiguousdata block, skipping leading garbage, and even proving substitution valueswithout modifying the underlying sourceselect
: re-shape CSV by skipping leading garbage, combining header rows intoa single header, selecting or excluding specified columns, removing duplicatecolumns, sampling, converting from fixed-width input, searching and moresql
: treat one or more CSV files like database tables and query with SQLdesc
: provide a quick description of your table datapretty
: format for console (fixed-width) display, or convert to markdownformat2json
: convert CSV to JSON. Optionally, output indatabase schema2tsv
: convert to TSV (tab-delimited) formatcompare
: compare two or more tables of data and output the differencespaste
(alpha): horizontally paste two tables together (given inputs X and Y,output 1...N rows where each row contains the entire correspondingrow in X followed by the entire corresponding row in Y)serialize
(inverse of flatten): convert an NxM table to a single 3x (Nx(M-1))table with columns: Row, Column Name, Column Valueflatten
(inverse of serialize): flatten a table by combining rows that sharea common value in a specified identifier columnstack
: merge CSV files verticallyjq
: run ajq
filter2db
:convert from JSON to sqlite3 dboverwrite
: overwrite a cell value; changes will be reflected in any zsvcommand when the --apply-overwrites option is specifiedprop
: view or save parsing options associated with a file, such as initialrows to ignore, or header row span. Saved options are be applied by defaultwhen processing that file.
Each of these can also be built as an independent executable namedzsv_xxx
wherexxx
is the command name.
After installing, runzsv help
to see usage details. The typical syntax iszsv <command> <parameters>
e.g.
zsv sql my_population_data.csv"select * from data where population > 100000"
Simple API usage examples include:
Pull parsing:
zsv_parserparser=zsv_new(...);while (zsv_next_row(parser)==zsv_status_row) {// for each row// ...size_tcell_count=zsv_cell_count(parser);for (size_ti=0;i<cell_count;i++) {// for each cellstructzsv_cellc=zsv_get_cell(parser,i);fprintf(stderr,"Cell: %.*s\n",c.len,c.str);// ... }}
Push parsing:
staticvoidmy_row_handler(void*ctx) {zsv_parserp=ctx;size_tcell_count=zsv_cell_count(p);for (size_ti=0,j=zsv_cell_count(p);i<j;i++) {// ... }}intmain() {zsv_parserp=zsv_new(NULL);zsv_set_row_handler(p,my_row_handler);zsv_set_context(p,p);while (zsv_parse_more(data.parser)==zsv_status_ok);return0;}
Full application code examples can be found atexamples/lib/README.md.
An example of using the API, compiled to wasm and called via Javascript, is inexamples/js/README.md.
For more sophisticated (but at this time, only sporadicallycommented/documented) use cases, see the various CLI C source files in theapp
directory such asapp/serialize.c
.
You can extendzsv
by providing a pre-compiled shared or static library thatdefines the functions specified inextension_template.h
and whichzsv
loadsin one of three ways:
- as a static library that is statically linked at compile time
- as a dynamic library that is linked at compile time and located in any librarysearch path
- as a dynamic library that is located in the same folder as the
zsv
executable and loaded at runtime if/as/when the custom mode is invoked
You can build and run a sample extension by runningmake test
fromapp/ext_example
.
The easiest way to implement your own extension is to copy and customize thetemplate files inapp/ext_template
This release does not yet implement the full range of core features that areplanned for implementation prior to beta release. If you are interested inhelping, please post an issue.
- optimize search; add search with hyperscan or re2 regex matching, possiblyparallelize?
- optional OpenMP or other multi-threading for row processing
- auto-generated documentation, and better documentation in general
- Additional benchmarking. Would be great to usehttps://bitbucket.org/ewanhiggs/csv-game/src/master/ as a springboard tobenchmarking a number of various tasks
- encoding conversion e.g. UTF16 to UTF8
About
zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser