sstadick/hckPublic

NotificationsYou must be signed in to change notification settings
Fork18
Star718

A sharp cut(1) clone.

License

Unlicense, MIT licenses found

Licenses found

718 stars 18 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
.github		.github
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-MIT		LICENSE-MIT
README.md		README.md
THIRDPARTY.yml		THIRDPARTY.yml
UNLICENSE		UNLICENSE
benchmark.sh		benchmark.sh
justfile		justfile
rust-toolchain.toml		rust-toolchain.toml

Repository files navigation

🪓 hck

A sharpcut(1) clone.

hck is a shortening ofhack, a rougher form ofcut.

A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string.Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).

No single feature ofhck on its own makes it stand out overawk,cut,xsv or other such tools. Wherehck excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter.It is meant to be simple and easy to use while exploring datasets.Think of this as filling a gap betweencut andawk.

hck is dual-licensed under MIT or theUNLICENSE.

Features

Reordering of output columns! i.e. if you use-f4,2,8 the output columns will appear in the order4,2,8
Delimiter treated as a regex, i.e. you can split on multiple spaces without an extra pipe totr!
Specification of output delimiter
Selection of columns by header string literal with the-F option, or by regex by setting the-r flag
Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep). SeeDecompression.
Output can be gzip compressed using the multi-threaded compressors fromgzp with-Z flag
- This gzipped output is in BGZF format and can be indexed and queried withtabix
Exclude fields by index or by header.
Speed

Non-goals

hck does not aim to be a complete CSV / TSV parser a laxsv which will respect quoting rules. It acts similar tocut in that it will split on the delimiter no matter where in the line it is.
Delimiters cannot contain newlines... well they can, they will just never be seen.hck will always be a line-by-line tool where newlines are the standard\n\r\n.

Install

Homebrew / Linuxbrew

brew tap sstadick/hckbrew install hck

Conda

# Note, this version lags by about a dayconda install -c conda-forge hck

MacPorts

# Note, version may lag behind latestsudo port selfupdatesudo port install hck

Debian (Ubuntu)

curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.debsudo dpkg -i hck-linux-amd64.deb

* Built with profile guided optimizations

With the Rust toolchain:

export RUSTFLAGS='-C target-cpu=native'cargo install hck

From thereleases page (the binaries have been built with profile guided optimizations)
Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features:

# Assumes you are on stable rust# NOTE: this won't work on windows, see CI for linked issuecargo install justgit clone https://github.com/sstadick/hckcd hckjust install-native

PRs are both welcome and encouraged for adding more packaging options and build types! I'd especially welcome PRs for the windows family of package managers / general making sure things are windows friendly.

Packaging status

Examples

Splitting with a string literal

❯ hck -Ld'' -f1-3,5- ./README.md| head -n4#       🪓      hck<p      align="center"><a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

Splitting with a regex delimiter

# note, '\s+' is the default❯ ps aux| hck -f1-3,5-| head -n4USER    PID     %CPU    VSZ     RSS     TTY     STAT    START   TIME    COMMANDroot    1       0.0     169452  13472?       Ss      Jun21   0:19    /sbin/init      splashroot    2       0.0     0       0?       S       Jun21   0:00    [kthreadd]root    3       0.0     0       0?       I<      Jun21   0:00    [rcu_gp]

Reordering output columns

❯ ps aux| hck -f2,1,3-| head -n4PID     USER    %CPU    %MEM    VSZ     RSS     TTY     STAT    START   TIME    COMMAND1       root    0.0     0.0     169452  13472?       Ss      Jun21   0:19    /sbin/init      splash2       root    0.0     0.0     0       0?       S       Jun21   0:00    [kthreadd]3       root    0.0     0.0     0       0?       I<      Jun21   0:00    [rcu_gp]

Excluding output columns

❯ ps aux| hck -e3,5| head -n4USER    PID     %MEM    RSS     TTY     STAT    START   TIME    COMMANDroot    1       0.0     14408?       Ss      Jun21   0:27    /sbin/init      splashroot    2       0.0     0?       S       Jun21   0:01    [kthreadd]root    3       0.0     0?       I<      Jun21   0:00    [rcu_gp]

Excluding output columns by header regex

❯  ps aux| hck -r -E"CPU" -E"^ST.*"| head -n4USER    PID     %MEM    VSZ     RSS     TTY     TIME    COMMANDroot    1       0.0     170224  14408?       0:27    /sbin/init      splashroot    2       0.0     0       0?       0:01    [kthreadd]root    3       0.0     0       0?       0:00    [rcu_gp]

Changing the output record separator

❯ ps aux| hck -D'___' -f2,1,3| head -n4PID___USER___%CPU1___root___0.02___root___0.03___root___0.0

Select columns with regex

# Note the order match the order of the -F argsps aux| hck -r -F'^ST.*' -F'^USER$'| head -n4STAT    START   USERSs      Jun21   rootS       Jun21   rootI<      Jun21   root

Automagic decompresion

❯ gzip ./README.md❯ hck -Ld'' -f1-3,5- -z ./README.md.gz| head -n4#       🪓      hck<p      align="center"><a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

Splitting on multiple characters

# with string literal❯printf'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n'> test.txt❯ hck -Ld'$;$' -f3,4 ./test.txtatest3       four# with an interesting regex❯printf'this123__is456--a789-test\na129_-b849-_3109_-four\n'> test.txt❯ hck -d'\d{3}[-_]+' -f3,4 ./test.txtatest3       four

Splitting by-index and by-header

This one requires some explaining first. Basically, by-index and by-header selections each have their own "order", and then the orders are merged ex:

❯printf'a,b,c,d,e\n1,2,3,4,5\n'| hck -d, -D: -f3 -F'b' -F'a'b:c:a2:3:1

In the by-index group, we've specified column 3 to be in output position 0. In the by-header group, we've specified that columnb be in position 0. They by-index and by-header selections are merged together and when merging, if there are two outputs specified to be in the same output position the are output in input order (input meaning the order of columns in the input data).

This can lead to unexpected outcomes, such as the following example wherea now comes first in the output when compared to the example above.

❯printf'a,b,c,d,e\n1,2,3,4,5\n'| hck -d, -D: -f3 -F'a'a:c1:3

Takeaway: be careful when a specific output order is desired and you are mixing and matching by-index and by-header field selections.

Benchmarks

This set of benchmarks is simply meant to show thathck is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark forgcut, for example, we usetr to convert the space runs to a single space and then pipe togcut.

Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.

Hardware

Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive

Data

Theall_train.csv data is used.

This is a CSV dataset with 7 million lines. We test it both using, as the delimiter, and then also using\s\s\s as a delimiter.

PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.

Tools

cut:

mawk:

xsv:

https://github.com/BurntSushi/xsv
v0.13.0 (compiled locally with optimizations)

tsv-utils:

https://github.com/eBay/tsv-utils
v2.2.0 (ldc2, compiled locally with optimizations)

choose:

https://github.com/theryangeary/choose
v1.3.2 (compiled locally with optimizations)

Single character delimiter benchmark

Command	Mean [s]	Min [s]	Max [s]	Relative
`hck -Ld, -f1,8,19 ./hyper_data.txt > /dev/null`	1.198 ± 0.015	1.185	1.215	1.00
`hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null`	1.349 ± 0.029	1.320	1.389	1.13 ± 0.03
`hck -d, -f1,8,19 ./hyper_data.txt > /dev/null`	1.649 ± 0.023	1.624	1.673	1.38 ± 0.03
`hck -d, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null`	1.869 ± 0.019	1.842	1.894	1.56 ± 0.02
`tsv-select -d, -f 1,8,19 ./hyper_data.txt > /dev/null`	1.702 ± 0.021	1.687	1.734	1.42 ± 0.02
`choose -f , -i ./hyper_data.txt 0 7 18 > /dev/null`	4.285 ± 0.092	4.214	4.428	3.58 ± 0.09
`xsv select -d, 1,8,19 ./hyper_data.txt > /dev/null`	5.693 ± 0.042	5.635	5.745	4.75 ± 0.07
`awk -F, '{print $1, $8, $19}' ./hyper_data.txt > /dev/null`	4.993 ± 0.029	4.959	5.030	4.17 ± 0.06
`cut -d, -f1,8,19 ./hyper_data.txt > /dev/null`	7.541 ± 1.250	6.827	9.769	6.30 ± 1.05

Multi-character delimiter benchmark

Command	Mean [s]	Min [s]	Max [s]	Relative
`hck -Ld' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	1.718 ± 0.003	1.715	1.722	1.00
`hck -Ld' ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null`	2.191 ± 0.072	2.135	2.291	1.28 ± 0.04
`hck -d' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	2.180 ± 0.029	2.135	2.208	1.27 ± 0.02
`hck -d' ' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	2.542 ± 0.014	2.529	2.565	1.48 ± 0.01
`hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	8.597 ± 0.023	8.575	8.631	5.00 ± 0.02
`hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	8.890 ± 0.013	8.871	8.903	5.17 ± 0.01
`hck -d'\s+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null`	10.014 ± 0.247	9.844	10.449	5.83 ± 0.14
`hck -d'\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null`	10.173 ± 0.035	10.111	10.193	5.92 ± 0.02
`choose -f ' ' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null`	6.537 ± 0.148	6.452	6.799	3.80 ± 0.09
`choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null`	10.656 ± 0.219	10.484	10.920	6.20 ± 0.13
`choose -f '\s' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null`	37.238 ± 0.153	37.007	37.383	21.67 ± 0.10
`awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt > /dev/null`	6.673 ± 0.064	6.595	6.734	3.88 ± 0.04
`awk -F' ' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null`	5.947 ± 0.098	5.896	6.121	3.46 ± 0.06
`awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null`	11.080 ± 0.215	10.881	11.376	6.45 ± 0.13
`< ./hyper_data_multichar.txt tr -s ' ' \| cut -d ' ' -f1,8,19 > /dev/null`	7.471 ± 0.066	7.397	7.561	4.35 ± 0.04
`< ./hyper_data_multichar.txt tr -s ' ' \| xsv select -d ' ' 1,8,19 --no-headers > /dev/null`	6.172 ± 0.068	6.071	6.235	3.59 ± 0.04
`< ./hyper_data_multichar.txt tr -s ' ' \| hck -Ld' ' -f1,8,19 > /dev/null`	6.171 ± 0.112	5.975	6.243	3.59 ± 0.07
`< ./hyper_data_multichar.txt tr -s ' ' \| tsv-select -d ' ' -f 1,8,19 > /dev/null`	6.202 ± 0.130	5.984	6.290	3.61 ± 0.08

Decompression

The following table indicates the file extension / binary pairs that are used to try to decompress a file when the-z option is specified:

Extension	Binary	Type
`*.gz`	Native	gzip
`*.tgz`	`gzip -d -c`	gzip
`*.bz2`	`bzip2 -d -c`	bzip2
`*.tbz2`	`bzip2 -d -c`	bzip2
`*.xz`	`xz -d -c`	xz
`*.txz`	`xz -d -c`	xz
`*.lz4`	`lz4 -d -c`	lz4
`*.lzma`	`xz --format=lzma -d -c`	lzma
`*.br`	`brotli -d -c`	brotli
`*.zst`	`zstd -d -c`	zstd
`*.zstd`	`zstd -q -d -c`	zstd
`*.Z`	`uncompress -c`	uncompress

When a file with one of the extensions above is found,hck will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found thenhck will try to read the compressed file as is. Seegrep_cli for source code. The end goal is to add a similar preprocessor asripgrep. Where there are multiple binaries for a given type, they are tried in the order listed above.

Profile Guided Optimization

See thepgo*.sh scripts for how to build this with optimizations. You will need to install the llvm tools viarustup component add llvm-tools-preview for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.

TODO

Add output compression detection when writing to a file
Don't reparse fields / headers for each new file
Figure out how to better reuse / share a vec
Support indexing from the end (unlikely though)
Bake in grep / filtering somehow (this will not be done at the expense of the primary utility ofhck)
Move tests from main to core
Add more tests all around
Experiment with parallel parser as describedhere This should be very doable given we don't care about escaping quotes and such.

More packages and builds

https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml

References

About

A sharp cut(1) clone.

Topics

rust command-line text-processing

Resources

Readme

License

Unlicense, MIT licenses found

Releases47

v0.11.4 Latest

Mar 14, 2025

+ 46 releases

Packages

No packages published

Movatterモバイル変換

License

Licenses found

sstadick/hck

Folders and files

Latest commit

History

Repository files navigation

🪓 hck

Features

Non-goals

Install

Packaging status

Examples

Splitting with a string literal

Splitting with a regex delimiter

Reordering output columns

Excluding output columns

Excluding output columns by header regex

Changing the output record separator

Select columns with regex

Automagic decompresion

Splitting on multiple characters

Splitting by-index and by-header

Benchmarks

Hardware

Data

Tools

Single character delimiter benchmark

Multi-character delimiter benchmark

Decompression

Profile Guided Optimization

TODO

More packages and builds

References

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases47

Packages0

Uh oh!

Contributors8

Uh oh!

Languages

Packages