- Notifications
You must be signed in to change notification settings - Fork18
A sharp cut(1) clone.
License
Unlicense, MIT licenses found
Licenses found
sstadick/hck
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
hck
is a shortening ofhack
, a rougher form ofcut
.
A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string.Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).
No single feature ofhck
on its own makes it stand out overawk
,cut
,xsv
or other such tools. Wherehck
excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter.It is meant to be simple and easy to use while exploring datasets.Think of this as filling a gap betweencut
andawk
.
hck
is dual-licensed under MIT or theUNLICENSE.
- Reordering of output columns! i.e. if you use
-f4,2,8
the output columns will appear in the order4
,2
,8
- Delimiter treated as a regex, i.e. you can split on multiple spaces without an extra pipe to
tr
! - Specification of output delimiter
- Selection of columns by header string literal with the
-F
option, or by regex by setting the-r
flag - Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep). SeeDecompression.
- Output can be gzip compressed using the multi-threaded compressors from
gzp
with-Z
flag- This gzipped output is in BGZF format and can be indexed and queried with
tabix
- This gzipped output is in BGZF format and can be indexed and queried with
- Exclude fields by index or by header.
- Speed
hck
does not aim to be a complete CSV / TSV parser a laxsv
which will respect quoting rules. It acts similar tocut
in that it will split on the delimiter no matter where in the line it is.- Delimiters cannot contain newlines... well they can, they will just never be seen.
hck
will always be a line-by-line tool where newlines are the standard\n
\r\n
.
- Homebrew / Linuxbrew
brew tap sstadick/hckbrew install hck
- Conda
# Note, this version lags by about a dayconda install -c conda-forge hck
- MacPorts
# Note, version may lag behind latestsudo port selfupdatesudo port install hck
- Debian (Ubuntu)
curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.debsudo dpkg -i hck-linux-amd64.deb
* Built with profile guided optimizations
- With the Rust toolchain:
export RUSTFLAGS='-C target-cpu=native'cargo install hck
From thereleases page (the binaries have been built with profile guided optimizations)
Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features:
# Assumes you are on stable rust# NOTE: this won't work on windows, see CI for linked issuecargo install justgit clone https://github.com/sstadick/hckcd hckjust install-native
- PRs are both welcome and encouraged for adding more packaging options and build types! I'd especially welcome PRs for the windows family of package managers / general making sure things are windows friendly.
❯ hck -Ld'' -f1-3,5- ./README.md| head -n4# 🪓 hck<p align="center"><a src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build Status"></a>
# note, '\s+' is the default❯ ps aux| hck -f1-3,5-| head -n4USER PID %CPU VSZ RSS TTY STAT START TIME COMMANDroot 1 0.0 169452 13472? Ss Jun21 0:19 /sbin/init splashroot 2 0.0 0 0? S Jun21 0:00 [kthreadd]root 3 0.0 0 0? I< Jun21 0:00 [rcu_gp]
❯ ps aux| hck -f2,1,3-| head -n4PID USER %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND1 root 0.0 0.0 169452 13472? Ss Jun21 0:19 /sbin/init splash2 root 0.0 0.0 0 0? S Jun21 0:00 [kthreadd]3 root 0.0 0.0 0 0? I< Jun21 0:00 [rcu_gp]
❯ ps aux| hck -e3,5| head -n4USER PID %MEM RSS TTY STAT START TIME COMMANDroot 1 0.0 14408? Ss Jun21 0:27 /sbin/init splashroot 2 0.0 0? S Jun21 0:01 [kthreadd]root 3 0.0 0? I< Jun21 0:00 [rcu_gp]
❯ ps aux| hck -r -E"CPU" -E"^ST.*"| head -n4USER PID %MEM VSZ RSS TTY TIME COMMANDroot 1 0.0 170224 14408? 0:27 /sbin/init splashroot 2 0.0 0 0? 0:01 [kthreadd]root 3 0.0 0 0? 0:00 [rcu_gp]
❯ ps aux| hck -D'___' -f2,1,3| head -n4PID___USER___%CPU1___root___0.02___root___0.03___root___0.0
# Note the order match the order of the -F argsps aux| hck -r -F'^ST.*' -F'^USER$'| head -n4STAT START USERSs Jun21 rootS Jun21 rootI< Jun21 root
❯ gzip ./README.md❯ hck -Ld'' -f1-3,5- -z ./README.md.gz| head -n4# 🪓 hck<p align="center"><a src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build Status"></a>
# with string literal❯printf'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n'> test.txt❯ hck -Ld'$;$' -f3,4 ./test.txtatest3 four# with an interesting regex❯printf'this123__is456--a789-test\na129_-b849-_3109_-four\n'> test.txt❯ hck -d'\d{3}[-_]+' -f3,4 ./test.txtatest3 four
This one requires some explaining first. Basically, by-index and by-header selections each have their own "order", and then the orders are merged ex:
❯printf'a,b,c,d,e\n1,2,3,4,5\n'| hck -d, -D: -f3 -F'b' -F'a'b:c:a2:3:1
In the by-index group, we've specified column 3 to be in output position 0. In the by-header group, we've specified that columnb
be in position 0. They by-index and by-header selections are merged together and when merging, if there are two outputs specified to be in the same output position the are output in input order (input meaning the order of columns in the input data).
This can lead to unexpected outcomes, such as the following example wherea
now comes first in the output when compared to the example above.
❯printf'a,b,c,d,e\n1,2,3,4,5\n'| hck -d, -D: -f3 -F'a'a:c1:3
Takeaway: be careful when a specific output order is desired and you are mixing and matching by-index and by-header field selections.
This set of benchmarks is simply meant to show thathck
is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark forgcut
, for example, we usetr
to convert the space runs to a single space and then pipe togcut
.
Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.
Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive
Theall_train.csv data is used.
This is a CSV dataset with 7 million lines. We test it both using,
as the delimiter, and then also using\s\s\s
as a delimiter.
PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.
cut
:
mawk
:
xsv
:
- https://github.com/BurntSushi/xsv
- v0.13.0 (compiled locally with optimizations)
tsv-utils
:
- https://github.com/eBay/tsv-utils
- v2.2.0 (ldc2, compiled locally with optimizations)
choose
:
- https://github.com/theryangeary/choose
- v1.3.2 (compiled locally with optimizations)
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
hck -Ld, -f1,8,19 ./hyper_data.txt > /dev/null | 1.198 ± 0.015 | 1.185 | 1.215 | 1.00 |
hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null | 1.349 ± 0.029 | 1.320 | 1.389 | 1.13 ± 0.03 |
hck -d, -f1,8,19 ./hyper_data.txt > /dev/null | 1.649 ± 0.023 | 1.624 | 1.673 | 1.38 ± 0.03 |
hck -d, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null | 1.869 ± 0.019 | 1.842 | 1.894 | 1.56 ± 0.02 |
tsv-select -d, -f 1,8,19 ./hyper_data.txt > /dev/null | 1.702 ± 0.021 | 1.687 | 1.734 | 1.42 ± 0.02 |
choose -f , -i ./hyper_data.txt 0 7 18 > /dev/null | 4.285 ± 0.092 | 4.214 | 4.428 | 3.58 ± 0.09 |
xsv select -d, 1,8,19 ./hyper_data.txt > /dev/null | 5.693 ± 0.042 | 5.635 | 5.745 | 4.75 ± 0.07 |
awk -F, '{print $1, $8, $19}' ./hyper_data.txt > /dev/null | 4.993 ± 0.029 | 4.959 | 5.030 | 4.17 ± 0.06 |
cut -d, -f1,8,19 ./hyper_data.txt > /dev/null | 7.541 ± 1.250 | 6.827 | 9.769 | 6.30 ± 1.05 |
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
hck -Ld' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 1.718 ± 0.003 | 1.715 | 1.722 | 1.00 |
hck -Ld' ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null | 2.191 ± 0.072 | 2.135 | 2.291 | 1.28 ± 0.04 |
hck -d' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 2.180 ± 0.029 | 2.135 | 2.208 | 1.27 ± 0.02 |
hck -d' ' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 2.542 ± 0.014 | 2.529 | 2.565 | 1.48 ± 0.01 |
hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 8.597 ± 0.023 | 8.575 | 8.631 | 5.00 ± 0.02 |
hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 8.890 ± 0.013 | 8.871 | 8.903 | 5.17 ± 0.01 |
hck -d'\s+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null | 10.014 ± 0.247 | 9.844 | 10.449 | 5.83 ± 0.14 |
hck -d'\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null | 10.173 ± 0.035 | 10.111 | 10.193 | 5.92 ± 0.02 |
choose -f ' ' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null | 6.537 ± 0.148 | 6.452 | 6.799 | 3.80 ± 0.09 |
choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null | 10.656 ± 0.219 | 10.484 | 10.920 | 6.20 ± 0.13 |
choose -f '\s' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null | 37.238 ± 0.153 | 37.007 | 37.383 | 21.67 ± 0.10 |
awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt > /dev/null | 6.673 ± 0.064 | 6.595 | 6.734 | 3.88 ± 0.04 |
awk -F' ' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null | 5.947 ± 0.098 | 5.896 | 6.121 | 3.46 ± 0.06 |
awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null | 11.080 ± 0.215 | 10.881 | 11.376 | 6.45 ± 0.13 |
< ./hyper_data_multichar.txt tr -s ' ' | cut -d ' ' -f1,8,19 > /dev/null | 7.471 ± 0.066 | 7.397 | 7.561 | 4.35 ± 0.04 |
< ./hyper_data_multichar.txt tr -s ' ' | xsv select -d ' ' 1,8,19 --no-headers > /dev/null | 6.172 ± 0.068 | 6.071 | 6.235 | 3.59 ± 0.04 |
< ./hyper_data_multichar.txt tr -s ' ' | hck -Ld' ' -f1,8,19 > /dev/null | 6.171 ± 0.112 | 5.975 | 6.243 | 3.59 ± 0.07 |
< ./hyper_data_multichar.txt tr -s ' ' | tsv-select -d ' ' -f 1,8,19 > /dev/null | 6.202 ± 0.130 | 5.984 | 6.290 | 3.61 ± 0.08 |
The following table indicates the file extension / binary pairs that are used to try to decompress a file when the-z
option is specified:
Extension | Binary | Type |
---|---|---|
*.gz | Native | gzip |
*.tgz | gzip -d -c | gzip |
*.bz2 | bzip2 -d -c | bzip2 |
*.tbz2 | bzip2 -d -c | bzip2 |
*.xz | xz -d -c | xz |
*.txz | xz -d -c | xz |
*.lz4 | lz4 -d -c | lz4 |
*.lzma | xz --format=lzma -d -c | lzma |
*.br | brotli -d -c | brotli |
*.zst | zstd -d -c | zstd |
*.zstd | zstd -q -d -c | zstd |
*.Z | uncompress -c | uncompress |
When a file with one of the extensions above is found,hck
will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found thenhck
will try to read the compressed file as is. Seegrep_cli
for source code. The end goal is to add a similar preprocessor asripgrep. Where there are multiple binaries for a given type, they are tried in the order listed above.
See thepgo*.sh
scripts for how to build this with optimizations. You will need to install the llvm tools viarustup component add llvm-tools-preview
for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.
- Add output compression detection when writing to a file
- Don't reparse fields / headers for each new file
- Figure out how to better reuse / share a vec
- Support indexing from the end (unlikely though)
- Bake in grep / filtering somehow (this will not be done at the expense of the primary utility of
hck
) - Move tests from main to core
- Add more tests all around
- Experiment with parallel parser as describedhere This should be very doable given we don't care about escaping quotes and such.
https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml
About
A sharp cut(1) clone.