dbohdan/sqawkPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star315

Like awk, but with SQL and table joins

License

MIT license

315 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.github/workflows		.github/workflows
examples		examples
lib		lib
tools		tools
.gitignore		.gitignore
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
assemble.cmd		assemble.cmd
sqawk-dev.tcl		sqawk-dev.tcl
squawk.jpg		squawk.jpg
tests.tcl		tests.tcl

Repository files navigation

Sqawk

Sqawk is anawk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

An example

Sqawk is invoked like this:

sqawk -foo bar script baz=qux filename

wherescript is your SQL.

Here is an example of what it can do:

## List all login shells used on the system.sqawk -ORS'\n''select distinct shell from passwd order by shell' FS=: columns=username,password,uid,gui,info,home,shell table=passwd /etc/passwd

or, equivalently,

## Do the same thing.sqawk'select distinct a7 from a order by a7' FS=: /etc/passwd

Sqawk lets you be verbose to better document your script but aims to provide good defaults that save you keystrokes in interactive use.

Skip down for more examples.

Installation

Sqawk requires Tcl 8.6 or newer, Tcllib, and SQLite version 3 bindings for Tcl installed.

To install these dependencies onDebian andUbuntu run the following command:

sudo apt install tcl tcllib libsqlite3-tcl

OnFedora,RHEL, andCentOS:

sudo dnf install tcl tcllib sqlite-tcl

OnFreeBSD withpkgng:

sudo pkg install tcl86 tcllib tcl-sqlite3sudo ln -s /usr/local/bin/tclsh8.6 /usr/local/bin/tclsh

OnWindows 7 or later installMagicsplat Tcl/Tk for Windows.

OnmacOS useMacPorts:

sudo port install tcllib tcl-sqlite3

Once you have the dependencies installed on *nix, run

git clone https://github.com/dbohdan/sqawkcd sqawkmakemake testsudo make install

or on Windows,

git clone https://github.com/dbohdan/sqawkcd sqawkassemble.cmdtclsh tests.tcl

Usage

sqawk [globaloptions] script [option=value ...] < filename

sqawk [globaloptions] script [option=value ...] filename1 [[option=value ...] filename2 ...]

One of the filenames can be- for standard input.

SQL

A Sqawkscript consists of one or more statements in the SQLite version 3dialect of SQL.

The default table names area for the first input file,b for the second,c for the third, and so on. You can change the table name for any file with afile option. The table name is used as the prefix in the column names of the table. By default, the columns are nameda1,a2, etc. in tablea;b1,b2, etc. inb; and so on. For each record,a0 is the text of the whole record (one line of input with the defaultawk parser and the default record separator of\n).anr ina,bnr inb, and so on contains the record number and is the primary key of its respective table.anf,bnf, and so on contain the field count for a given record.

Options

Global options

These options affect all files.

-FS value

Example:-FS '[ \t]+'

The input field separator regular expression for the defaultawk parser (for all files).

-RS value

Example:-RS '\n'

The input record separator regular expression for the defaultawk parser (for all files).

-OFS value

Example:-OFS ' '

The output field separator string for the defaultawk serializer.

-ORS value

Example:-ORS '\n'

The output record separator string for the defaultawk serializer.

-NF value

Example:-NF 10

The maximum number of fields per record. The corresponding number of columns is added to the target table at the start (e.g.,a0,a1,a2, ... ,a10 for ten fields). Increase this if you run Sqawk with-MNF error and get errors liketable x has no column named x51.

-MNF value

Examples:-MNF expand,-MNF crop,-MNF error

The NF mode. This option tells Sqawk what to do if a record exceeds the maximum number of fields:expand, the default, increasesNF automatically and add columns to the table during import;crop truncates the record toNF fields (that is, the fields for which there aren't enough table columns are omitted);error makes Sqawk quit with an error message liketable x has no column named x11.

-dbfile value

Example:-dbfile test.db

The SQLite database file in which Sqawk will store the parsed data. Defaults to the special filename:memory:, which instructs SQLite to hold the data in RAM only. Using an actual file instead of:memory: is slower but makes it possible to process larger datasets. The database file is opened if it exists and created if it doesn't. Once Sqawk creates the file, you can open it in other application, including thesqlite3 CLI. If you run Sqawk more than once with the same database file it reuses the tables each time. By default it usesa for the first file,b for the second, etc. For example,sqawk -dbfile test.db 'select 0' foo; sqawk -dbfile test.db 'select 1' bar inserts the data from bothfoo andbar into the tablea intest.db; you can avoid this withsqawk -dbfile test.db 'select 0' table=foo foo; sqawk -dbfile test.db 'select 1' table=bar bar. If you want to, you can also insert the data from both files into the same table in one invocation:sqawk 'select * from a' foo table=a bar.

-noinput

Do not read from standard input if Sqawk is given no filename arguments.

-output value

Example:-output awk

The output format. SeeOutput formats.

-v

Print the Sqawk version and exit.

-1

Do not split records into fields. The same as-FS 'x^'. (x^ is a regular expression that matches nothing.) Improves the performance somewhat for when you only want to operate on whole records (lines).

Output formats

The following are the possible values for the command line option-output. Some formats have format options to further customize the output. The options are appended to the format name and separated from the format name and each other with commas, e.g.,-output json,kv=1,pretty=1.

awk

Options: none

Example:-output awk

The default serializer,awk, mimics its namesake awk. When it is selected, the output consists of the rows returned by your query separated with the output record separator (-ORS). Each row in turn consists of columns separated with the output field separator (-OFS).

csv

Options: none

Example:-output csv

Output CSV.

json

Options:kv (default true),pretty (default false)

Example:-output json,pretty=0,kv=0

Output the result of the query as JSON. Ifkv (short for "key-value") is true, the result is an array of JSON objects with the column names as keys; ifkv is false, the result is an array of arrays. The values are all represented as strings in either case. Ifpretty is true, each object (but not array) is indented for readability.

table

Options:alignments oralign,margins,style

Examples:-output table,align=center left right,-output table,alignments=c l r

Output plain text tables. Thetable serializer usesTabulate to format the output as a table usingbox-drawing characters. Note that the default Unicode table output does not display correctly incmd.exe on Windows even afterchcp 65001. Usestyle=loFi to draw tables with plain ASCII characters instead.

tcl

Options:kv (default false),pretty (default false)

Example:-output tcl,kv=1

Output raw Tcl data structures. With thetcl serializer Sqawk outputs a list of lists ifkv is false and a list of dictionaries with the column names as keys ifkv is true. Ifpretty is true, print every list or dictionary on a separate line.

Per-file options

These options are set before a filename and only affect one file.

columns

Examples:columns=id,name,sum,columns=id,a long name with spaces

Give custom names to the table columns for the next file. If there are more columns than custom names, the columns after the last with a custom name are named automatically in the same way as with the optionheader=1 (see below). Custom column names override names taken from the header. If you give a column an empty name, it is named automatically or retains its name from the header.

datatypes

Example:datatypes=integer,real,text

Set thedatatypes for the columns, starting with the first (a1 if your table isa). The datatype for each column for which the datatype is not explicitly given isINTEGER. The datatype ofa0 is alwaysTEXT.

format

Example:format=csv csvsep=;

Set the input format for the next file. SeeInput formats.

header

Example:header=1

Can be0/false/no/off or1/true/yes/on. Use the first row of the file as a source of column names. If the first row has five fields, then the first five columns will have custom names and all the following columns will have automatically generated names (e.g.,name,surname,title,office,phone,a6,a7, ...).

prefix

Example:prefix=x

The column name prefix in the table. Defaults to the table name. For example, withtable=foo andprefix=bar you have columns namedbar1,bar2,bar3, etc. in tablefoo.

table

Example:table=foo

The table name. By default, tables are nameda,b,c, etc. Specifying, for example,table=foo for the second file only results in the tables having the namesa,foo,c, ...

F0

Examples:F0=no,F0=1

Can be0/false/no/off or1/true/yes/on. Enable the zeroth column of the table that stores the whole record. Disabling this column lowers memory/disk usage.

NF

Example:NF=20

The same as theglobal option -NF but for one file (table).

MNF

Example:MNF=crop

The same as theglobal option -MNF but for one file (table).

Input formats

A format option (format=x) selects the input parser with which Sqawk parses the next file. Formats can have multiple synonymous names or multiple names that configure the parser in different ways. Selecting an input format can enable additional per-file options that only work for that format.

awk

Format options:FS,RS,trim,fields

Option examples:RS=\n,FS=:,trim=left,fields=1,2,3-5,auto

The default input parser. Splits the input first into records then into fields using regular expressions. The optionsFS andRS work the same as -FS and -RS respectively but only apply to one file. The optiontrim removes whitespace at the beginning of each line of input (trim=left), at its end (trim=right), both (trim=both), or neither (trim=none, default). The optionfields configures how the fields of the input are mapped to the columns of the corresponding database table. This option lets you discard some of the fields, which can save memory, and to merge the contents of others. For example,fields=1,2,3-5,auto tells Sqawk to insert the contents of the first field into the columna1 (assuming tablea), the second field intoa2, the third through the fifth field intoa3, and the rest of the fields starting with the sixth into the columnsa4,a5, and so on, one field per column. If you merge several fields, the whitespace between them is preserved.

csv, csv2, csvalt

Format options:csvsep,csvquote

Option example:format=csv csvsep=, 'csvquote="'

Parse the input as CSV. Usingformat=csv2 orformat=csvalt enables thealternate mode meant for parsing CSV files exported by Microsoft Excel.csvsep sets the field separator; it defaults to,.csvquote selects the character with which the fields that contain the field separator are quoted; it defaults to". Note that some characters (like numbers and most letters) can't be be used ascsvquote.

json

Format options:kv (default true),lines (default false)

Option example:format=json kv=false

Parse the input as JSON orJSON Lines. The value forkv andlines can be0/false/no/off or1/true/yes/on. Iflines is false, the input is treated as a JSON array of either objects (kv=1) or arrays (kv=0). Iflines is true, the input is treated as a text file with a JSON array or object (depending onkv) on every line.

Whenkv is false, each array becomes a record and each of its elements a field. If the table for the input file isa, its columna0 contains the concatenation of every element of the array,a1 contains the first element,a2 the second element, and so on. Whenkv is true, the first record contains every unique key found in all of the objects. This is intended for use with thefile optionheader=1. The keys are in the same order they are in the first object of the input. (We treat JSON objects as ordered.) If some keys aren't in the first object but are in subsequent objects, they follow those that are in the first object in alphabetical order. Records from the second on contain the values of the input objects. These values are mapped to fields according to the order of the keys in the first record.

Every value in an object or an array is converted to text when parsed. JSON given to Sqawk should only have one level of nesting ([[],[],[]] or[{},{},{}]). What happens with more deeply nested JSON is undefined. Currently it is converted to text as Tcl dictionaries and lists.

tcl

Format options:kv (default false),lines (default false)

Option example:format=tcl kv=true

The value forkv can be0/false/no/off or1/true/yes/on. Iflines is false, the input is treated as a Tcl list of either lists (kv=0) or dictionaries (kv=1). Iflines is true, it is treated as a text file with a Tcl list or dictionary (depending onkv) on every line.

Whenkv is false, each list becomes a record and each of its elements a field. If the table for the file isa, its columna0 contains the full list,a1 contains the first element,a2 the second element, and so on. Whenkv is true, the first record contains every unique key found in all of the dictionaries. This is intended for use with thefile optionheader=1. The keys are in the same order they are in the first dictionary of the input. (Tcl dictionaries are ordered.) If some keys aren't in the first dictionary but are in the subsequent ones, they follow those that are in the first dictionary in alphabetical order. Records from the second on contain the values of the input dictionaries. They are mapped to fields according to the order of the keys in the first record.

More examples

Sum up numbers

find . -iname '*.jpg' -type f -printf '%s\n' | sqawk 'select sum(a1)/1024.0/1024 from a'

Line count

sqawk -1 'select count(*) from a' < file.txt

Find lines that match a pattern

ls | sqawk -1 'select a0 from a where a0 like "%win%"'

Shuffle lines

sqawk -1 'select a1 from a order by random()' < file

Pretty-print data as a table

ps | sqawk -output table \     'select a1,a2,a3,a4 from a' \     trim=left \     fields=1,2,3,4-end

Sample output

┌─────┬─────┬────────┬───────────────┐│ PID │ TTY │  TIME  │      CMD      │├─────┼─────┼────────┼───────────────┤│11476│pts/3│00:00:00│       ps      │├─────┼─────┼────────┼───────────────┤│11477│pts/3│00:00:00│tclkit-8.6.3-mk│├─────┼─────┼────────┼───────────────┤│20583│pts/3│00:00:02│      zsh      │└─────┴─────┴────────┴───────────────┘

Convert input to JSON objects

ps a | sqawk -output json,pretty=1 \             'select PID, TTY, STAT, TIME, COMMAND from a' \             trim=left \             fields=1,2,3,4,5-end \             header=1

Sample output

[{    "PID"     : "1171",    "TTY"     : "tty7",    "STAT"    : "Rsl+",    "TIME"    : "191:10",    "COMMAND" : "/usr/lib/xorg/Xorg -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch"},{    "PID"     : "1631",    "TTY"     : "tty1",    "STAT"    : "Ss+",    "TIME"    : "0:00",    "COMMAND" : "/sbin/agetty --noclear tty1 linux"}, <...>, {    "PID"     : "26583",    "TTY"     : "pts/1",    "STAT"    : "R+",    "TIME"    : "0:00",    "COMMAND" : "ps a"},{    "PID"     : "26584",    "TTY"     : "pts/1",    "STAT"    : "R+",    "TIME"    : "0:00",    "COMMAND" : "tclsh /usr/local/bin/sqawk -output json,pretty=1 select PID, TTY, STAT, TIME, COMMAND from a trim=left fields=1,2,3,4,5-end header=1"}]

Find duplicate lines

Print duplicate lines and how many times they are repeated.

sqawk -1 -OFS ' -- ' 'select a0, count(*) from a group by a0 having count(*) > 1' < file

Sample output

13 -- 216 -- 383 -- 2100 -- 2

Remove blank lines

sqawk -1 -RS '[\n]+' 'select a0 from a' < file

Sum up numbers with the same key

sqawk -FS , -OFS , 'select a1, sum(a2) from a group by a1' data

This is the equivalent of the AWK code

awk 'BEGIN {FS = OFS = ","} {s[$1] += $2} END {for (key in s) {print key, s[key]}}' data

Input

1015,51015,41035,171035,111009,11009,41026,91004,51004,51009,1

Output

1004,101009,61015,91026,91035,28

Combine data from two files

Commands

This example joins the data from two metadata files generated from thehappypenguin.com 2013 data dump. You do not need to download the data dump to try the query;MD5SUMS anddu-bytes are included in the directoryexamples/hp/.

# Generate input files -- see belowcd happypenguin_dump/screenshotsmd5sum * > MD5SUMSdu -b * > du-bytes# Perform querysqawk 'select a1, b1, a2 from a inner join b on a2 = b2 where b1 < 10000 order by b1' MD5SUMS du-bytes

Input files

MD5SUMS

d2e7d4d1c7587b40ef7e6637d8d777bc  0005.jpg4e7cde72529efc40f58124f13b43e1d9  001.jpge2ab70817194584ab6fe2efc3d8987f6  0.0.6-settings.png9d2cfea6e72d00553fb3d10cbd04f087  010_2.jpg3df1ff762f1b38273ff2a158e3c1a6cf  0.10-planets.jpg0be1582d861f9d047f4842624e7d01bb  012771602077.png60638f91b399c78a8b2d969adeee16cc  014tiles.png7e7a0b502cd4d63a7e1cda187b122b0b  017.jpg[...]

du-bytes

136229  0005.jpg112600  001.jpg26651   0.0.6-settings.png155579  010_2.jpg41485   0.10-planets.jpg2758972 012771602077.png426774  014tiles.png165354  017.jpg[...]

Output

d50700db41035eb74580decf83f83184 615 z81.pnge1b64d03caf4615d54e9022d5b13a22d 677 init.pnga0fb29411c169603748edcc02c0e86e6 823 agendaroids.gif3b0c65213e121793d4458e09bb7b1f58 970 screen01.gif05f89f23756e8ea4bc5379c841674a6e 999 retropong.pnga49a7b5ac5833ec365ed3cb7031d1d84 1458 fncpong.png80616256c790c2a831583997a6214280 1516 el2_small.jpg[...]1c8a3cb2811e9c20572e8629c513326d 9852 7.pngc53a88c68b73f3c1632e3cdc7a0b4e49 9915 choosing_building.PNGbf60508db16a92a46bbd4107f15730cd 9946 glad_shot01.jpg

License

MIT.

squawk.jpg photograph byTerry Foote atEnglish Wikipedia. It is licensed underCC BY-SA 3.0.

About

Like awk, but with SQL and table joins

Releases6

v0.24.0 Latest

May 10, 2024

+ 5 releases

Packages

No packages published

Contributors3

Languages

Tcl99.1%
Other0.9%

Movatterモバイル変換

License

dbohdan/sqawk

Folders and files

Latest commit

History

Repository files navigation

Sqawk

An example

Table of contents

Installation

Usage

SQL

Options

Global options

-FS value

-RS value

-OFS value

-ORS value

-NF value

-MNF value

-dbfile value

-noinput

-output value

-v

-1

Output formats

awk

csv

json

table

tcl

Per-file options

columns

datatypes

format

header

prefix

table

F0

NF

MNF

Input formats

awk

csv, csv2, csvalt

json

tcl

More examples

Sum up numbers

Line count

Find lines that match a pattern

Shuffle lines

Pretty-print data as a table

Sample output

Convert input to JSON objects

Sample output

Find duplicate lines

Sample output

Remove blank lines

Sum up numbers with the same key

Input

Output

Combine data from two files

Commands

Input files

MD5SUMS

du-bytes

Output

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages0

Uh oh!

Packages