- Notifications
You must be signed in to change notification settings - Fork9
Parses SQL into Concrete Syntax Tree (CST)
License
nene/sql-parser-cst
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SQL Parser which produces aConcrete Syntax Tree (CST).
Unlike a more usual parser which produces an Abstract Syntax Tree (AST),this parser preserves all the syntax elements present in the parsed source code,with the goal of being able to re-create the exact original source code.
Try it live inSQL Explorer.
- Detailed TypeScript types for the syntax tree
- Unified syntax tree for multiple SQL dialects
- Includes source code location data for all nodes
- Includes comments in the syntax tree
- Helpful error messages
- Fast
Supports the following SQL dialects:
- SQLite - full support (version 3.45)
- BigQuery - full support (as of 31 January 2024).
- MySQL - experimental (version 8)(see#7 for implementation progress).
- MariaDB - experimental (version 10)(see#32 for implementation progress).
- PostgreSQL - experimental (version 16)(see#40 for implementation progress).
Note: This software is in very active development.The syntax tree structure is mostly stable now,though there are bound to be changes as new SQL dialectsare added and they contain features that need to be accommodatedto the syntax tree.
npm install sql-parser-cst
import{parse,show,cstVisitor}from"sql-parser-cst";constcst=parse("select * from my_table",{dialect:"sqlite",// These are optional:includeSpaces:true,// Adds spaces/tabsincludeNewlines:true,// Adds newlinesincludeComments:true,// Adds commentsincludeRange:true,// Adds source code location data});// convert all keywords to uppercaseconsttoUpper=cstVisitor({keyword:(kw)=>{kw.text=kw.text.toUpperCase();},});toUpper(cst);// Serialize back to SQLshow(cst);// --> SELECT * FROM my_table
For example, given the following SQL:
/* My query*/SELECT ("first_name"||' jr.')as fname-- use important tableFROM persons;
An AST-parser might parse this to the following abstract syntax tree:
{"type":"select_stmt","columns": [ {"type":"alias","expr": {"type":"binary_expr","left": {"type":"column_ref","column":"first_name" },"operator":"||","right": {"type":"string","value":" jr." } },"alias":"fname" } ],"from": [{"type":"table_ref","table":"persons" }]}
Note that the above AST is missing the following information:
- comments
- whitespace (e.g. where the newlines are)
- case of keywords (e.g. whether
AS
oras
was written) - whether an identifier was quoted or not (and with what kind of quotes)
- whether an expression is wrapped in additional (unnecessary) parenthesis.
- whether the statement ends with a semicolon.
In contrast, this CST parser produces the following concrete syntax tree,which preserves all of this information:
{"type":"program","statements": [ {"type":"select_stmt","clauses": [ {"type":"select_clause","selectKw": {"type":"keyword","text":"SELECT","name":"SELECT" },"options": [],"columns": {"type":"list_expr","items": [ {"type":"alias","expr": {"type":"paren_expr","expr": {"type":"binary_expr","left": {"type":"identifier","text":"\"first_name\"","name":"first_name" },"operator":"||","right": {"type":"string_literal","text":"' jr.'","value":" jr." } } },"asKw": {"type":"keyword","text":"as","name":"AS" },"alias": {"type":"identifier","text":"fname","name":"fname" } } ] } }, {"type":"from_clause","fromKw": {"type":"keyword","text":"FROM","name":"FROM" },"expr": {"type":"identifier","text":"persons","name":"persons" },"leading": [ {"type":"newline","text":"\n" }, {"type":"line_comment","text":"-- use important table" }, {"type":"newline","text":"\n" } ] } ] }, {"type":"empty" } ],"leading": [ {"type":"block_comment","text":"/* My query */" }, {"type":"newline","text":"\n" } ]}
Note the following conventions:
- All keywords are preserved in
type: keyword
nodes, which are usuallystored in fields named likesomeNameKw
. - Parenthesis is represented by separate
type: paren_expr
node. - Comma-separated lists are represented by separate
type: list_expr
node. - Trailing semicolon is represented by
type: empty
node in the end. - The original source code representation of strings, identifiers, keywords, etcis preserved in
text
fields. - Each node can have
leading
andtrailing
fields,which store comments and newlines immediately before or after that node.These fields will also contain information about regular spaces/tabs(e.g.{"type": "space", "text": " \t"}
). This has been left out from thisexample for the sake of simplicity.
Parses SQL string and returns the CST tree. Takes the following options:
- dialect:
'sqlite' | 'bigquery' | 'mysql' | 'mariadb' | 'postgresql'
The SQL dialect to parse(required). - includeRange:
boolean
When enabled addsrange: [number, number]
field to all CST nodes,which contains the start and end locations of the node. - includeComments:
boolean
When enabled addsleading: Whitespace[]
and/ortrailing: Whitespace[]
to nodes which are preceded or followed by comments. - includeNewlines:
boolean
LikeincludeComments
, but includes newlines info to the same fields. - includeSpaces:
boolean
LikeincludeComments
, but includes horizontal whitespace info to the same fields. - paramTypes:
("?" | "?nr" | "$nr" | ":name" | "$name" | "@name" | "`@name`")[]
Determines the types of bound parameters supported by the parser.By default a query likeSELECT * FROM tbl WHERE id = ?
will result in parse error.To fix it, useparamTypes: ["?"]
config option. - filename:
string
Name of the SQL file. This is only used for error-reporting. - acceptUnsupportedGrammar:
boolean
When enabled, code that would otherwise fail to parse,gets parsed asunsupported_grammar_stmt
node. That will consume all text until the next semicolon.After the semicolon, parsing will resume as normal. This option is primarily intended as a workaroundfor using the parser with an SQL dialect that's not yet 100% supported.
When parsing fails with syntax error, it throwsFormattedSyntaxError
which contains a message like:
Syntax Error: Unexpected "WHERE"Was expecting to see: "!", "$", "(", "-", ":", "?", "@", "CASE", ...--> my_db.sql:2:33 |2 | SELECT * FROM my_table ORDER BY WHERE | ^
Converts CST back to string.
Important caveat: the CST has to contain whitespace data, meaning,it was generated withincludeComments
,includeNewlines
andincludeSpaces
options enabled.
For any valid SQL the following assertion will always hold:
constopts={dialect:"sqlite",includeComments:true,includeNewlines:true,includeSpaces:true,};show(parse(sql,opts))===sql;// always true
Generates a function that walks through the whole CST tree and callsa function inmap
whenever it encounters a node with that type.
For example the following code checks that all table and column aliasesuse the explicitAS
keyword:
constcheckAliases=cstVisitor({alias:(node)=>{if(!node.asKw){thrownewError("All alias definitions must use AS keyword!");}},});checkAliases(cst);
You can returnVisitorAction.SKIP
to avoid visiting all child nodes of a specific node:
lettopLevelSelects=0;constcountTopLevelSelects=cstVisitor({select_stmt:(node)=>{topLevelSelects++;returnVisitorAction.SKIP;},});countTopLevelSelects(cst);
Transforms the whole CST into some other typeT
. Themap
objectshould contain an entry for each of the CST node types it expects toencounter (this generally means all of them).
For example, the following implements atoString()
function thatserializes very basic SQL queries likeSELECT 1, 2, 3 + 4
:
consttoString=cstTransformer({program:(node)=>node.statements.map(toString).join(";"),select_statement:(node)=>node.clauses.map(toString).join(" "),select_clause:(node)=>"SELECT "+node.columns.map(toString).join(", "),binary_expr:(node)=>toString(node.left)+" "+node.operator+" "+toString(node.right),number_literal:(node)=>node.text,});
The builtinshow()
function is implemented as such a transform.
Additionally the parser exports lists ofreserved keywords for each supported SQL dialect:sqliteKeywords
,bigqueryKeywords
,mysqlKeywords
,mariadbKeywords
,postgresqlKeywords
.These are simple JavaScript objects, useful for doing lookups:
exportconstsqliteKeywords={ABORT:true,ACTION:true,ADD:true, ...};
yarn generate
will generate parser.
The testsuite contains two kinds of tests:
- tests applicable for all dialects
- tests applicable for only some specific dialects
When running the testsuite one always needs to pick a dialect.For exampleyarn test:sqlite
oryarn test:mysql
.Running one of these commands will run the testsuite against the parserof that dialect. It will execute all the generic tests plus testsapplicable for that dialect.
yarn test
will execute the testsuite for each supported dialect,covering all the possible combinations.
Start the parser-generator watch process in one terminal:
yarn watch:generate
and the tests watch process in another terminal:
yarn test:sqlite --watch
Note thatyarn test --watch
doesn't work.A separate watch process needs to be started manually for each dialect.
Generate new release withyarn publish
.
To generate a changelog use theyarn changelog
command:
VERSION=v0.27.0 yarn changelog
This started as a fork ofnode-sql-parser,which is based on@flora/sql-parser,which in turn was extracted from Alibaba'snquery module.
There's very little left of the original code though.
About
Parses SQL into Concrete Syntax Tree (CST)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.