dingodb/dingo-libexprPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star6

Cpp engine of DingoDB Expression.

License

Apache-2.0 license

6 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
contrib		contrib
docs/images		docs/images
src		src
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Dingo Expression Coprocessor

This project is the coprocessor part ofDingo Expression, written in C++. The project is made standalone to be easily integrated into other C++ projects.

Getting Started

#include"expr/runner.h"usingnamespacedingodb::expr;

Assumebytes is an array of typeexpr::Byte[size], which contains the encoded expression

Runner runner;runner.Decode(bytes, size);runner.Run();Operand result = runner.Get();

Operand is a class to hold any supported data types. AnOperand object can be created from any supported typeT by the constructorOperand<T>(T value), and the value contained in anOperand object can be retrieved by

TOperand::GetValue()const;

Operand can also hold aNULL value, and can be tested by the overloaded "equals/non-equals" operator, likeoperand == nullptr oroperand != nullptr.

Variables

Dingo Expression Coprocessor supports variables. Internally, variables are indexed by integers, start from 0. If the expression contains variables, aTuple must be bind to theRunner before run

Tuple tuple{1,"Alice",100.0};runner.BindTuple(&tuple);

whereTuple is a type alias ofstd::vector<Operand>. The variables indexed by 0, 1, 2 would be assigned the 1st, 2nd, 3rd values of thetuple, respectively.

Note:

Therunner does not take ownership of thetuple, which should be released by the caller if necessary
The number of elemnets in thetuple is not checked when retrieving the value and it is the caller's duty to ensure the index of variables are valid

Relational Algebra

Dingo Expression Coprocessor has also limited implementation for relational algebra. To use it, add the following to the source code

#include"rel/rel_runner.h"usingnamespacedingodb::expr;usingnamespacedingodb::rel;

Assumebytes is an array of typeexpr::Byte[size], which contains the encoded relational algebra

auto *rel =new RelRunner();rel->Decode(buf, size);

ThenTuples can be put into theRelRunner

for (int i =0; i < tuples.size(); ++i) {constauto *output = rel->Put(tuples[i]);}

Theoutput returned is anotherTuple ornullptr. The former is a valid result and the latter means no result can be retrieved immediately for the input may be fitered out or cached.

If there are datum cached in theRelRunner (which is typically when the relational algebra contains any aggregation), the cached results must be taken out after all tuples are put in

while ((constauto *output = rel->Get()) != nullptr) {do_some_thing(output);};

Note:

TheRelRunner takes over the ownership of theTuple put in. The caller must not try to release it
If theoutput returned either byPut orGet is notnullptr, it must be released by the caller
The implementation ofRelRunner is not thread-safe

Implementations

Expression Evaluating

TheRunner class is for expression evaluating. The following figure illustrate the evaluating process, in which the expression is like2 + T[0] * 3

TheRunner contains an operand stack and an operator verctor. The operator vector is constructed byDecode method from the encoded bytes of an expression. Each operator can manipulate (push/pop) operands in the operand stack following pre-defined process. WhenRun method is called, operators in the vector are carried out one by one. By callingGet method, the top elemement in the stack is poped out as the returned result. Mostly, there is only one oprand left in the stack afterRun for a valid expression.

Encodings

Data Types

Data types are represented by 4 bit, as described in the following table

Type	Description	Decimal Code	Hex Code	Corresponding C++ Type
`INT32`	32 bit integer	`1`	`0x1`	`int32_t`
`INT64`	64 bit integer	`2`	`0x2`	`int64_t`
`BOOL`	boolean	`3`	`0x3`	`bool`
`FLOAT`	single precision floating point	`4`	`0x4`	`float`
`DOUBLE`	double precision floating point	`5`	`0x5`	`double`
`DECIMAL`	arbitrary-precision decimal number	`6`	`0x6`	Not supported yet
`STRING`	string	`7`	`0x7`	`std::shared_ptr<std::string>`

Note the C++ types are wrapped inOperand class.

Immediate Numbers

Immediate numbers are const values embedded in the encoded expressions, to implement const, index of variables, length of arrays, etc. The encoding of immediate numbers is described in the following table

Type of Immediate Number	Description of Encoding
`INT32`	See encoding of VarInt inProtocol Buffers
`INT64`	The same as above
`FLOAT`	4 bytes big-endian representation
`DOUBLE`	8 bytes big-endian representation
`DECIMAL`	To be defined
`STRING`	Encoding the byte length of it first, followed by all the types of it. The length is encoded as`INT32` type

End of Expression

If the expressions are used in relational algebra, a special byte may be needed to indicate the end of expression. Byte0x00 is used for that purpose. But byte0x00 is not always means the termination of expression, because there may be any bytes in the presentation of an immediate number.

Operators

The expressions Dingo Expression Coprocessor processed are represented in suffix style, i.e. the operator is put after its operands. Consts and variables are also treated as operator in this representation. All the operators are encoded into unique bytes representation. Operators may have one or more immediate numbers followed. Generally

For 1-byte representation, the HSB is set to0, and the lower 4 bits can be used to indicate a data type if needed
For 2-bytes representation, the HSB is set to1, and the higher 4 bits and the lower 4 bits of the 2nd byte can be used to indicate two data types individually if needed
No byte0x00 or0xFF is allowed in the representation except for immediate numbers

The encoding of operators is illustrated in the following figure

One-byte Operators

The encoding of one-byte operators is as follows (The template-likeT is any of theData Types)

Operator	Higher 4 Bits	Lower 4 Bits	Immediate Number	Description
`NULL<T>`	`0x0`	Encode type`T`	None	`NULL` value
`CONST<T>`	`0x1`	Encode type`T`	`T` type value	`T` type const,`T` !=`BOOL`
`CONST<BOOL>`	`0x1`	Encode type`BOOL`	None	`BOOL` value`true`
`CONST_N<T>`	`0x2`	Encode type`T`	`T` type value	`T` type const,`T` ==`INT32` or`T` ==`INT64`, the real value is the inverse of the encoded immediate number
`CONST_N<BOOL>`	`0x2`	Encode type`BOOL`	None	`BOOL` value`false`
`VAR<T>`	`0x3`	Encode type`T`	`INT32` type value	`T` type variable indexed by an integer
`VAR_S<T>`	`0x4`	Encode type`T`	`STRING` type value	`T` type variable indexed by a string.Not implemented yet
`NOT`	`0x5`	`0x1`	None	Unary`NOT`
`AND`	`0x5`	`0x2`	None	Binary`AND`
`OR`	`0x5`	`0x3`	None	Binary`OR`
`AND_FUN`	`0x5`	`0x4`	`INT32` type value	Variadic`AND`, the immediate number is the number of parameters.Not implemented yet
`OR_FUN`	`0x5`	`0x5`	`INT32` type value	Variadic`OR`, the immediate number is the number of parameters.Not implemented yet

Arrays

Operator	Higher 4 Bits	Lower 4 Bits	Immediate number 0	Imediate number 1..N	Description
`ARRAY<T>`	`0x6`	Encode type`T`	`INT32` type number`N`, the number of elements	encode`N` values of`T` type continously	Array of`T` type consts. Not implement yet
`ARRAY<AGG>`	None	None	`INT32` type number`N`, the number of elements	encode`N` aggregation functions continously	Array of aggregations, used in relational algebra
`ARRAY<CONST>`	None	None	`INT32` type number`N`, the number of elements	encode`NCONST`/`CONST_N` expressions continously	Tuple of consts.Not implemented yet

Map (Key-Value Data)

Operator	1st Byte	Higher 4 Bits of 2nd Byte	Lower 4 Bits of 2nd Byte	Immediate Number 0	Immediate Number 1..2N	Description
`MAP<K, V>`	`0x80`	Encode type`K`	Encode type`V`	`INT32` type number`N`, tye number of key-value pairs	encode`K` and`V` type consts alternately	Map of consts,`K` and`V` is the type of key and value.Not implemented yet

Two-bytes operators

The encoding of two-bytes operators is as follows (The template-likeT is any of theData Types)

Operator	1st Byte	Higher 4 Bits of 2nd Byte	Lower 4 Bits of 2nd Byte	Immediate Number	Description
`POS<T>`	`0x81`	`0x0`	Encode type`T`	None	Unary`+`
`SUM<T>`	`0x81`	`0x1`	Encode type`T`	`INT32` type value, the number of parameters	Varidiac`SUM`.Not implemented yet
`NEG<T>`	`0x82`	`0x0`	Encode type`T`	None	Unary`-`
`ADD<T>`	`0x83`	`0x0`	Encode type`T`	None	Binary`+`
`SUB<T>`	`0x84`	`0x0`	Encode type`T`	None	Binary`-`
`MUL<T>`	`0x85`	`0x0`	Encode type`T`	None	Binary`*`
`DIV<T>`	`0x86`	`0x0`	Encode type`T`	None	Binary`/`
`MOD<T>`	`0x87`	`0x0`	Encode type`T`	None	Binary`%`
`EQ<T>`	`0x91`	`0x0`	Encode type`T`	None	Binary`==`
`GE<T>`	`0x92`	`0x0`	Encode type`T`	None	Binary`>=`
`GT<T>`	`0x93`	`0x0`	Encode type`T`	None	Binary`>`
`LE<T>`	`0x94`	`0x0`	Encode type`T`	None	Binary`<=`
`LT<T>`	`0x95`	`0x0`	Encode type`T`	None	Binary`<`
`NE<T>`	`0x96`	`0x0`	Encode type`T`	None	Binary`<>`
`IS_NULL<T>`	`0xA1`	`0x0`	Encode type`T`	None	Unary`IS_NULL` function
`IS_TRUE<T>`	`0xA2`	`0x0`	Encode type`T`	None	Unary`IS_TRUE` function
`IS_FALSE<T>`	`0xA3`	`0x0`	Encode type`T`	None	Unary`IS_FALSE` function
`MIN<T>`	`0xB1`	`0x0`	Encode type`T`	None	Binary`MIN` function
`VARG_MIN<T>`	`0xB1`	`0x1`	Encode type`T`	`INT32` type value, the number of parameters	Variadic`MIN` function
`MAX<T>`	`0xB2`	`0x0`	Encode type`T`	None	Binary`MAX` function
`VARG_MAX<T>`	`0xB2`	`0x1`	Encode type`T`	`INT32` type value, the number of parameters	Variadic`MAX` function
`ABS<T>`	`0xB3`	`0x0`	Encode type`T`	None	Unary`ABS` function
`CAST<D, T>`	`0xF0`	Encode type`D`	Encode type`T`	None	Casting from`T` type to`D` type

Functions

The coding of functions is as follows

Operator	1st Byte	2nd Byte	Immediate Number	Description
`FUN`	`0xF1`	1-byte positive integer, the sequence number of the function	None	Pre-defined functions
`VARG_FUN`	`0xF2`	1-byte positive integer, the sequence number of the function	`INT32` type value, the number of parameters	Pre-defined varidiac functions.Not implemented yet

Sequence Number of Functions

Mathematical Functions

Function	Type of Parameters	Type of Return Value	Sequence Number	Description
`CEIL`	`DOUBLE`	`DOUBLE`	`0x01`	Round up to the smallest integer
`FLOOR`	`DOUBLE`	`DOUBLE`	`0x02`	Round down to the largest integer
`ROUND`	`INT64`,`INT32`	`INT64`	`0x03`	Round.Not implemented yet
`ROUND`	`DOUBLE`,`INT32`	`DOUBLE`	`0x04`	Round.Not implemented yet
`POW`	`DOUBLE`	`DOUBLE`	`0x05`	Power.Not implemented yet
`POW`	`INT64`	`INT64`	`0x06`	Power.Not implemented yet
`SIN`	`DOUBLE`	`DOUBLE`	`0x07`	Sine
`COS`	`DOUBLE`	`DOUBLE`	`0x08`	Cosine
`TAN`	`DOUBLE`	`DOUBLE`	`0x09`	Tangent
`ASIN`	`DOUBLE`	`DOUBLE`	`0x0A`	Arc sine
`ACOS`	`DOUBLE`	`DOUBLE`	`0x0B`	Arc cosine
`ATAN`	`DOUBLE`	`DOUBLE`	`0x0C`	Arc tangent
`SINH`	`DOUBLE`	`DOUBLE`	`0x0D`	Hyperbolic sine
`COSH`	`DOUBLE`	`DOUBLE`	`0x0E`	Hyperbolic cosine
`TANH`	`DOUBLE`	`DOUBLE`	`0x0F`	Hyperbolic tangent
`EXP`	`DOUBLE`	`DOUBLE`	`0x10`	Exponential
`LOG`	`DOUBLE`	`DOUBLE`	`0x11`	Logarithm

String Functions

Function	Type of Parameters	Type of Return Value	Sequence Number	Description
`CHAR_LENGTH`	`STRING`	`INT32`	`0x20`	Return the length of a string.Not implemented yet
`CONCAT`	`STRING`,`STRING`	`STRING`	`0x21`	Concatenate two strings
`LOWER`	`STRING`	`STRING`	`0x22`	Converts to lowercase
`UPPER`	`STRING`	`STRING`	`0x23`	Converts to uppercase
`LEFT`	`STRING`,`INT32`	`STRING`	`0x24`	Sub string from left
`RIGHT`	`STRING`,`INT32`	`STRING`	`0x25`	Sub string from right
`TRIM`	`STRING`	`STRING`	`0x26`	Trim whitespaces
`TRIM`	`STRING`,`STRING`	`STRING`	`0x27`	Trim specified string.Not implemented yet
`LTRIM`	`STRING`	`STRING`	`0x28`	Trim whitespaces from left
`LTRIM`	`STRING`,`STRING`	`STRING`	`0x29`	Trim specified string from left.Not implemented yet
`RTRIM`	`STRING`	`STRING`	`0x2A`	Trim whitespaces from right
`RTRIM`	`STRING`,`STRING`	`STRING`	`0x2B`	Trime specified string from right.Not implemented yet
`SUBSTR`	`STRING`,`INT32`,`INT32`	`STRING`	`0x2C`	Sub string from a "position" to another "position". The 1st "position" is`0`
`SUBSTR`	`STRING`,`INT32`	`STRING`	`0x2D`	Sub string from a "position" to the end. The 1st "position" is`0`
`MID`	`STRING`,`INT32`,`INT32`	`STRING`	`0x2E`	Sub string from a "position" to the specified length. The 1st "position" is`1`
`MID`	`STRING`,`INT32`	`STRING`	`0x2F`	Sub string from a "position" to the end. The 1st "position" is`1`
`REPEAT`	`STRING`,`INT32`	`STRING`	`0x30`	Repeat the string.Not implemented yet
`REVERSE`	`STRING`	`STRING`	`0x31`	Reverse the string.Not implemented yet
`REPLACE`	`STRING`,`STRING`,`STRING`	`STRING`	`0x32`	Search and replace.Not implemented yet
`LOCATE`	`STRING`,`STRING`,`INT32`	`INT32`	`0x33`	Find the "position" of a string in another string. The 1st "position" is`1`.Not implemented yet
`LOCATE`	`STRING`,`STRING`	`INT32`	`0x34`	Find the "position" of a string in another string. The 1st "position" is`1`.Not implemented yet
`FORMAT`	DOUBLE,`INT32`	`STRING`	`0x35`	Formatted output of a number.Not implemented yet

Aggregation Functions

Aggregation functions are used only in relational algebra. Complicated aggregations such asAVG is not supported here. Actually,AVG can be converted toSUM andCOUNT with a projection before encoded.

Encoding of aggregations does not follow the way to encode normal function and has no leading0xF1 byte, but is more like that of one-byte operators, as in the following table

Aggregation	Type of Parameter	Type of Return Value	Higher 4 Bits	Low 4 Bits	Immediate Number	Description
`COUNT_ALL`	None	`INT64`	`0x1`	`0x0`	None	Count all rows
`COUNT<T>`	`T`	`INT64`	`0x1`	Encode type`T`	`INT32` type value, the column index	Count only non-null values
`SUM<T>`	`T`	`T`	`0x2`	Encode type`T`	`INT32` type value, the column index	Sum the values
`MAX<T>`	`T`	`T`	`0x3`	Encode type`T`	`INT32` type value, the column index	Maximum of the values
`MIN<T>`	`T`	`T`	`0x4`	Encode type`T`	`INT32` type values, the column index	Minimum of the values

Note:

The column indices are all started from0
Some aggregation functions are requried to return0 if there are no input rows, such asCOUNT. For simplicity,NULL is returned for all aggregation functions here and it is in the final reducing stage to convertNULL to0

Relational Algebra Operators

Only "simple" (the operators have only one single input and one output) relational algebra expressions are supported. These operators are chained with one's output connected to another's input to form a whole relational algebra expression. They are encoded one by one from the input end to the output end.

The encoding of each algebra operators contains an unique "leading byte", a "payload" for the parameters and expressions in the operator and a "trailing byte" which is alwaysEOE (End Of Expression). The details are in the following table

Operator	Leading Byte	Payload	Trailing Byte
Filter	`0x71`	Encode the filter expression	`EOE`
Project	`0x72`	Encode the project expression one by one	`EOE`
Grouped Aggregation	`0x73`	Encode group indices as`ARRAY<INT32>` type, then the list of aggregation functions as`ARRAY<AGG>` type	`EOE`
Ungrouped Aggregation	`0x74`	Encode the list of aggregation functions as`ARRAY<AGG>` type	`EOE`

A "Project" operator may contains several expressions but they can be concatenated into one "huge" expression without any separator simplify the evaluating process. The "huge" expression is decoded by oneRunner, and after evaluating there will be several results left in the operand stack just as needed. These results can be taken out by multiple calls toGet method.

Used by

About

Cpp engine of DingoDB Expression.

Movatterモバイル変換

License

dingodb/dingo-libexpr

Folders and files

Latest commit

History

Repository files navigation

Dingo Expression Coprocessor

Getting Started

Variables

Relational Algebra

Implementations

Expression Evaluating

Encodings

Data Types

Immediate Numbers

End of Expression

Operators

One-byte Operators

Arrays

Map (Key-Value Data)

Two-bytes operators

Functions

Sequence Number of Functions

Mathematical Functions

String Functions

Aggregation Functions

Relational Algebra Operators

Used by

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages