- Notifications
You must be signed in to change notification settings - Fork5
Cpp engine of DingoDB Expression.
License
dingodb/dingo-libexpr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This project is the coprocessor part ofDingo Expression, written in C++. The project is made standalone to be easily integrated into other C++ projects.
#include"expr/runner.h"usingnamespacedingodb::expr;
Assumebytes is an array of typeexpr::Byte[size], which contains the encoded expression
Runner runner;runner.Decode(bytes, size);runner.Run();Operand result = runner.Get();
Operand is a class to hold any supported data types. AnOperand object can be created from any supported typeT by the constructorOperand<T>(T value), and the value contained in anOperand object can be retrieved by
TOperand::GetValue()const;
Operand can also hold aNULL value, and can be tested by the overloaded "equals/non-equals" operator, likeoperand == nullptr oroperand != nullptr.
Dingo Expression Coprocessor supports variables. Internally, variables are indexed by integers, start from 0. If the expression contains variables, aTuple must be bind to theRunner before run
Tuple tuple{1,"Alice",100.0};runner.BindTuple(&tuple);whereTuple is a type alias ofstd::vector<Operand>. The variables indexed by 0, 1, 2 would be assigned the 1st, 2nd, 3rd values of thetuple, respectively.
Note:
- The
runnerdoes not take ownership of thetuple, which should be released by the caller if necessary - The number of elemnets in the
tupleis not checked when retrieving the value and it is the caller's duty to ensure the index of variables are valid
Dingo Expression Coprocessor has also limited implementation for relational algebra. To use it, add the following to the source code
#include"rel/rel_runner.h"usingnamespacedingodb::expr;usingnamespacedingodb::rel;
Assumebytes is an array of typeexpr::Byte[size], which contains the encoded relational algebra
auto *rel =new RelRunner();rel->Decode(buf, size);
ThenTuples can be put into theRelRunner
for (int i =0; i < tuples.size(); ++i) {constauto *output = rel->Put(tuples[i]);}
Theoutput returned is anotherTuple ornullptr. The former is a valid result and the latter means no result can be retrieved immediately for the input may be fitered out or cached.
If there are datum cached in theRelRunner (which is typically when the relational algebra contains any aggregation), the cached results must be taken out after all tuples are put in
while ((constauto *output = rel->Get()) != nullptr) {do_some_thing(output);};
Note:
- The
RelRunnertakes over the ownership of theTupleput in. The caller must not try to release it - If the
outputreturned either byPutorGetis notnullptr, it must be released by the caller - The implementation of
RelRunneris not thread-safe
TheRunner class is for expression evaluating. The following figure illustrate the evaluating process, in which the expression is like2 + T[0] * 3
TheRunner contains an operand stack and an operator verctor. The operator vector is constructed byDecode method from the encoded bytes of an expression. Each operator can manipulate (push/pop) operands in the operand stack following pre-defined process. WhenRun method is called, operators in the vector are carried out one by one. By callingGet method, the top elemement in the stack is poped out as the returned result. Mostly, there is only one oprand left in the stack afterRun for a valid expression.
Data types are represented by 4 bit, as described in the following table
| Type | Description | Decimal Code | Hex Code | Corresponding C++ Type |
|---|---|---|---|---|
INT32 | 32 bit integer | 1 | 0x1 | int32_t |
INT64 | 64 bit integer | 2 | 0x2 | int64_t |
BOOL | boolean | 3 | 0x3 | bool |
FLOAT | single precision floating point | 4 | 0x4 | float |
DOUBLE | double precision floating point | 5 | 0x5 | double |
DECIMAL | arbitrary-precision decimal number | 6 | 0x6 | Not supported yet |
STRING | string | 7 | 0x7 | std::shared_ptr<std::string> |
Note the C++ types are wrapped inOperand class.
Immediate numbers are const values embedded in the encoded expressions, to implement const, index of variables, length of arrays, etc. The encoding of immediate numbers is described in the following table
| Type of Immediate Number | Description of Encoding |
|---|---|
INT32 | See encoding of VarInt inProtocol Buffers |
INT64 | The same as above |
FLOAT | 4 bytes big-endian representation |
DOUBLE | 8 bytes big-endian representation |
DECIMAL | To be defined |
STRING | Encoding the byte length of it first, followed by all the types of it. The length is encoded asINT32 type |
If the expressions are used in relational algebra, a special byte may be needed to indicate the end of expression. Byte0x00 is used for that purpose. But byte0x00 is not always means the termination of expression, because there may be any bytes in the presentation of an immediate number.
The expressions Dingo Expression Coprocessor processed are represented in suffix style, i.e. the operator is put after its operands. Consts and variables are also treated as operator in this representation. All the operators are encoded into unique bytes representation. Operators may have one or more immediate numbers followed. Generally
- For 1-byte representation, the HSB is set to
0, and the lower 4 bits can be used to indicate a data type if needed - For 2-bytes representation, the HSB is set to
1, and the higher 4 bits and the lower 4 bits of the 2nd byte can be used to indicate two data types individually if needed - No byte
0x00or0xFFis allowed in the representation except for immediate numbers
The encoding of operators is illustrated in the following figure
The encoding of one-byte operators is as follows (The template-likeT is any of theData Types)
| Operator | Higher 4 Bits | Lower 4 Bits | Immediate Number | Description |
|---|---|---|---|---|
NULL<T> | 0x0 | Encode typeT | None | NULL value |
CONST<T> | 0x1 | Encode typeT | T type value | T type const,T !=BOOL |
CONST<BOOL> | 0x1 | Encode typeBOOL | None | BOOL valuetrue |
CONST_N<T> | 0x2 | Encode typeT | T type value | T type const,T ==INT32 orT ==INT64, the real value is the inverse of the encoded immediate number |
CONST_N<BOOL> | 0x2 | Encode typeBOOL | None | BOOL valuefalse |
VAR<T> | 0x3 | Encode typeT | INT32 type value | T type variable indexed by an integer |
VAR_S<T> | 0x4 | Encode typeT | STRING type value | T type variable indexed by a string.Not implemented yet |
NOT | 0x5 | 0x1 | None | UnaryNOT |
AND | 0x5 | 0x2 | None | BinaryAND |
OR | 0x5 | 0x3 | None | BinaryOR |
AND_FUN | 0x5 | 0x4 | INT32 type value | VariadicAND, the immediate number is the number of parameters.Not implemented yet |
OR_FUN | 0x5 | 0x5 | INT32 type value | VariadicOR, the immediate number is the number of parameters.Not implemented yet |
| Operator | Higher 4 Bits | Lower 4 Bits | Immediate number 0 | Imediate number 1..N | Description |
|---|---|---|---|---|---|
ARRAY<T> | 0x6 | Encode typeT | INT32 type numberN, the number of elements | encodeN values ofT type continously | Array ofT type consts. Not implement yet |
ARRAY<AGG> | None | None | INT32 type numberN, the number of elements | encodeN aggregation functions continously | Array of aggregations, used in relational algebra |
ARRAY<CONST> | None | None | INT32 type numberN, the number of elements | encodeNCONST/CONST_N expressions continously | Tuple of consts.Not implemented yet |
| Operator | 1st Byte | Higher 4 Bits of 2nd Byte | Lower 4 Bits of 2nd Byte | Immediate Number 0 | Immediate Number 1..2N | Description |
|---|---|---|---|---|---|---|
MAP<K, V> | 0x80 | Encode typeK | Encode typeV | INT32 type numberN, tye number of key-value pairs | encodeK andV type consts alternately | Map of consts,K andV is the type of key and value.Not implemented yet |
The encoding of two-bytes operators is as follows (The template-likeT is any of theData Types)
| Operator | 1st Byte | Higher 4 Bits of 2nd Byte | Lower 4 Bits of 2nd Byte | Immediate Number | Description |
|---|---|---|---|---|---|
POS<T> | 0x81 | 0x0 | Encode typeT | None | Unary+ |
SUM<T> | 0x81 | 0x1 | Encode typeT | INT32 type value, the number of parameters | VaridiacSUM.Not implemented yet |
NEG<T> | 0x82 | 0x0 | Encode typeT | None | Unary- |
ADD<T> | 0x83 | 0x0 | Encode typeT | None | Binary+ |
SUB<T> | 0x84 | 0x0 | Encode typeT | None | Binary- |
MUL<T> | 0x85 | 0x0 | Encode typeT | None | Binary* |
DIV<T> | 0x86 | 0x0 | Encode typeT | None | Binary/ |
MOD<T> | 0x87 | 0x0 | Encode typeT | None | Binary% |
EQ<T> | 0x91 | 0x0 | Encode typeT | None | Binary== |
GE<T> | 0x92 | 0x0 | Encode typeT | None | Binary>= |
GT<T> | 0x93 | 0x0 | Encode typeT | None | Binary> |
LE<T> | 0x94 | 0x0 | Encode typeT | None | Binary<= |
LT<T> | 0x95 | 0x0 | Encode typeT | None | Binary< |
NE<T> | 0x96 | 0x0 | Encode typeT | None | Binary<> |
IS_NULL<T> | 0xA1 | 0x0 | Encode typeT | None | UnaryIS_NULL function |
IS_TRUE<T> | 0xA2 | 0x0 | Encode typeT | None | UnaryIS_TRUE function |
IS_FALSE<T> | 0xA3 | 0x0 | Encode typeT | None | UnaryIS_FALSE function |
MIN<T> | 0xB1 | 0x0 | Encode typeT | None | BinaryMIN function |
VARG_MIN<T> | 0xB1 | 0x1 | Encode typeT | INT32 type value, the number of parameters | VariadicMIN function |
MAX<T> | 0xB2 | 0x0 | Encode typeT | None | BinaryMAX function |
VARG_MAX<T> | 0xB2 | 0x1 | Encode typeT | INT32 type value, the number of parameters | VariadicMAX function |
ABS<T> | 0xB3 | 0x0 | Encode typeT | None | UnaryABS function |
CAST<D, T> | 0xF0 | Encode typeD | Encode typeT | None | Casting fromT type toD type |
The coding of functions is as follows
| Operator | 1st Byte | 2nd Byte | Immediate Number | Description |
|---|---|---|---|---|
FUN | 0xF1 | 1-byte positive integer, the sequence number of the function | None | Pre-defined functions |
VARG_FUN | 0xF2 | 1-byte positive integer, the sequence number of the function | INT32 type value, the number of parameters | Pre-defined varidiac functions.Not implemented yet |
| Function | Type of Parameters | Type of Return Value | Sequence Number | Description |
|---|---|---|---|---|
CEIL | DOUBLE | DOUBLE | 0x01 | Round up to the smallest integer |
FLOOR | DOUBLE | DOUBLE | 0x02 | Round down to the largest integer |
ROUND | INT64,INT32 | INT64 | 0x03 | Round.Not implemented yet |
ROUND | DOUBLE,INT32 | DOUBLE | 0x04 | Round.Not implemented yet |
POW | DOUBLE | DOUBLE | 0x05 | Power.Not implemented yet |
POW | INT64 | INT64 | 0x06 | Power.Not implemented yet |
SIN | DOUBLE | DOUBLE | 0x07 | Sine |
COS | DOUBLE | DOUBLE | 0x08 | Cosine |
TAN | DOUBLE | DOUBLE | 0x09 | Tangent |
ASIN | DOUBLE | DOUBLE | 0x0A | Arc sine |
ACOS | DOUBLE | DOUBLE | 0x0B | Arc cosine |
ATAN | DOUBLE | DOUBLE | 0x0C | Arc tangent |
SINH | DOUBLE | DOUBLE | 0x0D | Hyperbolic sine |
COSH | DOUBLE | DOUBLE | 0x0E | Hyperbolic cosine |
TANH | DOUBLE | DOUBLE | 0x0F | Hyperbolic tangent |
EXP | DOUBLE | DOUBLE | 0x10 | Exponential |
LOG | DOUBLE | DOUBLE | 0x11 | Logarithm |
| Function | Type of Parameters | Type of Return Value | Sequence Number | Description |
|---|---|---|---|---|
CHAR_LENGTH | STRING | INT32 | 0x20 | Return the length of a string.Not implemented yet |
CONCAT | STRING,STRING | STRING | 0x21 | Concatenate two strings |
LOWER | STRING | STRING | 0x22 | Converts to lowercase |
UPPER | STRING | STRING | 0x23 | Converts to uppercase |
LEFT | STRING,INT32 | STRING | 0x24 | Sub string from left |
RIGHT | STRING,INT32 | STRING | 0x25 | Sub string from right |
TRIM | STRING | STRING | 0x26 | Trim whitespaces |
TRIM | STRING,STRING | STRING | 0x27 | Trim specified string.Not implemented yet |
LTRIM | STRING | STRING | 0x28 | Trim whitespaces from left |
LTRIM | STRING,STRING | STRING | 0x29 | Trim specified string from left.Not implemented yet |
RTRIM | STRING | STRING | 0x2A | Trim whitespaces from right |
RTRIM | STRING,STRING | STRING | 0x2B | Trime specified string from right.Not implemented yet |
SUBSTR | STRING,INT32,INT32 | STRING | 0x2C | Sub string from a "position" to another "position". The 1st "position" is0 |
SUBSTR | STRING,INT32 | STRING | 0x2D | Sub string from a "position" to the end. The 1st "position" is0 |
MID | STRING,INT32,INT32 | STRING | 0x2E | Sub string from a "position" to the specified length. The 1st "position" is1 |
MID | STRING,INT32 | STRING | 0x2F | Sub string from a "position" to the end. The 1st "position" is1 |
REPEAT | STRING,INT32 | STRING | 0x30 | Repeat the string.Not implemented yet |
REVERSE | STRING | STRING | 0x31 | Reverse the string.Not implemented yet |
REPLACE | STRING,STRING,STRING | STRING | 0x32 | Search and replace.Not implemented yet |
LOCATE | STRING,STRING,INT32 | INT32 | 0x33 | Find the "position" of a string in another string. The 1st "position" is1.Not implemented yet |
LOCATE | STRING,STRING | INT32 | 0x34 | Find the "position" of a string in another string. The 1st "position" is1.Not implemented yet |
FORMAT | DOUBLE,INT32 | STRING | 0x35 | Formatted output of a number.Not implemented yet |
Aggregation functions are used only in relational algebra. Complicated aggregations such asAVG is not supported here. Actually,AVG can be converted toSUM andCOUNT with a projection before encoded.
Encoding of aggregations does not follow the way to encode normal function and has no leading0xF1 byte, but is more like that of one-byte operators, as in the following table
| Aggregation | Type of Parameter | Type of Return Value | Higher 4 Bits | Low 4 Bits | Immediate Number | Description |
|---|---|---|---|---|---|---|
COUNT_ALL | None | INT64 | 0x1 | 0x0 | None | Count all rows |
COUNT<T> | T | INT64 | 0x1 | Encode typeT | INT32 type value, the column index | Count only non-null values |
SUM<T> | T | T | 0x2 | Encode typeT | INT32 type value, the column index | Sum the values |
MAX<T> | T | T | 0x3 | Encode typeT | INT32 type value, the column index | Maximum of the values |
MIN<T> | T | T | 0x4 | Encode typeT | INT32 type values, the column index | Minimum of the values |
Note:
- The column indices are all started from
0 - Some aggregation functions are requried to return
0if there are no input rows, such asCOUNT. For simplicity,NULLis returned for all aggregation functions here and it is in the final reducing stage to convertNULLto0
Only "simple" (the operators have only one single input and one output) relational algebra expressions are supported. These operators are chained with one's output connected to another's input to form a whole relational algebra expression. They are encoded one by one from the input end to the output end.
The encoding of each algebra operators contains an unique "leading byte", a "payload" for the parameters and expressions in the operator and a "trailing byte" which is alwaysEOE (End Of Expression). The details are in the following table
| Operator | Leading Byte | Payload | Trailing Byte |
|---|---|---|---|
| Filter | 0x71 | Encode the filter expression | EOE |
| Project | 0x72 | Encode the project expression one by one | EOE |
| Grouped Aggregation | 0x73 | Encode group indices asARRAY<INT32> type, then the list of aggregation functions asARRAY<AGG> type | EOE |
| Ungrouped Aggregation | 0x74 | Encode the list of aggregation functions asARRAY<AGG> type | EOE |
A "Project" operator may contains several expressions but they can be concatenated into one "huge" expression without any separator simplify the evaluating process. The "huge" expression is decoded by oneRunner, and after evaluating there will be several results left in the operand stack just as needed. These results can be taken out by multiple calls toGet method.
About
Cpp engine of DingoDB Expression.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.