Source column transformation#
From version 22.07.19.3 onwards, Replicant allows you to transform the data of source tables before it reaches the target.
Overview#
To achieve source column transformation, Replicant uses a configuration file that contains transformation logic for each individual table. As Replicant processes data from source tables, Replicant applies the transformation rules to the data, and then loads the transformed data into the destination tables. The column on the destination could either be a new column, or a source column with transformed values.
Source column transformation solves several business cases for data migration, while making it possible to build new features like data encryption and obfuscation.
Supported platforms#
Source column transformation is supported for the following pipelines:
Transformation configuration file#
The transformation conifiguration file contains all the transformation logic for each table. You can access some sample configuration files inside theconf/transformation/ directory of yourReplicant self-hosted CLI download folder.
About YAML syntax for transformation configuration file#
Transformation configuration file uses YAML syntax. If you’re new to YAML and want to learn more, seeLearn YAML in Y minutes.
You can access some sample transformation configuration inside theconf/transformation/ directory of yourReplicant self-hosted CLI download folder.
type#
The type of transformation. For source column transformation, set it toCOLUMN.
enable#
{true|false}.
Set this totrue to enable transformation.
per-table-config#
Useper-table-config to specify the the following:
- Thecatalog andschema
- Thetables under the catalog and schema
- Theoperations
For example, the following configuration specifies catalogtpch and schemapublic, and two tablesnation andlineitem.
type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:...lineitem:...per-table-config.catalog#
Specify the database catalog. Don’t specify this parameter if the source database doesn’t support it.
type:COLUMNenable:trueper-table-config: -catalog:tpchper-table-config.schema#
Specify the database schema. Don’t specify this parameter if the source database doesn’t support it.
type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publicper-table-config.tables#
Useper-table-config.tables to define the specific tables to apply transformation to and the transformation logic. You can define transformation logic for single or multiple tables, and have different transformation logic for each table. For example, the following defines transformation logic for two tablesnation andlineitem that both belong to catalogtpch and schemapublic.
type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-column:Dmod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"computed-column-key-type:SHARDKEYlineitem:operations: -modulo:enable:truesource-column:l_orderkeymod-by-column:l_partkeycomputed-column:l_orderkey1computed-column-datatype:"DECIMAL(15, 2)"per-table-config.tables.<TABLE_NAME>.operations#
Useper-table-config.tables.<TABLE_NAME>.operations to define the operations to perform on source column for the transformation.Arcion supports the following operation types:
concatmodulotrim
concat- Compute a column by concatenating source column and
concat.concat-by-columns.concat.enable#{true|false}.Enable or disable concatenation.
concat.source-column#The source column name.
concat.concat-by-columns#An array of string values and column names that you want to concatenate with
concat.source-column. Replicant performs concatenation in the same order as you specify in this list. For example, the following example concatenates the string"_"and the columnQto source columnp.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -concat:enable:truesource-column:pconcat-by-columns: ["_","column:Q"]concat.computed-column#The computed column that results from the concatenation. For example, in the following sample, the computed column
Sresults from the concatenationP + "_" + Q.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -concat:enable:truesource-column:Pconcat-by-columns: ["_","column:Q"]computed-column:Sconcat.computed-column-datatype#Specifies data type for the computed column. If
null, Replicant uses the data type ofsource-column.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -concat:enable:truesource-column:Pconcat-by-columns: ["_","column:Q"]computed-column:Scomputed-column-datatype:"" modulo- Compute a column by performing modulo operation on source column by either
modulo.mod-by-valueormodulo.mod-by-column.modulo.enable#{true|false}.Enable or disable modulo operation.
modulo.source-column#The source column name.
modulo.{mod-by-column|mod-by-value}#Use
modulo.mod-by-columnto specify the numeric value column to use to calculate the modulo of source column.Usemodulo.mod-by-valueto specify the numeric value to use to calculate the modulo of source column.Use either
modulo.mod-by-columnormodulo.mod-by-valuebut not both. In both bases, make sure that the data type is the same asmodulo.source-column.Example: Use
modulo.mod-by-column#In the following example, the modulo operation for the computed column
Freads asF = A % D.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-column:Dcomputed-column:FExample: Use
modulo.mod-by-value#In the following example, the modulo operation for the computed column
Freads asF = A % 5.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fmodulo.computed-column#The computed column that results from the modulo operation. This can be a new column that Replicant creates on the target, or the same as the source column, or any other column in the same table.
For example, in the following sample, the computed column
Fresults from the modulo operationA % 5.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fmodulo.computed-column-datatype#Specify the data type for the computed column. If
null, Replicant uses the data type ofsource-column. If you specifymodulo.computed-column-datatype, then Replicant tries to convert the operation result into that data type. The conversion might fail due to compatibility.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"modulo.computed-column-keytype#This parameter supports only
SHARDKEYas value. Specify this parameter to usemodulo.computed-columnas shard key for the destination table.In the following example, Replicant uses the computed column
Fas the shard key column on destination database.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"computed-column-key-type:SHARDKEY trim- Trim whitespaces from start and end of a source column value.
trim.enable#{true|false}.Enable or disable trim operation.
trim.source-column#The source column name.
trim.computed-column#The computed column that results from the trim operation. For example, in the following sample, if
X = "name ", thenY = "name".type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -trim:enable:truesource-column:Xcomputed-column:Ytrim.computed-column-datatype#Specify the data type for the computed column. If
null, Replicant uses the data type ofsource-column.type:COLUMNenable:trueper-table-config: -catalog:tpchschema:publictables:nation:operations: -trim:enable:truesource-column:Xcomputed-column:Ycomputed-column-datatype:""
Note:
- Computed column can be the same as source column, new column, or any other column in same table.
- If you don’t specify
computed-column-datatypeandcomputed-columndoes not exist in source table, then Replicant uses the data type ofsource-column.- If
computed-columnexists in source table, then Replicant uses the datatype ofsource-column.- If you specify
computed-column-datatype, then Replicant tries its best to covert operation result into that data type. The conversion might fail due to compatibility issue.We recommend that you do not use
computed-column-datatype. In that case, Replicant can fall back to using the data type ofsource-column.
Limitations#
- Column can be a part of
source-column,computed-columnin only one operation. For example, if columnFis part of modulo operation as asource-columnorcomputed-column, then you can’t use columnFin any other operation. - Certain operations support specific data types only. For example, the
modulooperation only works with numeric types. - Only supported for
snapshot,realtimeandfullmode.
Running Replicant#
To use source column transformation, run Replicant with the--transform argument and provide it the location to the transformation configuration file. For example:
./bin/replicant snapshot\ conf/conn/source_database_name_src.yaml\ conf/conn/target_database_name_dst.yaml\ --extractor conf/src/source_database_name.yaml\ --applier conf/dst/target_database_name.yaml\ --filter filter/source_database_name_filter.yaml\ --transform conf/transformation/column_transform.yaml\ --id repl2 --replace --overwrite