Movatterモバイル変換


[0]ホーム

URL:


✨ As of November 2023, Arcion has become a part of Databricks.Learn more here
Source Column Transformation

Source column transformation#

From version 22.07.19.3 onwards, Replicant allows you to transform the data of source tables before it reaches the target.

Overview#

To achieve source column transformation, Replicant uses a configuration file that contains transformation logic for each individual table. As Replicant processes data from source tables, Replicant applies the transformation rules to the data, and then loads the transformed data into the destination tables. The column on the destination could either be a new column, or a source column with transformed values.

Source column transformation solves several business cases for data migration, while making it possible to build new features like data encryption and obfuscation.

Supported platforms#

Source column transformation is supported for the following pipelines:

Transformation configuration file#

The transformation conifiguration file contains all the transformation logic for each table. You can access some sample configuration files inside theconf/transformation/ directory of yourReplicant self-hosted CLI download folder.

About YAML syntax for transformation configuration file#

Transformation configuration file uses YAML syntax. If you’re new to YAML and want to learn more, seeLearn YAML in Y minutes.

You can access some sample transformation configuration inside theconf/transformation/ directory of yourReplicant self-hosted CLI download folder.

type#

The type of transformation. For source column transformation, set it toCOLUMN.

enable#

{true|false}.

Set this totrue to enable transformation.

per-table-config#

Useper-table-config to specify the the following:

For example, the following configuration specifies catalogtpch and schemapublic, and two tablesnation andlineitem.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:...lineitem:...

per-table-config.catalog#

Specify the database catalog. Don’t specify this parameter if the source database doesn’t support it.

type:COLUMNenable:trueper-table-config:  -catalog:tpch

per-table-config.schema#

Specify the database schema. Don’t specify this parameter if the source database doesn’t support it.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:public

per-table-config.tables#

Useper-table-config.tables to define the specific tables to apply transformation to and the transformation logic. You can define transformation logic for single or multiple tables, and have different transformation logic for each table. For example, the following defines transformation logic for two tablesnation andlineitem that both belong to catalogtpch and schemapublic.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-column:Dmod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"computed-column-key-type:SHARDKEYlineitem:operations:          -modulo:enable:truesource-column:l_orderkeymod-by-column:l_partkeycomputed-column:l_orderkey1computed-column-datatype:"DECIMAL(15, 2)"

per-table-config.tables.<TABLE_NAME>.operations#

Useper-table-config.tables.<TABLE_NAME>.operations to define the operations to perform on source column for the transformation.Arcion supports the following operation types:

concat
Compute a column by concatenating source column andconcat.concat-by-columns.

concat.enable#

{true|false}.

Enable or disable concatenation.

concat.source-column#

The source column name.

concat.concat-by-columns#

An array of string values and column names that you want to concatenate withconcat.source-column. Replicant performs concatenation in the same order as you specify in this list. For example, the following example concatenates the string"_" and the columnQ to source columnp.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -concat:enable:truesource-column:pconcat-by-columns: ["_","column:Q"]

concat.computed-column#

The computed column that results from the concatenation. For example, in the following sample, the computed columnS results from the concatenationP + "_" + Q.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -concat:enable:truesource-column:Pconcat-by-columns: ["_","column:Q"]computed-column:S

concat.computed-column-datatype#

Specifies data type for the computed column. Ifnull, Replicant uses the data type ofsource-column.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -concat:enable:truesource-column:Pconcat-by-columns: ["_","column:Q"]computed-column:Scomputed-column-datatype:""
modulo
Compute a column by performing modulo operation on source column by eithermodulo.mod-by-value ormodulo.mod-by-column.

modulo.enable#

{true|false}.

Enable or disable modulo operation.

modulo.source-column#

The source column name.

modulo.{mod-by-column|mod-by-value}#

Usemodulo.mod-by-column to specify the numeric value column to use to calculate the modulo of source column.Usemodulo.mod-by-value to specify the numeric value to use to calculate the modulo of source column.

Use eithermodulo.mod-by-column ormodulo.mod-by-value but not both. In both bases, make sure that the data type is the same asmodulo.source-column.

Example: Usemodulo.mod-by-column#

In the following example, the modulo operation for the computed columnF reads asF = A % D.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-column:Dcomputed-column:F
Example: Usemodulo.mod-by-value#

In the following example, the modulo operation for the computed columnF reads asF = A % 5.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-value:5computed-column:F

modulo.computed-column#

The computed column that results from the modulo operation. This can be a new column that Replicant creates on the target, or the same as the source column, or any other column in the same table.

For example, in the following sample, the computed columnF results from the modulo operationA % 5.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-value:5computed-column:F

modulo.computed-column-datatype#

Specify the data type for the computed column. Ifnull, Replicant uses the data type ofsource-column. If you specifymodulo.computed-column-datatype, then Replicant tries to convert the operation result into that data type. The conversion might fail due to compatibility.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"

modulo.computed-column-keytype#

This parameter supports onlySHARDKEY as value. Specify this parameter to usemodulo.computed-column as shard key for the destination table.

In the following example, Replicant uses the computed columnF as the shard key column on destination database.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -modulo:enable:truesource-column:Amod-by-value:5computed-column:Fcomputed-column-datatype:"INTEGER"computed-column-key-type:SHARDKEY
trim
Trim whitespaces from start and end of a source column value.

trim.enable#

{true|false}.

Enable or disable trim operation.

trim.source-column#

The source column name.

trim.computed-column#

The computed column that results from the trim operation. For example, in the following sample, ifX = "name ", thenY = "name".

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -trim:enable:truesource-column:Xcomputed-column:Y

trim.computed-column-datatype#

Specify the data type for the computed column. Ifnull, Replicant uses the data type ofsource-column.

type:COLUMNenable:trueper-table-config:  -catalog:tpchschema:publictables:nation:operations:          -trim:enable:truesource-column:Xcomputed-column:Ycomputed-column-datatype:""

Note:

  • Computed column can be the same as source column, new column, or any other column in same table.
  • If you don’t specifycomputed-column-datatype andcomputed-column does not exist in source table, then Replicant uses the data type ofsource-column.
  • Ifcomputed-column exists in source table, then Replicant uses the datatype ofsource-column.
  • If you specifycomputed-column-datatype, then Replicant tries its best to covert operation result into that data type. The conversion might fail due to compatibility issue.

We recommend that you do not usecomputed-column-datatype. In that case, Replicant can fall back to using the data type ofsource-column.

Limitations#

Running Replicant#

To use source column transformation, run Replicant with the--transform argument and provide it the location to the transformation configuration file. For example:

./bin/replicant snapshot\ conf/conn/source_database_name_src.yaml\ conf/conn/target_database_name_dst.yaml\ --extractor conf/src/source_database_name.yaml\ --applier conf/dst/target_database_name.yaml\ --filter filter/source_database_name_filter.yaml\ --transform conf/transformation/column_transform.yaml\ --id repl2 --replace --overwrite

[8]ページ先頭

©2009-2026 Movatter.jp