Databricks on Azure Table of Contents

Destination Azure Databricks #

On this page, you’ll find step-by-step instructions on how to set up your Azure Databricks instance as data target with Arcion. The extractedreplicant-cli will be referred to as the$REPLICANT_HOME directory in the following steps.

Prerequisites#

A Databricks workspace on Azure
Azure container in ADLS Gen2 (Azure Data Lake Storage Gen2)

To create a storage account, seeCreate a storage account to use with Azure Data Lake Storage Gen2. To create a container, seeCreate a container.

I. Create a Databricks cluster#

Arcion supports both Databricks all-purpose cluster and SQL Warehouse. The following sections describe how set up each type of cluster.

Set up all-purpose cluster#

If you want to connect to Databricks all-purpose cluster, follow these instructions:

Log in to your Databricks workspace.
From the Databricks console, go toData Science & Engineering > Compute > Create Compute.
Enter a name for your cluster.
Select the latest Databricks runtime version.
Set up an external stage.
ClickCreate Cluster.

Get connection details for a cluster#

To establish connection between your Databricks instance and Arcion, you need to provide the connection details for your cluster. You provide these connection details to Replicant usingthe connection configuration file. To retrieve connection details, follow these steps after youset up a cluster:

Click theAdvanced Options toggle.
Click on theJDBC/ODBC tab and take note of the following values:
- Server Hostname
- Port
- JDBC URL

Set up SQL warehouse#

If you want to connect SQL warehouse (SQL Compute), follow these instructions:

Log in to your Databricks workspace.
From the Databricks console, go toSQL > Review SQL warehouses > Create SQL Warehouse.
Enter a name for your SQL warehouse.
Choose cluster size.
Set up an external stage.
ClickCreate.

Get connection details for a SQL warehouse#

To establish connection between your Databricks instance and Arcion, you need to provide the connection details for your SQL warehouse. You provide these connection details to Replicant usingthe connection configuration file. To retrieve connection details, follow these steps after youset up a SQL warehouse:

Navigate to theConnection Details tab.
Take note of the following values:
- Server Hostname
- Port
- JDBC URL

II. Create a personal access token for the Databricks cluster#

To create a personal access token, seeGenerate a personal access token in Databricks documentation. You need this token to configure Arcion Replicant for replication.

III. Configure ADLS container as stage#

To grant Databricks access to ADLS, follow one of the these methods:

The preceding resources use Python to grant access. Instead of Python, you can useSpark configuration properties to access data in Azure storage account.

Spark configuration for cluster#

On the cluster configuration page, click theAdvanced Options toggle.
Click theSpark tab.
In theSpark Config textbox, enter your configuration properties.

Spark configuration for SQL warehouse#

Click your username in the top bar of Databricks workspace and selectAdmin Console from the dropdown.
Click theSQL Warehouse Settings tab.
In theData Access Configuration textbox, enter your configuration properties.

For example, to access data in Azure storage account using storage account key, enter the following Spark configuration:

fs.azure.account.key.STORAGE_ACCOUNT.dfs.core.windows.net STORAGE_ACCOUNT_KEY

Replace the following:

STORAGE_ACCOUNT: your Azure storage account name
STORAGE_ACCOUNT_KEY: your storage account key

IV. Obtain the JDBC Driver for Databricks#

Replicant requires the Databricks JDBC Driver as a dependency. To obtain the appropriate driver, follow these steps:

For Legacy Databricks

Download theJDBC 4.2-compatible Databricks JDBC Driver ZIP.
From the downloaded ZIP, locate and extract theSparkJDBC42.jar file.
Put theSparkJDBC42.jar file inside$REPLICANT_HOME/lib directory.

For Databricks Unity Catalog

Go to theDatabricks JDBC Driver download page and download the driver.
From the downloaded ZIP, locate and extract theDatabricksJDBC42.jar file.
Put theDatabricksJDBC42.jar file inside$REPLICANT_HOME/lib directory.

V. Configure Replicant connection for Databricks#

In this step, you need to provide the Databricks connection details to Arcion. To do so, follow these steps:

You can find a sample connection configuration filedatabricks.yaml in the$REPLICANT_HOME/conf/conn/ directory.
The connection configuration file has the following two parts:
- Parameters related to target Databricks server connection.
- Parameters related to stage configuration.
Parameters related to target Databricks server connection#
Note: All communications with Databricks happen through port 443, the standard port for HTTPS. So all data is encrypted and secure with SSL by default.
You can store your connection credentials in a secrets management service and tell Replicant to retrieve the credentials. For more information, seeSecrets management. Otherwise, you can put your credentials in plain form like the following sample:
```
type:DATABRICKS_DELTALAKEurl:"JDBC_URL"username:USERNAMEpassword:"PASSWORD"host:"HOSTNAME"port:"PORT_NUMBER"max-connections:30
```
Replace the following:
- JDBC_URL: the JDBC URL that you retrievefrom the connection details for a cluster
- HOSTNAME: the hostname of your Databricks host that you retrievedfrom the connection details for a cluster
- PORT_NUMBER: the port number of the Databricks cluster that you retrievedfrom the connection details for a cluster
- USERNAME: the username that connects to your Databricks server
- PASSWORD: the password associated withUSERNAME
Change the value ofmax-connections as you need. It specifies the maximum number of connections Replicant can open in Databricks.
ForDatabricks Unity Catalog, set the connectiontype toDATABRICKS_LAKEHOUSE. For more information, seeDatabricks Unity Catalog Support.
Parameters related to stage configuration#
You must use an external stage to hold the data files and load that data on the target database from there. Thestage section contains the details Replicant needs to connect to and use a specific stage.
- type[v21.06.14.1]: The stage type. For Azure Legacy Databricks, settype toAZURE.
- root-dir: The directory under ADLS container. Replicant uses this directory to stage bulk-load files.
- conn-url[v21.06.14.1]: The name of the ADLS container.
- account-name[v21.06.14.1]: The name of the ADLS storage account.account-name corresponds to the same storage account in theConfigure ADLS container as stage section.
- secret-key[v21.06.14.1]: If you want to authenticate ADLS using a storage account key, specify your storage account key here.
- sas-token: If you useshared access signature (SAS) token to authenticate ADLS, specify the SAS token here.
The following illustrates two sample stage configurations for Azure Databricks. One sample specifies storage account key and the other sample specifies SAS token for authentication.
With storage account key
```
stage:type:AZUREroot-dir:"replicate-stage/databricks-stage"conn-url:"replicant-container"account-name:"replicant-storageaccount"secret-key:"YOUR_STORAGE_ACCOUNT_KEY"
```
With SAS token
```
stage:type:AZUREroot-dir:"replicate-stage/databricks-stage"conn-url:"replicant-container"account-name:"replicant-storageaccount"sas-token:"YOUR_SAS_TOKEN"
```

VI. Configure mapper file (optional)#

If you want to define data mapping from your source to Azure Databricks, specify the mapping rules in the mapper file. For more information on how to define the mapping rules and run Replicant CLI with the mapper file, seeMapper configuration andMapper configuration in Databricks.

VII. Set up Applier configuration#

From$REPLICANT_HOME, navigate to the applier configuration file:
```
vi conf/dst/databricks.yaml
```

The configuration file has two parts:

Parameters related to snapshot mode.
Parameters related to realtime mode.

Parameters related to snapshot mode#

For snapshot mode, make the necessary changes as follows:

snapshot:threads:16#Maximum number of threads Replicant should use for writing to the target#If bulk-load is used, Replicant will use the native bulk-loading capabilities of the target databasebulk-load:enable:truetype:FILEserialize:true|false#Set to true if you want the generated files to be applied in serial/parallel fashion

There are some additional parameters available that you can use in snapshot mode:

snapshot:enable-optimize-write:trueenable-auto-compact:trueenable-unmanaged-delta-table:falseunmanaged-delta-table-location:init-sk:falseper-table-config:init-sk:falseshard-key:enable-optimize-write:trueenable-auto-compact:trueenable-unmanaged-delta-table:falseunmanaged-delta-table-location:

These parameters are specific to Databricks as destination. More details about these parameters are as follows:

enable-optimize-write: Databricks dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset characteristics.
Default: By default, this parameter is set totrue.
enable-auto-compact: After an individual write, Databricks checks if files can be compacted further. If so, it runs anOPTIMIZE job to further compact files for partitions that have the most number of small files. The job is run with 128 MB file sizes instead of the 1 GB file size used in the standardOPTIMIZE.
Default: By default, this parameter is set totrue.
enable-unmanaged-delta-table: An unmanaged table is a Spark SQL table for which Spark manages only the metadata. The data is stored in the path provided by the user. So when you performDROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.
Default: By default, this parameter is set tofalse.
unmanaged-delta-table-location: The path where data for the unmanaged table is to be stored. It can be a Databricks DBFS path (for exampleFileStore/tables), or an S3 path (for example,s3://replicate-stage/unmanaged-table-data) where the S3 bucket is accessible to Databricks.
init-sk: Partition-key on the source table is represented as a shard-key by replicant. By default the target table does not include this sharding information. Ifinit-sk is true we add the shard-key/partition key to target table create SQL. Shard-key replication is disabled by default because DML replication with partitioned tables in Databricks is very slow if the partition key has a high distinct count.
Default: By default, this parameter is set tofalse.
per-table-config: This configuration allows you to specify various properties for target tables on a per table basis.
- init-sk: Partition-key on the source table is represented as a shard-key by replicant. By default, the target table does not include this sharding information. Ifinit-sk is true we add the shard-key/partition key to target table create SQL. Shard-key replication is disabled by default because DML replication with partitioned tables in\ databricks is very slow if the partition key has a high distinct count.
  Default: By default, this parameter is set tofalse.
- shard-key: Shard key to be used for partitioning the target table.
- enable-optimize-write: Databricks dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset characteristics.
  Default: By default, this parameter is set totrue.
- enable-auto-compact: After an individual write, Databricks checks if files can be compacted further. If so, it runs anOPTIMIZE job to further compact files for partitions that have the most number of small files. The job is run with 128 MB file sizes instead of the 1 GB file size used in the standardOPTIMIZE.
  Default: By default, this parameter is set totrue.
- enable-unmanaged-delta-table: An unmanaged table is a Spark SQL table for which Spark manages only the metadata. The data is stored in the path provided by the user. So when you performDROP TABLE <example-table>, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.
  Default: By default, this parameter is set tofalse.
- unmanaged-delta-table-location: The path where data for the unmanaged table is to be stored. It can be a Databricks DBFS path (for exampleFileStore/tables), or an S3 path (for example,s3://replicate-stage/unmanaged-table-data) where the S3 bucket is accessible to Databricks.

Parameters related to realtime mode#

If you want to operate in realtime mode, you can use therealtime section to specify your configuration. For example:

realtime:threads:4txn-size-rows:100000_traceDBTasks:trueskip-tables-on-failures:falseenable-dependency-tracking:truereuse-temp-table:true

Additional parameters#

reuse-temp-table

true orfalse.

Enables reusing temporary tables instead of creating and dropping each time the Applier ingests CDC operations.

The Applier creates temporary table using target table schema to ingest CDC operations to target table. Creating and dropping temporary tables on each CDC replay may slow down CDC ingestion. Enabling this parameter allows you to make CDC ingestion faster by telling the Applier to reuse the temporary tables.

Default:true.

parallel-transaction[v23.06.30.2]

This configuration enables parallel batch processing. Parallel batch processing processes large transactions by splitting them into multiple batches and processing the batches concurrently. This speeds up real-time replication for large transactions and improves overall performance. This feature is available for both Legacy Databricks and Unity Catalog.

The following configuration options are available:

Option	Value	Details
`enable`	Boolean. {`true`\|`false`}.	Enables parallel batch processing. Default:`true` when`txn-size-rows` is greater than or equal to`2_000_000`.
`txn-rows-threshold`	Integer	Sets the threshold limit for a transaction to qualify for splitting. If transaction size hits this threshold limit, the Applier thread splits the transaction into multiple batches for parallel processing. Default:`1_000_000`.
`max-chunks-per-table`	Integer	Sets the number of batches to split a transaction into for parallel processing. Default:`5`.
`threads`	Integer	Sets the number of threads responsible for parallel batch processing. For example, if the maximum number of chunks for each table is`5` and`threads` is set to`10`, the Applier processes only two transactions concurrently at a time. If new transactions come simultaneously, the main Applier threads process them serially in one batch. Default: The number of available processors to JVM.

Enable Type-2 CDC#

From version 22.07.19.3 onwards, Arcion supports Type-2 CDC for Databricks as the target. For more information, seeType-2 CDC andcdc-metadata-type.

For a detailed explanation of configuration parameters in the Applier file, readApplier Reference.

Databricks Unity Catalog support#

From version 22.08.31.3 onwards, Arcion supportsDatabricks Unity Catalog.

As of now, note the following about the state of Arcion’s Unity Catalog support:

Legacy Databricks only supports two-level namespace:
- Schemas
- Tables
With introduction of Unity Catalog, Databricks now exposes athree-level namespace that organizes data.
- Catalogs
- Schemas
- Tables
Arcion adds support for Unity Catalog by introducing a new child storage type (DATABRICKS_LAKEHOUSE child ofDATABRICKS_DELTALAKE).
If you’re using Unity Catalog, notice the following when configuring your Target Databricks with Arcion:
- Set the connectiontype toDATABRICKS_LAKEHOUSE in theconnection configuration file.
- To avoid manual steps to configure staging, Databricks has introduced personal staging. To read the staging URL, we’ve added a new configuration parameterUNITY_CATALOG_PERSONAL_STAGE. The completestage configuration is as follows:
```
stage:type:UNITY_CATALOG_PERSONAL_STAGEstaging-url:STAGING_URLfile-format:DATA_FILE_FORMAT
```
  Replace the following:
  - STAGING_URL: the temporary staging URL—for example,stage://tmp/userName/rootDir.
  - DATA_FILE_FORMAT: the type of data file format. Supported formats arePARQUET andCSV.
    Default:PARQUET.
We useSparkJDBC42 driver for Legacy Databricks (DATABRICKS_DELTALAKE) andDatabricksJDBC42 for Unity catalog (DATABRICKS_LAKEHOUSE). For instructions on how to obtain these drivers, seeObtain the JDBC Driver for Databricks.
Replicant supports Unity Catalog on AWS and Azure platforms.
To configureMapper file in Unity Catalog, seeMapping in Unity Catalog.

Movatterモバイル変換

Destination Azure Databricks#