Use data quality tasks

Caution: Dataplex Universal Catalog data quality tasks is a legacy offering basedon open source software. We recommend that you start using the latestbuilt-in Automatic data qualityoffering.

This document shows you how to create Dataplex Universal Catalog data quality tasksthat let you schedule and run data quality checks for your built-in andexternal BigQuery tables.

For more information, seeData quality tasks overview.

Before you begin

This document assumes that you have an existing Dataplex Universal Catalog lake tocreate the data quality task in.

Enable Google APIs and services

Enable the Dataproc API.
Enable the API
Enable Private Google Accessfor your network and subnetwork. Enable Private Google Access onthe network that you plan to use with Dataplex Universal Catalog data qualitytasks. If you don't specify a network or subnetwork when you create theDataplex Universal Catalog data quality task, Dataplex Universal Catalog uses thedefault subnet. In that case, you need to enablePrivate Google Access on the default subnet.

Create a specification file

Dataplex Universal Catalog usesopen source CloudDQas the driver program. Dataplex Universal Catalog data quality check requirementsare defined within CloudDQ YAML specification files.

As input to the data quality task, you can have a single YAML file or a single zip archive containing one or more YAML files. It'srecommended that you capture the data quality check requirements in separate YAML specificationfiles, with one file for each section.

To prepare a specification file, do the following:

Create one or more CloudDQ YAML specification files that define your data quality check requirements. For more information about the required syntax, see theAbout the specification file section of this document.
Save the YAML specification file in.yml or.yaml format. If you create multiple YAML specification files, save all of the files in a single zip archive.
Create a Cloud Storage bucket.
Upload the specification file to the Cloud Storage bucket.

About the specification file

Your CloudDQ YAML specification file needs to have the following sections:

Rules (defined in the top-levelrules YAML node): A list of rules to run. You can create these rules from predefined rule types, such asNOT_NULL andREGEX, or you can extend them with custom SQL statements such asCUSTOM_SQL_EXPR andCUSTOM_SQL_STATEMENT. TheCUSTOM_SQL_EXPR statement flags any row thatcustom_sql_expr evaluated toFalse as a failure. TheCUSTOM_SQL_STATEMENT statement flags any value returned by the whole statement as a failure.
Row filters (defined in the top-levelrow_filters YAML node): SQL expressions returning a boolean value that define filters to fetch a subset of data from the underlying entity subject for validation.
Rule bindings (defined in the top-levelrule_bindings YAML node): Definesrules andrule filters to apply to the tables.
Rule dimensions (defined in therule_dimensions YAML node): Defines the allowed list of data quality rule dimensions that a rule can define in the correspondingdimension field.
For example:
```
rule_dimensions:  - consistency  - correctness  - duplication  - completeness  - conformance
```
Thedimension field is optional for a rule. The rule dimensions section is mandatory ifdimension is listed on any rule.

For more information, see theCloudDQ reference guide and thesample specification files.

Note: Due to a CloudDQ limitation, you cannot have more than one top node for each section (such asrules orrule_bindings) in a single YAML file, as CloudDQ only parses the last node of each type within a file.

Create a dataset to store the results

To store the results,create a BigQuery dataset.
The dataset must be in the same region as the tables that you run the data quality task on.
Dataplex Universal Catalog uses this dataset, and it creates or reuses a table of your choice to store the results.

Create a service account

Create a service account that has the followingIdentity and Access Management (IAM) roles and permissions:

Read access to the Cloud Storage path containing the YAML specifications. You can use theStorage Object Viewer role (roles/storage.objectViewer) on the Cloud Storage bucket.
Read access to BigQuery datasets with data to be validated. You can use theBigQuery Data Viewer role (roles/bigquery.dataViewer).
Write access to the BigQuery dataset to create a table (if necessary) and write the results to that table. You can use theBigQuery Data Editor role (roles/bigquery.dataEditor) at the dataset level.
BigQuery Job User role (roles/bigquery.jobUser) at the project level in order to create BigQuery jobs in a project.
The Dataplex Metadata Reader role (roles/dataplex.metadataReader) at the project or lake level.
TheService Usage Consumer role (roles/serviceusage.serviceUsageConsumer) at the project level.
TheDataproc Worker role (roles/dataproc.worker).
Theiam.serviceAccounts.actAs permission given to the user submitting the job.
TheService account user role granted to the Dataplex Universal Catalog lake service account. You canview the Dataplex Universal Catalog lake service account in the Google Cloud console.

Use advanced settings

These steps are optional:

BigQuery runs data quality checks in the current project by default. You can choose a different project to run the BigQuery jobs. Use the--gcp_project_idTASK_ARGS argument for the--execution-args property of the task.
If the project ID specified to run BigQuery queries is different from the project in which the service account (specified by--execution-service-account) is created, ensure that the organization policy thatdisables cross-project service account usage (iam.disableServiceAccountCreation) is turned off. Also, ensure that the service account can access the BigQuery job schedule in the project where BigQuery queries are being run.

Limitations

All the tables specified for a given data quality task must belong to the same Google Cloudregion.

Schedule a data quality task

Console

In the Google Cloud console, go to the Dataplex Universal CatalogProcess page.
Go toProcess
ClickCreate task.
On theCheck Data Quality card, clickCreate task.
ForDataplex lake, choose your lake.
ForID, enter an ID.
In theData quality specification section, do the following:
1. In theSelect GCS file field, clickBrowse.
2. Select your Cloud Storage bucket.
3. ClickSelect.
In theResults table section, do the following:
1. In theSelect BigQuery dataset field, clickBrowse.
2. Select the BigQuery dataset to store validation outcomes.
3. ClickSelect.
4. In theBigQuery table field, enter the name of the table to store the results. If the table doesn't exist, Dataplex Universal Catalog creates it for you. Don't use the namedq_summary because it is reserved for internal processing tasks.
In theService account section, select a service account from theUser service account menu.
ClickContinue.
In theSet schedule section, configure the schedule for running the data quality task.
ClickCreate.

gcloud CLI

The following is an example execution of a data quality task that uses the Dataplex Universal Catalog tasks gcloud CLI command:

export USER_CLOUDDQ_YAML_CONFIGS_GCS_PATH="USER_CLOUDDQ_YAML_CONFIGS_GCS_PATH"# Google Cloud project where the Dataplex Universal Catalog task is created.export GOOGLE_CLOUD_PROJECT="GOOGLE_CLOUD_PROJECT"# Google Cloud region for the Dataplex Universal Catalog lake.export DATAPLEX_REGION_ID="DATAPLEX_REGION_ID"# Public Cloud Storage bucket containing the prebuilt data quality executable artifact. There is one bucket for each Google Cloud region.export DATAPLEX_PUBLIC_GCS_BUCKET_NAME="dataplex-clouddq-artifacts-${DATAPLEX_REGION_ID}"# The Dataplex Universal Catalog lake where your task is created.export DATAPLEX_LAKE_ID="DATAPLEX_LAKE_ID"# The service account used for running the task. Ensure that this service accounthas sufficient IAM permissions on your project, includingBigQuery Data Editor, BigQuery Job User,Dataplex Universal Catalog Editor, Dataproc Worker, and ServiceUsage Consumer.export DATAPLEX_TASK_SERVICE_ACCOUNT="DATAPLEX_TASK_SERVICE_ACCOUNT"# If you want to use a different dataset for storing the intermediate data quality summary resultsand the BigQuery views associated with each rule binding, use the following:export CLOUDDQ_BIGQUERY_DATASET="CLOUDDQ_BIGQUERY_DATASET"# The BigQuery dataset where the final results of the data quality checks are stored.This could be the same as CLOUDDQ_BIGQUERY_DATASET.export TARGET_BQ_DATASET="TARGET_BQ_DATASET"# The BigQuery table where the final results of the data quality checks are stored.export TARGET_BQ_TABLE="TARGET_BQ_TABLE"# The unique identifier for the task.export TASK_ID="TASK_ID"gcloud dataplex tasks create \    --location="${DATAPLEX_REGION_ID}" \    --lake="${DATAPLEX_LAKE_ID}" \    --trigger-type=ON_DEMAND \    --execution-service-account="$DATAPLEX_TASK_SERVICE_ACCOUNT" \    --spark-python-script-file="gs://${DATAPLEX_PUBLIC_GCS_BUCKET_NAME}/clouddq_pyspark_driver.py" \    --spark-file-uris="gs://${DATAPLEX_PUBLIC_GCS_BUCKET_NAME}/clouddq-executable.zip","gs://${DATAPLEX_PUBLIC_GCS_BUCKET_NAME}/clouddq-executable.zip.hashsum","${USER_CLOUDDQ_YAML_CONFIGS_GCS_PATH}" \    --execution-args=^::^TASK_ARGS="clouddq-executable.zip, ALL, ${USER_CLOUDDQ_YAML_CONFIGS_GCS_PATH}, --gcp_project_id='GOOGLE_CLOUD_PROJECT', --gcp_region_id='${DATAPLEX_REGION_ID}', --gcp_bq_dataset_id='${TARGET_BQ_DATASET}', --target_bigquery_summary_table='${GOOGLE_CLOUD_PROJECT}.${TARGET_BQ_DATASET}.${TARGET_BQ_TABLE}'," \    "$TASK_ID"

Parameter	Description
`USER_CLOUDDQ_YAML_CONFIGS_GCS_PATH`	The Cloud Storage path to your data quality YAML configurations input for the data quality task. You can have a single YAML file in`.yml` or`.yaml` format or a zip archive containing multiple YAML files.
`GOOGLE_CLOUD_PROJECT`	The Google Cloud project where the Dataplex Universal Catalog task and BigQuery jobs are created.
`DATAPLEX_REGION_ID`	The region of the Dataplex Universal Catalog lake where the data quality task is created.
`SERVICE_ACCOUNT`	The service account used for executing the task. Ensure this service account has sufficient IAM permissions as outlined in theBefore you begin section.

For--execution-args, the following arguments need to be passed as positioned arguments, and therefore in this order:

Argument	Description
`clouddq-executable.zip`	A precompiled executable that was passed in`spark-file-uris` from a public Cloud Storage bucket.
`ALL`	Run all the rule bindings. Alternatively, you can provide specific rule bindings as a comma-separated list. For example,`RULE_1,RULE_2`.
`gcp-project-id`	Project ID that runs the BigQuery queries.
`gcp-region-id`	Region for running the BigQuery jobs for data quality validation. This region should be the same as the region for`gcp-bq-dataset-id` and`target_bigquery_summary_table`.
`gcp-bq-dataset-id`	BigQuery dataset that is used to store the`rule_binding` views and intermediate data quality summary results.
`target-bigquery-summary-table`	Table ID reference of the BigQuery table where the final results of the data quality checks are stored. Don't use the ID value`dq_summary` because it is reserved for internal processing tasks.
`--summary_to_stdout`	(Optional) When this flag is passed, all the validation result rows created in the`dq_summary` table in the last run are logged as JSON records to Cloud Logging and`stdout`.

Note: By default the Dataplex Universal Catalog data quality task runs in theDefault network. You could change the network by using one of the arguments:--vpc-network-name or--vpc-sub-network-name. See how to set these arguments.

API

Replace the following:

PROJECT_ID = "Your Dataplex Universal Catalog Project ID"REGION = "Your Dataplex Universal Catalog lake region"LAKE_ID = "Your Dataplex Universal Catalog lake ID"SERVICE_ACC = "Your service account used for reading the data"DATAPLEX_TASK_ID = "Unique task ID for the data quality task"BUCKET_NAME = "Your Cloud Storage bucket name containing the CloudDQ configs or YAML specification"GCP_BQ_BILLING_PROJECT_ID = "Your BigQuery billing project"GCP_BQ_REGION_ID = "Your BigQuery dataset region ID" #OptionalGCP_BQ_DATASET_ID = "Your BigQuery dataset to store the data quality summary results"TARGET_TABLE_NAME = "Your target table name to store the results in BigQuery dataset"

Submit an HTTP POST request:

POST https://dataplex.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/lakes/${LAKE_ID}/tasks?task_id=${DATAPLEX_TASK_ID}{"spark": {    "python_script_file": f"gs://dataplex-clouddq-artifacts-us-central1/clouddq_pyspark_driver.py",    "file_uris": [  f"gs://dataplex-clouddq-artifacts-us-central1/clouddq-executable.zip",                    f"gs://dataplex-clouddq-artifacts-us-central1/clouddq-executable.zip.hashsum",                    f"gs://dataplex-clouddq-artifacts-us-central1/your-clouddq-configs.zip"                ]},"execution_spec": {    "args": {        "TASK_ARGS":f"clouddq-executable.zip, ALL, gs://BUCKET_NAME/your-clouddq-configs.zip, --gcp_project_id=${GCP_BQ_BILLING_PROJECT_ID}, --gcp_region_id=${GCP_BQ_REGION_ID}, --gcp_bq_dataset_id=${GCP_BQ_DATASET_ID}, --target_bigquery_summary_table=${GCP_BQ_BILLING_PROJECT_ID}.${GCP_BQ_DATASET_ID}.${TARGET_TABLE_NAME}"    },    "service_account": "SERVICE_ACC"},"trigger_spec": {"type": "ON_DEMAND"},"description": "${DATAPLEX_TASK_DESCRIPTION}"}

Monitor a scheduled data quality task

See how tomonitor your task.

View the results

The results of the data quality validations are stored in the BigQuery dataset and summary table that you specified, as described inCreate a dataset to store the results. The summary table contains the output summary for each combination of rule binding and rule for each validation run. The output in the summary table includes the following information:

Column name	Description
`dataplex_lake`	(string) ID of the Dataplex Universal Catalog lake containing the table being validated.
`dataplex_zone`	(string) ID of the Dataplex Universal Catalog zone containing the table being validated.
`dataplex_asset_id`	(string) ID of the Dataplex Universal Catalog asset containing the table being validated.
`execution_ts`	(timestamp) Timestamp of when the validation query was run.
`rule_binding_id`	(string) ID of the rule binding for which validation results are reported.
`rule_id`	(string) ID of the rule under the rule binding for which validation results are reported.
`dimension`	(string) Data quality dimension of the`rule_id`. This value can only be one of the values specified in the`rule_dimensions` YAML node.
`table_id`	(string) ID of the entity for which validation results are reported. This ID is specified under the`entity` parameter of the respective rule binding.
`column_id`	(string) ID of the column for which validation results are reported. This ID is specified under the`column` parameter of the respective rule binding.
`last_modified`	(timestamp) The last modified timestamp of the`table_id` being validated.
`metadata_json_string`	(string) Key-value pairs of the metadata parameter content specified under the rule binding or during the data quality run.
`configs_hashsum`	(string) The hash sum of the JSON document containing the rule binding and all its associated rules, rule bindings, row filters, and entity configurations.`configs_hashsum` allows tracking when the content of a`rule_binding` ID or one of its referenced configurations has changed.
`dq_run_id`	(string) Unique ID of the record.
`invocation_id`	(string) ID of the data quality run. All data quality summary records generated within the same data quality execution instance share the same`invocation_id`.
`progress_watermark`	(boolean) Determines whether this particular record is considered by the data quality check to determine the high watermark for incremental validation. If`FALSE`, the respective record is ignored when establishing the high-watermark value. This information is useful when executing test data quality validations that should not advance the high watermark. Dataplex Universal Catalog populates this field with`TRUE` by default, but the value can be overridden if the`--progress_watermark` argument has a value of`FALSE`.
`rows_validated`	(integer) Total number of records validated after applying`row_filters` and any high-watermark filters on the`incremental_time_filter_column_id` column, if specified.
`complex_rule_validation_errors_count`	(float) Number of rows returned by a`CUSTOM_SQL_STATEMENT` rule.
`complex_rule_validation_success_flag`	(boolean) Success status of`CUSTOM_SQL_STATEMENT` rules.
`success_count`	(integer) Total number of records that passed validation. This field is set to`NULL` for`CUSTOM_SQL_STATEMENT` rules.
`success_percentage`	(float) Percentage of the number of records that passed validation within the total number of records validated. This field is set to`NULL` for`CUSTOM_SQL_STATEMENT` rules.
`failed_count`	(integer) Total number of records that failed validation. This field is set to`NULL` for`CUSTOM_SQL_STATEMENT` rules.
`failed_percentage`	(float) Percentage of the number of records that failed validation within the total number of records validated. This field is set to`NULL` for`CUSTOM_SQL_STATEMENT` rules.
`null_count`	(integer) Total number of records that returned null during validation. This field is set to`NULL` for`NOT_NULL` and`CUSTOM_SQL_STATEMENT` rules.
`null_percentage`	(float) Percentage of the number of records that returned null during validation within the total number of records validated. This field is set to`NULL` for`NOT_NULL` and`CUSTOM_SQL_STATEMENT` rules.
`failed_records_query`	For every rule that fails, this column stores a query that you can use to get failed records. In this document, seeTroubleshoot failed rules with`failed_records_query`.

For BigQuery entities, a view is created for everyrule_binding containing the SQL validation logic of the latest execution. You can find these views in the BigQuery dataset specified in the argument--gcp-bq-dataset-id.

Cost optimizations

Data quality tasks are executed as BigQuery jobs in your project. To control the cost for running data quality jobs, use BigQuery pricing on the project where your BigQuery jobs run. For more information, seeBigQuery workload management.

Incremental validations

Often, you have tables that are updated routinely with new partitions (new rows). If you don't want to revalidate old partitions in every run, you can use incremental validations.

For incremental validations, you must have aTIMESTAMP orDATETIME type column in your table where the column value monotonically increases. You can use the columns that your BigQuery table is partitioned on.

To specify incremental validation, specify a value forincremental_time_filter_column_id=TIMESTAMP/DATETIME type column as part of a rule binding.

When you specify a column, the data quality task considers only rows with aTIMESTAMP value greater than the timestamp of the last data quality task that ran.

Example specification files

To use these samples,create a BigQuery dataset namedsales. Then, create a fact table namedsales_orders and add sample data byrunning a query with the following GoogleSQL statements:

CREATEORREPLACETABLE sales.sales_orders(idSTRINGNOTNULL, last_modified_timestampTIMESTAMP, customer_idSTRING, item_idSTRING, amountNUMERIC, transaction_currencySTRING);INSERTINTO sales.sales_orders(id, last_modified_timestamp, customer_id, item_id, amount, transaction_currency)VALUES("order1",CURRENT_TIMESTAMP(),"customer1","ASDWQ123456789012345",100,"USD"),("order1",CURRENT_TIMESTAMP(),"customer2","bad_item_id",-10,"XXX"),("order2",CURRENT_TIMESTAMP(),"customer3","INTNL987654321098765",50,"GBP"),("order3",CURRENT_TIMESTAMP(),"customer4","INTNL932716428593847",50,"GBP")

Sample 1

The following code sample creates data quality checks for validating thesevalues:

amount: values are zero or positive numbers.
item_id: an alphanumeric string of 5 alphabetical characters, followedby 15 digits.
transaction_currency: an allowed currency type, as defined by a staticlist. The static list of this sample allows GBP and JPY as currency types. Thisvalidation only applies to rows marked as international.

# The following `NONE` row filter is required.row_filters:NONE:filter_sql_expr:|-True# This filters for rows marked as international (INTNL).INTERNATIONAL_ITEMS:filter_sql_expr:|-REGEXP_CONTAINS(item_id,'INTNL')# Rule dimensions are optional but let you aggregate reporting.rule_dimensions:-consistency-correctness-duplication-completeness-conformance-integrity# Rules can apply to multiple tables or columns.rules:VALUE_ZERO_OR_POSITIVE:rule_type:CUSTOM_SQL_EXPRdimension:correctnessparams:custom_sql_expr:|-$column >=0VALID_ITEM_ID:rule_type:REGEXdimension:conformanceparams:pattern:|-[A-Z]{5}[0-9]{15}VALID_CURRENCY_ID:rule_type:CUSTOM_SQL_EXPRdimension:integrityparams:custom_sql_expr:|-$column in ('GBP','JPY')# Rule bindings associate rules to columns within tables.rule_bindings:TRANSACTION_AMOUNT_VALID:entity_uri:bigquery://projects/PROJECT_ID/datasets/DATASET_ID/tables/sales_orderscolumn_id:amountrow_filter_id:NONErule_ids:-VALUE_ZERO_OR_POSITIVETRANSACTION_VALID_ITEM_ID:entity_uri:bigquery://projects/PROJECT_ID/datasets/DATASET_ID/tables/sales_orderscolumn_id:item_idrow_filter_id:NONErule_ids:-VALID_ITEM_IDTRANSACTION_CURRENCY_VALID:entity_uri:bigquery://projects/PROJECT_ID/datasets/DATASET_ID/tables/sales_orderscolumn_id:transaction_currencyrow_filter_id:INTERNATIONAL_ITEMSrule_ids:-VALID_CURRENCY_ID

Replace the following:

PROJECT_ID: your project ID.
DATASET_ID: the dataset ID.

Sample 2

If the table to be checked is part of a Dataplex Universal Catalog lake, you canspecify the tables using lake or zone notation. This lets you aggregateyour results by lake or zone. For example, you can generate a zone level score.

To use this sample, create a Dataplex Universal Catalog lake with the lake IDoperations and the zone IDprocurement. Then, add the tablesales_ordersas anasset to the zone.

# This is a convenience section that allows you to shorten the entity_urimetadata_registry_defaults:dataplex:projects:PROJECT_IDlocations:REGION_IDlakes:operationszones:procurement# You have to define a NONE row filterrow_filters:NONE:filter_sql_expr:|-TrueINTERNATIONAL_ITEMS:filter_sql_expr:|-REGEXP_CONTAINS(item_id, 'INTNL')# rule dimensions are optional but allow you to aggregate reporting.rule_dimensions:-consistency-correctness-duplication-completeness-conformance-integrity# Rules can be shared across tables or columns.rules:VALUE_ZERO_OR_POSITIVE:rule_type:CUSTOM_SQL_EXPRdimension:correctnessparams:custom_sql_expr:|-$column >= 0VALID_ITEM_ID:rule_type:REGEXdimension:conformanceparams:pattern:|-[A-Z]{5}[0-9]{15}VALID_CURRENCY_ID:rule_type:CUSTOM_SQL_EXPRdimension:integrityparams:custom_sql_expr:|-$column in ('GBP', 'JPY')#rule bindings associate rules to {table, column}rule_bindings:TRANSACTION_AMOUNT_VALID:entity_uri:dataplex://projects/PROJECT_ID/locations/REGION_ID/lakes/operations/zones/procurement/entities/sales_orderscolumn_id:amountrow_filter_id:NONErule_ids:-VALUE_ZERO_OR_POSITIVETRANSACTION_VALID_ITEM_ID:entity_uri:dataplex://zones/procurement/entities/sales_orders# omitting projects/locations/lakes from uri path to use the default values specified in metadata_registry_defaultscolumn_id:item_idrow_filter_id:NONErule_ids:-VALID_ITEM_IDTRANSACTION_CURRENCY_VALID:entity_uri:dataplex://zones/procurement/entities/sales_orderscolumn_id:transaction_currencyrow_filter_id:INTERNATIONAL_ITEMSrule_ids:-VALID_CURRENCY_ID

Replace the following:

PROJECT_ID: your project ID.
REGION_ID: the region ID of theDataplex Universal Catalog lake in which the table exists, such asus-central1.

Sample 3

This example enhances Sample 2 by adding a custom SQL check to see if the ID values are unique.

# This is a convenience section that allows you to shorten the entity_urimetadata_registry_defaults:dataplex:projects:PROJECT_IDlocations:REGION_IDlakes:operationszones:procurement# You have to define a NONE row filterrow_filters:NONE:filter_sql_expr:|-TrueINTERNATIONAL_ITEMS:filter_sql_expr:|-REGEXP_CONTAINS(item_id, 'INTNL')# rule dimensions are optional but allow you to aggregate reporting.rule_dimensions:-consistency-correctness-duplication-completeness-conformance-integrity# Rules can be shared across tables or columns.rules:# This rule is parameterized with column_names as parameterNO_DUPLICATES_IN_COLUMN_GROUPS:rule_type:CUSTOM_SQL_STATEMENTdimension:duplicationparams:custom_sql_arguments:-column_namescustom_sql_statement:|-select a.*from data ainner join (select$column_namesfrom datagroup by $column_nameshaving count(*) > 1) duplicatesusing ($column_names)VALUE_ZERO_OR_POSITIVE:rule_type:CUSTOM_SQL_EXPRdimension:correctnessparams:custom_sql_expr:|-$column >= 0VALID_ITEM_ID:rule_type:REGEXdimension:conformanceparams:pattern:|-[A-Z]{5}[0-9]{15}VALID_CURRENCY_ID:rule_type:CUSTOM_SQL_EXPRdimension:integrityparams:custom_sql_expr:|-$column in ('GBP', 'JPY')#rule bindings associate rules to {table, column}rule_bindings:TRANSACTIONS_UNIQUE:entity_uri:dataplex://projects/PROJECT_ID/locations/REGION_ID/lakes/operations/zones/procurement/entities/sales_orderscolumn_id:idrow_filter_id:NONErule_ids:-NO_DUPLICATES_IN_COLUMN_GROUPS:column_names:"id"TRANSACTION_AMOUNT_VALID:entity_uri:dataplex://zones/procurement/entities/sales_orders# omitting projects/locations/lakes from uri path to use the default values specified in metadata_registry_defaultscolumn_id:amountrow_filter_id:NONErule_ids:-VALUE_ZERO_OR_POSITIVETRANSACTION_VALID_ITEM_ID:entity_uri:dataplex://zones/procurement/entities/sales_orderscolumn_id:item_idrow_filter_id:NONErule_ids:-VALID_ITEM_IDTRANSACTION_CURRENCY_VALID:entity_uri:dataplex://zones/procurement/entities/sales_orderscolumn_id:transaction_currencyrow_filter_id:INTERNATIONAL_ITEMSrule_ids:-VALID_CURRENCY_ID

Sample 4

This example enhances Sample 3 by adding incremental validations using thelast_modified_timestamp column. You can add incremental validations for oneor more rule bindings.

# This is a convenience section that allows you to shorten the entity_urimetadata_registry_defaults:dataplex:projects:PROJECT_IDlocations:REGION_IDlakes:operationszones:procurement# You have to define a NONE row filterrow_filters:NONE:filter_sql_expr:|-TrueINTERNATIONAL_ITEMS:filter_sql_expr:|-REGEXP_CONTAINS(item_id, 'INTNL')# rule dimensions are optional but allow you to aggregate reporting.rule_dimensions:-consistency-correctness-duplication-completeness-conformance-integrity# Rules can be shared across tables or columns.rules:# This rule is parameterized with column_names as parameterNO_DUPLICATES_IN_COLUMN_GROUPS:rule_type:CUSTOM_SQL_STATEMENTdimension:duplicationparams:custom_sql_arguments:-column_namescustom_sql_statement:|-select a.*from data ainner join (select$column_namesfrom datagroup by $column_nameshaving count(*) > 1) duplicatesusing ($column_names)VALUE_ZERO_OR_POSITIVE:rule_type:CUSTOM_SQL_EXPRdimension:correctnessparams:custom_sql_expr:|-$column >= 0VALID_ITEM_ID:rule_type:REGEXdimension:conformanceparams:pattern:|-[A-Z]{5}[0-9]{15}VALID_CURRENCY_ID:rule_type:CUSTOM_SQL_EXPRdimension:integrityparams:custom_sql_expr:|-$column in ('GBP', 'JPY')#rule bindings associate rules to {table, column}rule_bindings:TRANSACTIONS_UNIQUE:entity_uri:dataplex://projects/PROJECT_ID/locations/REGION_ID/lakes/operations/zones/procurement/entities/sales_orderscolumn_id:idrow_filter_id:NONEincremental_time_filter_column_id:last_modified_timestamprule_ids:-NO_DUPLICATES_IN_COLUMN_GROUPS:column_names:"id"TRANSACTION_AMOUNT_VALID:entity_uri:dataplex://zones/procurement/entities/sales_orders# omitting projects/locations/lakes from uri path to use the default values specified in metadata_registry_defaultscolumn_id:amountrow_filter_id:NONEincremental_time_filter_column_id:last_modified_timestamprule_ids:-VALUE_ZERO_OR_POSITIVETRANSACTION_VALID_ITEM_ID:entity_uri:dataplex://zones/procurement/entities/sales_orderscolumn_id:item_idrow_filter_id:NONEincremental_time_filter_column_id:last_modified_timestamprule_ids:-VALID_ITEM_IDTRANSACTION_CURRENCY_VALID:entity_uri:dataplex://zones/procurement/entities/sales_orderscolumn_id:transaction_currencyrow_filter_id:INTERNATIONAL_ITEMSincremental_time_filter_column_id:last_modified_timestamprule_ids:-VALID_CURRENCY_ID

Troubleshoot failed rules with`failed_records_query`

For every rule that fails, thesummary tablestores a query in thefailed_records_query columnthat you can use to get failed records.

To debug, you can also usereference columns in your YAML file,which lets you join the output offailed_records_query withthe original data to get the entire record. For example, you can specifyaprimary_key column or a compoundprimary_key column asa reference column.

Note: Failed record queries do not usereference_columns forCUSTOM_SQL_STATEMENT rules. The output of the failed records query depends on the query inCUSTOM_SQL_STATEMENT itself. For example, if you need a reference column in the failed records query for aCUSTOM_SQL_STATEMENT rule, you need to include the column in the SQL statement as well. See Use failed records queries for CUSTOM_SQL_STATEMENT rules for more information.

Specify reference columns

To generate reference columns, you can add the following to your YAML specification:

Thereference_columns section. In this section, you can create one or more reference column sets, with each set specifying one or more columns.
Therule_bindings section. In this section, you can add a line to a rule binding that specifies a reference column ID (reference_columns_id) to use for the rules in that rule binding. It should be one of the reference column sets specified in thereference_columns section.

For example, the following YAML file specifies areference_columns section and defines three columns:id,last_modified_timestamp, anditem_id as part of theORDER_DETAILS_REFERENCE_COLUMNS set. The following example uses the sample tablesales_orders.

reference_columns:ORDER_DETAILS_REFERENCE_COLUMNS:include_reference_columns:-id-last_modified_timestamp-item_idrules:VALUE_ZERO_OR_POSITIVE:rule_type:CUSTOM_SQL_EXPRparams:custom_sql_expr:|-row_filters:NONE:filter_sql_expr:|-Truerule_bindings:TRANSACTION_AMOUNT_VALID:entity_uri:bigquery://projects/PROJECT_ID/datasets/DATASET_ID/tables/sales_orderscolumn_id:amountrow_filter_id:NONEreference_columns_id:ORDER_DETAILS_REFERENCE_COLUMNSrule_ids:-VALUE_ZERO_OR_POSITIVE

Using the failed records query

The failed records query generates a row for every record that has a rule that failed. It includes the column name that triggered the failure, the value that triggered the failure, and the values for the reference columns. It also includes metadata that you can use to relate to the execution of the data quality task.

The following is an example of the output from a failed records query for the YAML file described inSpecify reference columns. It shows a failure for columnamount and a failed value of-10. It also records the corresponding value for the reference column.

_dq_validation_invocation_id	_dq_validation_rule_binding_id	_dq_validation_rule_id	_dq_validation_column_id	_dq_validation_column_value	_dq_validation_dimension	_dq_validation_simple_rule_row_is_valid	_dq_validation_complex_rule_validation_errors_count	_dq_validation_complex_rule_validation_success_flag	id	last_modified_timestamp	item_id
10a25be9-8dfa-446c-a42c-75f6bb4a49d9	TRANSACTION_AMOUNT_VALID	VALUE_ZERO_OR_POSITIVE	amount	-10		FALSE			order1	2022-01-22T02:30:06.321Z	bad_item_id

Use failed records queries for CUSTOM_SQL_STATEMENT rules

ForCUSTOM_SQL_STATEMENT rules, failed record queries include thecustom_sql_statement_validation_errors column. Thecustom_sql_statement_validation_errors column is a nested column with fields thatmatch the output of your SQL statement. Reference columns are not included in failed records queries forCUSTOM_SQL_STATEMENT rules.

For example, yourCUSTOM_SQL_STATEMENT rule might look like this:

rules:TEST_RULE:rule_type:CUSTOM_SQL_STATEMENTcustom_sql_arguments:-existing_id-replacement_idparams:CUSTOM_SQL_STATEMENT:|-(SELECTproduct_name,product_keyFROMdatawhere$existing_id!=$replacement_id)

The results for this example will contain one or more rows for thecustom_sql_statement_validation_errorscolumn, with a row for every occurrence whereexisting_id!=replacement_id.

When rendered in JSON, the contents of a cell in this column might look like this:

{  "custom_sql_statement_valdation_errors" :{    "product_name"="abc"    "product_key"="12345678"    "_rule_binding_id"="your_rule_binding"  }}

You can join these results to the original table with a nested reference likejoin on custom_sql_statement_valdation_errors.product_key.

What's next

Refer to theCloudDQ YAML specification reference.
For sample data quality rules, seeSimple rulesandAdvanced rules.
SeeSample Airflow DAG for Dataplex Universal Catalog data quality task.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Use data quality tasks Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Enable Google APIs and services

Create a specification file

About the specification file

Create a dataset to store the results

Create a service account

Use advanced settings

Limitations

Schedule a data quality task

Console

gcloud CLI

API

Monitor a scheduled data quality task

View the results

Cost optimizations

Incremental validations

Example specification files

Sample 1

Sample 2

Sample 3

Sample 4

Troubleshoot failed rules withfailed_records_query

Specify reference columns

Using the failed records query

Use failed records queries for CUSTOM_SQL_STATEMENT rules

What's next

Use data quality tasks

Troubleshoot failed rules with`failed_records_query`