Auto data quality overview

Dataplex Universal Catalog lets you define and measure the quality of the data in yourBigQuery tables. You can automate the data scanning, validate dataagainst defined rules, and log alerts if your data doesn't meet qualityrequirements. Auto data quality lets you manage data quality rules anddeployments as code, improving the integrity of data production pipelines.

To scan data for anomalies, seeDataplex Universal Catalog data profile scan.The scan can generate data quality rules. You can also use predefined qualityrules or build custom rules.

Dataplex Universal Catalog provides monitoring, troubleshooting, andCloud Logging alerting that's integrated with auto data quality.

Conceptual model

A data quality scan applies quality rules to table data to report results.

A data quality scan is a type ofDataplex Universal Catalog datascan thatvalidates your data against a set of predefined rules. A data scan is aDataplex Universal Catalog job that samples data from BigQuery andCloud Storage and infers various types of metadata. To measure thequality of a table using auto data quality, you create aDataScan object oftypedata quality. The scan runs on only one BigQuery table.The scan uses resources in a Googletenantproject, so you don't need to setup your own infrastructure.

Creating and using a data quality scan consists of the following steps:

Define data quality rules
Configure rule execution
Analyze data quality scan results
Set up monitoring and alerting
Troubleshoot data quality failures

Rule definition

Data quality rules associated with a data quality scan define dataexpectations. You can create data quality rules in the following ways:

Use recommendations fromDataplex Universal Catalog data profiling
Use the predefined rules
Create custom SQL rules

Predefined rules

Dataplex Universal Catalog supports the following categories of predefined rules:

Row-level

For row-level category rules, the expectation is appliedagainst each data row. Each row independently passes orfails the condition. For example,column_A_value < 1.

Row-level checks require you to specify a passing threshold. When thepercentage of rows passing the rule falls below the threshold value, the rulefails.

Aggregate

For aggregate rules, the expectation isapplied against a single value aggregated across the entire data. Forexample,Avg(someCol) >= 10. To pass, the check must evaluate to thebooleantrue. Aggregate rules don't provide an independent pass or failcount for each row.

For both of the rule categories, you can set the following parameters:

The column to which the rule applies
Adimension

The following table lists the supported row-level and aggregate rule types:

Rule type (Name in Google Cloud console)	Row-level or aggregate rule	Description	Supported column types	Rule-specific parameters
`RangeExpectation` (Range Check)	Row-level	Check if the value is between min and max.	All numeric, date, and timestamp type columns.	Required: Passing threshold percentage `min` or`max` values: Specify at least one value. Optional: Enable`strict min`: If enabled, the rule check uses ">" instead of ">=". Enable`strict max`: If enabled, the rule check uses "<" instead of "<=". Enable`ignore null`: If enabled, null values are ignored in the rule check.
`NonNullExpectation` (Null check)	Row-level	Validate that column values are not NULL.	All supported column types.	Required: Passing threshold percentage.
`SetExpectation` (Set check)	Row-level	Check if the values in a column are one of the specified values in a set.	All supported column types, except`Record` and`Struct`.	Required: Set of string values to check against. Passing threshold percentage. Optional: Enable`ignore null`: If enabled, null values are ignored in the rule check.
`RegexExpectation` (Regular expression check)	Row-level	Check the values against a specified regular expression.	String	Required: Regular expression pattern used to check. Passing threshold percentage. Note: GoogleSQL provides regular expression support using there2 library. See that documentation for its regular expression syntax. Optional: Enable`ignore null`: If enabled, null values are ignored in the rule check.
`Uniqueness` (Uniqueness Check)	Aggregate	Check if all the values in a column are unique.	All supported column types, except`Record` and`Struct`.	Required: Column and dimension from the supported parameters. Optional: Enable`ignore null`: If enabled, null values are ignored in the rule check.
`StatisticRangeExpectation` (Statistic check)	Aggregate	Check if the given statistical measure matches the range expectation.	All supported numeric column types.	Required: `mean`,`min`, or`max` values: Specify at least one value. Optional: Enable`strict min`: If enabled, the rule check uses ">" instead of ">=". Enable`strict max`: If enabled, the rule check uses "<" instead of "<=".

Supported custom SQL rule types

SQL rules provide flexibility to expand the validation with custom logic. Theserules come in the following types.

Rule type Row-level or aggregate rule Description Supported column types Rule-specific parameters Example

Row condition

Row-level

Rule type	Row-level or aggregate rule	Description	Supported column types	Rule-specific parameters	Example
Row condition	Row-level	Specify an expectation for every row by defining a SQL expression in a`WHERE` clause. The SQL expression should evaluate to`true` (pass) or`false` (fail) per row. Dataplex Universal Catalog computes the percentage of rows that pass this expectation and compares this value against the passing threshold percentage to determine the success or failure of the rule. The expression can include a reference to another table, for example, to create referential integrity checks.	All columns	Required: SQL condition to use Passing threshold percentage Dimension Optional: Column to associate this rule with.	`grossWeight` <=`netWeight`
Table condition (aggregate SQL expression)	Aggregate	These rules are executed once per table. Provide a SQL expression that evaluates to boolean`true` (pass) or`false` (fail). The SQL expression can include a reference to another table usingexpression subqueries.	All columns	Required: SQL condition to use Dimension Optional: Column to associate this rule with	Simple aggregate example: `avg(price) > 100` Using an expression subquery to compare values across a different table: (SELECT COUNT() FROM `example_project.example_dataset.different-table`) < COUNT()
SQL assertion	Aggregate	An assertion rule uses a data quality query to find rows that fail one or more conditions specified in the query. Provide a SQL statement that is evaluated to return rows that match the invalid state. If the query returns any rows, the rule fails. Omit the trailing semicolon from the SQL statement. The SQL statement can include a reference to another table usingexpression subqueries.	All columns	Required: SQL statement to check invalid state Dimension Optional: Column to associate this rule with.	Simple aggregate example to make sure that`discount_pct` is not greater than 100: `SELECT * FROM example_project.example_dataset.table WHERE discount_pct > 100` Using an expression subquery to compare values across a different table: SELECT * FROM `example_project.example_dataset.different-table` WHERE gross_weight > (SELECT avg(gross_weight) FROM `example_project.example_dataset.different-table`)

Specify an expectation for every row by defining a SQL expression in aWHERE clause. The SQL expression should evaluate totrue (pass) orfalse (fail) per row. Dataplex Universal Catalog computes the percentage of rows that pass this expectation and compares this value against the passing threshold percentage to determine the success or failure of the rule.

The expression can include a reference to another table, for example, to create referential integrity checks.

All columns

Required:

SQL condition to use
Passing threshold percentage
Dimension

Optional:

Column to associate this rule with.

grossWeight <=netWeight

Table condition
(aggregate SQL expression)

Aggregate

These rules are executed once per table. Provide a SQL expression that evaluates to booleantrue (pass) orfalse (fail).

The SQL expression can include a reference to another table usingexpression subqueries.

All columns

Required:

SQL condition to use
Dimension

Optional:

Column to associate this rule with

Simple aggregate example:
avg(price) > 100

Using an expression subquery to compare values across a different table:
(SELECT COUNT(*) FROM `example_project.example_dataset.different-table`) < COUNT(*)

SQL assertion

Aggregate

An assertion rule uses a data quality query to find rows that fail one or more conditions specified in the query. Provide a SQL statement that is evaluated to return rows that match the invalid state. If the query returns any rows, the rule fails.

Omit the trailing semicolon from the SQL statement.

The SQL statement can include a reference to another table usingexpression subqueries.

All columns

Required:

SQL statement to check invalid state
Dimension

Optional:

Column to associate this rule with.

Simple aggregate example to make sure thatdiscount_pct is not greater than 100:
SELECT * FROM example_project.example_dataset.table WHERE discount_pct > 100

Using an expression subquery to compare values across a different table:
SELECT * FROM `example_project.example_dataset.different-table` WHERE gross_weight > (SELECT avg(gross_weight) FROM `example_project.example_dataset.different-table`)

For example rules, seeauto data quality sample rules.

For SQL functions supported, seeGoogleSQL reference.

Dimensions

Dimensions let you aggregate the results of multiple data quality rules formonitoring and alerting. You must associate every data quality rule with adimension. Dataplex Universal Catalog provides the following dimensions:

Freshness: Freshness measures when the data was last updated. Having this informationcan help you determine whether the data is recent enough to be useful.
Volume: Volume measures whether all of the expected data is present.
Completeness: Completeness assesses whether the data contains all of the information that'srequired for its intended purpose.
Validity: Validity evaluates whether the data conforms to predefined standards forformat, acceptable ranges, or other criteria. For example, if a valid date needsto have the formatYYYY/mm/dd, then 08-12-2019 is invalid data. As anotherexample, if a valid sales price for an item is between $10 and $20, then a salesprice of $100 is invalid data.
Consistency: Consistency refers to having the same values for data across multipleinstances, such as tables and columns. Inconsistency in data arises when, forexample, the revenue of a product differs when it is read from a sales databaseor a usage database.
Accuracy: Accuracy reflects the correctness of the data. Note that data that is validisn't necessarily accurate. For example, a valid hair color might be brown, butif a person doesn't have brown hair, that is inaccurate data.
Uniqueness: Uniqueness measures whether the data is distinct with no duplicates.

Note: You can create a custom dimension name when you use the Google Cloud CLI orthe REST API to create a data scan. You can use dimension names that reflectyour organization's terminology. For example, instead of freshness, you mightuse timeliness.

Typed input in rules

All value parameters are passed as string values to the API.Dataplex Universal Catalog requires inputs to follow the BigQuery specified format.

Binary-typed parameters can be passed as a base64-encoded string.

Type	Supported formats	Examples
Binary	Base64 encoded value	YXBwbGU=
Timestamp	YYYY-[M]M-[D]D[( \|T)[H]H:[M]M:[S]S[.F]] [time_zone] OR YYYY-[M]M-[D]D[( \|T)[H]H:[M]M:[S]S[.F]][time_zone_offset]	2014-09-27 12:30:00.45-08
Date	YYYY-M[M]-D[D]	2014-09-27
Time	[H]H:[M]M:[S]S[.DDDDDD]	12:30:00.45
DateTime	YYYY-[M]M-[D]D [[H]H:[M]M:[S]S[.DDDDDD]]	2014-09-27 12:30:00.45

Data reference parameter

When you create a custom SQL rule, you can refer to a data source table and allof its precondition filters by using the data reference parameter${data()} inthe rule, instead of explicitly mentioning the source table and its filters.Dataplex Universal Catalog interprets the parameter as a reference to the sourcetable and its filters. Examples of precondition filters include row filters,sampling percents, and incremental filters.

For example, say that you have a data source table calledmy_project_id.dim_dataset.dim_currency. You want to run an incremental dataquality scan that scans only on new daily data. A row filter that filters fortoday's entries,transaction_timestamp >= current_date(), is applied on thetable.

A custom SQL rule to find rows withdiscount_pct for today looks like this:

discount_pct IN (SELECT discount_pct FROM my_project_id.dim_dataset.dim_currency WHERE transaction_timestamp >= current_date())

If you use the data reference parameter, you can simplify the rule. Replace themention of the table and its precondition filters with the${data()}parameter:

discount_pct IN (SELECT discount_pct FROM ${data()})

Dataplex Universal Catalog interprets the${data()} parameter as areference to the data source table with today's entries,my_project_id.dim_dataset.dim_currency WHERE transaction_timestamp >= current_date().In this example, the data reference parameter refers only to the incrementaldata.

The${data()} parameter is case sensitive.

When you use an alias within a subquery to refer to columns in the source table,either use the data reference parameter to refer to the source table, or omitthe table reference. Don't refer to the columns in the source table by using adirect table reference in theWHERE clause.

Recommended:

Use the data reference parameter to refer to the source table:

discount_pct IN (SELECT discount_pct FROM`my_project_id.dim_dataset.dim_currency` AS temp-tableWHEREtemp-table.transaction_timestamp = ${data()}.timestamp)

Omit the table reference:

discount_pct IN (SELECT discount_pct FROM`another_project.another_dataset.another_table` AS temp-tableWHEREtemp-table.transaction_timestamp = timestamp

Not recommended:

Don't use a direct table reference to refer to columns in the source table:

discount_pct IN (SELECT discount_pct FROM`my_project_id.dim_dataset.dim_currency` AS temp-tableWHEREtemp-table.transaction_timestamp = `my_project_id.dim_dataset.dim_currency`.timestamp)

Valid use of different tables:

You can use a direct table reference when comparing columns from adifferent table:

discount_pct IN (SELECT discount_pct FROM`my_project_id.dim_dataset.dim_currency` AS temp-tableWHEREtemp-table.transaction_timestamp = `another_project.another_dataset.another_table`.timestamp)

Rule execution

You can schedule data quality scans to run at a specific interval, or you canrun a scan on demand.

When you run a data quality scan, Dataplex Universal Catalog creates a job. As partof the specification of a data quality scan, you can specify the scope of a jobto be one of the following:

Full Table: Each job validates the entire table.
Incremental: Each job validates incremental data. Todetermine increments, provide aDate /Timestamp column in thetable that can be used as a marker. Typically, this is the column on which thetable is partitioned.

Filter data

You can filter data to be scanned for data quality by using arow filter. Creating a row filter lets you focus on data within a specifictime period or specific segment, such as a certain region. Using filters canreduce the run time and cost. For example, you can filter out data with a timestampbefore a certain date.

Sample data

You can specify a percentage of records from your datato sample for running a data quality scan. Creating data quality scans on asmaller sample of data can reduce the run time and the cost relative toquerying the entire dataset.

Data quality scan results

The results of your data quality scans are available in Dataplex Universal Catalogand BigQuery.You can also review and analyze the scan results by using the following methods:

Export results to BigQuery
You can export the scan results to a BigQuery table for furtheranalysis. To customize reporting, you can connect the BigQuerytable data to a Looker dashboard. You can build an aggregated reportby using the same results table across multiple scans.
Publish results as Dataplex Universal Catalog metadata
You can publish the data quality scan results asDataplex Universal Catalog metadata. The latest results are saved to theDataplex Universal Catalog entry that represents the source table, under thedata-quality-scorecard system aspect type. You can view the results on thesource table's BigQuery and Dataplex Universal Catalog pages inthe Google Cloud console, on theData quality tab. You can also retrievethe results by using the API.
Note: If an existing data quality scan published the results to theBigQuery and Dataplex Universal Catalog pages in theGoogle Cloud console, and you instead want to publish future scan results asDataplex Universal Catalog metadata, you must edit the scan tore-enable publishing.
For more information about Dataplex Universal Catalog metadata, see About metadata management in Dataplex Universal Catalog.
Review data quality scores
Each scan result provides data quality scores that indicate the percentage ofrules that passed. The scores are reported at the overall job level, the columnlevel (if the rule is evaluated against a column), and the dimension level. Usethe data quality scores to normalize data quality across tables or columns,track trends, and identify data that doesn't meet quality requirements.

For more information, seeView the data quality scan results.

Monitoring and alerting

You can monitor and get alerts about data quality scans by using the followingmethods:

Set alerts in Cloud Logging
You can monitor the data quality jobs using thedata_scan anddata_quality_scan_rule_result logs in the Logs Explorer.
For each data quality job, thedata_scan log with thedata_scan_type fieldset toDATA_QUALITY contains the following information:
- Data source used for the data scan.
- Job execution details, such as creation time, start time, end time, andjob state.
- Result of the data quality job: pass or fail.
- Dimension level pass or fail.
Every succeeded job contains adata_quality_scan_rule_resultlog with the following detailed information about each rule in that job:
- Configuration information, such as rule name, rule type, evaluation type,and dimension.
- Result information, such as pass or failure, total row count, passing rowcount, null row count, and evaluated row count.
The information in the logs is available through the API andGoogle Cloud console. You can use this information to set up alerts. Formore information, seeSet alerts in Logging.
Note: For the latest successful job, this information is also available in theparent data quality scan.
Send email notification reports
You can send email notification reports to alert people about the status andresults of a data quality job. Notification reports are available for thefollowing scenarios:
You configure notification reports when you create a data quality scan.

Troubleshoot data quality failures

When a rule fails, Dataplex Universal Catalog produces a query to get the failedrecords. Run this query to see the records that did not match your rule. Formore information, seeTroubleshoot a data quality failure.

Limitations

Rule recommendations aren't supported in the gcloud CLI.
The choice of dimensions is fixed to one of the predefined seven dimensions.
The number of rules per data quality scan is limited to 1000.
Data quality scores that are reported at the column level are supportedonly in the API.

Pricing

For more information about pricing, seeDataplex Universal Catalog pricing.

What's next?

Learn how touse auto data quality.
Learn how tomanage your data quality rules as code.
Learn about the available Terraform resources for data profiling. See the following:
- Dataplex data scan resource in the Terraform registry.
- TheDataplex data scan resource documentation on GitHub, which supports YAML-based rule configuration.
Learn aboutdata profiling.
Learn how touse data profiling.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.

Movatterモバイル変換

Auto data quality overview Stay organized with collections Save and categorize content based on your preferences.

Conceptual model

Rule definition

Predefined rules

Supported custom SQL rule types

Dimensions

Typed input in rules

Data reference parameter

Rule execution

Filter data

Sample data

Data quality scan results

Monitoring and alerting

Troubleshoot data quality failures

Limitations

Pricing

What's next?

Auto data quality overview