Transformation Pushdown overview

For improved performance in your data pipelines, you can push sometransformation operations to BigQuery instead of Apache Spark.Transformation Pushdown refers to a setting that lets an operation in aCloud Data Fusion data pipeline to be pushed to BigQuery as anexecution engine. As a result, the operation and its data are transferred toBigQuery and the operation is performed there.

Transformation Pushdown improves the performance of pipelines that havemultiple complexJOIN operationsor other supported transformations. Executing some transformations inBigQuery may be faster than executing them in Spark.

Unsupported transformations and all preview transformations are executedin Spark.

Supported transformations

Transformation Pushdown is available in Cloud Data Fusion version 6.5.0and later, but some of the following transformations are only supported in laterversions.

Note: When you preview data in Cloud Data Fusion, all transformations areexecuted in Spark and not BigQuery.

JOIN operations

BigQuery Sink

Transformation Pushdown is available for the BigQuery Sink inCloud Data Fusion version 6.7.0 and later.

When the BigQuery Sink follows a stage that has beenexecuted in BigQuery, the operation that writes records intoBigQuery is performed directly in BigQuery.

To improve performance with this sink, you need the following:

  • The service account must have permission to create and update tables in thedataset used by the BigQuery Sink.
  • The datasets used for Transformation Pushdown and theBigQuery Sink must be stored in the samelocation.
  • The operation must be one of the following:
    • Insert (theTruncate Table option is not supported)
    • Update
    • Upsert
Note: Without these requirements, the sink operation uses the existing workflowand doesn't execute a direct copy.

GROUP BY aggregations

Transformation Pushdown is available forGROUP BY aggregations inCloud Data Fusion version 6.7.0 and later.

GROUP BY aggregations in BigQuery are available for thefollowing operations:

  • Avg
  • Collect List (null values are removed from the output array)
  • Collect Set (null values are removed from the output array)
  • Concat
  • Concat Distinct
  • Count
  • Count Distinct
  • Count Nulls
  • Logical And
  • Logical Or
  • Max
  • Min
  • Standard Deviation
  • Sum
  • Sum of Squares
  • Corrected Sum of Squares
  • Variance
  • Shortest String
  • Longest String
Note: Conditional aggregations, which let you use a filter expressionbefore aggregating records, are not supported. If a conditional aggregation isused, theGROUP BY operation is executed in Spark.

GROUP BY aggregations are executed in BigQuery in the followingcases:

Deduplicate aggregations

Transformation Pushdown is available for deduplicate aggregations inCloud Data Fusion version 6.7.0 and later for the following operations:

  • No filter operation is specified
  • ANY (a non-null value for the desired field)
  • MIN (the minimum value for the specified field)
  • MAX (the maximum value for the specified field)

The following operations are not supported:

  • FIRST
  • LAST

Deduplicate aggregations are executed in the SQL engine in the following cases:

  • It follows a stage that has already been pushed down.
  • You configured it to be executed in SQL Engine (see theStages to force executionoption).

BigQuery Source Pushdown

BigQuery Source Pushdown is available inCloud Data Fusion versions 6.8.0 and later.

When a BigQuery Source follows a stage that's compatible forBigQuery pushdown, the pipeline can execute all compatible stageswithin BigQuery.

Cloud Data Fusion copies the records necessary to execute the pipelinewithin BigQuery.

When you use BigQuery Source Pushdown, the table partitioning andclustering properties are preserved, which lets you use these properties tooptimize further operations, such as joins.

Additional requirements

To use BigQuery Source Pushdown, the following requirements must be inplace:

  • The service account configured for BigQueryTransformation Pushdown must have permissions to read tables in theBigQuery Source's dataset.

  • The Datasets used in the BigQuery Source and the datasetconfigured for Transformation Pushdown must be stored in the samelocation.

Window aggregations

Transformation Pushdown is available for Window aggregations inCloud Data Fusion versions 6.9 and later. Window aggregations inBigQuery are supported for the following operations:

  • Rank
  • Dense Rank
  • Percent Rank
  • N tile
  • Row Number
  • Median
  • Continuous Percentile
  • Lead
  • Lag
  • First
  • Last
  • Cumulative distribution
  • Accumulate

Window aggregations are executed in BigQuery in the followingcases:

  • It follows a stage that has already been pushed down.
  • You configured it to be executed in SQL Engine (see theStages to force pushdown option).

Wrangler Filter Pushdown

Wrangler Filter Pushdown is available in Cloud Data Fusion versions6.9 and later.

When using the Wrangler plugin, you can push filters, known asPreconditionoperations, to be executed in BigQuery instead of Spark.

Filter pushdown is only supported with the SQL mode forPreconditions, which was also released in version 6.9. In this mode, theplugin accepts a precondition expression in ANSI-standard SQL.

If the SQL mode is used for preconditions,Directives andUser DefinedDirectives are disabled for the Wrangler plugin, as they're not supportedwith preconditions in SQL mode.

SQL mode for preconditions is unsupported for Wrangler plugins with multipleinputs when Transformation Pushdown is enabled. If used with multiple inputs,this Wrangler stage with SQL filter conditions is executed in Spark.

Filters are executed in BigQuery in the followingcases:

  • It follows a stage that has already been pushed down.
  • You configured it to be executed in SQL Engine (see theStages to force pushdown option).

Metrics

For more information about the metrics that Cloud Data Fusion provides forthe part of the pipeline that's executed in BigQuery, seeBigQuery pushdown pipeline metrics.

When to use Transformation Pushdown

Executing transformations in BigQuery involves the following:

  1. Writing records into BigQuery for supported stages in yourpipeline.
  2. Executing supported stages in BigQuery.
  3. Reading records from BigQuery after the supportedtransformations are executed, unless they are followed by aBigQuery Sink.

Depending on the size of your datasets, there might be considerable networkoverhead, which can have a negative impact on overall pipeline execution timewhen Transformation Pushdown is enabled.

Due to the network overhead, we recommend Transformation Pushdown in thefollowing cases:

  • Multiple supported operations are executed in sequence (with no steps betweenthe stages).
  • Performance gains from BigQuery executing the transformations,relative to Spark, outweighs the latency of data movement into and possiblyout of BigQuery.

How it works

When you run a pipeline that uses Transformation Pushdown,Cloud Data Fusion executes supported transformation stages inBigQuery. All other stages in the pipeline are executed in Spark.

When executing transformations:

  1. Cloud Data Fusion loads the input datasets into BigQueryby writing records into Cloud Storage, and then executing aBigQuery load job.

  2. JOIN operations and supported transformations are then executed asBigQuery jobs using SQL statements.

  3. If further processing is needed after the jobs are executed, records can beexported from BigQuery to Spark. However, if theAttemptdirect copy to BigQuery sinks option is enabled and theBigQuery Sink follows a stage that was executed inBigQuery, records are written directly to the destinationBigQuery Sink table.

The following diagram shows how Transformation Pushdown executes supportedtransformations in BigQuery instead of Spark.

Transformation Pushdown to BigQuery in Cloud Data Fusion pipelines.

Best practices

Adjust cluster and executor sizes

To optimize the resource management in your pipeline, do the following:

  • Use the right number of cluster workers (nodes) for a workload. In otherwords, get the most out of the provisioned Dataproc cluster byfully using the available CPU and memory for your instance, while alsobenefiting from the execution speed of BigQuery for largejobs.

  • Improve the parallelism in your pipelines by usingautoscaling clusters.

  • Adjust your resource configurations in the stages of your pipeline whererecords are pushed or pulled from BigQuery during yourpipeline execution.

Recommended: Experiment with increasing the number of CPU cores for yourexecutor resources (up to the number of CPU cores that your worker node uses).The executors optimize CPU use during the serialization and deserializationsteps as data goes in and out of BigQuery. For more information,seeCluster sizing.

A benefit of executing transformations in BigQuery is that yourpipelines can run on smaller Dataproc clusters. If joins are themost resource-intensive operations in your pipeline, you can experiment withsmaller cluster sizes, as the heavyJOIN operations are now performed inBigQuery), allowing you to potentially reduce your overallcompute costs.

Retrieve data faster with the BigQuery Storage Read API

After BigQuery executes the transformations, your pipeline mighthave additional stages to execute in Spark. In Cloud Data Fusion version6.7.0 and later, Transformation Pushdown supports theBigQuery Storage Read API, which improveslatency and results in faster read operations into Spark. It can reduce theoverall pipeline execution time.

The API reads records in parallel, so we recommend adjusting executor sizesaccordingly. If resource-intensive operations are executed inBigQuery, reduce the memory allocation for the executors toimprove parallelism when the pipeline runs (seeAdjust cluster and executor sizes).

The BigQuery Storage Read API is disabled by default. You can enable it inexecution environments where Scala 2.12 is installed (includingDataproc 2.0 and Dataproc 1.5).

Note: The BigQuery Storage Read API incurs additional charges (seePricing).

Consider the dataset size

Consider the sizes of the datasets in theJOIN operations. ForJOINoperations that generate a substantial number of output records, such assomething that resembles a crossJOIN operation, the resulting dataset sizemight be orders of magnitude larger than the input dataset. Also, consider theoverhead of pulling these records back into Spark when additional Sparkprocessing for these records occurs, such as a transformation or a sink, in thecontext of the overall pipeline performance.

Mitigate skewed data

JOIN operations for heavily skewed data might cause theBigQuery job to exceed theresource utilization limits,which causes theJOIN operation to fail. To prevent this, go to the Joinerplugin settings and identify the skewed input in theSkewed Input Stagefield. This lets Cloud Data Fusion arrange the inputs in a way that reducesthe risk of the BigQuery statement from exceeding the limits.

In the Joiner plugin settings, identify skewed data in the Skewed Input Stage field.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.