Preprocess data with MLTransform

This page explains why and how to use theMLTransformfeature to prepare your data for training machine learning (ML) models. Bycombining multiple data processing transforms in one class,MLTransformstreamlines the process of applying Apache Beam ML data processingoperations to your workflow.

For information about usingMLTransform for embedding generation tasks, seeGenerate embeddings with MLTransform.

Diagram of the Dataflow ML workflow with the data processing step highlighted.

Figure 1. The complete Dataflow ML workflow. UseMLTransform in the preprocessing step of the workflow.

Benefits

TheMLTransform class provides the following benefits:

  • Transform your data without writing complex code or managing underlyinglibraries.
  • Efficiently chain multiple types of processing operations with oneinterface.
  • Generate embeddings that you can use to push data into vector databases orto run inference.

    For more information about embedding generation, seeGenerate embeddings with MLTransform.

Support and limitations

TheMLTransform class has the following limitations:

  • Available for pipelines that use the Apache BeamPython SDK versions 2.53.0 and later.
  • Pipelines must usedefault windows.

Data processing transforms that use TFT:

  • Support Python 3.9, 3.10, 3.11.
  • Support batch pipelines.

Use cases

The example notebooks demonstrate how to useMLTransform forembeddings-specific use cases.

I want to compute a vocabulary from a dataset
Compute a unique vocabulary from a dataset and then map each word or token toa distinct integer index. Use this transform to change textual data intonumerical representations for machine learning tasks.
I want to scale my data to train my ML model
Scale your data so that you can use it to train your ML model. TheApache BeamMLTransform class includes multiple data scaling transforms.

For a full list of available transforms, seeTransformsin the Apache Beam documentation.

Use MLTransform

To use theMLTransform class to preprocess data, include the following code inyour pipeline:

importapache_beamasbeamfromapache_beam.ml.transforms.baseimportMLTransformfromapache_beam.ml.transforms.tftimportTRANSFORM_NAMEimporttempfiledata=[{DATA},]artifact_location=gs://BUCKET_NAMETRANSFORM_FUNCTION_NAME=TRANSFORM_NAME(columns=['x'])withbeam.Pipeline()asp:transformed_data=(p|beam.Create(data)|MLTransform(write_artifact_location=artifact_location).with_transform(TRANSFORM_FUNCTION_NAME)|beam.Map(print))

Replace the following values:

  • TRANSFORM_NAME: the name of thetransform to use
  • BCUKET_NAME: the name of yourCloud Storage bucket
  • DATA: the input data to transform
  • TRANSFORM_FUNCTION_NAME: the name that you assign to your transformfunction in your code

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.