Preprocess data with MLTransform Stay organized with collections Save and categorize content based on your preferences.
This page explains why and how to use theMLTransformfeature to prepare your data for training machine learning (ML) models. Bycombining multiple data processing transforms in one class,MLTransformstreamlines the process of applying Apache Beam ML data processingoperations to your workflow.
For information about usingMLTransform for embedding generation tasks, seeGenerate embeddings with MLTransform.

MLTransform in the preprocessing step of the workflow.Benefits
TheMLTransform class provides the following benefits:
- Transform your data without writing complex code or managing underlyinglibraries.
- Efficiently chain multiple types of processing operations with oneinterface.
Generate embeddings that you can use to push data into vector databases orto run inference.
For more information about embedding generation, seeGenerate embeddings with MLTransform.
Support and limitations
TheMLTransform class has the following limitations:
- Available for pipelines that use the Apache BeamPython SDK versions 2.53.0 and later.
- Pipelines must usedefault windows.
Data processing transforms that use TFT:
- Support Python 3.9, 3.10, 3.11.
- Support batch pipelines.
Use cases
The example notebooks demonstrate how to useMLTransform forembeddings-specific use cases.
- I want to compute a vocabulary from a dataset
- Compute a unique vocabulary from a dataset and then map each word or token toa distinct integer index. Use this transform to change textual data intonumerical representations for machine learning tasks.
- I want to scale my data to train my ML model
- Scale your data so that you can use it to train your ML model. TheApache Beam
MLTransformclass includes multiple data scaling transforms.
For a full list of available transforms, seeTransformsin the Apache Beam documentation.
Use MLTransform
To use theMLTransform class to preprocess data, include the following code inyour pipeline:
importapache_beamasbeamfromapache_beam.ml.transforms.baseimportMLTransformfromapache_beam.ml.transforms.tftimportTRANSFORM_NAMEimporttempfiledata=[{DATA},]artifact_location=gs://BUCKET_NAMETRANSFORM_FUNCTION_NAME=TRANSFORM_NAME(columns=['x'])withbeam.Pipeline()asp:transformed_data=(p|beam.Create(data)|MLTransform(write_artifact_location=artifact_location).with_transform(TRANSFORM_FUNCTION_NAME)|beam.Map(print))Replace the following values:
TRANSFORM_NAME: the name of thetransform to useBCUKET_NAME: the name of yourCloud Storage bucketDATA: the input data to transformTRANSFORM_FUNCTION_NAME: the name that you assign to your transformfunction in your code
What's next
- For more details about
MLTransform, seePreprocess datain the Apache Beam documentation. - For more examples, see
MLTransformfor data processingin the Apache Beam transform catalog. - Run aninteractive notebook in Colab.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.