Generate embeddings with MLTransform Stay organized with collections Save and categorize content based on your preferences.
This page explains why and how to use theMLTransform feature to prepareyour data for training machine learning (ML) models. Specifically, this pageshows you how to process data by generating embeddings usingMLTransform.
Bycombining multiple data processing transforms in one class,MLTransformstreamlines the process of applying Apache Beam ML data processingoperations to your workflow.

MLTransform in the preprocessing step of the workflow.Embeddings overview
Embeddings are essential for modern semantic search and Retrieval AugmentedGeneration (RAG) applications. Embeddings let systems understand and interactwith information on a deeper, more conceptual level. In semantic search,embeddings transform queries and documents into vector representations. Theserepresentations capture their underlying meaning and relationships.Consequently, this lets you find relevant results even when keywords don'tdirectly match. This is a significant leap beyond standard keyword-based search.You can also use embeddings for product recommendations. This includesmultimodal searches that use images and text, log analytics, and for tasks suchas deduplication.
Within RAG, embeddings play a crucial role in retrieving the most relevantcontext from a knowledge base to ground the responses of large language models(LLMs). By embedding both the user's query and the chunks of information in theknowledge base, RAG systems can efficiently identify and retrieve the mostsemantically similar pieces. This semantic matching ensures that the LLM hasaccess to the necessary information to generate accurate and informativeanswers.
Ingest and process data for embeddings

For core embedding use cases, the key consideration is how to ingest and processknowledge. This ingestion can be either in a batch or streaming manner. Thesource of this knowledge can vary widely. For example, this information can comefrom files stored in Cloud Storage,or can come from streaming sources like Pub/Sub orGoogle Cloud Managed Service for Apache Kafka.
For streaming sources, the data itself might be the raw content (for example,plain text) or URIs pointing to documents. Regardless of the source, the firststage typically involves preprocessing the information. For raw text, this mightbe minimal, such as basic data cleaning. However, for larger documents or morecomplex content, a crucial step ischunking. Chunking involves breaking downthe source material into smaller, manageable units. The optimal chunkingstrategy isn't standardized and depends on the specific data and application.Platforms like Dataflow offer built-in capabilities to handlediverse chunking needs, simplifying this essential preprocessing stage.
Benefits
TheMLTransform class provides the following benefits:
- Generate embeddings that you can use to push data into vector databases orto run inference.
- Transform your data without writing complex code or managing underlyinglibraries.
- Efficiently chain multiple types of processing operations with oneinterface.
Support and limitations
TheMLTransform class has the following limitations:
- Available for pipelines that use the Apache Beam Python SDK versions2.53.0 and later.
- Pipelines must usedefaultwindows.
Text embedding transforms:
- Support Python 3.8, 3.9, 3.10, 3.11, and 3.12.
- Support both batch and streaming pipelines.
- Support theVertex AI text-embeddings APIand theHugging Face Sentence Transformers module.
Use cases
The example notebooks demonstrate how to useMLTransform for specific usecases.
- I want to generate text embeddings for my LLM by using Vertex AI
- Use the Apache Beam
MLTransformclass with theVertex AI text-embeddings APIto generate text embeddings. Text embeddings are a way torepresent text as numerical vectors, which is necessary for many naturallanguage processing (NLP) tasks. - I want to generate text embeddings for my LLM by using Hugging Face
- Use the Apache Beam
MLTransformclass withHugging Face Hubmodels to generate text embeddings. The Hugging FaceSentenceTransformersframework uses Python to generate sentence, text, and image embeddings. - I want to generate text embeddings and ingest them into AlloyDB for PostgreSQL
- Use Apache Beam, specifically its
MLTransformclass withHugging FaceHub models to generatetext embeddings. Then, use theVectorDatabaseWriteTransformto load theseembeddings and associated metadata into AlloyDB for PostgreSQL. This notebookdemonstrates building scalable batch and streaming Beam datapipelines for populating an AlloyDB for PostgreSQL vector database. Thisincludes handling data from various sources like Pub/Sub or existingdatabase tables, making custom schemas, and updating data. - I want to generate text embeddings and ingest them into BigQuery
- Use the Apache Beam
MLTransformclass withHugging Face Hub modelsto generate text embeddings from application data, such as a product catalog.The Apache BeamHuggingfaceTextEmbeddingstransform is used for this.This transform uses the Hugging FaceSentenceTransformers framework,which provides models for generating sentence and text embeddings. Thesegenerated embeddings and their metadata are then ingested intoBigQuery using the Apache BeamVectorDatabaseWriteTransform.The notebook further demonstrates vector similarity searches inBigQuery using the Enrichment transform.
For a full list of available transforms, seeTransforms in theApache Beam documentation.
Use MLTransform for embedding generation
To use theMLTransform class to chunk information and generate embeddings,include the following code in your pipeline:
defcreate_chunk(product:Dict[str,Any])->Chunk:returnChunk(content=Content(text=f"{product['name']}:{product['description']}"),id=product['id'],# Use product ID as chunk IDmetadata=product,# Store all product info in metadata)[...]withbeam.Pipeline()asp:_=(p|'Create Products' >>beam.Create(products)|'Convert to Chunks' >>beam.Map(create_chunk)|'Generate Embeddings' >>MLTransform(write_artifact_location=tempfile.mkdtemp()).with_transform(huggingface_embedder)|'Write to AlloyDB' >>VectorDatabaseWriteTransform(alloydb_config))The previous example creates a single chunk per element, but you can also useLangChain for to create chunks instead:
splitter=CharacterTextSplitter(chunk_size=100,chunk_overlap=20)provider=beam.ml.rag.chunking.langchain.LangChainChunker(document_field='content',metadata_fields=[],text_splitter=splitter)withbeam.Pipeline()asp:_=(p|'Create Products' >>beam.io.textio.ReadFromText(products)|'Convert to Chunks' >>provider.get_ptransform_for_processing()What's next
- Read the blog post"How to enable real time semantic search and RAGapplications with Dataflow ML".
- For more details about
MLTransform, seePreprocess data in theApache Beam documentation. - For more examples, see
MLTransformfor data processing in theApache Beam transform catalog. - Run aninteractive notebook in Colab.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.