Use Vertex AI RAG Engine in Gemini Live API

Experimental

Some of the RAG features are Experimental offerings, subject to the "Pre-GA Offerings Terms" of theGoogle Cloud Service Specific Terms. Pre-GA products and features are available "as-is" and may have limited support, and changes to Pre-GA products and features may not be compatible with other Pre-GA versions. For more information, see the launch stage descriptions.

TheVPC-SC security controls and CMEK are supported by Vertex AI RAG Engine. Data residency and AXT security controls aren't supported.

Retrieval-augmented generation (RAG) is a technique that's used to retrieve andprovide relevant information to LLMs to generate verifiable responses. Theinformation can include fresh information, a topic and context, or ground truth.

This page shows you how to use Vertex AI RAG Engine with theGemini Live API, which lets you specify and retrieveinformation from the RAG corpus.

Prerequisites

The following prerequisites must be completed before you can use Vertex AI RAG Engine with the multimodal Live API:

  1. Enable the RAG API in Vertex AI.

  2. Create the RAG Corpusexample.

  3. To upload files to the RAG Corpus, seeImport RAG files exampleAPI.

Set up

You can use Vertex AI RAG Engine with the Live API by specifyingVertex AI RAG Engine as a tool. The following code sampledemonstrates how to specify Vertex AI RAG Engine as a tool:

Replace the following variables:

  • YOUR_PROJECT_ID: The ID of your Google Cloud project.
  • YOUR_CORPUS_ID: The ID of your corpus.
  • YOUR_LOCATION: The region to process the request.
PROJECT_ID="YOUR_PROJECT_ID"RAG_CORPUS_ID="YOUR_CORPUS_ID"LOCATION="YOUR_LOCATION"TOOLS={"retrieval":{"vertex_rag_store":{"rag_resources":{"rag_corpus":"projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}"}}}}

UseWebsocket for real-time communication

To enable real-time communication between a client and a server, you must use aWebsocket. These code samples demonstrate how to use aWebsocket using thePython API and the Python SDK.

Python API

CONFIG={"response_modalities":["TEXT"],"speech_config":{"language_code":"en-US"}}headers={"Content-Type":"application/json","Authorization":f"Bearer{bearer_token[0]}",}HOST="${LOCATION}-aiplatform.googleapis.com"SERVICE_URL=f"wss://{HOST}/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent"MODEL="gemini-2.0-flash-exp"# Connect to the serverasyncwithconnect(SERVICE_URL,additional_headers=headers)asws:# Setup the sessionawaitws.send(json.dumps({"setup":{"model":MODEL,"generation_config":CONFIG,# Setup RAG as a retrieval tool"tools":TOOLS,}}))# Receive setup responseraw_response=awaitws.recv(decode=False)setup_response=json.loads(raw_response.decode("ascii"))# Send text messagetext_input="What are popular LLMs?"display(Markdown(f"**Input:**{text_input}"))msg={"client_content":{"turns":[{"role":"user","parts":[{"text":text_input}]}],"turn_complete":True,}}awaitws.send(json.dumps(msg))responses=[]# Receive chunks of server responseasyncforraw_responseinws:response=json.loads(raw_response.decode())server_content=response.pop("serverContent",None)ifserver_contentisNone:breakmodel_turn=server_content.pop("modelTurn",None)ifmodel_turnisnotNone:parts=model_turn.pop("parts",None)ifpartsisnotNone:display(Markdown(f"**parts >**{parts}"))responses.append(parts[0]["text"])# End of turnturn_complete=server_content.pop("turnComplete",None)ifturn_complete:grounding_metadata=server_content.pop("groundingMetadata",None)ifgrounding_metadataisnotNone:grounding_chunks=grounding_metadata.pop("groundingChunks",None)ifgrounding_chunksisnotNone:forchunkingrounding_chunks:display(Markdown(f"**grounding_chunk >**{chunk}"))break# Print the server responsedisplay(Markdown(f"**Response >**{''.join(responses)}"))

Python SDK

To learn how to install the generative AI SDK, seeInstall a library:

fromgoogleimportgenaifromgoogle.genaiimporttypesfromgoogle.genai.typesimport(Content,LiveConnectConfig,HttpOptions,Modality,Part,)fromIPythonimportdisplayMODEL="gemini-2.0-flash-exp"client=genai.Client(vertexai=True,project=PROJECT_ID,location=LOCATION)asyncwithclient.aio.live.connect(model=MODEL,config=LiveConnectConfig(response_modalities=[Modality.TEXT],tools=TOOLS),)assession:text_input="\'What are core LLM techniques?\'"print("> ",text_input,"\n")awaitsession.send_client_content(turns=Content(role="user",parts=[Part(text=text_input)]))asyncformessageinsession.receive()ifmessage.text:display.display(display.Markdown(message.text))continue

Use Vertex AI RAG Engine as the context store

You can use Vertex AI RAG Engine as the context store forGemini Live API to store session context to form and retrieve pastcontexts that are related to your conversation and enrich the current contextfor model generation. You can also take advantage of this feature to sharecontexts across your different Live API sessions.

Vertex AI RAG Engine supports storing and indexing the followingforms of data from session contexts:

  • Text
  • Audio speech

Create a MemoryCorpus type corpus

To store and index conversation texts from the session context, you must createa RAG corpus of theMemoryCorpus type. You must also specify an LLM parser inyour memory corpus configuration that's used to parse session contexts storedfrom the Live API to build memory for indexing.

This code sample demonstrates how to create a corpus. However, first replace thevariables with values.

# Currently supports Google first-party embedding modelsEMBEDDING_MODEL=YOUR_EMBEDDING_MODEL# Such as "publishers/google/models/text-embedding-005"MEMORY_CORPUS_DISPLAY_NAME=YOUR_MEMORY_CORPUS_DISPLAY_NAMELLM_PARSER_MODEL_NAME=YOUR_LLM_PARSER_MODEL_NAME# Such as "projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.5-pro-preview-05-06"memory_corpus=rag.create_corpus(display_name=MEMORY_CORPUS_DISPLAY_NAME,corpus_type_config=rag.RagCorpusTypeConfig(corpus_type_config=rag.MemoryCorpus(llm_parser=rag.LlmParserConfig(model_name=LLM_PARSER_MODEL_NAME,))),backend_config=rag.RagVectorDbConfig(rag_embedding_model_config=rag.RagEmbeddingModelConfig(vertex_prediction_endpoint=rag.VertexPredictionEndpoint(publisher_model=EMBEDDING_MODEL))),)

Specify your memory corpus to store contexts

When using your memory corpus with the Live API, you must specify the memorycorpus as a retrieval tool and then setstore_context totrue to allow theLive API to store the session contexts.

This code sample demonstrates how to specify your memory corpus to storecontexts. However, first replace the variables with values.

fromgoogleimportgenaifromgoogle.genaiimporttypesfromgoogle.genai.typesimport(Content,LiveConnectConfig,HttpOptions,Modality,Part)fromIPythonimportdisplayPROJECT_ID=YOUR_PROJECT_IDLOCATION=YOUR_LOCATIONTEXT_INPUT=YOUR_TEXT_INPUTMODEL_NAME=YOUR_MODEL_NAME# Such as "gemini-2.0-flash-exp"client=genai.Client(vertexai=True,project=PROJECT_ID,location=LOCATION,)memory_store=types.VertexRagStore(rag_resources=[types.VertexRagStoreRagResource(rag_corpus=memory_corpus.name)],store_context=True)asyncwithclient.aio.live.connect(model=MODEL_NAME,config=LiveConnectConfig(response_modalities=[Modality.TEXT],tools=[types.Tool(retrieval=types.Retrieval(vertex_rag_store=memory_store))]),)assession:text_input=TEXT_INPUTawaitsession.send_client_content(turns=Content(role="user",parts=[Part(text=text_input)]))asyncformessageinsession.receive():ifmessage.text:display.display(display.Markdown(message.text))continue

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.