RAG infrastructure for generative AI using Vertex AI and Vector Search Stay organized with collections Save and categorize content based on your preferences.
This document provides a reference architecture that you can use to design theinfrastructure for a generative AI application withretrieval-augmented generation (RAG) by usingVector Search.Vector Search is a fully managed Google Cloud service thatprovides optimized serving infrastructure for very large-scale vector-similaritymatching.
The intended audience for this document includes architects, developers, andadministrators of generative AI applications. The document assumes a basicunderstanding of AI, machine learning (ML), and large language model (LLM)concepts. This document doesn't provide guidance about how to design and developa generative AI application.
Note: Depending on your requirements for application hosting and AIinfrastructure, you can consider other architectural approaches for yourRAG-capable generative AI application. For more information, see the "Designalternatives" section later in this document.Architecture
The following diagram shows a high-level view of the architecture that thisdocument presents:
The architecture in the preceding diagram has two subsystems: data ingestionand serving.
- Thedata ingestion subsystem ingests data that's uploaded fromexternal sources. The subsystem prepares the data for RAG and interactswith Vertex AI to generateembeddings for the ingested data and to build and update thevector index.
- Theserving subsystem contains the generative AI application'sfrontend and backend services.
- The frontend service handles the query-response flow withapplication users and forwards queries to the backend service.
- The backend service uses Vertex AI to generatequery embeddings, perform vector-similarity search, and applyResponsible AI safety filters and system instructions.
The following diagram shows a detailed view of the architecture:
The following sections describe the data flow within each subsystem of thepreceding architecture diagram.
Data ingestion subsystem
The data ingestion subsystem ingests data from external sources and preparesthe data for RAG. The following are the steps in the data-ingestion andpreparation flow:
- Data is uploaded from external sources to aCloud Storage bucket. The external sources might be applications, databases, or streaming services.
- When data is uploaded to Cloud Storage, amessage is published to a Pub/Sub topic.
- When the Pub/Sub topic receives a message, it triggers a Cloud Run function.
- The Cloud Run function parses the raw data, formats it as required, and divides it into chunks.
- The function uses the Vertex AI Embeddings API to create embeddings of the chunks by using an embedding model that you specify. Vertex AI supportstext andmultimodal embedding models.
- The function then builds a Vector Search index of the embeddings and then deploys the index.
When new data is ingested, the preceding steps are performed for the new dataand the index is updated usingstreaming updates.
When the serving subsystem processes user requests, it uses theVector Search index for vector-similarity search. The nextsection describes the serving flow.
Serving subsystem
The serving subsystem handles the query-response flow between the generative AIapplication and its users. The following are the steps in the serving flow:
- A user submits a natural-language query to a Cloud Run service that provides a frontend interface (such as a chatbot) for the generative AI application.
- The frontend service forwards the user query to a backend Cloud Run service.
- The backend service processes the query by doing the following:
- Converts the query to embeddings by using the same embeddings model and parameters that the data ingestion subsystem uses to generate embeddings of the ingested data.
- Retrieves relevantgrounding data by performing a vector-similarity search for the query embeddings in the Vector Search index.
- Constructs an augmented prompt by combining the original query with the grounding data.
- Sends the augmented prompt to an LLM that's deployed on Vertex AI.
- The LLM generates a response.
- For each prompt, Vertex AI applies the Responsible AI safety filters that you've configured and then sends the filtered response and AI safety scores to the Cloud Run backend service.
- The application sends the response to the user through the Cloud Run frontend service.
You can store and view logs of the query-response activity in Cloud Logging,and you can set up logs-based monitoring by using Cloud Monitoring. You canalso load the generated responses into BigQuery for offline analytics.
TheVertex AI prompt optimizer helps you improve prompts at scale, both during initial prompt design and forongoing prompt tuning. The prompt optimizer evaluates your model's response to aset of sample prompts that ML engineers provide. The output of the evaluationincludes the model's responses to the sample prompts, scores for metrics thatthe ML engineers specify, and a set of optimized system instructions that youcan consider using.
Products used
This reference architecture uses the following Google Cloud products:
- Vertex AI: An ML platform that lets you train and deploy ML modelsand AI applications, and customize LLMs for use in AI-powered applications.
- Vector Search: A vector similarity-matching service that lets youstore, index, and search semantically similar or related data.
- Cloud Run: A serverless compute platform that lets you runcontainers directly on top of Google's scalable infrastructure.
- Cloud Run functions: A serverless compute platform that lets you run single-purposefunctions directly in Google Cloud.
- Cloud Storage: A low-cost, no-limit object store for diverse data types.Data can be accessed from within and outside Google Cloud, and it'sreplicated across locations for redundancy.
- Pub/Sub: An asynchronous and scalable messaging service thatdecouples services that produce messages from services that process thosemessages.
- Cloud Logging: A real-time log management system with storage, search,analysis, and alerting.
- Cloud Monitoring: A service that provides visibility into theperformance, availability, and health of your applications and infrastructure.
- BigQuery: An enterprise data warehouse that helps you manage andanalyze your data with built-in features like machine learning geospatialanalysis, and business intelligence.
Use cases
RAG is an effective technique to improve the quality of output that'sgenerated from an LLM. This section provides examples of use cases for which youcan use RAG-capable generative AI applications.
Personalized product recommendations
An online shopping site might use an LLM-powered chatbot to assist customerswith finding products or getting shopping-related help. The questions from auser can be augmented by using historical data about the user's buying behaviorand website interaction patterns. The data might include user reviews andfeedback that's stored in an unstructured datastore or search-related metricsthat are stored in a web analytics data warehouse. The augmented question canthen be processed by the LLM to generate personalized responses that the usermight find more appealing and compelling.
Clinical assistance systems
Doctors in hospitals need to quickly analyze and diagnose a patient's healthcondition to make decisions about appropriate care and medication. A generativeAI application that uses a medical LLM likeMed-PaLM can be used to assist doctors in their clinical diagnosis process. The responsesthat the application generates can be grounded in historical patient records bycontextualizing the doctors' prompts with data from the hospital's electronichealth record (EHR) database or from an external knowledge base likePubMed.
Efficient legal research
Generative AI-powered legal research lets lawyers quickly query large volumesof statutes and case laws to identify relevant legal precedents or summarizecomplex legal concepts. The output of such research can be enhanced byaugmenting a lawyer's prompts with data that's retrieved from the law firm'sproprietary corpus of contracts, past legal communication, and internal caserecords. This design approach ensures that the generated responses are relevantto the legal domain that the lawyer specializes in.
Design alternatives
This section presents alternative design approaches that you can consider foryour RAG-capable generative AI application in Google Cloud.
AI infrastructure alternatives
If you want to take advantage of the vector store capabilities of a fullymanaged Google Cloud database like AlloyDB for PostgreSQL or Cloud SQL foryour RAG application, then seeRAG infrastructure for generative AI using Vertex AI and AlloyDB for PostgreSQL.
If you want to rapidly build and deploy RAG-capable generative AI applicationsby using open source tools and models Ray, Hugging Face, and LangChain, seeRAG infrastructure for generative AI using GKE and Cloud SQL.
Application hosting options
In the architecture that's shown in this document, Cloud Run isthe host for the generative AI application and data processing.Cloud Run is a developer-focused, fully managed applicationplatform. If you need greater configuration flexibility and control over thecompute infrastructure, you can deploy your application toGKE clusters or toCompute Engine VMs.
The decision of whether to use Cloud Run,GKE, or Compute Engine as your application hostinvolves trade-offs between configuration flexibility and management effort.With the serverless Cloud Run option, you deploy your applicationto a preconfigured environment that requires minimal management effort. WithCompute Engine VMs and GKE containers, you'reresponsible for managing the underlying compute resources, but you have greaterconfiguration flexibility and control. For more information about choosing anappropriate application hosting service, see the following documents:
- Is my app a good fit for Cloud Run?
- Select a managed container runtime environment
- Hosting Applications on Google Cloud
Other options
For information about other infrastructure options, supported models, andgrounding techniques that you can use for generative AI applications inGoogle Cloud, seeChoose models and infrastructure for your generative AI application.
Design considerations
This section describes design factors, best practices, and designrecommendations that you should consider when you use this referencearchitecture to develop a topology that meets your specific requirements forsecurity, reliability, cost, and performance.
The guidance in this section isn't exhaustive. Depending on the specificrequirements of your application and the Google Cloud and third-partyproducts and features that you use, there might be additional design factors andtrade-offs that you should consider.
Security, compliance, and privacy
This section describes design considerations and recommendations to design atopology in Google Cloud that meets the security and compliancerequirements of your workloads.
| Product | Design considerations and recommendations |
|---|---|
| Vertex AI | Security controls: Vertex AI supports Google Cloud security controls that you can use to meet your requirements for data residency, data encryption, network security, and access transparency. For more information, seeSecurity controls for Vertex AI andSecurity controls for Generative AI. Model access: You can set up organization policies to limit the type and versions of LLMs that can be used in a Google Cloud project. For more information, seeControl access to Model Garden models. Shared responsibility: Vertex AI secures the underlying infrastructure and provides tools and security controls to help you protect your data, code, and models. For more information, seeVertex AI shared responsibility. Data protection: Use theCloud Data Loss Prevention API to discover and de-identify sensitive data, such as personally identifiable information (PII), in the prompts and responses and in log data. For more information, see this video:Protecting sensitive data in AI apps. |
| Cloud Run | Ingress security (frontend service): To control external access to the application,disable the default run.app URL of the frontend Cloud Run service andset up a regional external Application Load Balancer. Along with load-balancing incoming traffic to the application, the load balancer handles SSL certificate management. For added protection, you can useGoogle Cloud Armor security policies to provide request filtering, DDoS protection, and rate limiting for the service. Ingress security (backend service): The Cloud Run service for the application's backend in this architecture doesn't need access from the internet. To ensure that only internal clients can access the service, set the Data encryption: By default, Cloud Run encrypts data by using a Google-owned and Google-managed encryption key. To protect your containers by using a key that you control, you can use customer-managed encryption keys (CMEK). For more information, seeUsing customer managed encryption keys. Container image security: To ensure that only authorized container images are deployed to the Cloud Run services, you can useBinary Authorization. Data residency: Cloud Run helps you to meet data residency requirements. Cloud Run container instances run within the region that you select. For more guidance about container security, seeGeneral Cloud Run development tips. |
| Cloud Storage | Data encryption: By default, the data that's stored in Cloud Storage is encrypted using Google-owned and Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, seeData encryption options. Access control: Cloud Storage supports two methods for controlling user access to your buckets and objects: Identity and Access Management (IAM) and access control lists (ACLs). In most cases, we recommend using IAM, which lets you grant permissions at the bucket and project levels. For more information, seeOverview of access control. Data protection: The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. To protect such data, you can use Sensitive Data Protection to discover, classify, and de-identify the data. For more information, seeUsing Sensitive Data Protection with Cloud Storage. Network control: To mitigate the risk of data exfiltration from Cloud Storage, you can create a service perimeter by usingVPC Service Controls. Data residency: Cloud Storage helps you to meet data residency requirements. Data is stored or replicated within the regions that you specify. |
| Pub/Sub | Data encryption: By default, Pub/Sub encrypts all messages, both at rest and in transit, by using Google-owned and Google-managed encryption keys. Pub/Sub supports the use of CMEKs for message encryption at the application layer. For more information, seeConfigure message encryption. Data residency: If you have data residency requirements, in order to ensure that message data is stored in specific locations, you can configuremessage storage policies. |
| Cloud Logging | Administrative activity audit: Logging ofadministrative activity is enabled by default for all of the Google Cloud services that are used in this reference architecture. You can access the logs through Cloud Logging and use the logs to monitor API calls or other actions that modify the configuration or metadata of Google Cloud resources. Data access audit: Logging ofdata access events is enabled by default for BigQuery. For the other services that are used in this architecture, you can enable Data Access audit logs. You can use these logs to monitor the following:
Security of log data: Google doesn't access or use the data in Cloud Logging. Data residency: To help meet data residency requirements, you can configure Cloud Logging to store log data in the region that you specify. For more information, seeRegionalize your logs. |
| All of the products in the architecture | Mitigate data exfiltration risk: To reduce the risk of data exfiltration, create aVPC Service Controls perimeter around the infrastructure. VPC Service Controls supports all of the services that are used in this reference architecture. Post-deployment optimization: After you deploy your application in Google Cloud, use theActive Assist service to get recommendations that can help you to further optimize the security of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist. Access control: Follow the principle ofleast privilege for every cloud service. |
For general guidance regarding security for AI and ML deployments inGoogle Cloud, see the following resources:
- (Blog)Introducing Google's Secure AI Framework
- (Documentation)AI and ML security perspective in the Google Cloud Well-Architected Framework
- (Documentation)Vertex AI shared responsibility
- (Whitepaper)Generative AI, Privacy, and Google Cloud
- (Video)Protecting sensitive data in AI apps
Reliability
This section describes design considerations and recommendations to build andoperate reliable infrastructure for your deployment in Google Cloud.
| Product | Design considerations and recommendations |
|---|---|
| Vector Search | Query scaling: To make sure that the Vector Search index can handle increases in query load, you can configure autoscaling for the index endpoint. When the query load increases, the number of nodes is increased automatically up to the maximum that you specify. For more information, seeEnable autoscaling. |
| Cloud Run | Robustness to infrastructure outages: Cloud Run is a regional service. Data is stored synchronously across multiple zones within a region. Traffic is automatically load-balanced across the zones. If a zone outage occurs, Cloud Run continues to run and data isn't lost. If a region outage occurs, Cloud Run stops running until Google resolves the outage. |
| Cloud Storage | Data availability: You can create Cloud Storage buckets in one of threelocation types: regional, dual-region, or multi-region. Data that's stored in regional buckets is replicated synchronously across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where data is replicated asynchronously across regions. |
| Pub/Sub | Rate control: To avoid errors during periods of transient spikes in message traffic, you can limit the rate of publish requests by configuringflow control in the publisher settings. Failure handling: To handle failed publish attempts, adjust the retry-request variables as necessary. For more information, seeRetry requests. |
| BigQuery | Robustness to infrastructure outages: Data that you load into BigQuery is stored synchronously in two zones within the region that you specify. This redundancy helps to ensure that your data isn't lost when a zone outage occurs. For more information about reliability features in BigQuery, seeUnderstand reliability. |
| All of the products in the architecture | Post-deployment optimization: After you deploy your application in Google Cloud, use theActive Assist service to get recommendations to further optimize the reliability of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist. |
For reliability principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Reliabilityin the Well-Architected Framework.
Cost optimization
This section provides guidance to optimize the cost of setting up and operatinga Google Cloud topology that you build by using this referencearchitecture.
| Product | Design considerations and recommendations |
|---|---|
| Vector Search | Billing for Vector Search depends on the size of your index, queries per second (QPS), and the number and machine type of the nodes that you use for the index endpoint. For high-QPS workloads, batching the queries can help to reduce cost. For information about how you can estimate Vector Search cost, seeVector Search pricing examples. To improve the utilization of the compute nodes on which the Vector Search index is deployed, you can configure autoscaling for the index endpoint. When demand is low, the number of nodes is reduced automatically to the minimum that you specify. For more information, seeEnable autoscaling. |
| Cloud Run | When you create Cloud Run services, you specify the amount of memory and CPU to be allocated to the container instance. To control costs, start with the default (minimum) CPU and memory allocations. To improve performance, you can increase the allocation by configuring the CPU limit and memory limit. For more information, see the following documentation: If you can predict the CPU and memory requirements of your Cloud Run services, then you can save money by getting discounts for committed usage. For more information, seeCloud Run committed use discounts. |
| Cloud Storage | For the Cloud Storage bucket that you use to load data into the data ingestion subsystem, choose an appropriatestorage class. When you choose the storage class, consider the data-retention and access-frequency requirements of your workloads. For example, to control storage costs, you can choose the Standard class and useObject Lifecycle Management. Doing so enables automatic downgrade of objects to a lower-cost storage class or deletion of objects based on conditions that you set. |
| Cloud Logging | To control the cost of storing logs, you can do the following:
|
| BigQuery | BigQuery lets you estimate the cost of queries before you run them. To optimize query costs, you need to optimize storage and query computation. For more information, seeEstimate and control costs. |
| All of the products in the architecture | After you deploy your application in Google Cloud, use theActive Assist service to get recommendations to further optimize the cost of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist. |
To estimate the cost of your Google Cloud resources, use theGoogle Cloud Pricing Calculator.
For cost optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Cost optimizationin the Well-Architected Framework.
Performance optimization
This section describes design considerations and recommendations to design atopology in Google Cloud that meets the performance requirements of yourworkloads.
| Product | Design considerations and recommendations |
|---|---|
| Vector Search | When you create the index, set the shard size, distance measure type, and number of embeddings for each leaf node based on your performance requirements. For example, if your application is extremely sensitive to latency variability, we recommend a large shard size. For more information, seeConfiguration parameters that affect performance. When you configure the compute capacity of the nodes on which the Vector Search index is deployed, consider your requirements for performance. Choose an appropriate machine type and set the maximum number of nodes based on the query load that you expect. For more information, seeDeployment settings that affect performance. Configure the query parameters for the Vertex Search index based on your requirements for query performance, availability, and cost. For example, the An index that's up-to-date helps to improve the accuracy of the generated responses. You can update your Vector Search index by using batch or streaming updates. Streaming updates let you perform near real-time queries on updated data. For more information, seeUpdate and rebuild an active index. |
| Cloud Run | By default, each Cloud Run container instance is allocated one CPU and 512 MiB of memory. Depending on the performance requirements, you can configure the CPU limit and the memory limit. For more information, see the following documentation: To ensure optimal latency even after a period of no traffic, you can configure aminimum number of instances. When such instances are idle, the CPU and memory that are allocated to the instances are billed at a lower price. For more performance optimization guidance, seeGeneral Cloud Run development tips. |
| Cloud Storage | To upload large files, you can use a method called parallel composite uploads. With this strategy, the large file is split into chunks. The chunks are uploaded to Cloud Storage in parallel and then the data is recomposed in the cloud. When network bandwidth and disk speed aren't limiting factors, then parallel composite uploads can be faster than regular upload operations. However, this strategy has some limitations and cost implications. For more information, seeParallel composite uploads. |
| BigQuery | BigQuery provides a query execution graph that you can use to analyze query performance and get performance insights for issues like slot contention and insufficient shuffle quota. For more information, seeGet query performance insights. After you address the issues that you identify through query performance insights, you can further optimize queries by using techniques like reducing the volume of input and output data. For more information, seeOptimize query computation. |
| All of the products in the architecture | After you deploy your application in Google Cloud, use theActive Assist service to get recommendations to further optimize the performance of your cloud resources. Review the recommendations and apply them as appropriate for your environment. For more information, seeFind recommendations in Active Assist. |
For performance optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Performance optimizationin the Well-Architected Framework.
Deployment
To deploy a topology that's based on this reference architecture, you candownload and use the Terraform sample configuration that's available in arepository inGitHub.Follow the instructions in the README in the repository. The sample code isn'tintended for production use cases.
What's next
- Choose models and infrastructure for your generative AI application
- RAG infrastructure for generative AI using Vertex AI and AlloyDB for PostgreSQL
- RAG infrastructure for generative AI using GKE and Cloud SQL
- RAG infrastructure for generative AI using Gemini Enterprise and Vertex AI
- GraphRAG infrastructure for generative AI using Vertex AI and Spanner Graph
- For an overview of architectural principles and recommendations that are specific to AIand ML workloads in Google Cloud, see theAI and ML perspectivein the Well-Architected Framework.
- For more reference architectures, diagrams, and best practices, explore theCloud Architecture Center.
Contributors
Author:Kumar Dhanagopal | Cross-Product Solution Developer
Other contributors:
- Assaf Namer | Principal Cloud Security Architect
- Deepak Michael | Networking Specialist Customer Engineer
- Divam Anand | Product Strategy and Operations Lead
- Eran Lewis | Senior Product Manager
- Jerome Simms | Director, Product Management
- Katie McLaughlin | Senior Developer Relations Engineer
- Mark Schlagenhauf | Technical Writer, Networking
- Megan O'Keefe | Developer Advocate
- Nicholas McNamara | Product and Commercialization Strategy Principal
- Preston Holmes | Outbound Product Manager - App Acceleration
- Rob Edwards | Technology Practice Lead, DevOps
- Victor Moreno | Product Manager, Cloud Networking
- Wietse Venema | Developer Relations Engineer
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-03-07 UTC.