RAG infrastructure for generative AI using GKE and Cloud SQL

Last reviewed 2024-12-11 UTC

This document provides a reference architecture that you can use to design theinfrastructure to run a generative AI application withretrieval-augmented generation (RAG) using Google Kubernetes Engine (GKE), Cloud SQL, and open source tools like Ray,Hugging Face, and LangChain. To help you experiment with this referencearchitecture, a sample application and Terraform configuration are provided inGitHub.

This document is for developers who want to rapidly build and deploy RAG-capablegenerative AI applications by using open source tools and models. It assumesthat you have experience with using GKE andCloud SQL and that you have a conceptual understanding of AI, machinelearning (ML), and large language models (LLMs). This document doesn't provideguidance about how to design anddevelop a generative AI application.

Architecture

The following diagram shows a high-level view of an architecture for aRAG-capable generative AI application in Google Cloud:

A high-level architecture for a RAG-capable generative AI application in Google Cloud.

The architecture contains a serving subsystem and an embedding subsystem.

  • Theserving subsystem handles the request-response flow between theapplication and its users. The subsystem includes a frontend server, aninference server, and aresponsible AI (RAI) service. The serving subsystem interacts with the embedding subsystemthrough a vector database.
  • Theembedding subsystem enables the RAG capability in thearchitecture. This subsystem does the following:
    • Ingests data from data sources in Google Cloud,on-premises, and other cloud platforms.
    • Converts the ingested data to vector embeddings.
    • Stores the embeddings in a vector database.

The following diagram shows a detailed view of the architecture:

A detailed architecture for a RAG-capable generative AI application in Google Cloud.

As shown in the preceding diagram, the frontend server, inference server, andembedding service are deployed in a regional GKE cluster inAutopilot mode. Data for RAG is ingested through a Cloud Storage bucket. Thearchitecture uses a Cloud SQL for PostgreSQL instance with thepgvector extension as the vector database to store embeddings and perform semanticsearches.Vector databases are designed to efficiently store and retrieve high-dimensional vectors.

Note: The preceding diagram doesn't show the architecture for networkingresources. For guidance to design and configure networking for yourGKE cluster, seeBest practices for GKE networking.

The following sections describe the components and data flow within eachsubsystem of the architecture.

Embedding subsystem

The following is the flow of data in the embedding subsystem:

  1. Data from external and internal sources is uploaded to theCloud Storage bucket by human users or programmatically. Theuploaded data might be in files, databases, or streamed data.
  2. (Not shown in the architecture diagram.) The data upload activitytriggers an event that's published to a messaging service likePub/Sub. The messaging service sends a notification to theembedding service.
  3. When the embedding service receives a notification of a data uploadevent, it does the following:
    1. Retrieves data from the Cloud Storage bucket throughtheCloud Storage FUSE CSI driver.
    2. Reads the uploaded data and preprocesses it usingRay Data.The preprocessing can include chunking the data and transforming itinto a suitable format for embedding generation.
    3. Runs aRay job to create vectorized embeddings of the preprocessed data by using anopen-source model likeintfloat/multilingual-e5-small that's deployed in the same cluster.
    4. Writes the vectorized embeddings to theCloud SQL for PostgreSQL vector database.

As described in the following section, when the serving subsystem processesuser requests, it uses the embeddings in the vector database to retrieverelevant domain-specific data.

Serving subsystem

The following is the request-response flow in the serving subsystem:

  1. A user submits a natural-language request to a frontend server througha web-based chat interface. The frontend server runs onGKE.
  2. The frontend server runs aLangChain process that does the following:
    1. Converts the natural-language request to embeddings by usingthe same model and parameters that the embedding service uses.
    2. Retrieves relevant grounding data by performing a semanticsearch for the embeddings in the vector database. Semantic search helpsfind embeddings based on the intent of a prompt rather than its textual content.
    3. Constructs a contextualized prompt by combining the originalrequest with the grounding data that was retrieved.
    4. Sends the contextualized prompt to the inference server, whichruns on GKE.
  3. The inference server uses theHugging Face TGI serving framework to serve an open-source LLM likeMistral-7B-Instruct or aGemma open model.
  4. The LLM generates a response to the prompt, and the inference serversends the response to the frontend server.

    You can store and view logs of the request-response activity inCloud Logging, and you can set up logs-based monitoring by usingCloud Monitoring. You can also load the generated responses intoBigQuery for offline analytics.

  5. The frontend server invokes an RAI service to apply the required safetyfilters to the response. You can use tools likeSensitive Data Protection andCloud Natural Language API to discover, filter, classify, and de-identify sensitive content in theresponses.

  6. The frontend server sends the filtered response to the user.

Products used

The following is a summary of the Google Cloud and open-source productsthat the preceding architecture uses:

Google Cloud products

  • Google Kubernetes Engine (GKE): A Kubernetes service that you can use to deployand operate containerized applications at scale using Google's infrastructure.
  • Cloud Storage: A low-cost, no-limit object store for diverse data types.Data can be accessed from within and outside Google Cloud, and it'sreplicated across locations for redundancy.
  • Cloud SQL: A fully managed relational database service that helps youprovision, operate, and manage your MySQL, PostgreSQL, and SQL Server databaseson Google Cloud.

Open-source products

Use cases

RAG is an effective technique to improve the quality of output that'sgenerated from an LLM. This section provides examples of use cases for which youcan use RAG-capable generative AI applications.

Personalized product recommendations

An online shopping site might use an LLM-powered chatbot to assist customerswith finding products or getting shopping-related help. The questions from auser can be augmented by using historical data about the user's buying behaviorand website interaction patterns. The data might include user reviews andfeedback that's stored in an unstructured datastore or search-related metricsthat are stored in a web analytics data warehouse. The augmented question canthen be processed by the LLM to generate personalized responses that the usermight find more appealing and compelling.

Clinical assistance systems

Doctors in hospitals need to quickly analyze and diagnose a patient's healthcondition to make decisions about appropriate care and medication. A generativeAI application that uses a medical LLM likeMed-PaLM can be used to assist doctors in their clinical diagnosis process. The responsesthat the application generates can be grounded in historical patient records bycontextualizing the doctors' prompts with data from the hospital's electronichealth record (EHR) database or from an external knowledge base likePubMed.

Efficient legal research

Generative AI-powered legal research lets lawyers quickly query large volumesof statutes and case laws to identify relevant legal precedents or summarizecomplex legal concepts. The output of such research can be enhanced byaugmenting a lawyer's prompts with data that's retrieved from the law firm'sproprietary corpus of contracts, past legal communication, and internal caserecords. This design approach ensures that the generated responses are relevantto the legal domain that the lawyer specializes in.

Design alternatives

This section presents alternative design approaches that you can consider foryour RAG-capable generative AI application in Google Cloud.

Fully managed vector search

If you need an architecture that uses a fully managed vector search product, youcan use Vertex AI and Vector Search, which provides optimizedserving infrastructure for very large-scale vector search. For more information,seeRAG infrastructure for generative AI using Vertex AI and Vector Search.

Vector-enabled Google Cloud database

If you want to take advantage of the vector store capabilities of a fullymanaged Google Cloud database like AlloyDB for PostgreSQL or Cloud SQLfor your RAG application, then seeRAG infrastructure for generative AI using Vertex AI and AlloyDB for PostgreSQL.

Other options

For information about other infrastructure options, supported models, andgrounding techniques that you can use for generative AI applications inGoogle Cloud, seeChoose models and infrastructure for your generative AI application.

Design considerations

This section provides guidance to help you develop and run aGKE-hosted RAG-capable generative AI architecture thatmeets your specific requirements for security and compliance, reliability, cost,and performance. The guidance in this section isn't exhaustive. Depending on thespecific requirements of your application and the Google Cloud productsand features that you use, you might need to consider additional design factorsand trade-offs.

For design guidance related to the open-source tools in this referencearchitecture, like Hugging Face TGI, see the documentation for those tools.

Security, privacy, and compliance

This section describes factors that you should consider when you design andbuild a RAG-capable generative AI application in Google Cloud that meetsyour security, privacy, and compliance requirements.

ProductDesign considerations
GKE

In the Autopilot mode of operation, GKE pre-configures your cluster and manages nodes according to security best practices, which lets you focus on workload-specific security. For more information, see the following:

To ensure enhanced access control for your applications running in GKE, you can use Identity-Aware Proxy (IAP). IAP integrates with the GKE Ingress resource and ensures that only authenticated users with the correct Identity and Access Management (IAM) role can access the applications. For more information, see Enabling IAP for GKE.

By default, your data in GKE is encrypted at rest and in transit using Google-owned and Google-managed encryption keys. As an additional layer of security for sensitive data, you can encrypt data at the application layer by using a key that you own and manage with Cloud KMS. For more information, see Encrypt secrets at the application layer.

If you use a Standard GKE cluster, then you can use the following additional data-encryption capabilities:

Cloud SQL

The Cloud SQL instance in the architecture doesn't need to be accessible from the public internet. If external access to the Cloud SQL instance is necessary, you can encrypt external connections by using SSL/TLS or the Cloud SQL Auth Proxy connector. The Auth Proxy connector provides connection authorization by using IAM. The connector uses a TLS 1.3 connection with a 256-bit AES cipher to verify client and server identities and encrypt data traffic. For connections created by using Java, Python, Go, or Node.js, use the appropriateLanguage Connector instead of the Auth Proxy connector.

By default, Cloud SQL uses Google-owned and Google-managed data encryption keys (DEK) and key encryption keys (KEK) to encrypt data at rest. If you need to use KEKs that you control and manage, you can use customer-managed encryption keys (CMEKs).

To prevent unauthorized access to the Cloud SQL Admin API, you can create a service perimeter by using VPC Service Controls.

For information about configuring Cloud SQL to help meet data residency requirements, see Data residency overview.

Cloud Storage

By default, the data that's stored in Cloud Storage is encrypted using Google-owned and Google-managed encryption keys. If required, you can use CMEKs or your own keys that you manage by using an external management method like customer-supplied encryption keys (CSEKs). For more information, see Data encryption options.

Cloud Storage supports two methods for controlling user access to your buckets and objects: IAM and access control lists (ACLs). In most cases, we recommend using IAM, which lets you grant permissions at the bucket and project levels. For more information, see Overview of access control.

The data that you load into the data ingestion subsystem through Cloud Storage might include sensitive data. To protect such data, you can use Sensitive Data Protection to discover, classify, and de-identify the data. For more information, see Using Sensitive Data Protection with Cloud Storage.

To mitigate the risk of data exfiltration from Cloud Storage, you can create a service perimeter by using VPC Service Controls.

Cloud Storage helps you meet data residency requirements. Data is stored or replicated within theregions that you specify.

All of the products in this architecture

Admin Activity audit logs are enabled by default for all of the Google Cloud services that are used in this reference architecture. You can access the logs through Cloud Logging and use the logs to monitor API calls or other actions that modify the configuration or metadata of Google Cloud resources.

Data Access audit logs are also enabled by default for all of the Google Cloud services in this architecture. You can use these logs to monitor the following:

  • API calls that read the configuration or metadata of resources.
  • User requests to create, modify, or read user-provided resource data.
Google doesn't access or use the data in Cloud Logging.

For security principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Securityin the Well-Architected Framework.

Reliability

This section describes design factors that you should consider to build andoperate reliable infrastructure for a RAG-capable generative AI application inGoogle Cloud.

ProductDesign considerations
GKE

With the Autopilot mode of operation that's used in this architecture, GKE provides the following built-in reliability capabilities:

  • Your workload uses a regional GKE cluster. The control plane and worker nodes are spread across three different zones within a region. Your workloads are robust against zone outages. Regional GKE clusters have a higher uptime SLA than zonal clusters.
  • You don't need to create nodes or manage node pools. GKE automatically creates the node pools and scales them automatically based on the requirements of your workloads.

To ensure that sufficient GPU capacity is available when required for autoscaling the GKE cluster, you can create and usereservations. A reservation provides assured capacity in a specific zone for a specified resource. A reservation can be specific to a project, or shared across multiple projects. You incur charges for reserved resources even if the resources aren't provisioned or used. For more information, see Consuming reserved zonal resources.

Cloud SQL

To ensure that the vector database is robust against database failures and zone outages, use an HA-configured Cloud SQL instance. In the event of a failure of the primary database or a zone outage, Cloud SQL fails over automatically to the standby database in another zone. You don't need to change the IP address for the database endpoint.

To ensure that your Cloud SQL instances are covered by the SLA, follow the recommended operational guidelines. For example, ensure that CPU and memory are properly sized for the workload, and enable automatic storage increases. For more information, see Operational guidelines.

Cloud Storage You can create Cloud Storage buckets in one of three location types: regional, dual-region, or multi-region. Data that's stored in regional buckets is replicated synchronously across multiple zones within a region. For higher availability, you can use dual-region or multi-region buckets, where data is replicated asynchronously across regions.

For reliability principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Reliabilityin the Well-Architected Framework.

Cost optimization

This section provides guidance to help you optimize the cost of setting up andoperating a RAG-capable generative AI application in Google Cloud.

ProductDesign considerations
GKE

In Autopilot mode, GKE optimizes the efficiency of your cluster's infrastructure based on workload requirements. You don't need to constantly monitor resource utilization or manage capacity to control costs.

If you can predict the CPU, memory, and ephemeral storage usage of your GKE Autopilot cluster, then you can save money by getting discounts for committed usage. For more information, see GKE committed use discounts.

To reduce the cost of running your application, you can use Spot VMs for your GKE nodes. Spot VMs are priced lower than standard VMs, but provide no guarantee of availability. For information about the benefits of nodes that use Spot VMs, how they work in GKE, and how to schedule workloads on such nodes, see Spot VMs.

For more cost-optimization guidance, see Best practices for running cost-optimized Kubernetes applications on GKE.

Cloud SQL

A high availability (HA) configuration helps to reduce downtime for your Cloud SQL database when the zone or instance becomes unavailable. However, the cost of an HA-configured instance is higher than that of a standalone instance. If you don't need HA for the vector database, then you can reduce cost by using a standalone instance, which isn't robust against zone outages.

You can detect whether your Cloud SQL instance is over-provisioned and optimize billing by using Cloud SQL cost insights and recommendations powered by Active Assist. For more information, see Reduce over-provisioned Cloud SQL instances.

If you can predict the CPU and memory requirements of your Cloud SQL instance, then you can save money by getting discounts for committed usage. For more information, see Cloud SQL committed use discounts.

Cloud Storage For the Cloud Storage bucket that you use to load data into the data ingestion subsystem, choose an appropriate storage class. When you choose the storage class, consider the data-retention and access-frequency requirements of your workloads. For example, to control storage costs, you can choose the Standard class and use Object Lifecycle Management. Doing so enables automatic downgrade of objects to a lower-cost storage class or deletion of objects based on conditions that you set.

To estimate the cost of your Google Cloud resources, use theGoogle Cloud Pricing Calculator.

For cost optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Cost optimizationin the Well-Architected Framework.

Performance optimization

This section describes the factors that you should consider when you design andbuild a RAG-capable generative AI application in Google Cloud that meetsyour performance requirements.

ProductDesign considerations
GKE Choose appropriate compute classes for your Pods based on the performance requirements of the workloads. For the Pods that run the inference server and the embedding service, we recommend that you use a GPU machine type likenvidia-l4.
Cloud SQL

To optimize the performance of your Cloud SQL instance, ensure that the CPU and memory that are allocated to the instance are adequate for the workload. For more information, see Optimize underprovisioned Cloud SQL instances.

To improve the response time for approximate nearest neighbor (ANN) vector search, use the Inverted File with Flat Compression (IVFFlat) index or Hierarchical Navigable Small World (HNSW) index

To help you analyze and improve the query performance of the databases, Cloud SQL provides a Query Insights tool. You can use this tool to monitor performance and trace the source of a problematic query. For more information, see Use Query insights to improve query performance.

To get an overview of the status and performance of your databases and to view detailed metrics such as peak connections and disk utilization, you can use the System Insights dashboard. For more information, see Use System insights to improve system performance.

Cloud Storage To upload large files, you can use a method calledparallel composite uploads. With this strategy, the large file is split into chunks. The chunks are uploaded to Cloud Storage in parallel and then the data is recomposed in the cloud. When network bandwidth and disk speed aren't limiting factors, then parallel composite uploads can be faster than regular upload operations. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.

For performance optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Performance optimizationin the Well-Architected Framework.

Deployment

To deploy a topology that's based on this reference architecture, you candownload and use the open-source sample code that's available in a repository inGitHub.The sample code isn't intended for production use cases. You can use the code toexperiment with setting up AI infrastructure for a RAG-enabled generative AIapplication.

The sample code does the following:

  1. Provisions a Cloud SQL for PostgreSQL instance to serve as thevector database.
  2. Deploys Ray, JupyterHub, and Hugging Face TGI to a GKE cluster thatyou specify.
  3. Deploys a sample web-based chatbot application to your GKE clusterto let you verify the RAG capability.
Note: The sample code doesn't set up a messaging component to notify theembedding service when data is uploaded to Cloud Storage. The codeincludes a Jupyter notebook that you use to upload data toCloud Storage and trigger the Ray job to generate embeddings.

For instructions to use the sample code, see theREADME for the code. If any errors occur when you use the sample code, and if openGitHub issues don't exist for the errors, then create issues in GitHub.

The sample code deploys billable Google Cloud resources. When you finishusing the code, remove any resources that you no longer need.

What's next

Contributors

Author:Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-12-11 UTC.