AI and ML perspective: Reliability

Last reviewed 2025-08-07 UTC

This document in theGoogle Cloud Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operatereliable AI and ML systems on Google Cloud. It explores how to integrateadvanced reliability practices and observability into your architecturalblueprints. The recommendations in this document align with thereliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential inorder to ensure customer satisfaction and achieve business goals. To meet theunique demands of both predictive ML and generative AI, you need AI and MLsystems that are robust, reliable, and adaptable. To handle the complexities ofMLOps—fromdevelopment to deployment and continuous improvement—you need to use areliability-first approach. Google Cloud offers a purpose-built AIinfrastructure that's aligned withsite reliability engineering (SRE) principles and that provides a powerful foundation for reliable AI and MLsystems.

The recommendations in this document are mapped to the following coreprinciples:

Ensure that ML infrastructure is scalable and highly available

Reliable AI and ML systems in the cloud require scalable and highlyavailable infrastructure. These systems have dynamic demands, diverse resourceneeds, and critical dependencies on model availability. Scalable architecturesadapt to fluctuating loads and variations in data volume or inference requests.High availability (HA) helps to ensure resilience against failures at thecomponent, zone, or region level.

To build scalable and highly available ML infrastructure, consider the followingrecommendations.

Implement automatic and dynamic scaling capabilities

AI and ML workloads are dynamic, with demand that fluctuates based on dataarrival rates, training frequency, and the volume of inference traffic.Automatic and dynamic scaling adapts infrastructure resources seamlessly todemand fluctuations. Scaling your workloads effectively helps to preventdowntime, maintain performance, and optimize costs.

To autoscale your AI and ML workloads, use the following products and featuresin Google Cloud:

  • Data processing pipelines: Create data pipelines inDataflow.Configure the pipelines to useDataflow's horizontal autoscaling feature, which dynamically adjusts the number of worker instances based onCPU utilization, pipeline parallelism, and pending data. You can configureautoscaling parameters through pipeline options when you launch jobs.
  • Training jobs: Automate the scaling of training jobs by usingVertex AI custom training.You can define worker pool specifications such as the machine type, thetype and number of accelerators, and the number of worker pools. For jobsthat can tolerate interruptions and for jobs where the training codeimplements checkpointing, you can reduce costs by usingSpot VMs.
  • Online inference: For online inference, useVertex AI endpoints.To enable autoscaling, configure the minimum and maximum replica count.Specify a minimum of two replicas for HA. Vertex AIautomatically adjusts the number of replicas based on traffic and theconfigured autoscaling metrics, such as CPU utilization and replicautilization.
  • Containerized workloads inGoogle Kubernetes Engine:Configure autoscaling at the node and Pod levels. Configure thecluster autoscaler andnode auto-provisioning to adjust the node count based on pending Pod resourcerequests like CPU, memory, GPU, and TPU. UseHorizontal Pod Autoscaler (HPA) for deployments to define scaling policies based on metrics like CPU andmemory utilization. You can also scale based on custom AI and ML metrics,such as GPU or TPU utilization and prediction requests per second.
  • Serverless containerized services: Deploy the services inCloud Run and configure autoscaling by specifying the minimum and maximum number ofcontainer instances. Usebest practices to autoscale GPU-enabled instances by specifying the accelerator type.Cloud Run automatically scales instances between the configuredminimum and maximum limits based on incoming requests. When there are norequests, it scales efficiently to zero instances. You can leverage theautomatic, request-driven scaling of Cloud Run to deployVertex AI agents and to deploy third-party workloads likequantized models using Ollama,LLM model inference using vLLM,andHuggingface Text Generation Inference (TGI).

Design for HA and fault tolerance

For production-grade AI and ML workloads, it's crucial that you ensurecontinuous operation and resilience against failures. To implement HA and faulttolerance, you need to build redundancy and replication into your architectureon Google Cloud. This approach helps to ensure that a failure of anindividual component doesn't cause a failure of the complete system.

Implement redundancy for critical AI and ML components in Google Cloud.The following are examples of products and features that let you implementresource redundancy:

  • DeployGKE regional clusters across multiple zones.
  • Ensure data redundancy for datasets and checkpoints by usingCloud Storage multi-regional or dual-region buckets.
  • Use Spanner for globally consistent, highly available storageof metadata.
  • ConfigureCloud SQL read replicas for operational databases.
  • Ensure thatvector databases for retrieval augmented generation (RAG) are highly available andmulti-zonal or multi-regional.

Manage resources proactively and anticipate requirements

Effective resource management is important to help you optimize costs,performance, and reliability. AI and ML workloads are dynamic and there's highdemand for specialized hardware like GPUs and TPUs. Therefore, it's crucial thatyou apply proactive resource management and ensure resource availability.

Plan for capacity based on historical monitoring data, such as GPU or TPUutilization and throughput rates, fromCloud Monitoring and logs inCloud Logging.Analyze this telemetry data by usingBigQuery orLooker Studio and forecast future demand for GPUs based on growth or new models. Analysis ofresource usage patterns and trends helps you to predict when and where you needcritical specialized accelerators.

  • Validate capacity estimates through rigorous load testing. Simulatetraffic on AI and ML services like serving and pipelines by using toolslikeApache JMeter orLoadView.
  • Analyze system behavior under stress.
    • To anticipate and meet increased workload demands inproduction, proactively identify resource requirements. Monitorlatency, throughput, errors, and resource utilization, especially GPUand TPU utilization. Increase resource quotas as necessary.
    • For generative AI serving, test under high concurrent loads andidentify the level at which accelerator availability limits performance.
  • Perform continuous monitoring for model queries and set up proactivealerts for agents.
    • Use themodel observability dashboard to view metrics that are collected by Cloud Monitoring, such asmodel queries per second (QPS), token throughput, and first tokenlatencies.

Optimize resource availability and obtainability

Optimize costs and ensure resource availability by strategically selectingappropriate compute resources based on workload requirements.

  • For stable 24x7 inference or for training workloads with fixed orpredictable capacity requirements, usecommitted use discounts (CUDs) for VMs and accelerators.
  • For GKE nodes and Compute Engine VMs, useSpot VMs and Dynamic Workload Scheduler (DWS) capabilities:

    • For fault-tolerant tasks such as evaluation and experimentationworkloads, useSpot VMs.Spot VMs can be preempted, but they can help reduce youroverall costs.
    • To manage preemption risk for high-demand accelerators, you canensure better obtainability by usingDWS.
      • For complex batch training that needs high-end GPUs torun up to seven days, use the DWS Flex-Start mode.
      • For longer running workloads that run up to threemonths, use the Calendar mode to reserve specific GPUs (H100 andH200) and TPUs (Trillium).
  • To optimize AI inference on GKE, you can run avLLM engine that dynamically uses TPUs and GPUs to address fluctuating capacityand performance needs. For more information, seevLLM GPU/TPU Fungibility.

  • For advanced scenarios with complex resource and topology needsthat involve accelerators, use tools to abstract resource management.

    • Cluster Director lets you deploy and manage accelerator groups with colocation andscheduling for multi-GPU training (A3 Ultra H200 and A4 B200). ClusterDirector supports GKE and Slurm clusters.
    • Ray on Vertex AI abstracts distributed computing infrastructure. It enables applicationsto request resources for training and serving without the need fordirect management of VMs and containers.

Distribute incoming traffic across multiple instances

Effective load balancing is crucial for AI applications that have fluctuatingdemands. Load balancing distributes traffic, optimizes resource utilization,provides HA and low latency, and helps to ensure a seamless user experience.

  • Inference with varying resource needs: Implement load balancingbased on model metrics.GKE Inference Gateway lets you deploy models behind a load balancer with model-aware routing. Thegateway prioritizes instances with GPU and TPU accelerators forcompute-intensive tasks like generative AI and LLM inference. Configuredetailed health checks to assess model status. Use serving frameworks likevLLM or Triton for LLM metrics and integrate the metrics intoCloud Monitoring by usingGoogle Cloud Managed Service for Prometheus.
  • Inference workloads that need GPUs or TPUs: To ensure that criticalAI and ML inference workloads consistently run on machines that aresuitable to the workloads' requirements, particularly when GPU and TPUavailability is constrained, useGKE custom compute classes.You can define specific computeprofiles with fallback policies forautoscaling. For example, you can define a profile that specifies a higherpriority for reserved GPU or TPU instances. The profile can include afallback to use cost-efficient Spot VMs if the reservedresources are temporarily unavailable.
  • Generative AI on diverse orchestration platforms: Use a centralizedload balancer. For example, for cost and management efficiency, you canroute requests that have low GPU needs to Cloud Runand route more complex, GPU-intensive tasks to GKE.For inter-service communication and policy management, implement a servicemesh by usingCloud Service Mesh.Ensure consistent logging and monitoring by using Cloud Logging andCloud Monitoring.
  • Global load distribution: To load balance traffic from global userswho need low latency, use aglobal external Application Load Balancer.Configure geolocation routing to the closest region and implement failover.Establish regional endpoint replication in Vertex AI orGKE. Configure Cloud CDN for static assets.Monitor global traffic and latency by using Cloud Monitoring.
  • Granular traffic management: For requests that have diverse datatypes or complexity and long-running requests, implement granular trafficmanagement.
    • Configurecontent-based routing to direct requests to specialized backends based on attributes like URLpaths and headers. For example, direct requests to GPU-enabled backendsfor image or video models and to CPU-optimized backends for text-basedmodels.
    • For long-running generative AI requests or batch workloads, useWebSockets or gRPC. Implement traffic management to handle timeouts andbuffering. Configure request timeouts and retries and implement ratelimiting and quotas by using API Gateway orApigee.

Use a modular and loosely coupled architecture

In a modular, loosely coupled AI and ML architecture, complex systems aredivided into smaller, self-contained components that interact throughwell-defined interfaces. This architecture minimizes module dependencies,simplifies development and testing, enhances reproducibility, and improves faulttolerance by containing failures. The modular approach is crucial for managingcomplexity, accelerating innovation, and ensuring long-term maintainability.

To design a modular and loosely coupled architecture for AI and ML workloads,consider the following recommendations.

Implement small self-contained modules or components

Separate your end-to-end AI and ML system into small, self-contained modules orcomponents. Each module or component is responsible for a specificfunction, such as data ingestion, feature transformation, model training,inference serving, or evaluation. A modular design provides several key benefitsfor AI and ML systems: improved maintainability, increased scalability,reusability, and greater flexibility and agility.

The following sections describe Google Cloud products, features, and toolsthat you can use to design a modular architecture for your AI and ML systems.

Containerized microservices on GKE

For complex AI and ML systems or intricate generative AI pipelines that needfine-grained orchestration, implement modules as microservices that areorchestrated by using GKE. Package each distinct stage as anindividual microservice within Docker containers. These distinct stages includedata ingestion that's tailored for diverse formats, specialized datapreprocessing or feature engineering, distributed model training or fine tuningof large foundation models, evaluation, or serving.

Deploy the containerized microservices on GKE and leverageautomated scaling based on CPU and memory utilization or custom metrics like GPUutilization, rolling updates, and reproducible configurations in YAML manifests.Ensure efficient communication between the microservices by usingGKE service discovery. For asynchronous patterns, usemessage queues likePub/Sub.

The microservices-on-GKE approach helps you build scalable,resilient platforms for tasks like complex RAG applications where the stages canbe designed as distinct services.

Serverless event-driven services

For event-driven tasks that can benefit from serverless, automatic scaling, useCloud Run orCloud Run functions.These services are ideal for asynchronous tasks like preprocessing or forsmaller inference jobs. Trigger Cloud Run functions on events, such asa new data file that's created in Cloud Storage or model updates inArtifact Registry. For web-hook tasks or services that need a container environment,use Cloud Run.

Cloud Run services and Cloud Run functions canscale up rapidly and scale down to zero, which helps to ensure cost efficiencyfor fluctuating workloads. These services are suitable for modular components inVertex AI Agents workflows. You can orchestrate componentsequences withWorkflows or Application Integration.

Vertex AI managed services

Vertex AI services support modularity and help you simplify thedevelopment and deployment of your AI and ML systems. The services abstract theinfrastructure complexities so that you can focus on the application logic.

  • To orchestrate workflows that are built from modular steps, useVertex AI Pipelines.
  • To run custom AI and ML code, package the code in Docker containers that canrun on managed services like Vertex AI custom training andVertex AI prediction.
  • For modular feature engineering pipelines, useVertex AI Feature Store.
  • For modular exploration and prototyping, use notebook environments likeVertex AI Workbench orColab Enterprise.Organize your code into reusable functions, classes, and scripts.

Agentic applications

For AI agents,Agent Development Kit (ADK) provides modular capabilities likeTools andState.To enable interoperability between frameworks likeLangChain,LangGraph,LlamaIndex,and Vertex AI, you can combine ADK with theAgent2Agent (A2A) protocol and theModel Context Protocol (MCP).This interoperability lets you compose agentic workflows by using diversecomponents.

You can deploy agents on Vertex AI Agent Engine, which is a managedruntime that's optimized for scalable agent deployment. To run containerizedagents, you can leverage the autoscaling capabilities inCloud Run.

Design well-defined interfaces

To build robust and maintainable software systems, it's crucial to ensure thatthe components of a system are loosely coupled and modularized. This approachoffers significant advantages, because it minimizes the dependencies betweendifferent parts of the system. When modules are loosely coupled, changes in onemodule have minimal impact on other modules. This isolation enables independentupdates and development workflows for individual modules.

The following sections provide guidance to help ensure seamless communicationand integration between the modules of your AI and ML systems.

Protocol choice

  • For universal access, use HTTP APIs, adhere to RESTful principles, anduse JSON for language-agnostic data exchange. Design the API endpoints torepresent actions on resources.
  • For high-performance internal communication among microservices, usegRPC with Protocol Buffers (ProtoBuf) for efficient serialization and strict typing. Define data structures likeModelInput, PredictionResult, or ADK Tool data by using.proto files, andthen generate language bindings.
  • For use cases where performance is critical, leverage gRPC streaming forlarge datasets or for continuous flows such as live text-to-speech or videoapplications. Deploy the gRPC services on GKE.

Standardized and comprehensive documentation

Regardless of the interface protocol that you choose, standardizeddocumentation is crucial. TheOpenAPI Specification describes RESTful APIs. Use OpenAPI to document your AI and ML APIs: paths,methods, parameters, request-response formats that are linked to JSON schemas,and security. Comprehensive API documentation helps to improve discoverabilityand client integration. For API authoring and visualization, use UI tools likeSwagger Editor.To accelerate development and ensure consistency, you can generate client SDKsand server stubs by using AI-assisted coding tools likeGemini Code Assist.Integrate OpenAPI documentation into your CI/CD flow.

Interaction with Google Cloud managed services like Vertex AI

Choose between the higher abstraction of the Vertex AI SDK,which is preferred for development productivity, and the granular control thatthe REST API provides.

  • The Vertex AI SDK simplifies tasks and authentication.Use the SDK when you need to interact withVertex AI.
  • The REST API is a powerful alternative especially when interoperabilityis required between layers of your system. It's useful for tools inlanguages that don't have an SDK or when you need fine-grained control.

Use APIs to isolate modules and abstract implementation details

For security, scalability, and visibility, it's crucial that you implementrobust API management for your AI and ML services. To implement API managementfor your defined interfaces, use the following products:

  • API Gateway:For APIs that are externally exposed and managed, API Gatewayprovides a centralized, secure entry point. It simplifies access toserverless backend services, such as prediction, training, and data APIs.API Gateway helps to consolidate access points, enforce APIcontracts, and manage security capabilities like API keys andOAuth 2.0. To protect backends from overload and ensure reliability,implement rate limiting and usage quotas in API Gateway.
  • Cloud Endpoints:To streamline API development and deployment on GKEand Cloud Run, use Cloud Endpoints, which offers adeveloper-friendly solution for generating API keys. It alsoprovides integrated monitoring and tracing for API calls and it automatesthe generation of OpenAPI specs, which simplifies documentation and clientintegration. You can use Cloud Endpoints to manage access to internal orcontrolled AI and ML APIs, such as to trigger training and manage featurestores.
  • Apigee:For enterprise-scale AI and ML, especially sophisticated generative AIAPIs, Apigee provides advanced, comprehensive APImanagement. Use Apigee for advanced security like threatprotection and OAuth 2.0, for traffic management like caching, quotas, andmediation, and for analytics. Apigee can help you to gaindeep insights into API usage patterns, performance, and engagement, whichare crucial for understanding generative AI API usage.

Plan for graceful degradation

In production AI and ML systems, component failures are unavoidable, just likein other systems. Graceful degradation ensures that essential functions continueto operate, potentially with reduced performance. This approach preventscomplete outages and improves overall availability. Graceful degradation iscritical for latency-sensitive inference, distributed training, and generativeAI.

The following sections describe techniques that you use to plan and implementgraceful degradation.

Fault isolation

  • To isolate faulty components in distributed architectures, implementthecircuit breaker pattern by using resilience libraries, such asResilience4j in Java andCircuitBreaker in Python.
  • To prevent cascading failures, configure thresholds based on AI and MLworkload metrics like error rates and latency and define fallbacks likesimpler models and cached data.

Component redundancy

For critical components, implement redundancy and automatic failover. Forexample, use GKE multi-zone clusters or regional clustersand deploy Cloud Run services redundantly across differentregions. To route traffic to healthy instances when unhealthy instances aredetected, use Cloud Load Balancing.

Ensure data redundancy by using Cloud Storage multi-regional buckets.For distributed training, implement asynchronous checkpointing to resume afterfailures. For resilient and elastic training, usePathways.

Proactive monitoring

Graceful degradation helps to ensure system availability during failure, butyou must also implement proactive measures for continuous health checks andcomprehensive monitoring. Collect metrics that are specific to AI and ML, suchas latency, throughput, and GPU utilization. Also, collect model performancedegradation metrics like model and data drift by using Cloud Monitoring andVertex AI Model Monitoring.

Health checks can trigger the need to replace faulty nodes, deploy morecapacity, or automatically trigger continuous retraining or fine-tuning ofpipelines that use updated data. This proactive approach helps to prevent bothaccuracy-based degradation and system-level graceful degradation and it helps toenhance overall reliability.

SRE practices

To monitor the health of your systems, consider adopting SRE practices toimplementservice level objectives (SLOs).Alerts on error budget loss and burn rate can be early indicators of reliabilityproblems with the system. For more information about SRE practices, see theGoogle SRE book.

Build an automated end-to-end MLOps platform

A robust, scalable, and reliable AI and ML system on Google Cloudrequires an automated end-to-end MLOps platform for the model developmentlifecycle. The development lifecycle includes initial data handling, continuousmodel training, deployment, and monitoring in production. By automating thesestages on Google Cloud, you establish repeatable processes, reduce manualtoil, minimize errors, and accelerate the pace of innovation.

An automated MLOps platform is essential for establishing production-gradereliability for your applications. Automation helps to ensure modelquality, guarantee reproducibility, and enable continuous integration anddelivery of AI and ML artifacts.

To build an automated end-to-end MLOps platform, consider the followingrecommendations.

Automate the model development lifecycle

A core element of an automated MLOps platform is the orchestration of theentire AI and ML workflow as a series of connected, automated steps: from datapreparation and validation to model training, evaluation, deployment, andmonitoring.

  • UseVertex AI Pipelines as your central orchestrator:
    • Define end-to-end workflows with modular components for dataprocessing, training, evaluation, and deployment.
    • Automate pipeline runs by using schedules or triggers like newdata or code changes.
    • Implement automated parameterization and versioning for eachpipeline run and create a version history.
    • Monitor pipeline progress and resource usage by using built-inlogging and tracing, and integrate withCloud Monitoring alerts.
  • Define your ML pipelines programmatically by using theKubeflow Pipelines (KFP) SDK or TensorFlow Extended SDK. For more information, seeInterfaces for Vertex AI Pipelines.
  • Orchestrate operations by using Google Cloud services likeDataflow,Vertex AI custom training,Vertex AI Model Registry,and Vertex AI endpoints.
  • For generative AI workflows, orchestrate the steps for promptmanagement, batched inference, human-in-the-loop (HITL) evaluation, andcoordinating ADK components.

Manage infrastructure as code

Infrastructure as code (IaC) is crucial for managing AI and ML systeminfrastructure and for enabling reproducible, scalable, and maintainabledeployments. The infrastructure needs of AI and ML systems are dynamic andcomplex. The systems often require specialized hardware like GPUs and TPUs. IaChelps to mitigate the risks of manual infrastructure management by ensuringconsistency, enabling rollbacks, and making deployments repeatable.

To effectively manage your infrastructure resources as code, use the followingtechniques.

Automate resource provisioning

To effectively manage IaC on Google Cloud, define and provision your AIand ML infrastructure resources by usingTerraform.The infrastructure might include resources such as the following:

  • GKE clusters that are configured with node pools. The node pools can beoptimized based on workload requirements. For example, you can use A100,H100, H200, or B200 GPUs for training, and use L4 GPUs for inference.
  • Vertex AI endpoints that are configured for model serving, with defined machine types andscaling policies.
  • Cloud Storage buckets for data and artifacts.

Use configuration templates

Organize your Terraform configurations as modular templates. To accelerate theprovisioning of AI and ML resources, you can useCluster Toolkit.The toolkit providesexample blueprints,which are Google-curated Terraform templates that you can use to deployready-to-use HPC, AI, and ML clusters in Slurm or GKE. Youcan customize the Terraform code and manage it in your version control system.To automate the resource provisioning and update workflow, you can integrate thecode into your CI/CD pipelines by usingCloud Build.

Automate configuration changes

After you provision your infrastructure, manage the ongoing configurationchanges declaratively:

  • In Kubernetes-centric environments, manage your Google Cloudresources as Kubernetes objects by usingConfig Connector.
  • Define and manage Vertex AI resources like datasets,models, and endpoints, Cloud SQL instances, Pub/Subtopics, and Cloud Storage buckets by using YAML manifests.
  • Deploy the manifests to your GKE cluster in order to integrate theapplication and infrastructure configuration.
  • Automate configuration updates by using CI/CD pipelines and usetemplating to handle environment differences.
  • Implement configurations for Identity and Access Management (IAM) policies andservice accounts by using IaC.

Integrate with CI/CD

  • Automate the lifecycle of the Google Cloud infrastructureresources by integrating IaC into CI/CD pipelines by using tools likeCloud Build andInfrastructure Manager.
  • Define triggers for automatic updates on code commits.
  • Implement automated testing and validation within the pipeline. Forexample, you can create a script to automatically run the Terraformvalidate andplan commands.
  • Store the configurations as artifacts and enable versioning.
  • Define separate environments, such as dev, staging, and prod, withdistinct configurations in version control and automate environmentpromotion.

Validate model behavior

To maintain model accuracy and relevance over time, automate the training andevaluation process within your MLOps platform. This automation, coupled withrigorous validation, helps to ensure that the models behave as expected withrelevant data before they're deployed to production.

  • Set up continuous training pipelines, which are either triggered by newdata and monitoring signals like data drift or that run on a schedule.
    • To manage automated training jobs, such as hyperparameter tuning trialsand distributed training configurations for larger models, useVertex AI custom training.
    • For fine-tuning foundation models, automate the fine-tuningprocess and integrate the jobs into your pipelines.
  • Implement automated model versioning and securely store trained modelartifacts after each successful training run. You can store the artifactsin Cloud Storage or register them inModel Registry.
  • Define evaluation metrics and set clear thresholds, such as minimumaccuracy, maximum error rate, and minimum F1 score.
    • Ensure that a model meets the thresholds to automatically passthe evaluation and be considered for deployment.
    • Automate evaluation by using services likemodel evaluation in Vertex AI.
    • Ensure that the evaluation includes metrics that are specific tothe quality of generated output, factual accuracy, safety attributes,and adherence to specified style or format.
  • To automatically log and track the parameters, code versions, datasetversions, and results of each training and evaluation run, useVertex AI Experiments.This approach provides a history that's useful for comparison, debugging,and reproducibility.
  • To optimizehyperparameter tuning and automate searching for optimal model configurations based on yourdefined objective, useVertex AI Vizier.
  • To visualize training metrics and to debug during development, useVertex AI TensorBoard.

Validate inputs and outputs of AI and ML pipelines

To ensure the reliability and integrity of your AI and ML systems, you mustvalidate data when it enters the systems and moves through the pipelines. Youmust also verify the inputs and outputs at the component boundaries. Robustvalidation of all inputs and outputs—raw data, processed data, configurations,arguments, and files—helps to prevent unexpected behavior and maintain modelquality throughout the MLOps lifecycle. When you integrate this proactiveapproach into your MLOps platform, it helps detect errors before they arepropagated throughout a system and it saves time and resources.

To effectively validate the inputs and outputs of your AI and ML pipelines, usethe following techniques.

Automate data validation

  • Implement automated data validation in your data ingestion andpreprocessing pipelines by usingTensorFlow Data Validation (TFDV).
    • For large-scale, SQL-based data quality checks, leveragescalable processing services likeBigQuery.
    • For complex, programmatic validation on streaming or batch data,useDataflow.
  • Monitor data distributions over time with TFDV capabilities.
    • Visualize trends by using tools that are integrated withCloud Monitoring to detect data drift. You can automatically trigger model retrainingpipelines when data patterns change significantly.
  • Store validation results and metrics in BigQuery foranalysis and historical tracking and archive validation artifacts inCloud Storage.

Validate pipeline configurations and input data

To prevent pipeline failures or unexpected behavior caused by incorrectsettings, implement strict validation for all pipeline configurations andcommand-line arguments:

  • Define clear schemas for your configuration files like YAML or JSON byusing schema validation libraries likejsonschema for Python. Validate configuration objects against these schemas before apipeline run starts and before a component executes.
  • Implement input validation for all command-line arguments and pipelineparameters by using argument-parsing libraries likeargparse.Validation should check for correct data types, valid values, and requiredarguments.
  • WithinVertex AI Pipelines,define the expected types and properties of component parameters by usingthe built-in component input validation features.
  • To ensure reproducibility of pipeline runs and to maintain an audittrail, store validated, versioned configuration files inCloud Storage or Artifact Registry.

Validate input and output files

Validate input and output files such as datasets, model artifacts, andevaluation reports for integrity and format correctness:

  • Validate file formats like CSV,Parquet,and image types by using libraries.
  • For large files or critical artifacts, validate file sizes and checksumsto detect corruption or incomplete transfers by usingCloud Storagedata validation and change detection.
  • Perform file validation by usingCloud Run functions (for example, based on file upload events) or withinDataflow pipelines.
  • Store validation results inBigQuery for easier retrieval and analysis.

Automate deployment and implement continuous monitoring

Automated deployment and continuous monitoring of models in production helps toensure reliability, perform rapid updates, and detect issues promptly. Thisinvolves managing model versions, controlled deployment, automated deploymentusing CI/CD, and comprehensive monitoring as described in the followingsections.

Manage model versions

Manage model iterations and associated artifacts by using versioning tools:

  • To track model versions and metadata and to link to underlying modelartifacts, useModel Registry.
  • Implement a clear versioning scheme (such as, semantic versioning). For eachmodel version, attach comprehensive metadata such as training parameters,evaluation metrics from validation pipelines, and dataset version.
  • Store model artifacts such as model files, pretrained weights, andserving container images inArtifact Registry and use its versioning and tagging features.
  • To meet security and governance requirements, define stringentaccess-control policies for Model Registry andArtifact Registry.
  • To programmatically register and manage versions and to integrateversions into automated CI/CD pipelines, use the Vertex AISDK or API.

Perform controlled deployment

Control the deployment of model versions to endpoints by using your servingplatform's traffic management capabilities.

  • Implement arolling deployment by using the traffic splitting feature of Vertex AIendpoints.
  • If you deploy your model to GKE, use advancedtraffic management techniques likecanary deployment:
    1. Route a small subset of the production traffic to a new model version.
    2. Continuously monitor performance and error rates through metrics.
    3. Establish that the model is reliable.
    4. Roll out the version to all traffic.
  • PerformA/B testing of AI agents:
    1. Deploy two different model-agent versions or entirely differentmodels to the same endpoint.
    2. Split traffic across the deployments.
    3. Analyze the results against business objectives.
  • Implement automated rollback mechanisms that can quickly revert endpointtraffic to a previous stable model version if monitoring alerts aretriggered or performance thresholds are missed.
  • Configure traffic splitting and deployment settings programmatically byusing the Vertex AI SDK or API.
  • UseCloud Monitoring to track performance and traffic across versions.
  • Automate deployment with CI/CD pipelines. You can useCloud Build to build containers, version artifacts, and trigger deployment toVertex AI endpoints.
  • Ensure that the CI/CD pipelines manage versions and pull fromArtifact Registry.
  • Before you shift traffic, perform automated endpoint testing forprediction correctness, latency, throughput, and API function.
  • Store all configurations in version control.

Monitor continuously

  • UseModel Monitoring to automatically detect performance degradation, data drift (changes ininput distribution compared to training), and prediction drift (changes inmodel outputs).
    • Configure drift detection jobs with thresholds and alerts.
    • Monitor real-time performance: prediction latency, throughput,error rates.
  • Define custom metrics inCloud Monitoring for business KPIs.
  • Integrate Model Monitoring results and custommetrics with Cloud Monitoring for alerts and dashboards.
  • Configure notification channels like email, Slack, or PagerDuty andconfigure automated remediation.
  • To debug prediction logs, useCloud Logging.
  • Integrate monitoring with incident management.

For generative AI endpoints, monitor output characteristics like toxicity andcoherence:

  • Monitor feature serving for drift.
  • Implement granular prediction validation: validate outputs againstexpected ranges and formats by using custom logic.
  • Monitor prediction distributions for shifts.
  • Validate output schema.
  • Configure alerts for unexpected outputs and shifts.
  • Track and respond to real-time validation events by usingPub/Sub.

Ensure that the output of comprehensive monitoring feeds back into continuoustraining.

Maintain trust and control through data and model governance

AI and ML reliability extends beyond technical uptime. It includes trust androbust data and model governance. AI outputs might be inaccurate, biased, oroutdated. Such issues erode trust and can cause harm. Comprehensivetraceability, strong access control, automated validation, and transparentpractices help to ensure that AI outputs are reliable, trustworthy, and meetethics standards.

To maintain trust and control through data and model governance, consider thefollowing recommendations.

Establish data and model catalogs for traceability

To facilitate comprehensive tracing, auditing, and understanding the lineage ofyour AI and ML assets, maintain a robust, centralized record of data and modelversions throughout their lifecycle. A reliable data and model catalog serves asthe single source of truth for all of the artifacts that are used and producedby your AI and ML pipelines–from raw data sources and processed datasets totrained model versions and deployed endpoints.

Use the following products, tools, and techniques to create and maintaincatalogs for your data assets:

  • Build an enterprise-wide catalog of your data assets by usingDataplex Universal Catalog.To automatically discover and build inventories of the data assets,integrate Dataplex Universal Catalog with your storage systems, such asBigQuery,Cloud Storage,andPub/Sub.
  • Ensure that your data is highly available and durable by storing it inCloud Storagemulti-region or dual-region buckets.Data that you upload to these buckets is stored redundantly across at leasttwo separate geographic locations. This redundancy provides built-inresilience against regional outages and it helps to ensure data integrity.
  • Tag and annotate your datasets with relevant business metadata,ownership information, sensitivity levels, and lineage details. Forexample, link a processed dataset to its raw source and to the pipelinethat created the dataset.
  • Create a central repository for model versions by usingModel Registry.Register each trained model version and link it to the associated metadata.The metadata can include the following:
    • Training parameters.
    • Evaluation metrics from validation pipelines.
    • Dataset version that was used for training, with lineage tracedback to the relevant Dataplex Universal Catalog entry.
    • Code version that produced the dataset.
    • Details about the framework or foundation model that was used.
  • Before you import a model into Model Registry,store model artifacts like model files and pretrained weights in a servicelike Cloud Storage. Store custom container images for serving orcustom training jobs in a secure repository likeArtifact Registry.
  • To ensure that data and model assets are automatically registered andupdated in the respective catalogs upon creation or modification, implementautomated processes within your MLOps pipelines. This comprehensivecataloging provides end-to-end traceability from raw data to prediction,which lets you audit the inputs and processes that led to a specific modelversion or prediction. The auditing capability is vital for debuggingunexpected behavior, ensuring compliance with data usage policies, andunderstanding the impact of data or model changes over time.
  • For Generative AI and foundation models, your catalog must also trackdetails about the specific foundation model used, fine-tuning parameters,and evaluation results that are specific to the quality and safety of thegenerated output.

Implement robust access controls and audit trails

To maintain trust and control in your AI and ML systems, it's essential thatyou protect sensitive data and models from unauthorized access and ensureaccountability for all changes.

  • Implement strict access controls and maintain detailed audit trailsacross all components of your AI and ML systems in Google Cloud.
  • Define granular permissions in IAM for users, groups, andservice accounts that interact with your AI and ML resources.
  • Follow the principle of least privilege rigorously.
  • Grant only the minimum necessary permissions for specific tasks. Forexample, a training service account needs read access to training data andwrite access for model artifacts, but the service might not need writeaccess to production serving endpoints.

Apply IAM policies consistently across all relevant assets andresources in your AI and ML systems, including the following:

  • Cloud Storage buckets that contain sensitive data or modelartifacts.
  • BigQuery datasets.
  • Vertex AI resources, such as model repositories,endpoints, pipelines, and Feature Store resources.
  • Compute resources, such as GKE clusters and Cloud Runservices.

Use auditing and logs to capture, monitor, and analyze access activity:

  • EnableCloud Audit Logs for all of the Google Cloud services that are used by your AI and MLsystem.
  • Configure audit logs to capture detailed information about API calls,data access events, and configuration changes made to your resources.Monitor the logs for suspicious activity, unauthorized access attempts, orunexpected modifications to critical data or model assets.
  • For real-time analysis, alerting, and visualization, stream the auditlogs toCloud Logging.
  • For cost-effective long-term storage and retrospective security analysisor compliance audits, export the logs to BigQuery.
  • For centralized security monitoring, integrate audit logs with yoursecurity information and event management (SIEM) systems. Regularly reviewaccess policies and audit trails to ensure they align with your governancerequirements and detect potential policy violations.
  • For applications that handle sensitive data, such as personallyidentifiable information (PII) for training or inference, useSensitive Data Protection checks within pipelines or on data storage.
  • For generative AI and agentic solutions, use audit trails to help trackwho accessed specific models or tools, what data was used for fine-tuningor prompting, and what queries were sent to production endpoints. The audittrails help you to ensure accountability and they provide crucial data foryou to investigate misuse of data or policy violations.

Address bias, transparency, and explainability

To build trustworthy AI and ML systems, you need to address potential biasesthat are inherent in data and models, strive for transparency in systembehavior, and provide explainability for model outputs. It's especially crucialto build trustworthy systems in sensitive domains or when you use complex modelslike those that are typically used for generative AI applications.

  • Implement proactive practices to identify and mitigate bias throughoutthe MLOps lifecycle.
  • Analyze training data for bias by using tools that detect skew infeature distributions across different demographic groups or sensitiveattributes.
  • Evaluate the overall model performance and the performance acrosspredefined slices of data. Such evaluation helps you to identify disparateperformance or bias that affects specific subgroups.

For model transparency and explainability, use tools that help users anddevelopers understand why a model made a particular prediction or produced aspecific output.

  • For tabular models that are deployed onVertex AI endpoints,generate feature attributions by usingVertex Explainable AI.Feature attributions indicate the input features that contributed most tothe prediction.
  • Interactively explore model behavior and potential biases on a datasetby using model-agnostic tools like theWhat-If Tool,which integrates with TensorBoard.
  • Integrate explainability into your monitoring dashboards. In situationswhere understanding the model's reasoning is important for trust ordecision-making, provide explainability data directly to end users throughyour application interfaces.
  • For complex models like LLMs that are used for generative AI models,explain theprocess that an agent followed, such as by usingtrace logs.Explainability is relatively challenging for such models, but it's stillvital.
  • In RAG applications, provide citations for retrieved information. Youcan also use techniques like prompt engineering to guide the model toprovide explanations or show its reasoning steps.
  • Detect shifts in model behavior or outputs that might indicate emergingbias or unfairness by implementing continuous monitoring in production.Document model limitations, intended use cases, and known potential biasesas part of the model's metadata in theModel Registry.

Implement holistic AI and ML observability and reliability practices

Holistic observability is essential for managing complex AI and ML systems inproduction. It's also essential for measuring the reliability of complex AI andML systems, especially for generative AI, due to its complexity, resourceintensity, and potential for unpredictable outputs. Holistic observabilityinvolves observing infrastructure, application code, data, and model behavior togain insights for proactive issue detection, diagnosis, and response. Thisobservability ultimately leads to high-performance, reliable systems. To achieveholistic observability you need to do the following:

  • Adopt SRE principles.
  • Define clear reliability goals.
  • Track metrics across system layers.
  • Use insights from observability for continuous improvement and proactivemanagement.

To implement holistic observability and reliability practices for AI and MLworkloads in Google Cloud, consider the following recommendations.

Establish reliability goals and business metrics

Identify the key performance indicators (KPIs) that your AI and ML systemdirectly affects. The KPIs might include revenue that's influenced by AIrecommendations, customer churn that the AI systems predicted or mitigated, anduser engagement and conversion rates that are driven by generative AI features.

For each KPI, define the corresponding technical reliability metrics that affectthe KPI. For example, if the KPI is "customer satisfaction with a conversationalAI assistant," then the corresponding reliability metrics can include thefollowing:

  • The success rate of user requests.
  • The latency of responses: time to first token (TTFT) and token streamingfor LLMs.
  • The rate of irrelevant or harmful responses.
  • The rate of successful task completion by the agent.

For AI and ML training, reliability metrics can include model FLOPSutilization (MFU), iterations per second, tokens per second, and tokens perdevice.

To effectively measure and improve AI and ML reliability, begin by setting clearreliability goals that are aligned with the overarching business objectives.Adopt the SRE approach by defining SLOs that quantify acceptable levels ofreliability and performance for your AI and ML services from the users'perspective. Quantify these technical reliability metrics with specific SLOtargets.

The following are examples of SLO targets:

  • 99.9% of API calls must return a successful response.
  • 95th percentile inference latency must be below 300 ms.
  • TTFT must be below 500 ms for 99% of requests.
  • Rate of harmful output must be below 0.1%.

Aligning SLOs directly with business needs ensures that reliability efforts arefocused on the most critical system behavior that affects users and thebusiness. This approach helps to transform reliability into a measurable andactionable engineering property.

Monitor infrastructure and application performance

Track infrastructure metrics across all of the resources that are used by yourAI and ML systems. The metrics include processor usage (CPU, GPU, and TPU),memory usage, network throughput and latency, and disk I/O. Track the metricsfor managed environments like Vertex AI training and serving andfor self-managed resources like GKE nodes andCloud Run instances.

Monitor thefour golden signals for your AI and ML applications:

  • Latency: Time to respond to requests.
  • Traffic: Volume of requests or workload.
  • Error rate: Rate of failed requests or operations.
  • Saturation: Utilization of critical resources like CPU, memory, andGPU or TPU accelerators, which indicates how close your system is tocapacity limits.

Perform monitoring by using the following techniques:

  • Collect, store, and visualize the infrastructure and applicationmetrics by usingCloud Monitoring.You can use pre-built dashboards for Google Cloud services and createcustom dashboards that are tailored based on your workload's specificperformance indicators and infrastructure health.
  • Collect detailed logs from your AI and ML applications and theunderlying infrastructure by usingCloud Logging.These logs are essential for troubleshooting and performance analysis. Theyprovide context around events and errors.
  • Pinpoint latency issues and understand request flows across distributedAI and ML microservices by usingCloud Trace.This capability is crucial for debugging complexVertex AI Agents interactions or multi-componentinference pipelines.
  • Identify performance bottlenecks within function blocks in applicationcode by usingCloud Profiler.Identifying performance bottlenecks can help you optimize resource usageand execution time.
  • Gather specific accelerator-related metrics like detailed GPUutilization per process, memory usage per process, and temperature, byusing tools likeNVIDIA Data Center GPU Manager (DCGM).

Implement data and model observability

Reliable generative AI systems require robust data and model observability,which starts with end-to-end pipeline monitoring.

  • Track data ingestion rates, processed volumes, and transformationlatencies by using services likeDataflow.
  • Monitor job success and failure rates within your MLOps pipelines,including pipelines that are managed byVertex AI Pipelines.

Continuous assessment of data quality is crucial.

  • Manage and govern data by usingDataplex Universal Catalog:
    • Evaluate accuracy by validating against ground truth or bytracking outlier detection rates.
    • Monitor freshness based on the age of data and frequency ofupdates against SLAs.
    • Assess completeness by tracking null-value percentages andrequired field-fill rates.
    • Ensure validity and consistency through checks forschema-adherence and duplication.
  • Proactively detect anomalies by using Cloud Monitoring alerting andthrough clear data lineage for traceability.
  • For RAG systems, examine the relevance of the retrieved context and thegroundedness (attribution to source) of the responses.
  • Monitor the throughput of vector database queries.

Key model observability metrics include input-output token counts andmodel-specific error rates, such as hallucination or query resolution failures.To track these metrics, useModel Monitoring.

  • Continuously monitor the toxicity scores of the output anduser-feedback ratings.
  • Automate the assessment of model outputs against defined criteria byusing theGen AI evaluation service.
  • Ensure sustained performance by systematically monitoring for data andconcept drift with comprehensive error-rate metrics.

To track model metrics, you can useTensorBoard orMLflow.For deep analysis and profiling to troubleshoot performance issues, you can usePyTorch XLA profiling orNVIDIA Nsight.

Contributors

Authors:

Other contributors:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-07 UTC.