AI and ML perspective: Operational excellence Stay organized with collections Save and categorize content based on your preferences.
This document in theWell-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to build and operaterobust AI and ML systems on Google Cloud. These recommendations help you set upfoundational elements like observability, automation, and scalability. Therecommendations in this document align with theoperational excellence pillar of the Google Cloud Well-Architected Framework.
Operational excellence within the AI and ML domain is the ability to seamlesslydeploy, manage, and govern the AI and ML systems and pipelines that help driveyour organization's strategic objectives. Operational excellence lets yourespond efficiently to changes, reduce operational complexity, and ensure thatyour operations remain aligned with business goals.
The recommendations in this document are mapped to the following coreprinciples:
- Build a robust foundation for model development
- Automate the model development lifecycle
- Implement observability
- Build a culture of operational excellence
- Design for scalability
Build a robust foundation for model development
To develop and deploy scalable, reliable AI systems that help you achieve yourbusiness goals, a robust model-development foundation is essential. Such afoundation enables consistent workflows, automates critical steps in order toreduce errors, and ensures that the models can scale with demand. A strongmodel-development foundation ensures that your ML systems can be updated,improved, and retrained seamlessly. The foundation also helps you to align yourmodels' performance with business needs, deploy impactful AI solutions quickly,and adapt to changing requirements.
To build a robust foundation to develop your AI models, consider the followingrecommendations.
Define the problems and the required outcomes
Before you start any AI or ML project, you must have a clear understanding ofthe business problems to be solved and the required outcomes. Start with anoutline of the business objectives and break the objectives down into measurablekey performance indicators (KPIs). To organize and document your problemdefinitions and hypotheses in a Jupyter notebook environment, use tools likeVertex AI Workbench.To implement versioning for code and documents and to document your projects,goals, and assumptions, use tools like Git. To develop and manage prompts forgenerative AI applications, you can useVertex AI Studio.
Collect and preprocess the necessary data
To implement data preprocessing and transformation, you can useDataflow (for Apache Beam),Dataproc (for Apache Spark), orBigQuery if an SQL-based process is appropriate. To validate schemas and detectanomalies, useTensorFlow Data Validation (TFDV) and take advantage ofautomated data quality scans in BigQuery where applicable.
For generative AI, data quality includes accuracy, relevance, diversity, andalignment with the required output characteristics. In cases where real-worlddata is insufficient or imbalanced, you can generate synthetic data to helpimprove model robustness and generalization. To create synthetic datasets basedon existing patterns or to augment training data for better model performance,useBigQuery DataFrames and Gemini.Synthetic data is particularly valuable for generative AI because it can helpimprove prompt diversity and overall model robustness. When you build datasetsfor fine-tuning generative AI models, consider using the synthetic datageneration capabilities in Vertex AI.
For generative AI tasks like fine-tuning or reinforcement learning from humanfeedback (RLHF), ensure that labels accurately reflect the quality, relevance,and safety of the generated outputs.
Select an appropriate ML approach
When you design your model and parameters, consider the model's complexity andcomputational needs. Depending on the task (such as classification, regression,or generation), consider usingVertex AI custom training for custom model building or AutoML for simpler ML tasks. Forcommon applications, you can also access pretrained models throughVertex AI Model Garden.You can experiment with a variety of state-of-the-art foundation models forvarious use cases, such as generating text, images, and code.
You might want to fine-tune a pretrained foundation model to achieve optimalperformance for your specific use case. For high-performance requirements incustom training, configureCloud Tensor Processing Units (TPUs) orGPU resources to accelerate the training and inference of deep-learning models, like largelanguage models (LLMs) and diffusion models.
Set up version control for code, models, and data
To manage and deploy code versions effectively, use tools like GitHub orGitLab. These tools provide robust collaboration features, branching strategies,and integration with CI/CD pipelines to ensure a streamlined developmentprocess.
Use appropriate solutions to manage each artifact of your ML system, like thefollowing examples:
- For code artifacts like container images and pipeline components,Artifact Registry provides a scalable storage solution that can help improve security. Artifact Registryalso includes versioning and can integrate withCloud Build andCloud Deploy.
- To manage data artifacts, like datasets used for training and evaluation, usesolutions like BigQuery orCloud Storage for storage and versioning.
- To store metadata and pointers to data locations, use your version controlsystem or a separate data catalog.
To maintain the consistency and versioning of your feature data, useVertex AI Feature Store.To track and manage model artifacts, including binaries and metadata, useVertex AI Model Registry,which lets you store, organize, and deploy model versions seamlessly.
To ensure model reliability, implementVertex AI Model Monitoring.Detect data drift, track performance, and identify anomalies in production. Forgenerative AI systems, monitor shifts in output quality and safety compliance.
Automate the model-development lifecycle
Automation helps you to streamline every stage of the AI and ML lifecycle.Automation reduces manual effort and standardizes processes, which leads toenhanced operational efficiency and a lower risk of errors. Automated workflowsenable faster iteration, consistent deployment across environments, and morereliable outcomes, so your systems can scale and adapt seamlessly.
To automate the development lifecycle of your AI and ML systems, consider thefollowing recommendations.
Use a managed pipeline orchestration system
UseVertex AI Pipelines to automate every step of the ML lifecycle—from data preparation to modeltraining, evaluation, and deployment. To accelerate deployment and promoteconsistency across projects, automate recurring tasks withscheduled pipeline runs,monitor workflows withexecution metrics,and developreusable pipeline templates for standardized workflows. These capabilities extend to generative AI models,which often require specialized steps likeprompt engineering,response filtering, andhuman-in-the-loop evaluation. For generative AI, Vertex AI Pipelines can automate thesesteps, including the evaluation of generated outputs against quality metrics andsafety guidelines. To improve prompt diversity and model robustness, automatedworkflows can also include data augmentation techniques.
Implement CI/CD pipelines
To automate the building, testing, and deployment of ML models, useCloud Build.This service is particularly effective when you run test suites for applicationcode, which ensures that the infrastructure, dependencies, and model packagingmeet your deployment requirements.
ML systems often need additional steps beyond code testing. For example, youneed to stress test the models under varying loads, perform bulk evaluations toassess model performance across diverse datasets, and validate data integritybefore retraining. To simulate realistic workloads for stress tests, you can usetools likeLocust,Grafana k6,orApache JMeter.To identify bottlenecks, monitor key metrics like latency,error rate, and resource utilization throughCloud Monitoring.For generative AI, the testing must also include evaluations that are specificto the type of generated content, such as text quality, image fidelity, or codefunctionality. These evaluations can involve automated metrics likeperplexity for language models or human-in-the-loop evaluation for more nuanced aspectslike creativity and safety.
To implement testing and evaluation tasks, you can integrateCloud Build with other Google Cloud services. For example, you canuse Vertex AI Pipelines for automated model evaluation,BigQuery for large-scale data analysis, and Dataflow pipeline validation forfeature validation.
You can further enhance your CI/CD pipeline by usingVertex AI for continuous training to enable automated retraining of models on new data. Specifically forgenerative AI, to keep the generated outputs relevant and diverse, theretraining might involve automatically updating the models with new trainingdata or prompts. You can useVertex AI Model Garden to select the latest base models that are available for tuning. This practiceensures that the models remain current and optimized for your evolving businessneeds.
Implement safe and controlled model releases
To minimize risks and ensure reliable deployments, implement a model releaseapproach that lets you detect issues early, validate performance, and roll backquickly when required.
To package your ML models and applications into container images and deploythem, useCloud Deploy.You can deploy your models toVertex AI endpoints.
Implement controlled releases for your AI applications and systems by usingstrategies likecanary releases.For applications that use managed models like Gemini, we recommend thatyou gradually release new application versions to a subset of users before thefull deployment. This approach lets you detect potential issues early,especially when you use generative AI models where outputs can vary.
To releasefine-tuned models, you can use Cloud Deploy to manage the deployment ofthe model versions, and use the canary release strategy to minimize risk. Withmanaged models and fine-tuned models, the goal of controlled releases is to testchanges with a limited audience before you release the applications and modelsto all users.
For robust validation, useVertex AI Experiments to compare new models against existing ones, and useVertex AI model evaluation to assess model performance. Specifically for generative AI, define evaluationmetrics that align with the intended use case and the potential risks. You canuse the Gen AI evaluation service in Vertex AI to assess metrics liketoxicity, coherence, factual accuracy, and adherence to safety guidelines.
To ensure deployment reliability, you need a robust rollback plan. Fortraditional ML systems, useVertex AI Model Monitoring to detect data drift and performance degradation. For generative AI models, youcan track relevant metrics and set up alerts for shifts in output quality or theemergence of harmful content by using Vertex AI model evaluationalong with Cloud Logging and Cloud Monitoring. Configure alerts based ongenerative AI-specific metrics to trigger rollback procedures when necessary. Totrack model lineage and revert to the most recent stable version, use insightsfromVertex AI Model Registry.
Implement observability
The behavior of AI and ML systems can change over time due to changes in thedata or environment and updates to the models. This dynamic nature makesobservability crucial to detect performance issues, biases, or unexpectedbehavior. This is especially true for generative AI models because the outputscan be highly variable and subjective. Observability lets you proactivelyaddress unexpected behavior and ensure that your AI and ML systems remainreliable, accurate, and fair.
To implement observability for your AI and ML systems, consider the followingrecommendations.
Monitor performance continuously
Use metrics and success criteria for ongoing evaluation of models afterdeployment.
You can useVertex AI Model Monitoring to proactively track model performance, identify training-serving skew andprediction drift, and receive alerts to trigger necessary model retraining orother interventions. To effectively monitor for training-serving skew, constructagolden dataset that represents the ideal data distribution, and useTFDV to analyze your training data and establish a baseline schema.
Configure Model Monitoring to compare the distribution ofinput data against the golden dataset for automatic skew detection. Fortraditional ML models, focus onmetrics like accuracy, precision, recall, F1-score, AUC-ROC, and log loss. Define customthresholds for alerts in Model Monitoring. For generativeAI, use theGen AI evaluation service to continuously monitor model output in production. You can also enableautomatic evaluation metrics for response quality, safety, instructionadherence, grounding, writing style, and verbosity. To assess the generatedoutputs for quality, relevance, safety, and adherence to guidelines, you canincorporatehuman-in-the-loop evaluation.
Create feedback loops to automatically retrain models withVertex AI Pipelines when Model Monitoring triggersan alert. Use these insights to improve your models continuously.
Evaluate models during development
Before you deploy your LLMs and other generative AI models, thoroughly evaluatethem during the development phase. Use Vertex AI model evaluation toachieve optimal performance and to mitigate risk. UseVertex AI rapid evaluation to let Google Cloud automatically run evaluations based on the dataset andprompts that you provide.
You can also define and integrate custom metrics that are specific to your usecase. For feedback on generated content, integrate human-in-the-loop workflowsby using Vertex AI Model Evaluation.
Useadversarial testing to identify vulnerabilities and potential failure modes. To identify andmitigate potential biases, use techniques like subgroup analysis andcounterfactual generation. Use the insights gathered from the evaluations thatwere completed during the development phase to define your model monitoringstrategy in production. Prepare your solution for continuous monitoring asdescribed in theMonitor performance continuously section of this document.
Monitor for availability
To gain visibility into the health and performance of your deployed endpointsand infrastructure, useCloud Monitoring.For your Vertex AI endpoints, track key metrics like requestrate, error rate, latency, and resource utilization, and set up alerts foranomalies. For more information, seeCloud Monitoring metrics for Vertex AI.
Monitor the health of the underlying infrastructure, which can includeCompute Engine instances, Google Kubernetes Engine (GKE) clusters, and TPUs andGPUs. Get automated optimization recommendations fromActive Assist.If you use autoscaling, monitor the scaling behavior to ensure that autoscalingresponds appropriately to changes in traffic patterns.
Track the status of model deployments, including canary releases and rollbacks,by integrating Cloud Deploy with Cloud Monitoring. In addition, monitorfor potential security threats and vulnerabilities by usingSecurity Command Center.
Set up custom alerts for business-specific thresholds
For timely identification and rectification of anomalies and issues, set upcustom alerting based on thresholds that are specific to your businessobjectives. Examples of Google Cloud products that you can use toimplement a custom alerting system include the following:
- Cloud Logging:Collect, store, and analyze logs from all components of your AI and ML system.
- Cloud Monitoring: Create custom dashboards to visualize key metrics andtrends, and definecustom metrics based on your needs. Configure alerts to get notificationsabout critical issues, and integrate the alerts with your incidentmanagement tools like PagerDuty or Slack.
- Error Reporting:Automatically capture and analyze errors and exceptions.
- Cloud Trace:Analyze the performance of distributed systems and identify bottlenecks.Tracing is particularly useful for understanding latency between differentcomponents of your AI and ML pipeline.
- Cloud Profiler:Continuously analyze the performance of your code in production andidentify performance bottlenecks in CPU or memory usage.
Build a culture of operational excellence
Shift the focus from just building models to building sustainable, reliable, andimpactful AI solutions. Empower teams to continuously learn, innovate, andimprove, which leads to faster development cycles, reduced errors, and increasedefficiency. By prioritizing automation, standardization, and ethicalconsiderations, you can ensure that your AI and ML initiatives consistentlydeliver value, mitigate risks, and promote responsible AI development.
To build a culture of operational excellence for your AI and ML systems,consider the following recommendations.
Champion automation and standardization
To emphasize efficiency and consistency, embed automation and standardizedpractices into every stage of the AI and ML lifecycle. Automation reduces manualerrors and frees teams to focus on innovation. Standardization ensures thatprocesses are repeatable and scalable across teams and projects.
Prioritize continuous learning and improvement
Foster an environment where ongoing education and experimentation are coreprinciples. Encourage teams to stay up-to-date with AI and ML advancements, andprovide opportunities to learn from past projects. A culture of curiosity andadaptation drives innovation and ensures that teams are equipped to meet newchallenges.
Cultivate accountability and ownership
Build trust and alignment with clearly defined roles, responsibilities, andmetrics for success. Empower teams to make informed decisions within theseboundaries, and establish transparent ways to measure progress. A sense ofownership motivates teams and ensures collective responsibility for outcomes.
Embed AI ethics and safety considerations
Prioritize considerations for ethics in every stage of development. Encourageteams to think critically about the impact of their AI solutions, and fosterdiscussions on fairness, bias, and societal impact. Clear principles andaccountability mechanisms ensure that your AI systems align with organizationalvalues and promote trust.
Design for scalability
To accommodate growing data volumes and user demands and to maximize the valueof AI investments, your AI and ML systems need to be scalable. The systems mustadapt and perform optimally to avoid performance bottlenecks that hindereffectiveness. When you design for scalability, you ensure that the AIinfrastructure can handle growth and maintain responsiveness. Use scalableinfrastructure, plan for capacity, and employ strategies like horizontal scalingand managed services.
To design your AI and ML systems for scalability, consider the followingrecommendations.
Plan for capacity and quotas
Assess future growth, and plan your infrastructure capacity and resource quotasaccordingly. Work with business stakeholders to understand the projected growthand then define the infrastructure requirements accordingly.
UseCloud Monitoring to analyze historical resource utilization, identify trends, and project futureneeds. Conduct regular load testing to simulate workloads and identifybottlenecks.
Familiarize yourself withGoogle Cloud quotas for the services that you use, such as Compute Engine, Vertex AI, andCloud Storage. Proactively request quota increases throughthe Google Cloud console, and justify the increases with data from forecastingand load testing.Monitor quota usage and set up alerts to get notifications when the usage approaches the quotalimits.
To optimize resource usage based on demand, rightsize your resources, useSpot VMs for fault-tolerant batch workloads, and implement autoscaling.
Prepare for peak events
Ensure that your system can handle sudden spikes in traffic or workload duringpeak events. Document your peak event strategy and conduct regular drills totest your system's ability to handle increased load.
To aggressively scale up resources when the demand spikes, configure autoscalingpolicies inCompute Engine andGKE.For predictable peak patterns, consider usingpredictive autoscaling.To trigger autoscaling based on application-specific signals, use custom metricsin Cloud Monitoring.
Distribute traffic across multiple application instances by usingCloud Load Balancing.Choose an appropriate load balancer type based on your application's needs. Forgeographically distributed users, you can use global load balancing to routetraffic to the nearest available instance. For complex microservices-basedarchitectures, consider usingCloud Service Mesh.
Cache static content at the edge of Google's network by usingCloud CDN.To cache frequently accessed data, you can useMemorystore,which offers a fully managed in-memory service for Redis, Valkey, or Memcached.
Decouple the components of your system by usingPub/Sub for real-time messaging and Cloud Tasks for asynchronous taskexecution
Scale applications for production
To ensure scalable serving in production, you can use managed services likeVertex AI distributed training andVertex AI Inference.Vertex AI Inference lets you configure themachine types for your prediction nodes when you deploy a model to an endpoint or requestbatch predictions. For some configurations, you can add GPUs. Choose theappropriate machine type and accelerators to optimize latency, throughput, andcost.
To scale complex AI and Python applications and custom workloads acrossdistributed computing resources, you can useRay on Vertex AI.This feature can help optimize performance and enables seamless integration withGoogle Cloud services. Ray on Vertex AI simplifies distributedcomputing by handling cluster management, task scheduling, and data transfer. Itintegrates with other Vertex AI services like training,prediction, and pipelines. Ray provides fault tolerance and autoscaling, andhelps you adapt the infrastructure to changing workloads. It offers a unifiedframework for distributed training, hyperparameter tuning, reinforcementlearning, and model serving. Use Ray for distributed data preprocessing withDataflow orDataproc,accelerated model training, scalable hyperparameter tuning, reinforcementlearning, and parallelized batch prediction.
Contributors
Authors:
- Charlotte Gistelinck, PhD | Partner Engineer
- Sannya Dang | AI Solution Architect
- Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Other contributors:
- Gary Harmson | Principal Architect
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Ryan Cox | Principal Architect
- Stef Ruinard | Generative AI Field Solutions Architect
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-04-28 UTC.