Well-Architected Framework: AI and ML perspective

Last reviewed 2025-02-14 UTC

This page provides a one-page view of all of the pages in the AI and ML perspective of the Well-Architected Framework. You can print this page or save it in PDF format by using your browser's print function.

This page doesn't have a table of contents. You can't use the links on this page to navigatewithin the page.

This document in theGoogle Cloud Well-Architected Framework describes principles and recommendations to help you to design, build, andmanage AI and ML workloads in Google Cloud that meet youroperational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects,administrators, developers, and operators who design, build, deploy, andmaintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specificto AI and ML, for each pillar of the Well-Architected Framework:

Contributors

Authors:

Benjamin Sadik | AI and ML Specialist Customer Engineer
Charlotte Gistelinck, PhD | Partner Engineer
Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Isaac Lo | AI Business Development Manager
Kamilla Kurta | GenAI/ML Specialist Customer Engineer
Mohamed Fawzi | Benelux Security and Compliance Lead
Rick (Rugui) Chen | AI Infrastructure Field Solutions Architect
Sannya Dang | AI Solution Architect

Other contributors:

Daniel Lees | Cloud Security Architect
Gary Harmson | Principal Architect
Jose Andrade | Customer Engineer, SRE Specialist
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Radhika Kanakam | Program Lead, Google Cloud Well-Architected Framework
Ryan Cox | Principal Architect
Samantha He | Technical Writer
Stef Ruinard | Generative AI Field Solutions Architect
Wade Holmes | Global Solutions Director
Zach Seils | Networking Specialist

AI and ML perspective: Operational excellence

This document in theWell-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to build and operaterobust AI and ML systems on Google Cloud. These recommendations help you set upfoundational elements like observability, automation, and scalability. Therecommendations in this document align with theoperational excellence pillar of the Google Cloud Well-Architected Framework.

Operational excellence within the AI and ML domain is the ability to seamlesslydeploy, manage, and govern the AI and ML systems and pipelines that help driveyour organization's strategic objectives. Operational excellence lets yourespond efficiently to changes, reduce operational complexity, and ensure thatyour operations remain aligned with business goals.

The recommendations in this document are mapped to the following coreprinciples:

Build a robust foundation for model development

To develop and deploy scalable, reliable AI systems that help you achieve yourbusiness goals, a robust model-development foundation is essential. Such afoundation enables consistent workflows, automates critical steps in order toreduce errors, and ensures that the models can scale with demand. A strongmodel-development foundation ensures that your ML systems can be updated,improved, and retrained seamlessly. The foundation also helps you to align yourmodels' performance with business needs, deploy impactful AI solutions quickly,and adapt to changing requirements.

To build a robust foundation to develop your AI models, consider the followingrecommendations.

Define the problems and the required outcomes

Before you start any AI or ML project, you must have a clear understanding ofthe business problems to be solved and the required outcomes. Start with anoutline of the business objectives and break the objectives down into measurablekey performance indicators (KPIs). To organize and document your problemdefinitions and hypotheses in a Jupyter notebook environment, use tools likeVertex AI Workbench.To implement versioning for code and documents and to document your projects,goals, and assumptions, use tools like Git. To develop and manage prompts forgenerative AI applications, you can useVertex AI Studio.

Collect and preprocess the necessary data

To implement data preprocessing and transformation, you can useDataflow (for Apache Beam),Dataproc (for Apache Spark), orBigQuery if an SQL-based process is appropriate. To validate schemas and detectanomalies, useTensorFlow Data Validation (TFDV) and take advantage ofautomated data quality scans in BigQuery where applicable.

For generative AI, data quality includes accuracy, relevance, diversity, andalignment with the required output characteristics. In cases where real-worlddata is insufficient or imbalanced, you can generate synthetic data to helpimprove model robustness and generalization. To create synthetic datasets basedon existing patterns or to augment training data for better model performance,useBigQuery DataFrames and Gemini.Synthetic data is particularly valuable for generative AI because it can helpimprove prompt diversity and overall model robustness. When you build datasetsfor fine-tuning generative AI models, consider using the synthetic datageneration capabilities in Vertex AI.

For generative AI tasks like fine-tuning or reinforcement learning from humanfeedback (RLHF), ensure that labels accurately reflect the quality, relevance,and safety of the generated outputs.

Select an appropriate ML approach

When you design your model and parameters, consider the model's complexity andcomputational needs. Depending on the task (such as classification, regression,or generation), consider usingVertex AI custom training for custom model building or AutoML for simpler ML tasks. Forcommon applications, you can also access pretrained models throughVertex AI Model Garden.You can experiment with a variety of state-of-the-art foundation models forvarious use cases, such as generating text, images, and code.

You might want to fine-tune a pretrained foundation model to achieve optimalperformance for your specific use case. For high-performance requirements incustom training, configureCloud Tensor Processing Units (TPUs) orGPU resources to accelerate the training and inference of deep-learning models, like largelanguage models (LLMs) and diffusion models.

Set up version control for code, models, and data

To manage and deploy code versions effectively, use tools like GitHub orGitLab. These tools provide robust collaboration features, branching strategies,and integration with CI/CD pipelines to ensure a streamlined developmentprocess.

Use appropriate solutions to manage each artifact of your ML system, like thefollowing examples:

For code artifacts like container images and pipeline components,Artifact Registry provides a scalable storage solution that can help improve security. Artifact Registryalso includes versioning and can integrate withCloud Build andCloud Deploy.
To manage data artifacts, like datasets used for training and evaluation, usesolutions like BigQuery orCloud Storage for storage and versioning.
To store metadata and pointers to data locations, use your version controlsystem or a separate data catalog.

To maintain the consistency and versioning of your feature data, useVertex AI Feature Store.To track and manage model artifacts, including binaries and metadata, useVertex AI Model Registry,which lets you store, organize, and deploy model versions seamlessly.

To ensure model reliability, implementVertex AI Model Monitoring.Detect data drift, track performance, and identify anomalies in production. Forgenerative AI systems, monitor shifts in output quality and safety compliance.

Automate the model-development lifecycle

Automation helps you to streamline every stage of the AI and ML lifecycle.Automation reduces manual effort and standardizes processes, which leads toenhanced operational efficiency and a lower risk of errors. Automated workflowsenable faster iteration, consistent deployment across environments, and morereliable outcomes, so your systems can scale and adapt seamlessly.

To automate the development lifecycle of your AI and ML systems, consider thefollowing recommendations.

Use a managed pipeline orchestration system

UseVertex AI Pipelines to automate every step of the ML lifecycle—from data preparation to modeltraining, evaluation, and deployment. To accelerate deployment and promoteconsistency across projects, automate recurring tasks withscheduled pipeline runs,monitor workflows withexecution metrics,and developreusable pipeline templates for standardized workflows. These capabilities extend to generative AI models,which often require specialized steps likeprompt engineering,response filtering, andhuman-in-the-loop evaluation. For generative AI, Vertex AI Pipelines can automate thesesteps, including the evaluation of generated outputs against quality metrics andsafety guidelines. To improve prompt diversity and model robustness, automatedworkflows can also include data augmentation techniques.

Implement CI/CD pipelines

To automate the building, testing, and deployment of ML models, useCloud Build.This service is particularly effective when you run test suites for applicationcode, which ensures that the infrastructure, dependencies, and model packagingmeet your deployment requirements.

ML systems often need additional steps beyond code testing. For example, youneed to stress test the models under varying loads, perform bulk evaluations toassess model performance across diverse datasets, and validate data integritybefore retraining. To simulate realistic workloads for stress tests, you can usetools likeLocust,Grafana k6,orApache JMeter.To identify bottlenecks, monitor key metrics like latency,error rate, and resource utilization throughCloud Monitoring.For generative AI, the testing must also include evaluations that are specificto the type of generated content, such as text quality, image fidelity, or codefunctionality. These evaluations can involve automated metrics likeperplexity for language models or human-in-the-loop evaluation for more nuanced aspectslike creativity and safety.

To implement testing and evaluation tasks, you can integrateCloud Build with other Google Cloud services. For example, you canuse Vertex AI Pipelines for automated model evaluation,BigQuery for large-scale data analysis, and Dataflow pipeline validation forfeature validation.

You can further enhance your CI/CD pipeline by usingVertex AI for continuous training to enable automated retraining of models on new data. Specifically forgenerative AI, to keep the generated outputs relevant and diverse, theretraining might involve automatically updating the models with new trainingdata or prompts. You can useVertex AI Model Garden to select the latest base models that are available for tuning. This practiceensures that the models remain current and optimized for your evolving businessneeds.

Implement safe and controlled model releases

To minimize risks and ensure reliable deployments, implement a model releaseapproach that lets you detect issues early, validate performance, and roll backquickly when required.

To package your ML models and applications into container images and deploythem, useCloud Deploy.You can deploy your models toVertex AI endpoints.

Implement controlled releases for your AI applications and systems by usingstrategies likecanary releases.For applications that use managed models like Gemini, we recommend thatyou gradually release new application versions to a subset of users before thefull deployment. This approach lets you detect potential issues early,especially when you use generative AI models where outputs can vary.

To releasefine-tuned models, you can use Cloud Deploy to manage the deployment ofthe model versions, and use the canary release strategy to minimize risk. Withmanaged models and fine-tuned models, the goal of controlled releases is to testchanges with a limited audience before you release the applications and modelsto all users.

For robust validation, useVertex AI Experiments to compare new models against existing ones, and useVertex AI model evaluation to assess model performance. Specifically for generative AI, define evaluationmetrics that align with the intended use case and the potential risks. You canuse the Gen AI evaluation service in Vertex AI to assess metrics liketoxicity, coherence, factual accuracy, and adherence to safety guidelines.

To ensure deployment reliability, you need a robust rollback plan. Fortraditional ML systems, useVertex AI Model Monitoring to detect data drift and performance degradation. For generative AI models, youcan track relevant metrics and set up alerts for shifts in output quality or theemergence of harmful content by using Vertex AI model evaluationalong with Cloud Logging and Cloud Monitoring. Configure alerts based ongenerative AI-specific metrics to trigger rollback procedures when necessary. Totrack model lineage and revert to the most recent stable version, use insightsfromVertex AI Model Registry.

Implement observability

The behavior of AI and ML systems can change over time due to changes in thedata or environment and updates to the models. This dynamic nature makesobservability crucial to detect performance issues, biases, or unexpectedbehavior. This is especially true for generative AI models because the outputscan be highly variable and subjective. Observability lets you proactivelyaddress unexpected behavior and ensure that your AI and ML systems remainreliable, accurate, and fair.

To implement observability for your AI and ML systems, consider the followingrecommendations.

Monitor performance continuously

Use metrics and success criteria for ongoing evaluation of models afterdeployment.

You can useVertex AI Model Monitoring to proactively track model performance, identify training-serving skew andprediction drift, and receive alerts to trigger necessary model retraining orother interventions. To effectively monitor for training-serving skew, constructagolden dataset that represents the ideal data distribution, and useTFDV to analyze your training data and establish a baseline schema.

Configure Model Monitoring to compare the distribution ofinput data against the golden dataset for automatic skew detection. Fortraditional ML models, focus onmetrics like accuracy, precision, recall, F1-score, AUC-ROC, and log loss. Define customthresholds for alerts in Model Monitoring. For generativeAI, use theGen AI evaluation service to continuously monitor model output in production. You can also enableautomatic evaluation metrics for response quality, safety, instructionadherence, grounding, writing style, and verbosity. To assess the generatedoutputs for quality, relevance, safety, and adherence to guidelines, you canincorporatehuman-in-the-loop evaluation.

Create feedback loops to automatically retrain models withVertex AI Pipelines when Model Monitoring triggersan alert. Use these insights to improve your models continuously.

Evaluate models during development

Before you deploy your LLMs and other generative AI models, thoroughly evaluatethem during the development phase. Use Vertex AI model evaluation toachieve optimal performance and to mitigate risk. UseVertex AI rapid evaluation to let Google Cloud automatically run evaluations based on the dataset andprompts that you provide.

You can also define and integrate custom metrics that are specific to your usecase. For feedback on generated content, integrate human-in-the-loop workflowsby using Vertex AI Model Evaluation.

Useadversarial testing to identify vulnerabilities and potential failure modes. To identify andmitigate potential biases, use techniques like subgroup analysis andcounterfactual generation. Use the insights gathered from the evaluations thatwere completed during the development phase to define your model monitoringstrategy in production. Prepare your solution for continuous monitoring asdescribed in theMonitor performance continuously section of this document.

Monitor for availability

To gain visibility into the health and performance of your deployed endpointsand infrastructure, useCloud Monitoring.For your Vertex AI endpoints, track key metrics like requestrate, error rate, latency, and resource utilization, and set up alerts foranomalies. For more information, seeCloud Monitoring metrics for Vertex AI.

Monitor the health of the underlying infrastructure, which can includeCompute Engine instances, Google Kubernetes Engine (GKE) clusters, and TPUs andGPUs. Get automated optimization recommendations fromActive Assist.If you use autoscaling, monitor the scaling behavior to ensure that autoscalingresponds appropriately to changes in traffic patterns.

Track the status of model deployments, including canary releases and rollbacks,by integrating Cloud Deploy with Cloud Monitoring. In addition, monitorfor potential security threats and vulnerabilities by usingSecurity Command Center.

Set up custom alerts for business-specific thresholds

For timely identification and rectification of anomalies and issues, set upcustom alerting based on thresholds that are specific to your businessobjectives. Examples of Google Cloud products that you can use toimplement a custom alerting system include the following:

Cloud Logging:Collect, store, and analyze logs from all components of your AI and ML system.
Cloud Monitoring: Create custom dashboards to visualize key metrics andtrends, and definecustom metrics based on your needs. Configure alerts to get notificationsabout critical issues, and integrate the alerts with your incidentmanagement tools like PagerDuty or Slack.
Error Reporting:Automatically capture and analyze errors and exceptions.
Cloud Trace:Analyze the performance of distributed systems and identify bottlenecks.Tracing is particularly useful for understanding latency between differentcomponents of your AI and ML pipeline.
Cloud Profiler:Continuously analyze the performance of your code in production andidentify performance bottlenecks in CPU or memory usage.

Build a culture of operational excellence

Shift the focus from just building models to building sustainable, reliable, andimpactful AI solutions. Empower teams to continuously learn, innovate, andimprove, which leads to faster development cycles, reduced errors, and increasedefficiency. By prioritizing automation, standardization, and ethicalconsiderations, you can ensure that your AI and ML initiatives consistentlydeliver value, mitigate risks, and promote responsible AI development.

To build a culture of operational excellence for your AI and ML systems,consider the following recommendations.

Champion automation and standardization

To emphasize efficiency and consistency, embed automation and standardizedpractices into every stage of the AI and ML lifecycle. Automation reduces manualerrors and frees teams to focus on innovation. Standardization ensures thatprocesses are repeatable and scalable across teams and projects.

Prioritize continuous learning and improvement

Foster an environment where ongoing education and experimentation are coreprinciples. Encourage teams to stay up-to-date with AI and ML advancements, andprovide opportunities to learn from past projects. A culture of curiosity andadaptation drives innovation and ensures that teams are equipped to meet newchallenges.

Cultivate accountability and ownership

Build trust and alignment with clearly defined roles, responsibilities, andmetrics for success. Empower teams to make informed decisions within theseboundaries, and establish transparent ways to measure progress. A sense ofownership motivates teams and ensures collective responsibility for outcomes.

Embed AI ethics and safety considerations

Prioritize considerations for ethics in every stage of development. Encourageteams to think critically about the impact of their AI solutions, and fosterdiscussions on fairness, bias, and societal impact. Clear principles andaccountability mechanisms ensure that your AI systems align with organizationalvalues and promote trust.

Design for scalability

To accommodate growing data volumes and user demands and to maximize the valueof AI investments, your AI and ML systems need to be scalable. The systems mustadapt and perform optimally to avoid performance bottlenecks that hindereffectiveness. When you design for scalability, you ensure that the AIinfrastructure can handle growth and maintain responsiveness. Use scalableinfrastructure, plan for capacity, and employ strategies like horizontal scalingand managed services.

To design your AI and ML systems for scalability, consider the followingrecommendations.

Plan for capacity and quotas

Assess future growth, and plan your infrastructure capacity and resource quotasaccordingly. Work with business stakeholders to understand the projected growthand then define the infrastructure requirements accordingly.

UseCloud Monitoring to analyze historical resource utilization, identify trends, and project futureneeds. Conduct regular load testing to simulate workloads and identifybottlenecks.

Familiarize yourself withGoogle Cloud quotas for the services that you use, such as Compute Engine, Vertex AI, andCloud Storage. Proactively request quota increases throughthe Google Cloud console, and justify the increases with data from forecastingand load testing.Monitor quota usage and set up alerts to get notifications when the usage approaches the quotalimits.

To optimize resource usage based on demand, rightsize your resources, useSpot VMs for fault-tolerant batch workloads, and implement autoscaling.

Prepare for peak events

Ensure that your system can handle sudden spikes in traffic or workload duringpeak events. Document your peak event strategy and conduct regular drills totest your system's ability to handle increased load.

To aggressively scale up resources when the demand spikes, configure autoscalingpolicies inCompute Engine andGKE.For predictable peak patterns, consider usingpredictive autoscaling.To trigger autoscaling based on application-specific signals, use custom metricsin Cloud Monitoring.

Distribute traffic across multiple application instances by usingCloud Load Balancing.Choose an appropriate load balancer type based on your application's needs. Forgeographically distributed users, you can use global load balancing to routetraffic to the nearest available instance. For complex microservices-basedarchitectures, consider usingCloud Service Mesh.

Cache static content at the edge of Google's network by usingCloud CDN.To cache frequently accessed data, you can useMemorystore,which offers a fully managed in-memory service for Redis, Valkey, or Memcached.

Decouple the components of your system by usingPub/Sub for real-time messaging and Cloud Tasks for asynchronous taskexecution

Scale applications for production

To ensure scalable serving in production, you can use managed services likeVertex AI distributed training andVertex AI Inference.Vertex AI Inference lets you configure themachine types for your prediction nodes when you deploy a model to an endpoint or requestbatch predictions. For some configurations, you can add GPUs. Choose theappropriate machine type and accelerators to optimize latency, throughput, andcost.

To scale complex AI and Python applications and custom workloads acrossdistributed computing resources, you can useRay on Vertex AI.This feature can help optimize performance and enables seamless integration withGoogle Cloud services. Ray on Vertex AI simplifies distributedcomputing by handling cluster management, task scheduling, and data transfer. Itintegrates with other Vertex AI services like training,prediction, and pipelines. Ray provides fault tolerance and autoscaling, andhelps you adapt the infrastructure to changing workloads. It offers a unifiedframework for distributed training, hyperparameter tuning, reinforcementlearning, and model serving. Use Ray for distributed data preprocessing withDataflow orDataproc,accelerated model training, scalable hyperparameter tuning, reinforcementlearning, and parallelized batch prediction.

Contributors

Authors:

Charlotte Gistelinck, PhD | Partner Engineer
Sannya Dang | AI Solution Architect
Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist

Other contributors:

Gary Harmson | Principal Architect
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Ryan Cox | Principal Architect
Stef Ruinard | Generative AI Field Solutions Architect

AI and ML perspective: Security

This document in theWell-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AIand ML deployments meet the security and compliance requirements of yourorganization. The recommendations in this document align with thesecurity pillar of the Google Cloud Well-Architected Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularlyin enterprise environments. To meet this requirement, you need to adopt aholistic security approach that starts from the initial conceptualization ofyour AI and ML solutions and extends to development, deployment, and ongoingoperations. Google Cloud offers robust tools and services that are designed tohelp secure your AI and ML workloads.

The recommendations in this document are mapped to the following coreprinciples:

For more information about AI security, you can also review the followingresources:

Google Cloud's Secure AI Framework (SAIF) provides a comprehensive guide for building secure and responsible AIsystems. It outlines key principles and best practices for addressingsecurity and compliance considerations throughout the AI lifecycle.
To learn more about Google Cloud's approach to trust in AI, seeourcompliance resource center.

Define clear goals and requirements

Effective AI and ML security is a core component of your overarching businessstrategy. It's easier to integrate the required security and compliance controlsearly in your design and development process, instead of adding controls afterdevelopment.

From the start of your design and development process, make decisions that areappropriate for your specific risk environment and your specific businesspriorities. For example, overly restrictive security measures might protect databut also impede innovation and slow down development cycles. However, a lack ofsecurity can lead to data breaches, reputational damage, and financial losses,which are detrimental to business goals.

To define clear goals and requirements, consider the followingrecommendations.

Align AI and ML security with business goals

To align your AI and ML security efforts with your business goals, use astrategic approach that integrates security into every stage of the AIlifecycle. To follow this approach, do the following:

Define clear business objectives and security requirements:
- Identify key business goals: Define clear business objectives thatyour AI and ML initiatives are designed to achieve. For example, yourobjectives might be to improve customer experience, optimize operations, ordevelop new products.
- Translate goals into security requirements: When you clarify yourbusiness goals, define specific security requirements to support thosegoals. For example, your goal might be to use AI to personalize customerrecommendations. To support that goal, your security requirements might beto protect customer data privacy and prevent unauthorized access torecommendation algorithms.
Balance security with business needs:
- Conduct risk assessments: Identify potential security threats andvulnerabilities in your AI systems.
- Prioritize security measures: Base the priority of these securitymeasures upon their potential impact on your business goals.
- Analyze the costs and benefits: Ensure that you invest in the mosteffective solutions. Consider the costs and benefits of different securitymeasures.
- Shift left on security: Implement security best practices early inthe design phase, and adapt your safety measures as business needs changeand threats emerge.

Identify potential attack vectors and risks

Consider potential attack vectors that could affect your AI systems, such asdata poisoning, model inversion, or adversarial attacks. Continuously monitorand assess the evolving attack surface as your AI system develops, and keeptrack of new threats and vulnerabilities. Remember that changes in your AIsystems can also introduce changes to their attack surface.

To mitigate potential legal and reputational risks, you also need to addresscompliance requirements related to data privacy, algorithmic bias, and otherrelevant regulations.

To anticipate potential threats and vulnerabilities early and make designchoices that mitigate risks, adopt asecure by design approach.

Google Cloud provides a comprehensive suite of tools and services to helpyou implement a secure by design approach:

Cloud posture management:Use Security Command Center to identify potential vulnerabilities andmisconfigurations in your AI infrastructure.
Attack exposure scores and attack paths:Refine and use the attack exposure scores and attack paths thatSecurity Command Center generates.
Google Threat Intelligence:Stay informed about new threats and attack techniques that emerge to targetAI systems.
Logging and Monitoring:Track the performance and security of your AI systems, and detect anyanomalies or suspicious activities. Conduct regular security audits toidentify and address potential vulnerabilities in your AI infrastructureand models.
Vulnerability management:Implement a vulnerability management process to track and remediatesecurity vulnerabilities in your AI systems.

For more information, seeSecure by Design at Google andImplement security by design.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data securityhelps you to maintain user trust, support your business objectives, and meetyour compliance requirements.

To help keep your data secure, consider the following recommendations.

Adhere to data minimization principles

To ensure data privacy, adhere to the principle of data minimization. Tominimize data, don't collect, keep, or use data that's not strictly necessaryfor your business goals. Where possible, use synthetic or fully anonymized data.

Data collection can help drive business insights and analytics, but it's crucialto exercise discretion in the data collection process. If you collect personallyidentifiable information (PII) about your customer, reveal sensitive information,or create bias or controversy, then you might build biased ML models.

You can use Google Cloud features to help you improve data minimizationand data privacy for various use cases:

To de-identify your data and also preserve its utility, applytransformation methods like pseudonymization, de-identification, andgeneralization such as bucketing. To implement these methods, you can useSensitive Data Protection.
To enrich data and mitigate potential bias, you can use aVertex AI data labeling job. The data labelingprocess adds informative and meaningful tags to raw data, which transformsit into structured training data for ML models. Data labeling addsspecificity to the data and reduces ambiguity.
To help protect resources from prolonged access or manipulation, useCloud Storage features tocontrol data lifecycles.

For best practices about how to implement data encryption, seedata encryption at rest and in transit in the Well-Architected Framework.

Monitor data collection, storage, and transformation

Your AI application's training data poses the largest risks for theintroduction of bias and data leakage. To stay compliant and manage data acrossdifferent teams, establish a data governance layer to monitor data flows,transformations, and access. Maintain logs for data access and manipulationactivities. The logs help you audit data access, detect unauthorized accessattempts, and prevent unwanted access.

You can use Google Cloud features to help you implement data governancestrategies:

To establish an organization-wide or department-wide data governanceplatform, useDataplex Universal Catalog.A data governance platform can help you to centrally discover, manage,monitor, and govern data and AI artifacts across your data platforms. Thedata governance platform also provides access to trusted users. You canperform the following tasks with Dataplex Universal Catalog:
- Manage data lineage. BigQuery can also provide column-levellineage.
- Manage data quality checks and data profiles.
- Manage data discovery, exploration, and processing acrossdifferent data marts.
- Manage feature metadata and model artifacts.
- Create a business glossary to manage metadata and establish astandardized vocabulary.
- Enrich the metadata with context through aspects and aspect types.
- Unify data governance across BigLake and open-format tableslike Iceberg and Delta.
- Build adata mesh to decentralize data ownership among data owners from different teamsor domains. This practice adheres to data security principles and it canhelp improve data accessibility and operational efficiency.
- Inspect and send sensitive data results from BigQuery to Dataplex Universal Catalog.
To build a unified openlakehouse that is well-governed, integrate your data lakes and warehouses with managedmetastore services likeDataproc Metastore andBigLake metastore.An open lakehouse uses open table formats that are compatible withdifferent data processing engines.
To schedule the monitoring of features and feature groups, useVertex AI Feature Store.
To scan your Vertex AI datasets at the organization,folder, or project level, useSensitive data discovery for Vertex AI.You can alsoanalyze the data profiles that are stored in BigQuery.
To capture real-time logs and collect metrics related to data pipelines,useCloud Logging andCloud Monitoring.To collect audit trails of API calls, useCloud Audit Logs.Don't log PII or confidential data in experiments or in different logservers.

Implement role-based access controls with least privilege principles

Implement role-based access controls (RBAC) to assign different levels of accessbased on user roles. Users must have only the minimum permissions that arenecessary to let them perform their role activities. Assign permissions based ontheprinciple of least privilege so that users have only the access that they need, such as no-access, read-only,or write.

RBAC with least privilege is important for security when yourorganization uses sensitive data that resides in data lakes, in feature stores,or in hyperparameters for model training. This practice helps you to preventdata theft, preserve model integrity, and limit the surface area for accidentsor attacks.

To help you implement these access strategies, you can use the followingGoogle Cloud features:

To implement access granularity, consider the following options:
- Map the IAM roles of different products to auser, group, or service account to allow granular access. Map theseroles based on your project needs, access patterns, ortags.
- SetIAM policies with conditions to manage granular access to your data, model, and modelconfigurations, such as code, resource settings, and hyperparameters.
- Explore application-level granular access that helps you securesensitive data that you audit and share outside of your team.
  - Cloud Storage: Set IAM policies onbuckets and managed folders.
  - BigQuery: UseIAM roles and permissions for datasets and resources within datasets. Also, restrict access at therow-level andcolumn-level in BigQuery.
To limit access to certain resources, you can useprincipal access boundary (PAB) policies.You can also usePrivileged Access Manager to control just-in-time, temporary privilege elevation for selectprincipals. Later, you can view the audit logs for this Privileged Access Manageractivity.
To restrict access to resources based on the IP address and end userdevice attributes, you canextend Identity-Aware Proxy (IAP) access policies.
To create access patterns for different user groups, you can useVertex AI access control with IAM to combine the predefined or custom roles.
To protect Vertex AI Workbench instances by using context-aware accesscontrols, useAccess Context Manager andChrome Enterprise Premium.With this approach, access is evaluated each time a user authenticates tothe instance.

Implement security measures for data movement

Implement secure perimeters and other measures like encryption and restrictionson data movement. These measures help you to prevent data exfiltration and dataloss, which can cause financial losses, reputational damage, legal liabilities,and a disruption to business operations.

To help prevent data exfiltration and loss on Google Cloud, you can use acombination of security tools and services.

To implement encryption, consider the following:

To gain more control over encryption keys, use customer-managedencryption keys (CMEKs) in Cloud KMS. When you use CMEKs, thefollowing CMEK-integrated services encrypt data at rest for you:
To help protect your data in Cloud Storage, use server-sideencryption to store your CMEKs. If you manage CMEKs on your own servers,server-side encryption can help protect your CMEKs and associated data, evenif your CMEK storage system is compromised.
To encrypt data in transit, use HTTPS for all of your API calls to AI andML services. To enforce HTTPS for your applications and APIs, useHTTPS load balancers.

For more best practices about how to encrypt data, seeEncrypt data at rest and in transit in the security pillar of the Well-Architected Framework.

To implement perimeters, consider the following:

To create a security boundary around your AI and ML resources andprevent data exfiltration from your Virtual Private Cloud (VPC), useVPC Service Controls to define a service perimeter. Include your AI and ML resources andsensitive data in the perimeter. To control data flow, configure ingressand egress rules for your perimeter.
To restrict inbound and outbound traffic to your AI and ML resources,configure firewall rules.Implement policies that deny all traffic by default and explicitly allowonly the traffic that meets your criteria. For a policy example, seeExample: Deny all external connections except to specific ports.

To implement restrictions on data movement, consider the following:

To share data and to scale across privacy boundaries in a secureenvironment, useBigQuery sharing andBigQuery data clean rooms,which provide a robust security and privacy framework.
To share data directly into built-in destinations from businessintelligence dashboards, useLooker Action Hub,which provides a secure cloud environment.

Guard against data poisoning

Data poisoning is a type of cyberattack in which attackers inject maliciousdata into training datasets to manipulate model behavior or to degradeperformance. This cyberattack can be a serious threat to ML training systems.To protect the validity and quality of the data, maintain practices that guardyour data. This approach is crucial for consistent unbiasedness, reliability,and integrity of your model.

To track inconsistent behavior, transformation, or unexpected access to yourdata, set up comprehensivemonitoring and alerting for data pipelines and ML pipelines.

Google Cloud features can help you implement more protections against datapoisoning:

To validate data integrity, consider the following:
- Implement robust data validation checks before you use the datafor training. Verify data formats, ranges, and distributions. You canuse the automatic data quality capabilities inDataplex Universal Catalog.
- Use Sensitive Data Protection with Model Armor to takeadvantage of comprehensive data loss prevention capabilities. For moreinformation, seeModel Armor key concepts.Sensitive Data Protection with Model Armor lets youdiscover, classify, and protect sensitive data such as intellectualproperty. These capabilities can help you prevent the unauthorizedexposure of sensitive data in LLM interactions.
- To detect anomalies in your training data that might indicatedata poisoning, useanomaly detection in BigQuery with statistical methods or ML models.
To prepare for robust training, do the following:
- Employensemble methods to reduce the impact of poisoned datapoints. Train multiple models on different subsets of the data withhyperparameter tuning.
- Use data augmentation techniques to balance the distribution ofdata across datasets. This approach can reduce the impact of datapoisoning and lets you add adversarial examples.
To incorporate human review for training data or model outputs, do thefollowing:
- Analyze model evaluation metrics todetect potential biases, anomalies, or unexpected behavior that mightindicate data poisoning. For details, seeModel evaluation in Vertex AI.
- Take advantage of domain expertise to evaluate the model orapplication and identify suspicious patterns or data points thatautomated methods might not detect. For details, seeGen AI evaluation service overview.

For best practices about how to create data platforms that focus oninfrastructure and data security, see theImplement security by design principle in the Well-Architected Framework.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Codethat isn't secured can be tampered with, which can lead to data leaks,compliance failure, and disruption of critical business activities. Keeping yourAI and ML code secure helps to ensure the integrity and value of your models andmodel outputs.

To keep AI code and pipelines secure, consider the following recommendations.

Use secure coding practices

To prevent vulnerabilities, use secure coding practices when you develop yourmodels. We recommend that you implement AI-specific input and outputvalidation, manage all of your software dependencies, and consistently embedsecure coding principles into your development. Embed security into every stageof the AI lifecycle, from data preprocessing to your final application code.

To implement rigorous validation, consider the following:

To prevent model manipulation or system exploits, validate and sanitizeinputs and outputs in your code.
- Use Model Armor or fine-tuned LLMs to automaticallyscreen prompts and responses for common risks.
- Implement data validation within your data ingestion andpreprocessing scripts for data types, formats, and ranges. ForVertex AI Pipelines or BigQuery, you can use Python to implement this data validation.
- Use coding assistant LLM agents, likeCodeMender,to improve code security. Keep a human in the loop to validate itsproposed changes.
To manage and secure your AI model API endpoints, useApigee,which includes configurable features like request validation, trafficcontrol, and authentication.
To help mitigate risk throughout the AI lifecycle, you can useAI Protection to do the following:
- Discover AI inventory in your environment.
- Assess the inventory for potential vulnerabilities.
- Secure AI assets with controls, policies, and protections.
- Manage AI systems with detection, investigation, and responsecapabilities.

To help secure the code and artifact dependencies in your CI/CD pipeline,consider the following:

To address the risks that open-source library dependencies canintroduce to your project, useArtifact Analysis withArtifact Registry to detect known vulnerabilities. Use and maintain the approved versions oflibraries. Store your custom ML packages and vetted dependencies in aprivate Artifact Registry repository.
To embed dependency scanning into your Cloud Build MLOps pipelines,useBinary Authorization.Enforce policies that allow deployments only if your code's containerimages pass the security checks.
To get security information about your software supply chain, usedashboards in the Google Cloud console that provide details about sources, builds,artifacts, deployments, and runtimes. This information includesvulnerabilities in build artifacts, build provenance, and Software Bill ofMaterials (SBOM) dependency lists.
To assess the maturity level of your software supply chain security, usetheSupply chain Levels for Software Artifacts (SLSA) framework.

To consistently embed secure coding principles into every stage of development,consider the following:

To prevent the exposure of sensitive data from model interactions, useLogging with Sensitive Data Protection. When you usethese products together, you can control what data your AI applications andpipeline components log, and hide sensitive data.
To implement the principle of least privilege, ensure that the serviceaccounts that you use for your Vertex AI custom jobs,pipelines, and deployed models have only the minimum requiredIAM permissions. For more information, seeImplement role-based access controls with least privilege principles.
To help secure and protect your pipelines and build artifacts,understand the security configurations (VPC andVPC Service Controls) in the environment your code runs in.

Protect pipelines and model artifacts from unauthorized access

Your model artifacts and pipelines are intellectual property, and theirtraining data also contains proprietary information. To protect model weights,files, and deployment configurations from tampering and vulnerabilities, storeand access these artifacts with improved security. Implement different accesslevels for each artifact based on user roles and needs.

To help secure your model artifacts, consider the following:

To protect model artifacts and other sensitive files, encrypt them withCloud KMS. This encryption helps to protect data at rest and intransit, even if the underlying storage becomes compromised.
To help secure access to your files, store them in Cloud Storageand configure access controls.
To track any incorrect or inadequate configurations and any drift fromyour defined standards, use Security Command Center to configuresecurity postures.
To enable fine-grained access control and encryption at rest, store yourmodel artifacts inVertex AI Model Registry. For additional security, createa digital signature for packages and containers that are produced duringthe approved build processes.
To benefit from Google Cloud's enterprise-grade security, use modelsthat are available in Model Garden. Model Gardenprovides Google's proprietary models and it offersthird-party models from featured partners.
To enforce central management for all user and group lifecycles and toenforce the principle of least privilege, use IAM.
- Create and use dedicated, least-privilege service accounts foryour MLOps pipelines. For example, a training pipeline's serviceaccount has the permissions to read data from only a specificCloud Storage bucket and to write model artifacts toModel Registry.
- Use IAM Conditionsto enforce conditional, attribute-based access control. For example, acondition allows a service account to trigger aVertex AI pipeline only if the request originates from atrusted Cloud Build trigger.

To help secure your deployment pipelines, consider the following:

To manage MLOps stages on Google Cloud services and resources,use Vertex AI Pipelines,which can integrate with other services and provide low-level accesscontrol. When you re-execute the pipelines, ensure that you performVertex Explainable AI andresponsible AI checks before you deploy the model artifacts. These checks can help youdetect or prevent the following security issues:
- Unauthorized changes, which can indicate model tampering.
- Cross-site scripting (XSS), which can indicate compromisedcontainer images or dependencies.
- Insecure endpoints, which can indicate misconfigured servinginfrastructure.
To help secure model interactions during inference, use privateendpoints based onPrivate Service Connect withprebuilt containers orcustom containers.Create model signatures with a predefined input and output schema.
To automate code change tracking, use Git for source code management,and integrate version control with robust CI/CD pipelines.

For more information, seeSecuring the AI Pipeline.

Enforce lineage and tracking

To help meet the regulatory compliance requirements that you might have, enforcelineage and tracking of your AI and ML assets. Data lineage and trackingprovides extensive change records for data, models, and code. Model provenanceprovides transparency and accountability throughout the AI and ML lifecycle.

To effectively enforce lineage and tracking in Google Cloud, consider thefollowing tools and services:

To track the lineage of models, datasets, and artifacts that areautomatically encrypted at rest, useVertex ML Metadata.Log metadata about data sources, transformations, model parameters, andexperiment results.
To track thelineage of pipeline artifacts from Vertex AI Pipelines, and to search for model and datasetresources, you can use Dataplex Universal Catalog. Track individual pipeline artifactswhen you want to perform debugging, troubleshooting, or a root cause analysis.To track your entire MLOps pipeline, which includes the lineage of pipelineartifacts, useVertex ML Metadata.Vertex ML Metadata also lets you analyze the resources and runs.Model Registry applies and manages the versions ofeach model that you store.
To track API calls and administrative actions, enableaudit logs for Vertex AI.Analyze audit logs with Log Analytics to understand who accessed or modified data andmodels, and when. You can alsoroute logs to third-party destinations.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment. This environmentmust have a robust access control system and provide security assurances for thetools and artifacts that you deploy.

To deploy your code on secure systems, consider the followingrecommendations.

Train and deploy models in a secure environment

To maintain system integrity, confidentiality, and availability for your AI andML systems, implement stringent access controls that prevent unauthorizedresource manipulation. This defense helps you to do the following:

Mitigate model tampering that could produce unexpected or conflictingresults.
Protect your training data from privacy violations.
Maintain service uptime.
Maintain regulatory compliance.
Build user trust.

To train your ML models in an environment with improved security, use managedservices in Google Cloud like Cloud Run,GKE, and Dataproc. You can also useVertex AI serverless training.

This section provides recommendations to help you further help secure yourtraining and deployment environment.

To help secure your environment and perimeters, consider the following:

When youimplement security measures,as described earlier, consider the following:
- To isolate training environments and limit access, usededicated projects or VPCs for training.
- To protect sensitive data and code during execution, useShielded VMs orconfidential computing for training workloads.
- To help secure your network infrastructure and to control access to your deployed models, use VPCs, firewalls,and security perimeters.
When you use Vertex AI training, you can use thefollowing methods to help secure your compute infrastructure:
- To train custom jobs that privately communicate with otherauthorized Google Cloud services and that aren't exposed to publictraffic,set up a Private Service Connect interface.
- For increased network security and lower network latency thanwhat you get with a public IP address, use a private IP address toconnect to your training jobs. For details, seeUse a private IP for custom training.
When you use GKE or Cloud Run to setup a custom environment, consider the following options:
- To secure your GKE cluster, use the appropriate networkpolicies, pod security policies, and access controls. Usetrusted and verified container images for your training workloads. Toscan container images for vulnerabilities, use Artifact Analysis.
- To protect your environment fromcontainer escapes and other attacks, implementruntime security measures for Cloud Run functions. To further protect your environment,useGKE Sandbox andworkload isolation.
- To help secure yourGKE workloads, follow the best practices in theGKE security overview.
- To help meet your security requirements in Cloud Run, seethesecurity design overview}.
When you use Dataproc for model training, follow theDataproc security best practices.

To help secure your deployment, consider the following:

When you deploy models, use Model Registry.If you deploy models in containers, use GKE Sandbox andContainer-Optimized OS toenhance security and isolate workloads.Restrict access to models from Model Garden according to user roles andresponsibilities.
To help secure your model APIs, use Apigee orAPI Gateway.To prevent abuse, implement API keys, authentication,authorization, and rate limiting. To control access to model APIs, useAPI keys and authentication mechanisms.
To help secure access to models during prediction, useVertex AI Inference.To prevent data exfiltration,use VPC Service Controls perimeters to protectprivate endpoints and govern access to the underlying models. You use private endpointsto enable access to the models within a VPC network.IAM isn't directly applied to the private endpoint, but the target service usesIAM to manage access to the models. For online prediction, werecommend that you use Private Service Connect.
To track API calls that are related to model deployment, enableCloud Audit Logs for Vertex AI.Relevant API calls include activities such as endpoint creation, modeldeployment, and configuration updates.
To extend Google Cloud infrastructure to edge locations, considerGoogle Distributed Cloud solutions.For a fully disconnected solution, you can useDistributed Cloud air-gapped,which doesn't require connectivity to Google Cloud.
To help standardize deployments and to help ensure compliance withregulatory and security needs, useAssured Workloads.

Follow SLSA guidelines for AI artifacts

Follow the standardSupply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.

SLSA is a security framework that's designed to help you improve the integrityof software artifacts and help prevent tampering. When you adhere to the SLSAguidelines, you can enhance the security of your AI and ML pipeline and theartifacts that the pipeline produces. SLSA adherence can provide the followingbenefits:

Increased trust in your AI and ML artifacts: SLSA helps to ensurethat tampering doesn't occur to your models and software packages. Userscan also trace models and software packages back to their source, whichincreases users' confidence in the integrity and reliability of theartifacts.
Reduced risk of supply chain attacks: SLSA helps to mitigate therisk of attacks that exploit vulnerabilities in the software supply chain,like attacks that inject malicious code or that compromise build processes.
Enhanced security posture: SLSA helps you to strengthen the overallsecurity posture of your AI and ML systems. This implementation can helpreduce the risk of attacks and protect your valuable assets.

To implement SLSA for your AI and ML artifacts on Google Cloud, dothe following:

Understand SLSA levels: Familiarize yourself with the differentSLSA levels and their requirements. As the levels increase, the integrity that theyprovide also increases.
Assess your current level: Evaluate your current practices againstthe SLSA framework to determine your current level and to identify areas forimprovement.
Set your target level: Determine the appropriate SLSA level totarget based on your risk tolerance, security requirements, and thecriticality of your AI and ML systems.
Implement SLSA requirements: To meet your target SLSA level,implement the necessary controls and practices,which could include the following:
- Source control: Use a version control systemlike Git to track changes to your code and configurations.
- Build process: Use a service that helps to secure yourbuilds, likeCloud Build,and ensure that your build process is scripted or automated.
- Provenance generation: Generate provenance metadata thatcaptures details about how your artifacts were built, including thebuild process, source code, and dependencies. For details, seeTrack Vertex ML Metadata andTrack executions and artifacts.
- Artifact signing: Sign your artifacts to verify theirauthenticity and integrity.
- Vulnerability management: Scan your artifacts anddependencies for vulnerabilities on a regular basis. Use tools likeArtifact Analysis.
- Deployment security: Implement deployment practices thathelp to secure your systems, such as the practices that are described inthis document.
Continuous improvement: Monitor and improve your SLSA implementationto address new threats and vulnerabilities, and strive for higher SLSAlevels.

Use validated prebuilt container images

To prevent a single point of failure for your MLOps stages, isolate the tasksthat require different dependency management into different containers. Forexample, use separate containers for feature engineering, training orfine-tuning, and inference tasks. This approach also gives ML engineers theflexibility to control and customize their environment.

To promote MLOps consistency across your organization, use prebuilt containers.Maintain acentral repository of verified and trusted base platform images with the following bestpractices:

Maintain a centralized platform team in your organization that builds andmanages standardized base containers.
Extend the prebuilt container images that Vertex AIprovides specifically for AI and ML. Manage the container images in acentral repository within your organization.

Vertex AI provides a variety ofprebuilt containers for training and inference, and it also lets you usecustom containers.For smaller models, you can reduce latency for inference if youload models in containers.

To improve the security of your container management, consider the followingrecommendations:

UseArtifact Registry to create, store, and manage repositories of container images with differentformats. Artifact Registry handles access control with IAM,and it has integrated observability andvulnerability assessment features. Artifact Registry lets you enable container security features,scan container images, and investigate vulnerabilities.
Runcontinuous integration steps andbuild container images with Cloud Build. Dependency issues can be highlighted at thisstage. If you want to deploy only the images that are built byCloud Build, you can useBinary Authorization.To help prevent supply chain attacks,deploy the images built by Cloud Build in Artifact Registry.Integrate automated testing tools such as SonarQube, PyLint, or OWASP ZAP.
Use a container platform like GKE orCloud Run, which are optimized for GPU or TPU for AI and MLworkloads. Consider thevulnerability scanning options for containers in GKE clusters.

Consider Confidential Computing for GPUs

To protectdata in use,you can use Confidential Computing.Conventional security measures protect data at rest and in transit, butConfidential Computing encrypts data during processing. When you useConfidential Computing for GPUs, you help to protect sensitive trainingdata and model parameters from unauthorized access. You can also help to preventunauthorized access from privileged cloud users or potential attackers who mightgain access to the underlying infrastructure.

To determine whether you need Confidential Computing for GPUs, consider thesensitivity of the data, regulatory requirements, and potential risks.

If you set up Confidential Computing, consider the following options:

For general-purpose AI and ML workloads, useConfidential VM instances with NVIDIA T4 GPUs. These VM instances offer hardware-based encryption ofdata in use.
For containerized workloads, useConfidential GKE Nodes.These nodes provide a secure and isolated environment for your pods.
To ensure that your workload is running in a genuine and secure enclave,verify the attestation reports that Confidential VM provides.
To track performance, resource utilization, and security events, monitoryour Confidential Computing resources and yourConfidential GKE Nodes by using Monitoring andLogging.

Verify and protect inputs

Treat all of the inputs to your AI systems as untrusted, regardless of whetherthe inputs are from end users or other automated systems. To help keep your AIsystems secure and to ensure that they operate as intended, you must detect andsanitize potential attack vectors early.

To verify and protect your inputs, consider the following recommendations.

Implement practices that help secure generative AI systems

Treat prompts as a critical application component that has the same importanceto security as code does. Implement adefense-in-depthstrategy that combines proactive design, automated screening, and disciplinedlifecycle management.

To help secure your generative AI prompts, you must design them for security,screen them before use, and manage them throughout their lifecycle.

To improve the security of your prompt design and engineering, consider thefollowing practices:

Structure prompts for clarity: Design and test all of your promptsby usingVertex AI Studio prompt management capabilities.Prompts need to have a clear, unambiguous structure. Define a role,include few-shot examples,and give specific, bounded instructions. These methods reduce the risk thatthe model might misinterpret a user's input in a way that creates asecurity loophole.
Test the inputs for robustness and grounding: Test all of yoursystems proactively against unexpected, malformed, and malicious inputs inorder to prevent crashes or insecure outputs. Use red team testing tosimulate real-world attacks. As a standard step in yourVertex AI Pipelines, automate your robustness tests. You can usethe following testing techniques:
- Fuzz testing.
- Test directly against PII, sensitive inputs, and SQL injections.
- Scan multimodal inputs that can contain malware or violateprompt policies.
Implement a layered defense: Use multiple defenses and never relyon a single defensive measure. For example, for an application based onretrieval-augmented generation (RAG), use a separate LLM to classifyincoming user intent and check for malicious patterns. Then, that LLM canpass the request to the more-powerful primary LLM that generates the finalresponse.
Sanitize and validate inputs: Before you incorporate external input oruser-provided input into a prompt, filter and validate all of the input inyour application code. This validation is important to help you preventindirect prompt injection.

For automated prompt and response screening, consider the followingpractices:

Use comprehensive security services: Implement a dedicated,model-agnostic security service like Model Armoras a mandatory protection layer for your LLMs. Model Armorinspects prompts and responses for threats like prompt injection,jailbreak attempts,and harmful content. To help ensure that your models don't leaksensitive training data or intellectual property in their responses, usethe Sensitive Data Protection integrationwith Model Armor. For details, seeModel Armor filters.
Monitor and log interactions: Maintain detailed logs for all of theprompts and responses for your model endpoints. UseLogging to audit these interactions, identify patterns of misuse, and detect attackvectors that might emerge against your deployed models.

To help secure prompt lifecycle management, consider the following practices:

Implement versioning for prompts: Treat all of your productionprompts like application code. Use a version control system like Git tocreate a complete history of changes, enforce collaboration standards, andenable rollbacks to previous versions. This core MLOps practice can helpyou to maintain stable and secure AI systems.
Centralize prompt management: Use a central repository to store,manage, and deploy all of your versioned prompts. This strategy enforcesconsistency across environments and it enables runtime updates without theneed for a full application redeployment.
Conduct regular audits and red team testing: Test your system'sdefenses continuously against known vulnerabilities, such as those listedin theOWASP Top 10 for LLM Applications.As an AI engineer, you must be proactive and red-team test your ownapplication to discover and remediate weaknesses before an attacker canexploit them.

Prevent malicious queries to your AI systems

Along with authentication and authorization, which this document discussedearlier, you can take further measures to help secure your AI systems againstmalicious inputs. You need to prepare your AI systems forpost-authenticationscenarios in which attackers bypass both the authentication and authorizationprotocols, and then attempt to attack the system internally.

To implement a comprehensive strategy that can help protect your system frompost-authentication attacks, apply the following requirements:

Secure network and application layers: Establish a multi-layereddefense for all of your AI assets.
- To create a security perimeter that prevents data exfiltrationof models from Model Registry or of sensitive datafrom BigQuery, useVPC Service Controls.Always usedry run mode to validate the impact of a perimeter before you enforce it.
- To help protect web-based tools such as notebooks, useIAP.
- To help secure all of the inference endpoints, useApigee for enterprise-grade security and governance. You can also useAPI Gateway for straightforward authentication.
Watch for query pattern anomalies: For example, an attacker thatprobes a system for vulnerabilities might send thousands of slightlydifferent, sequential queries. Flag abnormal query patterns that don'treflect normal user behavior.
Monitor the volume of requests: A sudden spike in query volumestrongly indicates a denial-of-service (DoS) attack or a model theftattack, which is an attempt to reverse-engineer the model. Use ratelimiting and throttling to control the volume of requests from a single IPaddress or user.
Monitor and set alerts for geographic and temporal anomalies:Establish a baseline for normal access patterns. Generate alerts for suddenactivity from unusual geographic locations or at odd hours. For example, amassive spike in logins from a new country at 3 AM.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize,or automate human decision-making. To maintain the integrity and trustworthinessof your AI systems and applications, ensure that the outputs are secure andwithin the expected parameters. You also need a plan to respond to incidents.

To maintain your outputs, consider the following recommendations.

Evaluate model performance with metrics and security measures

To ensure that your AI models meet performance benchmarks, meet securityrequirements, and adhere to fairness and compliance standards, thoroughlyevaluate the models. Conduct evaluations before deployment, and then continue toevaluate the models in production on a regular basis. To minimize risks andbuild trustworthy AI systems, implement a comprehensive evaluation strategy thatcombines performance metrics with specific AI security assessments.

To evaluate model robustness and security posture, consider the followingrecommendations:

Implement model signing and verification in your MLOps pipeline.
- For containerized models, useBinary Authorization to verify signatures.
- For models that are deployed directly to Vertex AIendpoints, use custom checks in your deployment scripts for verification.
- For any model, use Cloud Build formodel signing.
Assess your model's resilience to unexpected or adversarial inputs.
- For all of your models, test your model for common datacorruptions and any potentially malicious data modifications. Toorchestrate these tests, you can use Vertex AI trainingor Vertex AI Pipelines.
- For security-critical models, conduct adversarial attacksimulations to understand the potential vulnerabilities.
- For models that are deployed in containers, useArtifact Analysis in Artifact Registry to scan the baseimages for vulnerabilities.
Use Vertex AI Model Monitoring to detect drift and skew fordeployed models. Then, feed these insights back into the re-evaluation orretraining cycles.
Use model evaluations from Vertex AI as a pipelinecomponent with Vertex AI Pipelines. You can run the modelevaluation component by itself or with other pipeline components. Comparethe model versions against your defined metrics and datasets. Log theevaluation results to Vertex ML Metadata for lineage andtracking.
Use or build upon theGen AI evaluation service to evaluate your chosen models or to implement custom human-evaluationworkflows.

To assess fairness, bias, explainability, and factuality, consider thefollowing recommendations:

Define fairness measures that match your use cases, and then evaluate your models for potentialbiases across different data slices.
Understand which features drive model predictions in order to ensurethat the features, and the predictions that result, align with domainknowledge and ethical guidelines.
UseVertex Explainable AI to get feature attributions for your models.
Use the Gen AI evaluation service to compute metrics. During the sourceverification phase of testing, the service's grounding metric checks forfactuality against the source text that's provided.
Enablegrounding for your model's output in order to facilitate a second layer of sourceverification at the user level.
Review ourAI principles and adapt them for your AI applications.

Monitor AI and ML model outputs in production

Continuously monitor your AI and ML models and their supporting infrastructurein production. It's important to promptly identify and diagnose degradations inmodel output quality or performance, security vulnerabilities that emerge, anddeviations from compliance mandates. This monitoring helps you sustain systemsafety, reliability, and trustworthiness.

To monitor AI system outputs for anomalies, threats, and quality degradation,consider the following recommendations:

Use Model Monitoring for your model outputs to trackunexpected shifts in prediction distributions or spikes in low-confidencemodel predictions. Actively monitor your generative AI model outputs forgenerated content that's unsafe, biased, off-topic, or malicious. You canalso use Model Armor to screen all of your model outputs.
Identify specific error patterns, capture quality indicators, or detectharmful or non-compliant outputs at the application level. To find theseissues, use custom monitoring in Monitoring dashboards anduse log-based metrics from Logging.

To monitor outputs for security-specific signals and unauthorized changes,consider the following recommendations:

Identify unauthorized access attempts to AI models, datasets inCloud Storage or BigQuery, or MLOps pipelinecomponents. In particular, identify unexpected or unauthorized changes inIAM permissions for AI resources. To track these activitiesand review them for suspicious patterns, use the Admin Activity audit logsand Data Access audit logs in Cloud Audit Logs.Integrate the findings from Security Command Center, which can flag securitymisconfigurations and flag potential threats that are relevant to your AIassets.
Monitor outputs for high volumes of requests or requests from suspicioussources, which might indicate attempts to reverse engineer models orexfiltrate data. You can also use Sensitive Data Protection tomonitor for the exfiltration of potentially sensitive data.
Integrate logs into your security operations. Use Google Security Operationsto help you detect, orchestrate, and respond to any cyber threats from yourAI systems.

To track the operational health and performance of the infrastructure thatserves your AI models, consider the following recommendations:

Identify operational issues that can impact service delivery or modelperformance.
Monitor Vertex AI endpoints for latency, error rates, andtraffic patterns.
Monitor MLOps pipelines for execution status and errors.
Use Monitoring, which provides ready-made metrics. You canalso create custom dashboards to help you identify issues like endpointoutages or pipeline failures.

Implement alerting and incident response procedures

When you identify any potential performance, security, or compliance issues, aneffective response is critical. To ensure timely notifications to theappropriate teams, implement robust alerting mechanisms. Establish andoperationalize comprehensive, AI-aware incident response procedures to manage,contain, and remediate these issues efficiently.

To establish robust alerting mechanisms for AI issues that you identify,consider the following recommendations:

Configure actionable alerts to notify the relevant teams, based on themonitoring activities of your platform. For example, configure alerts totrigger when Model Monitoring detects significantdrift, skew, or prediction anomalies. Or, configure alerts to trigger whenModel Armor or custom Monitoring rules flagmalicious inputs or unsafe outputs.
Define clear notification channels, which can include Slack, email, orSMS through Pub/Sub integrations. Customize the notificationchannels for your alert severities and the responsible teams.

Develop and operationalize an AI-aware incident response plan. A structuredincident response plan is vital to minimize any potential impacts and ensurerecovery. Customize this plan to address AI-specific risks such as modeltampering, incorrect predictions due to drift, prompt injection, or unsafeoutputs from generative models. To create an effective plan, include thefollowing key phases:

Preparation: Identify assets and their vulnerabilities, developplaybooks, and ensure that your teams have appropriate privileges. Thisphase includes the following tasks:
- Identify critical AI assets, such as models, datasets, andspecific Vertex AI resources like endpoints orVertex AI Feature Store instances.
- Identify the assets' potential failure modes or attack vectors.
- Develop AI-specific playbooks for incidents that match yourorganization's threat model. For example, playbooks might include thefollowing:
  - A model rollback that uses versioning inModel Registry.
  - An emergency retraining pipeline onVertex AI training.
  - The isolation of a compromised data source inBigQuery or Cloud Storage.
- Use IAM to ensure that response teams have the necessaryleast-privilege access to tools that are required during an incident.
Identification and triage: Use configured alerts to detect andvalidate potential incidents. Establish clear criteria and thresholds forhow your organization investigates or declares an AI-related incident. Fordetailed investigation and evidence collection, use Loggingfor application logs and service logs, and use Cloud Audit Logs foradministrative activities and data access patterns. Security teams can useGoogle SecOps for deeper analyses of security telemetry.
Containment: Isolate affected AI systems or components to preventfurther impact or data exfiltration. This phase might include the followingtasks:
- Disable a problematic Vertex AI endpoint.
- Revoke specific IAM permissions.
- Update firewall rules or Cloud Armor policies.
- Pause a Vertex AI pipeline that's misbehaving.
Eradication: Identify and remove the root cause of the incident.This phase might include the following tasks:
- Patch the vulnerable code in a custom model container.
- Remove the identified malicious backdoors from a model.
- Sanitize the poisoned data before you initiate a secureretraining job on Vertex AI training.
- Update any insecure configurations.
- Refine the input validation logic to block specific prompt-injectiontechniques.
Recovery and secure redeployment: Restore the affected AI systems toa known good and secure operational state. This phase might include thefollowing tasks:
- Deploy a previously validated and trusted model version fromModel Registry.
- Ensure that you find and apply all of the security patches forvulnerabilities that might be present in your code or system.
- Reset the IAM permissions to the principle ofleast privilege.
Post-incident activity and lessons learned: After you resolve thesignificant AI incidents, conduct a thorough post-incident review. Thisreview involves all of the relevant teams, such as the AI and ML, MLOps,security, and data science teams. Understand the full lifecycle of theincident. Use these insights to refine the AI system design, updatesecurity controls, improve Monitoring configurations, andenhance the AI incident response plan and playbooks.

Integrate the AI incident response with the broader organizational frameworks,such as IT and security incident management, for a coordinated effort. To alignyour AI-specific incident response with your organizational frameworks, considerthe following:

Escalation: Define clear paths for how you escalate significant AIincidents to central SOC, IT, legal, or relevant business units.
Communication: Use established organizational channels for allinternal and external incident reports and updates.
Tooling and processes: Use existing enterprise incident managementand ticketing systems for AI incidents to ensure consistent tracking andvisibility.
Collaboration: Pre-define collaboration protocols between AI and ML,MLOps, data science, security, legal, and compliance teams for effective AIincident responses.

Contributors

Authors:

Kamilla Kurta | GenAI/ML Specialist Customer Engineer
Vidhi Jain | Cloud Engineer, Analytics and AI
Mohamed Fawzi | Benelux Security and Compliance Lead
Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist

Other contributors:

Lauren Anthony | Customer Engineer, Security Specialist
Daniel Lees | Cloud Security Architect
John Bacon | Partner Solutions Architect
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Mónica Carranza | Senior Generative AI Threat Analyst
Tarun Sharma | Principal Architect
Wade Holmes | Global Solutions Director

AI and ML perspective: Reliability

This document in theGoogle Cloud Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operatereliable AI and ML systems on Google Cloud. It explores how to integrateadvanced reliability practices and observability into your architecturalblueprints. The recommendations in this document align with thereliability pillar of the Google Cloud Well-Architected Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential inorder to ensure customer satisfaction and achieve business goals. To meet theunique demands of both predictive ML and generative AI, you need AI and MLsystems that are robust, reliable, and adaptable. To handle the complexities ofMLOps—fromdevelopment to deployment and continuous improvement—you need to use areliability-first approach. Google Cloud offers a purpose-built AIinfrastructure that's aligned withsite reliability engineering (SRE) principles and that provides a powerful foundation for reliable AI and MLsystems.

The recommendations in this document are mapped to the following coreprinciples:

Ensure that ML infrastructure is scalable and highly available

Reliable AI and ML systems in the cloud require scalable and highlyavailable infrastructure. These systems have dynamic demands, diverse resourceneeds, and critical dependencies on model availability. Scalable architecturesadapt to fluctuating loads and variations in data volume or inference requests.High availability (HA) helps to ensure resilience against failures at thecomponent, zone, or region level.

To build scalable and highly available ML infrastructure, consider the followingrecommendations.

Implement automatic and dynamic scaling capabilities

AI and ML workloads are dynamic, with demand that fluctuates based on dataarrival rates, training frequency, and the volume of inference traffic.Automatic and dynamic scaling adapts infrastructure resources seamlessly todemand fluctuations. Scaling your workloads effectively helps to preventdowntime, maintain performance, and optimize costs.

To autoscale your AI and ML workloads, use the following products and featuresin Google Cloud:

Data processing pipelines: Create data pipelines inDataflow.Configure the pipelines to useDataflow's horizontal autoscaling feature, which dynamically adjusts the number of worker instances based onCPU utilization, pipeline parallelism, and pending data. You can configureautoscaling parameters through pipeline options when you launch jobs.
Training jobs: Automate the scaling of training jobs by usingVertex AI custom training.You can define worker pool specifications such as the machine type, thetype and number of accelerators, and the number of worker pools. For jobsthat can tolerate interruptions and for jobs where the training codeimplements checkpointing, you can reduce costs by usingSpot VMs.
Online inference: For online inference, useVertex AI endpoints.To enable autoscaling, configure the minimum and maximum replica count.Specify a minimum of two replicas for HA. Vertex AIautomatically adjusts the number of replicas based on traffic and theconfigured autoscaling metrics, such as CPU utilization and replicautilization.
Containerized workloads inGoogle Kubernetes Engine:Configure autoscaling at the node and Pod levels. Configure thecluster autoscaler andnode auto-provisioning to adjust the node count based on pending Pod resourcerequests like CPU, memory, GPU, and TPU. UseHorizontal Pod Autoscaler (HPA) for deployments to define scaling policies based on metrics like CPU andmemory utilization. You can also scale based on custom AI and ML metrics,such as GPU or TPU utilization and prediction requests per second.
Serverless containerized services: Deploy the services inCloud Run and configure autoscaling by specifying the minimum and maximum number ofcontainer instances. Usebest practices to autoscale GPU-enabled instances by specifying the accelerator type.Cloud Run automatically scales instances between the configuredminimum and maximum limits based on incoming requests. When there are norequests, it scales efficiently to zero instances. You can leverage theautomatic, request-driven scaling of Cloud Run to deployVertex AI agents and to deploy third-party workloads likequantized models using Ollama,LLM model inference using vLLM,andHuggingface Text Generation Inference (TGI).

Design for HA and fault tolerance

For production-grade AI and ML workloads, it's crucial that you ensurecontinuous operation and resilience against failures. To implement HA and faulttolerance, you need to build redundancy and replication into your architectureon Google Cloud. This approach helps to ensure that a failure of anindividual component doesn't cause a failure of the complete system.

For HA and low latency in model serving, particularly for real-timeinference and generative AI models, distribute your deployments acrossmultiple locations.
For global availability and resilience, deploy the models to multipleVertex AI endpoints across Google Cloud regions or use theglobal endpoint.
Use global load balancing to route traffic.
For training on GKE or Compute Engine MIGs,implement monitoring forXid errors.When you identify Xid errors, take appropriate remedial action. Forexample,reset GPUs,reset Compute Engine instances,or trigger hardware replacement by using thegcloud CLI report faulty host command.
Explore fault-tolerant or elastic and resilient training solutions likerecipes to use the Google Resiliency Library or integration of theResilient training with Pathways logic for TPU workloads.

Implement redundancy for critical AI and ML components in Google Cloud.The following are examples of products and features that let you implementresource redundancy:

DeployGKE regional clusters across multiple zones.
Ensure data redundancy for datasets and checkpoints by usingCloud Storage multi-regional or dual-region buckets.
Use Spanner for globally consistent, highly available storageof metadata.
ConfigureCloud SQL read replicas for operational databases.
Ensure thatvector databases for retrieval augmented generation (RAG) are highly available andmulti-zonal or multi-regional.

Manage resources proactively and anticipate requirements

Effective resource management is important to help you optimize costs,performance, and reliability. AI and ML workloads are dynamic and there's highdemand for specialized hardware like GPUs and TPUs. Therefore, it's crucial thatyou apply proactive resource management and ensure resource availability.

Plan for capacity based on historical monitoring data, such as GPU or TPUutilization and throughput rates, fromCloud Monitoring and logs inCloud Logging.Analyze this telemetry data by usingBigQuery orLooker Studio and forecast future demand for GPUs based on growth or new models. Analysis ofresource usage patterns and trends helps you to predict when and where you needcritical specialized accelerators.

Validate capacity estimates through rigorous load testing. Simulatetraffic on AI and ML services like serving and pipelines by using toolslikeApache JMeter orLoadView.
Analyze system behavior under stress.
- To anticipate and meet increased workload demands inproduction, proactively identify resource requirements. Monitorlatency, throughput, errors, and resource utilization, especially GPUand TPU utilization. Increase resource quotas as necessary.
- For generative AI serving, test under high concurrent loads andidentify the level at which accelerator availability limits performance.
Perform continuous monitoring for model queries and set up proactivealerts for agents.
- Use themodel observability dashboard to view metrics that are collected by Cloud Monitoring, such asmodel queries per second (QPS), token throughput, and first tokenlatencies.

Optimize resource availability and obtainability

Optimize costs and ensure resource availability by strategically selectingappropriate compute resources based on workload requirements.

For stable 24x7 inference or for training workloads with fixed orpredictable capacity requirements, usecommitted use discounts (CUDs) for VMs and accelerators.
For GKE nodes and Compute Engine VMs, useSpot VMs and Dynamic Workload Scheduler (DWS) capabilities:
- For fault-tolerant tasks such as evaluation and experimentationworkloads, useSpot VMs.Spot VMs can be preempted, but they can help reduce youroverall costs.
- To manage preemption risk for high-demand accelerators, you canensure better obtainability by usingDWS.
  - For complex batch training that needs high-end GPUs torun up to seven days, use the DWS Flex-Start mode.
  - For longer running workloads that run up to threemonths, use the Calendar mode to reserve specific GPUs (H100 andH200) and TPUs (Trillium).
To optimize AI inference on GKE, you can run avLLM engine that dynamically uses TPUs and GPUs to address fluctuating capacityand performance needs. For more information, seevLLM GPU/TPU Fungibility.
For advanced scenarios with complex resource and topology needsthat involve accelerators, use tools to abstract resource management.
- Cluster Director lets you deploy and manage accelerator groups with colocation andscheduling for multi-GPU training (A3 Ultra H200 and A4 B200). ClusterDirector supports GKE and Slurm clusters.
- Ray on Vertex AI abstracts distributed computing infrastructure. It enables applicationsto request resources for training and serving without the need fordirect management of VMs and containers.

Distribute incoming traffic across multiple instances

Effective load balancing is crucial for AI applications that have fluctuatingdemands. Load balancing distributes traffic, optimizes resource utilization,provides HA and low latency, and helps to ensure a seamless user experience.

Inference with varying resource needs: Implement load balancingbased on model metrics.GKE Inference Gateway lets you deploy models behind a load balancer with model-aware routing. Thegateway prioritizes instances with GPU and TPU accelerators forcompute-intensive tasks like generative AI and LLM inference. Configuredetailed health checks to assess model status. Use serving frameworks likevLLM or Triton for LLM metrics and integrate the metrics intoCloud Monitoring by usingGoogle Cloud Managed Service for Prometheus.
Inference workloads that need GPUs or TPUs: To ensure that criticalAI and ML inference workloads consistently run on machines that aresuitable to the workloads' requirements, particularly when GPU and TPUavailability is constrained, useGKE custom compute classes.You can define specific computeprofiles with fallback policies forautoscaling. For example, you can define a profile that specifies a higherpriority for reserved GPU or TPU instances. The profile can include afallback to use cost-efficient Spot VMs if the reservedresources are temporarily unavailable.
Generative AI on diverse orchestration platforms: Use a centralizedload balancer. For example, for cost and management efficiency, you canroute requests that have low GPU needs to Cloud Runand route more complex, GPU-intensive tasks to GKE.For inter-service communication and policy management, implement a servicemesh by usingCloud Service Mesh.Ensure consistent logging and monitoring by using Cloud Logging andCloud Monitoring.
Global load distribution: To load balance traffic from global userswho need low latency, use aglobal external Application Load Balancer.Configure geolocation routing to the closest region and implement failover.Establish regional endpoint replication in Vertex AI orGKE. Configure Cloud CDN for static assets.Monitor global traffic and latency by using Cloud Monitoring.
Granular traffic management: For requests that have diverse datatypes or complexity and long-running requests, implement granular trafficmanagement.
- Configurecontent-based routing to direct requests to specialized backends based on attributes like URLpaths and headers. For example, direct requests to GPU-enabled backendsfor image or video models and to CPU-optimized backends for text-basedmodels.
- For long-running generative AI requests or batch workloads, useWebSockets or gRPC. Implement traffic management to handle timeouts andbuffering. Configure request timeouts and retries and implement ratelimiting and quotas by using API Gateway orApigee.

Use a modular and loosely coupled architecture

In a modular, loosely coupled AI and ML architecture, complex systems aredivided into smaller, self-contained components that interact throughwell-defined interfaces. This architecture minimizes module dependencies,simplifies development and testing, enhances reproducibility, and improves faulttolerance by containing failures. The modular approach is crucial for managingcomplexity, accelerating innovation, and ensuring long-term maintainability.

To design a modular and loosely coupled architecture for AI and ML workloads,consider the following recommendations.

Implement small self-contained modules or components

Separate your end-to-end AI and ML system into small, self-contained modules orcomponents. Each module or component is responsible for a specificfunction, such as data ingestion, feature transformation, model training,inference serving, or evaluation. A modular design provides several key benefitsfor AI and ML systems: improved maintainability, increased scalability,reusability, and greater flexibility and agility.

The following sections describe Google Cloud products, features, and toolsthat you can use to design a modular architecture for your AI and ML systems.

Containerized microservices on GKE

For complex AI and ML systems or intricate generative AI pipelines that needfine-grained orchestration, implement modules as microservices that areorchestrated by using GKE. Package each distinct stage as anindividual microservice within Docker containers. These distinct stages includedata ingestion that's tailored for diverse formats, specialized datapreprocessing or feature engineering, distributed model training or fine tuningof large foundation models, evaluation, or serving.

Deploy the containerized microservices on GKE and leverageautomated scaling based on CPU and memory utilization or custom metrics like GPUutilization, rolling updates, and reproducible configurations in YAML manifests.Ensure efficient communication between the microservices by usingGKE service discovery. For asynchronous patterns, usemessage queues likePub/Sub.

The microservices-on-GKE approach helps you build scalable,resilient platforms for tasks like complex RAG applications where the stages canbe designed as distinct services.

Serverless event-driven services

For event-driven tasks that can benefit from serverless, automatic scaling, useCloud Run orCloud Run functions.These services are ideal for asynchronous tasks like preprocessing or forsmaller inference jobs. Trigger Cloud Run functions on events, such asa new data file that's created in Cloud Storage or model updates inArtifact Registry. For web-hook tasks or services that need a container environment,use Cloud Run.

Cloud Run services and Cloud Run functions canscale up rapidly and scale down to zero, which helps to ensure cost efficiencyfor fluctuating workloads. These services are suitable for modular components inVertex AI Agents workflows. You can orchestrate componentsequences withWorkflows or Application Integration.

Vertex AI managed services

Vertex AI services support modularity and help you simplify thedevelopment and deployment of your AI and ML systems. The services abstract theinfrastructure complexities so that you can focus on the application logic.

To orchestrate workflows that are built from modular steps, useVertex AI Pipelines.
To run custom AI and ML code, package the code in Docker containers that canrun on managed services like Vertex AI custom training andVertex AI prediction.
For modular feature engineering pipelines, useVertex AI Feature Store.
For modular exploration and prototyping, use notebook environments likeVertex AI Workbench orColab Enterprise.Organize your code into reusable functions, classes, and scripts.

Agentic applications

For AI agents,Agent Development Kit (ADK) provides modular capabilities likeTools andState.To enable interoperability between frameworks likeLangChain,LangGraph,LlamaIndex,and Vertex AI, you can combine ADK with theAgent2Agent (A2A) protocol and theModel Context Protocol (MCP).This interoperability lets you compose agentic workflows by using diversecomponents.

You can deploy agents on Vertex AI Agent Engine, which is a managedruntime that's optimized for scalable agent deployment. To run containerizedagents, you can leverage the autoscaling capabilities inCloud Run.

Design well-defined interfaces

To build robust and maintainable software systems, it's crucial to ensure thatthe components of a system are loosely coupled and modularized. This approachoffers significant advantages, because it minimizes the dependencies betweendifferent parts of the system. When modules are loosely coupled, changes in onemodule have minimal impact on other modules. This isolation enables independentupdates and development workflows for individual modules.

The following sections provide guidance to help ensure seamless communicationand integration between the modules of your AI and ML systems.

Protocol choice

For universal access, use HTTP APIs, adhere to RESTful principles, anduse JSON for language-agnostic data exchange. Design the API endpoints torepresent actions on resources.
For high-performance internal communication among microservices, usegRPC with Protocol Buffers (ProtoBuf) for efficient serialization and strict typing. Define data structures likeModelInput, PredictionResult, or ADK Tool data by using.proto files, andthen generate language bindings.
For use cases where performance is critical, leverage gRPC streaming forlarge datasets or for continuous flows such as live text-to-speech or videoapplications. Deploy the gRPC services on GKE.

Standardized and comprehensive documentation

Regardless of the interface protocol that you choose, standardizeddocumentation is crucial. TheOpenAPI Specification describes RESTful APIs. Use OpenAPI to document your AI and ML APIs: paths,methods, parameters, request-response formats that are linked to JSON schemas,and security. Comprehensive API documentation helps to improve discoverabilityand client integration. For API authoring and visualization, use UI tools likeSwagger Editor.To accelerate development and ensure consistency, you can generate client SDKsand server stubs by using AI-assisted coding tools likeGemini Code Assist.Integrate OpenAPI documentation into your CI/CD flow.

Interaction with Google Cloud managed services like Vertex AI

Choose between the higher abstraction of the Vertex AI SDK,which is preferred for development productivity, and the granular control thatthe REST API provides.

The Vertex AI SDK simplifies tasks and authentication.Use the SDK when you need to interact withVertex AI.
The REST API is a powerful alternative especially when interoperabilityis required between layers of your system. It's useful for tools inlanguages that don't have an SDK or when you need fine-grained control.

Use APIs to isolate modules and abstract implementation details

For security, scalability, and visibility, it's crucial that you implementrobust API management for your AI and ML services. To implement API managementfor your defined interfaces, use the following products:

API Gateway:For APIs that are externally exposed and managed, API Gatewayprovides a centralized, secure entry point. It simplifies access toserverless backend services, such as prediction, training, and data APIs.API Gateway helps to consolidate access points, enforce APIcontracts, and manage security capabilities like API keys andOAuth 2.0. To protect backends from overload and ensure reliability,implement rate limiting and usage quotas in API Gateway.
Cloud Endpoints:To streamline API development and deployment on GKEand Cloud Run, use Cloud Endpoints, which offers adeveloper-friendly solution for generating API keys. It alsoprovides integrated monitoring and tracing for API calls and it automatesthe generation of OpenAPI specs, which simplifies documentation and clientintegration. You can use Cloud Endpoints to manage access to internal orcontrolled AI and ML APIs, such as to trigger training and manage featurestores.
Apigee:For enterprise-scale AI and ML, especially sophisticated generative AIAPIs, Apigee provides advanced, comprehensive APImanagement. Use Apigee for advanced security like threatprotection and OAuth 2.0, for traffic management like caching, quotas, andmediation, and for analytics. Apigee can help you to gaindeep insights into API usage patterns, performance, and engagement, whichare crucial for understanding generative AI API usage.

Plan for graceful degradation

In production AI and ML systems, component failures are unavoidable, just likein other systems. Graceful degradation ensures that essential functions continueto operate, potentially with reduced performance. This approach preventscomplete outages and improves overall availability. Graceful degradation iscritical for latency-sensitive inference, distributed training, and generativeAI.

The following sections describe techniques that you use to plan and implementgraceful degradation.

Fault isolation

To isolate faulty components in distributed architectures, implementthecircuit breaker pattern by using resilience libraries, such asResilience4j in Java andCircuitBreaker in Python.
To prevent cascading failures, configure thresholds based on AI and MLworkload metrics like error rates and latency and define fallbacks likesimpler models and cached data.

Component redundancy

For critical components, implement redundancy and automatic failover. Forexample, use GKE multi-zone clusters or regional clustersand deploy Cloud Run services redundantly across differentregions. To route traffic to healthy instances when unhealthy instances aredetected, use Cloud Load Balancing.

Ensure data redundancy by using Cloud Storage multi-regional buckets.For distributed training, implement asynchronous checkpointing to resume afterfailures. For resilient and elastic training, usePathways.

Proactive monitoring

Graceful degradation helps to ensure system availability during failure, butyou must also implement proactive measures for continuous health checks andcomprehensive monitoring. Collect metrics that are specific to AI and ML, suchas latency, throughput, and GPU utilization. Also, collect model performancedegradation metrics like model and data drift by using Cloud Monitoring andVertex AI Model Monitoring.

Health checks can trigger the need to replace faulty nodes, deploy morecapacity, or automatically trigger continuous retraining or fine-tuning ofpipelines that use updated data. This proactive approach helps to prevent bothaccuracy-based degradation and system-level graceful degradation and it helps toenhance overall reliability.

SRE practices

To monitor the health of your systems, consider adopting SRE practices toimplementservice level objectives (SLOs).Alerts on error budget loss and burn rate can be early indicators of reliabilityproblems with the system. For more information about SRE practices, see theGoogle SRE book.

Build an automated end-to-end MLOps platform

A robust, scalable, and reliable AI and ML system on Google Cloudrequires an automated end-to-end MLOps platform for the model developmentlifecycle. The development lifecycle includes initial data handling, continuousmodel training, deployment, and monitoring in production. By automating thesestages on Google Cloud, you establish repeatable processes, reduce manualtoil, minimize errors, and accelerate the pace of innovation.

An automated MLOps platform is essential for establishing production-gradereliability for your applications. Automation helps to ensure modelquality, guarantee reproducibility, and enable continuous integration anddelivery of AI and ML artifacts.

To build an automated end-to-end MLOps platform, consider the followingrecommendations.

Automate the model development lifecycle

A core element of an automated MLOps platform is the orchestration of theentire AI and ML workflow as a series of connected, automated steps: from datapreparation and validation to model training, evaluation, deployment, andmonitoring.

UseVertex AI Pipelines as your central orchestrator:
- Define end-to-end workflows with modular components for dataprocessing, training, evaluation, and deployment.
- Automate pipeline runs by using schedules or triggers like newdata or code changes.
- Implement automated parameterization and versioning for eachpipeline run and create a version history.
- Monitor pipeline progress and resource usage by using built-inlogging and tracing, and integrate withCloud Monitoring alerts.
Define your ML pipelines programmatically by using theKubeflow Pipelines (KFP) SDK or TensorFlow Extended SDK. For more information, seeInterfaces for Vertex AI Pipelines.
Orchestrate operations by using Google Cloud services likeDataflow,Vertex AI custom training,Vertex AI Model Registry,and Vertex AI endpoints.
For generative AI workflows, orchestrate the steps for promptmanagement, batched inference, human-in-the-loop (HITL) evaluation, andcoordinating ADK components.

Manage infrastructure as code

Infrastructure as code (IaC) is crucial for managing AI and ML systeminfrastructure and for enabling reproducible, scalable, and maintainabledeployments. The infrastructure needs of AI and ML systems are dynamic andcomplex. The systems often require specialized hardware like GPUs and TPUs. IaChelps to mitigate the risks of manual infrastructure management by ensuringconsistency, enabling rollbacks, and making deployments repeatable.

To effectively manage your infrastructure resources as code, use the followingtechniques.

Automate resource provisioning

To effectively manage IaC on Google Cloud, define and provision your AIand ML infrastructure resources by usingTerraform.The infrastructure might include resources such as the following:

GKE clusters that are configured with node pools. The node pools can beoptimized based on workload requirements. For example, you can use A100,H100, H200, or B200 GPUs for training, and use L4 GPUs for inference.
Vertex AI endpoints that are configured for model serving, with defined machine types andscaling policies.
Cloud Storage buckets for data and artifacts.

Use configuration templates

Organize your Terraform configurations as modular templates. To accelerate theprovisioning of AI and ML resources, you can useCluster Toolkit.The toolkit providesexample blueprints,which are Google-curated Terraform templates that you can use to deployready-to-use HPC, AI, and ML clusters in Slurm or GKE. Youcan customize the Terraform code and manage it in your version control system.To automate the resource provisioning and update workflow, you can integrate thecode into your CI/CD pipelines by usingCloud Build.

Automate configuration changes

After you provision your infrastructure, manage the ongoing configurationchanges declaratively:

In Kubernetes-centric environments, manage your Google Cloudresources as Kubernetes objects by usingConfig Connector.
Define and manage Vertex AI resources like datasets,models, and endpoints, Cloud SQL instances, Pub/Subtopics, and Cloud Storage buckets by using YAML manifests.
Deploy the manifests to your GKE cluster in order to integrate theapplication and infrastructure configuration.
Automate configuration updates by using CI/CD pipelines and usetemplating to handle environment differences.
Implement configurations for Identity and Access Management (IAM) policies andservice accounts by using IaC.

Integrate with CI/CD

Automate the lifecycle of the Google Cloud infrastructureresources by integrating IaC into CI/CD pipelines by using tools likeCloud Build andInfrastructure Manager.
Define triggers for automatic updates on code commits.
Implement automated testing and validation within the pipeline. Forexample, you can create a script to automatically run the Terraformvalidate andplan commands.
Store the configurations as artifacts and enable versioning.
Define separate environments, such as dev, staging, and prod, withdistinct configurations in version control and automate environmentpromotion.

Validate model behavior

To maintain model accuracy and relevance over time, automate the training andevaluation process within your MLOps platform. This automation, coupled withrigorous validation, helps to ensure that the models behave as expected withrelevant data before they're deployed to production.

Set up continuous training pipelines, which are either triggered by newdata and monitoring signals like data drift or that run on a schedule.
- To manage automated training jobs, such as hyperparameter tuning trialsand distributed training configurations for larger models, useVertex AI custom training.
- For fine-tuning foundation models, automate the fine-tuningprocess and integrate the jobs into your pipelines.
Implement automated model versioning and securely store trained modelartifacts after each successful training run. You can store the artifactsin Cloud Storage or register them inModel Registry.
Define evaluation metrics and set clear thresholds, such as minimumaccuracy, maximum error rate, and minimum F1 score.
- Ensure that a model meets the thresholds to automatically passthe evaluation and be considered for deployment.
- Automate evaluation by using services likemodel evaluation in Vertex AI.
- Ensure that the evaluation includes metrics that are specific tothe quality of generated output, factual accuracy, safety attributes,and adherence to specified style or format.
To automatically log and track the parameters, code versions, datasetversions, and results of each training and evaluation run, useVertex AI Experiments.This approach provides a history that's useful for comparison, debugging,and reproducibility.
To optimizehyperparameter tuning and automate searching for optimal model configurations based on yourdefined objective, useVertex AI Vizier.
To visualize training metrics and to debug during development, useVertex AI TensorBoard.

Validate inputs and outputs of AI and ML pipelines

To ensure the reliability and integrity of your AI and ML systems, you mustvalidate data when it enters the systems and moves through the pipelines. Youmust also verify the inputs and outputs at the component boundaries. Robustvalidation of all inputs and outputs—raw data, processed data, configurations,arguments, and files—helps to prevent unexpected behavior and maintain modelquality throughout the MLOps lifecycle. When you integrate this proactiveapproach into your MLOps platform, it helps detect errors before they arepropagated throughout a system and it saves time and resources.

To effectively validate the inputs and outputs of your AI and ML pipelines, usethe following techniques.

Automate data validation

Implement automated data validation in your data ingestion andpreprocessing pipelines by usingTensorFlow Data Validation (TFDV).
- For large-scale, SQL-based data quality checks, leveragescalable processing services likeBigQuery.
- For complex, programmatic validation on streaming or batch data,useDataflow.
Monitor data distributions over time with TFDV capabilities.
- Visualize trends by using tools that are integrated withCloud Monitoring to detect data drift. You can automatically trigger model retrainingpipelines when data patterns change significantly.
Store validation results and metrics in BigQuery foranalysis and historical tracking and archive validation artifacts inCloud Storage.

Validate pipeline configurations and input data

To prevent pipeline failures or unexpected behavior caused by incorrectsettings, implement strict validation for all pipeline configurations andcommand-line arguments:

Define clear schemas for your configuration files like YAML or JSON byusing schema validation libraries likejsonschema for Python. Validate configuration objects against these schemas before apipeline run starts and before a component executes.
Implement input validation for all command-line arguments and pipelineparameters by using argument-parsing libraries likeargparse.Validation should check for correct data types, valid values, and requiredarguments.
WithinVertex AI Pipelines,define the expected types and properties of component parameters by usingthe built-in component input validation features.
To ensure reproducibility of pipeline runs and to maintain an audittrail, store validated, versioned configuration files inCloud Storage or Artifact Registry.

Validate input and output files

Validate input and output files such as datasets, model artifacts, andevaluation reports for integrity and format correctness:

Validate file formats like CSV,Parquet,and image types by using libraries.
For large files or critical artifacts, validate file sizes and checksumsto detect corruption or incomplete transfers by usingCloud Storagedata validation and change detection.
Perform file validation by usingCloud Run functions (for example, based on file upload events) or withinDataflow pipelines.
Store validation results inBigQuery for easier retrieval and analysis.

Automate deployment and implement continuous monitoring

Automated deployment and continuous monitoring of models in production helps toensure reliability, perform rapid updates, and detect issues promptly. Thisinvolves managing model versions, controlled deployment, automated deploymentusing CI/CD, and comprehensive monitoring as described in the followingsections.

Manage model versions

Manage model iterations and associated artifacts by using versioning tools:

To track model versions and metadata and to link to underlying modelartifacts, useModel Registry.
Implement a clear versioning scheme (such as, semantic versioning). For eachmodel version, attach comprehensive metadata such as training parameters,evaluation metrics from validation pipelines, and dataset version.
Store model artifacts such as model files, pretrained weights, andserving container images inArtifact Registry and use its versioning and tagging features.
To meet security and governance requirements, define stringentaccess-control policies for Model Registry andArtifact Registry.
To programmatically register and manage versions and to integrateversions into automated CI/CD pipelines, use the Vertex AISDK or API.

Perform controlled deployment

Control the deployment of model versions to endpoints by using your servingplatform's traffic management capabilities.

Implement arolling deployment by using the traffic splitting feature of Vertex AIendpoints.
If you deploy your model to GKE, use advancedtraffic management techniques likecanary deployment:
1. Route a small subset of the production traffic to a new model version.
2. Continuously monitor performance and error rates through metrics.
3. Establish that the model is reliable.
4. Roll out the version to all traffic.
PerformA/B testing of AI agents:
1. Deploy two different model-agent versions or entirely differentmodels to the same endpoint.
2. Split traffic across the deployments.
3. Analyze the results against business objectives.
Implement automated rollback mechanisms that can quickly revert endpointtraffic to a previous stable model version if monitoring alerts aretriggered or performance thresholds are missed.
Configure traffic splitting and deployment settings programmatically byusing the Vertex AI SDK or API.
UseCloud Monitoring to track performance and traffic across versions.
Automate deployment with CI/CD pipelines. You can useCloud Build to build containers, version artifacts, and trigger deployment toVertex AI endpoints.
Ensure that the CI/CD pipelines manage versions and pull fromArtifact Registry.
Before you shift traffic, perform automated endpoint testing forprediction correctness, latency, throughput, and API function.
Store all configurations in version control.

Monitor continuously

UseModel Monitoring to automatically detect performance degradation, data drift (changes ininput distribution compared to training), and prediction drift (changes inmodel outputs).
- Configure drift detection jobs with thresholds and alerts.
- Monitor real-time performance: prediction latency, throughput,error rates.
Define custom metrics inCloud Monitoring for business KPIs.
Integrate Model Monitoring results and custommetrics with Cloud Monitoring for alerts and dashboards.
Configure notification channels like email, Slack, or PagerDuty andconfigure automated remediation.
To debug prediction logs, useCloud Logging.
Integrate monitoring with incident management.

For generative AI endpoints, monitor output characteristics like toxicity andcoherence:

Monitor feature serving for drift.
Implement granular prediction validation: validate outputs againstexpected ranges and formats by using custom logic.
Monitor prediction distributions for shifts.
Validate output schema.
Configure alerts for unexpected outputs and shifts.
Track and respond to real-time validation events by usingPub/Sub.

Ensure that the output of comprehensive monitoring feeds back into continuoustraining.

Maintain trust and control through data and model governance

AI and ML reliability extends beyond technical uptime. It includes trust androbust data and model governance. AI outputs might be inaccurate, biased, oroutdated. Such issues erode trust and can cause harm. Comprehensivetraceability, strong access control, automated validation, and transparentpractices help to ensure that AI outputs are reliable, trustworthy, and meetethics standards.

To maintain trust and control through data and model governance, consider thefollowing recommendations.

Establish data and model catalogs for traceability

To facilitate comprehensive tracing, auditing, and understanding the lineage ofyour AI and ML assets, maintain a robust, centralized record of data and modelversions throughout their lifecycle. A reliable data and model catalog serves asthe single source of truth for all of the artifacts that are used and producedby your AI and ML pipelines–from raw data sources and processed datasets totrained model versions and deployed endpoints.

Use the following products, tools, and techniques to create and maintaincatalogs for your data assets:

Build an enterprise-wide catalog of your data assets by usingDataplex Universal Catalog.To automatically discover and build inventories of the data assets,integrate Dataplex Universal Catalog with your storage systems, such asBigQuery,Cloud Storage,andPub/Sub.
Ensure that your data is highly available and durable by storing it inCloud Storagemulti-region or dual-region buckets.Data that you upload to these buckets is stored redundantly across at leasttwo separate geographic locations. This redundancy provides built-inresilience against regional outages and it helps to ensure data integrity.
Tag and annotate your datasets with relevant business metadata,ownership information, sensitivity levels, and lineage details. Forexample, link a processed dataset to its raw source and to the pipelinethat created the dataset.
Create a central repository for model versions by usingModel Registry.Register each trained model version and link it to the associated metadata.The metadata can include the following:
- Training parameters.
- Evaluation metrics from validation pipelines.
- Dataset version that was used for training, with lineage tracedback to the relevant Dataplex Universal Catalog entry.
- Code version that produced the dataset.
- Details about the framework or foundation model that was used.
Before you import a model into Model Registry,store model artifacts like model files and pretrained weights in a servicelike Cloud Storage. Store custom container images for serving orcustom training jobs in a secure repository likeArtifact Registry.
To ensure that data and model assets are automatically registered andupdated in the respective catalogs upon creation or modification, implementautomated processes within your MLOps pipelines. This comprehensivecataloging provides end-to-end traceability from raw data to prediction,which lets you audit the inputs and processes that led to a specific modelversion or prediction. The auditing capability is vital for debuggingunexpected behavior, ensuring compliance with data usage policies, andunderstanding the impact of data or model changes over time.
For Generative AI and foundation models, your catalog must also trackdetails about the specific foundation model used, fine-tuning parameters,and evaluation results that are specific to the quality and safety of thegenerated output.

Implement robust access controls and audit trails

To maintain trust and control in your AI and ML systems, it's essential thatyou protect sensitive data and models from unauthorized access and ensureaccountability for all changes.

Implement strict access controls and maintain detailed audit trailsacross all components of your AI and ML systems in Google Cloud.
Define granular permissions in IAM for users, groups, andservice accounts that interact with your AI and ML resources.
Follow the principle of least privilege rigorously.
Grant only the minimum necessary permissions for specific tasks. Forexample, a training service account needs read access to training data andwrite access for model artifacts, but the service might not need writeaccess to production serving endpoints.

Apply IAM policies consistently across all relevant assets andresources in your AI and ML systems, including the following:

Cloud Storage buckets that contain sensitive data or modelartifacts.
BigQuery datasets.
Vertex AI resources, such as model repositories,endpoints, pipelines, and Feature Store resources.
Compute resources, such as GKE clusters and Cloud Runservices.

Use auditing and logs to capture, monitor, and analyze access activity:

EnableCloud Audit Logs for all of the Google Cloud services that are used by your AI and MLsystem.
Configure audit logs to capture detailed information about API calls,data access events, and configuration changes made to your resources.Monitor the logs for suspicious activity, unauthorized access attempts, orunexpected modifications to critical data or model assets.
For real-time analysis, alerting, and visualization, stream the auditlogs toCloud Logging.
For cost-effective long-term storage and retrospective security analysisor compliance audits, export the logs to BigQuery.
For centralized security monitoring, integrate audit logs with yoursecurity information and event management (SIEM) systems. Regularly reviewaccess policies and audit trails to ensure they align with your governancerequirements and detect potential policy violations.
For applications that handle sensitive data, such as personallyidentifiable information (PII) for training or inference, useSensitive Data Protection checks within pipelines or on data storage.
For generative AI and agentic solutions, use audit trails to help trackwho accessed specific models or tools, what data was used for fine-tuningor prompting, and what queries were sent to production endpoints. The audittrails help you to ensure accountability and they provide crucial data foryou to investigate misuse of data or policy violations.

Address bias, transparency, and explainability

To build trustworthy AI and ML systems, you need to address potential biasesthat are inherent in data and models, strive for transparency in systembehavior, and provide explainability for model outputs. It's especially crucialto build trustworthy systems in sensitive domains or when you use complex modelslike those that are typically used for generative AI applications.

Implement proactive practices to identify and mitigate bias throughoutthe MLOps lifecycle.
Analyze training data for bias by using tools that detect skew infeature distributions across different demographic groups or sensitiveattributes.
Evaluate the overall model performance and the performance acrosspredefined slices of data. Such evaluation helps you to identify disparateperformance or bias that affects specific subgroups.

For model transparency and explainability, use tools that help users anddevelopers understand why a model made a particular prediction or produced aspecific output.

For tabular models that are deployed onVertex AI endpoints,generate feature attributions by usingVertex Explainable AI.Feature attributions indicate the input features that contributed most tothe prediction.
Interactively explore model behavior and potential biases on a datasetby using model-agnostic tools like theWhat-If Tool,which integrates with TensorBoard.
Integrate explainability into your monitoring dashboards. In situationswhere understanding the model's reasoning is important for trust ordecision-making, provide explainability data directly to end users throughyour application interfaces.
For complex models like LLMs that are used for generative AI models,explain theprocess that an agent followed, such as by usingtrace logs.Explainability is relatively challenging for such models, but it's stillvital.
In RAG applications, provide citations for retrieved information. Youcan also use techniques like prompt engineering to guide the model toprovide explanations or show its reasoning steps.
Detect shifts in model behavior or outputs that might indicate emergingbias or unfairness by implementing continuous monitoring in production.Document model limitations, intended use cases, and known potential biasesas part of the model's metadata in theModel Registry.

Implement holistic AI and ML observability and reliability practices

Holistic observability is essential for managing complex AI and ML systems inproduction. It's also essential for measuring the reliability of complex AI andML systems, especially for generative AI, due to its complexity, resourceintensity, and potential for unpredictable outputs. Holistic observabilityinvolves observing infrastructure, application code, data, and model behavior togain insights for proactive issue detection, diagnosis, and response. Thisobservability ultimately leads to high-performance, reliable systems. To achieveholistic observability you need to do the following:

Adopt SRE principles.
Define clear reliability goals.
Track metrics across system layers.
Use insights from observability for continuous improvement and proactivemanagement.

To implement holistic observability and reliability practices for AI and MLworkloads in Google Cloud, consider the following recommendations.

Establish reliability goals and business metrics

Identify the key performance indicators (KPIs) that your AI and ML systemdirectly affects. The KPIs might include revenue that's influenced by AIrecommendations, customer churn that the AI systems predicted or mitigated, anduser engagement and conversion rates that are driven by generative AI features.

For each KPI, define the corresponding technical reliability metrics that affectthe KPI. For example, if the KPI is "customer satisfaction with a conversationalAI assistant," then the corresponding reliability metrics can include thefollowing:

The success rate of user requests.
The latency of responses: time to first token (TTFT) and token streamingfor LLMs.
The rate of irrelevant or harmful responses.
The rate of successful task completion by the agent.

For AI and ML training, reliability metrics can include model FLOPSutilization (MFU), iterations per second, tokens per second, and tokens perdevice.

To effectively measure and improve AI and ML reliability, begin by setting clearreliability goals that are aligned with the overarching business objectives.Adopt the SRE approach by defining SLOs that quantify acceptable levels ofreliability and performance for your AI and ML services from the users'perspective. Quantify these technical reliability metrics with specific SLOtargets.

The following are examples of SLO targets:

99.9% of API calls must return a successful response.
95th percentile inference latency must be below 300 ms.
TTFT must be below 500 ms for 99% of requests.
Rate of harmful output must be below 0.1%.

Aligning SLOs directly with business needs ensures that reliability efforts arefocused on the most critical system behavior that affects users and thebusiness. This approach helps to transform reliability into a measurable andactionable engineering property.

Monitor infrastructure and application performance

Track infrastructure metrics across all of the resources that are used by yourAI and ML systems. The metrics include processor usage (CPU, GPU, and TPU),memory usage, network throughput and latency, and disk I/O. Track the metricsfor managed environments like Vertex AI training and serving andfor self-managed resources like GKE nodes andCloud Run instances.

Monitor thefour golden signals for your AI and ML applications:

Latency: Time to respond to requests.
Traffic: Volume of requests or workload.
Error rate: Rate of failed requests or operations.
Saturation: Utilization of critical resources like CPU, memory, andGPU or TPU accelerators, which indicates how close your system is tocapacity limits.

Perform monitoring by using the following techniques:

Collect, store, and visualize the infrastructure and applicationmetrics by usingCloud Monitoring.You can use pre-built dashboards for Google Cloud services and createcustom dashboards that are tailored based on your workload's specificperformance indicators and infrastructure health.
- Collect and integrate metrics from specialized servingframeworks likevLLM orNVIDIA Triton Inference Server into Cloud Monitoring by usingGoogle Cloud Managed Service for Prometheus.
- Create dashboards and configure alerts for metrics that arerelated to custom training, endpoints, and performance, and for metricsthatVertex AI exports to Cloud Monitoring.
Collect detailed logs from your AI and ML applications and theunderlying infrastructure by usingCloud Logging.These logs are essential for troubleshooting and performance analysis. Theyprovide context around events and errors.
Pinpoint latency issues and understand request flows across distributedAI and ML microservices by usingCloud Trace.This capability is crucial for debugging complexVertex AI Agents interactions or multi-componentinference pipelines.
Identify performance bottlenecks within function blocks in applicationcode by usingCloud Profiler.Identifying performance bottlenecks can help you optimize resource usageand execution time.
Gather specific accelerator-related metrics like detailed GPUutilization per process, memory usage per process, and temperature, byusing tools likeNVIDIA Data Center GPU Manager (DCGM).

Implement data and model observability

Reliable generative AI systems require robust data and model observability,which starts with end-to-end pipeline monitoring.

Track data ingestion rates, processed volumes, and transformationlatencies by using services likeDataflow.
Monitor job success and failure rates within your MLOps pipelines,including pipelines that are managed byVertex AI Pipelines.

Continuous assessment of data quality is crucial.

Manage and govern data by usingDataplex Universal Catalog:
- Evaluate accuracy by validating against ground truth or bytracking outlier detection rates.
- Monitor freshness based on the age of data and frequency ofupdates against SLAs.
- Assess completeness by tracking null-value percentages andrequired field-fill rates.
- Ensure validity and consistency through checks forschema-adherence and duplication.
Proactively detect anomalies by using Cloud Monitoring alerting andthrough clear data lineage for traceability.
For RAG systems, examine the relevance of the retrieved context and thegroundedness (attribution to source) of the responses.
Monitor the throughput of vector database queries.

Key model observability metrics include input-output token counts andmodel-specific error rates, such as hallucination or query resolution failures.To track these metrics, useModel Monitoring.

Continuously monitor the toxicity scores of the output anduser-feedback ratings.
Automate the assessment of model outputs against defined criteria byusing theGen AI evaluation service.
Ensure sustained performance by systematically monitoring for data andconcept drift with comprehensive error-rate metrics.

To track model metrics, you can useTensorBoard orMLflow.For deep analysis and profiling to troubleshoot performance issues, you can usePyTorch XLA profiling orNVIDIA Nsight.

Contributors

Authors:

Rick (Rugui) Chen | AI Infrastructure Field Solutions Architect
Stef Ruinard | Generative AI Field Solutions Architect

Other contributors:

Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Hossein Sarshar | AI Infrastructure Field Solution Architect
Jose Andrade | Customer Engineer, SRE Specialist
Kumar Dhanagopal | Cross-Product Solution Developer
Laura Hyatt | Customer Engineer, FSI
Olivier Martin | AI Infrastructure Field Solution Architect
Radhika Kanakam | Program Lead, Google Cloud Well-Architected Framework

AI and ML perspective: Cost optimization

This document inWell-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost ofyour AI systems throughout the ML lifecycle. By adopting a proactive andinformed cost management approach, your organization can realize the fullpotential of AI and ML systems and also maintain financial discipline. Therecommendations in this document align with thecost optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems can help you unlock valuable insights and predictivecapabilities from data. For example, you canreduce friction in internal processes,improve user experiences,andgain deeper customer insights.The cloud offers vast amounts of resources and quick time-to-value without largeup-front investments for AI and ML workloads. To maximize business value and toalign the spending with your business goals, you need to understand the costdrivers, proactively optimize costs, set up spending controls, and adoptFinOps practices.

The recommendations in this document are mapped to the following coreprinciples:

Define and measure costs and returns

To effectively manage AI and ML costs in Google Cloud, you must define andmeasure the cloud resource costs and the business value of your AI and MLinitiatives. To help you track expenses granularly, Google Cloud providescomprehensive billing and cost management tools, such as the following:

Cloud Billing reports and tables
Looker Studio dashboards, budgets, and alerts
Cloud Monitoring
Cloud Logging

To make informed decisions about resource allocation and optimization, considerthe following recommendations.

Establish business goals and KPIs

Align the technical choices in your AI and ML projects with business goals andkey performance indicators (KPIs).

Define strategic objectives and ROI-focused KPIs

Ensure that AI and ML projects are aligned with strategic objectives likerevenue growth, cost reduction, customer satisfaction, and efficiency. Engagestakeholders to understand the business priorities. Define AI and ML objectivesthat are specific, measurable, attainable, relevant, and time-bound (SMART). Forexample, a SMART objective is: "Reduce chat handling time for customer supportby 15% in 6 months by using an AI chatbot".

To make progress towards your business goals and to measure the return oninvestment (ROI), define KPIs for the following categories of metrics:

Costs for training, inference, storage, and network resources, includingspecific unit costs (such as the cost per inference, data point, or task).These metrics help you gain insights into efficiency and cost optimizationopportunities. You can track these costs by usingCloud Billing reports andCloud Monitoring dashboards.
Business value metrics like revenue growth, cost savings, customersatisfaction, efficiency, accuracy, and adoption. You can track thesemetrics by usingBigQuery analytics andLooker dashboards.
Industry-specific metrics like the following:
- Retail industry: measure revenue lift and churn
- Healthcare industry: measure patient time and patient outcomes
- Finance industry: measure fraud reduction
Project-specific metrics. You can track these metrics by usingVertex AI Experiments andevaluation.
- Predictive AI: measure accuracy and precision
- Generative AI: measure adoption, satisfaction, and content quality
- Computer vision AI: measure accuracy

Foster a culture of cost awareness and continuous optimization

AdoptFinOps principles to ensure that each AI and ML project hasestimated costs and has ways to measure and track actual costs throughout its lifecycle. Ensurethat the costs and business benefits of your projects have assigned owners andclear accountability.

For more information, seeFoster a culture of cost awareness in the Cost Optimization pillar of the Google Cloud Well-Architected Framework.

Drive value and continuous optimization through iteration and feedback

Map your AI and ML applications directly to your business goals and measure theROI.

To validate your ROI hypotheses, start with pilot projects and use the followingiterative optimization cycle:

Monitor continuously and analyze data: Monitor KPIs and costs toidentify deviations and opportunities for optimization.
Make data-driven adjustments: Optimize strategies, models,infrastructure, and resource allocation based on data insights.
Refine iteratively: Adapt business objectives and KPIs based onthe things you learned and the evolving business needs. This iterationhelps you maintain relevance and strategic alignment.
Establish a feedback loop: Review performance, costs, and value withstakeholders to inform ongoing optimization and future project planning.

Manage billing data with Cloud Billing and labels

Effective cost optimization requires visibility into the source of each costelement. The recommendations in this section can help you use Google Cloudtools to get granular insights into your AI and ML costs. You can also attributecosts to specific AI and ML projects, teams, and activities. These insights laythe groundwork for cost optimization.

Organize and label Google Cloud resources

Structure your projects and resources in a hierarchy that reflects yourorganizational structure and your AI and ML workflows. To track and analyzecosts at different levels, organize your Google Cloud resources byusing organizations, folders, and projects. For more information, seeDecide a resource hierarchy for your Google Cloud landing zone.
Apply meaningfullabels to your resources. You can use labels thatindicate the project, team, environment, model name, dataset, use case, andperformance requirements. Labels provide valuable context for your billingdata and enable granular cost analysis.
Maintain consistency in your labeling conventions across all of your AIand ML projects. Consistent labeling conventions ensure that your billingdata is organized and can be readily analyzed.

Use billing-related tools

To facilitate detailed analysis and reporting,export the billing data to BigQuery.BigQuery has powerful query capabilities that let youanalyze the billing data to help you understand your costs.
To aggregate costs by labels, projects, or specific time periods, youcan write custom SQL queries in BigQuery. Such queries letyou attribute costs to specific AI and ML activities, such as modeltraining, hyperparameter tuning, or inference.
To identify cost anomalies or unexpected spending spikes, use theanalytic capabilities in BigQuery. This approach can helpyou detect potential issues or inefficiencies in your AI and ML workloads.
To identify and manage unexpected costs, use theanomaly detection dashboard in Cloud Billing.
To distribute costs across different teams or departments based onresource usage, use Google Cloud'scost allocation feature. Cost allocation promotes accountability and transparency.
To gain insights into spending patterns, explore the prebuiltCloud Billing reports.You can filter and customize these reports to focus on specific AI and MLprojects or services.

Monitor resources continuously with dashboards, alerts, and reports

To create a scalable and resilient way to track costs, you need continuousmonitoring and reporting. Dashboards, alerts, and reports constitute thefoundation for effective cost tracking. This foundation lets you maintainconstant access to cost information, identify areas of optimization, andensure alignment between business goals and costs.

Create a reporting system

Create scheduled reports and share them with appropriate stakeholders.

UseCloud Monitoring to collect metrics from various sources, including your applications,infrastructure, and Google Cloud services like Compute Engine,Google Kubernetes Engine (GKE), and Cloud Run functions. To visualizemetrics and logs in real time, you can use the prebuilt Cloud Monitoringdashboard or create custom dashboards. Custom dashboards let you define and addmetrics to track specific aspects of your systems, like model performance, APIcalls, or business-level KPIs.

UseCloud Logging for centralized collection and storage of logs from your applications, systems,and Google Cloud services. Use the logs for the following purposes:

Track costs and utilization of resources like CPU, memory, storage, andnetwork.
Identify cases of over-provisioning (where resources aren't fullyutilized) and under-provisioning (where there are insufficient resources).Over-provisioning results in unnecessary costs. Under-provisioning slowstraining times and might cause performance issues.
Identify idle or underutilized resources, such as VMs and GPUs, and takesteps to shut down or rightsize them to optimize costs.
Identify cost spikes to detect sudden and unexpected increases inresource usage or costs.

UseLooker orLooker Studio to create interactive dashboards and reports. Connect the dashboards and reportsto various data sources, including BigQuery andCloud Monitoring.

Set alert thresholds based on key KPIs

For your KPIs, determine the thresholds that should trigger alerts. Meaningfulalert thresholds can help you avoid alert fatigue. Createalerting policies in Cloud Monitoring to get notifications related to your KPIs. For example,you can get notifications when accuracy drops below a certain threshold orlatency exceeds a defined limit. Alerts based on log data can notify you aboutpotential cost issues in real time. Such alerts let you take corrective actionspromptly and prevent further financial loss.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, youmust optimize resource allocation. To help you avoid unnecessary expenses andensure that your workloads have the resources that they need to performoptimally, align resource allocation with the needs of your workloads.

To optimize the allocation of cloud resources to AI and ML workloads, considerthe following recommendations.

Use autoscaling to dynamically adjust resources

Use Google Cloud services that support autoscaling, which automaticallyadjusts resource allocation to match the current demand. Autoscaling providesthe following benefits:

Cost and performance optimization: You avoid paying for idleresources. At the same time, autoscaling ensures that your systems have thenecessary resources to perform optimally, even at peak load.
Improved efficiency: You free up your team to focus on other tasks.
Increased agility: You can respond quickly to changing demands andmaintain high availability for your applications.

The following table summarizes the techniques that you can use to implementautoscaling for different stages of your AI projects.

Stage	Autoscaling techniques
Training	Use managed services likeVertex AI orGKE, which offer built-in autoscaling capabilities for training jobs. Configure autoscaling policies to scale the number of traininginstances based on metrics like CPU utilization, memory usage, and jobqueue length. Use custom scaling metrics to fine-tune autoscaling behavior foryour specific workloads.
Inference	Deploy your models on scalable platforms likeVertex AI Inference,GPUs on GKE, orTPUs on GKE. Use autoscaling features to adjust the number of replicas based onmetrics like request rate, latency, and resource utilization. Implement load balancing to distribute traffic evenly acrossreplicas and ensure high availability.

Start with small models and datasets

To help reduce costs, test ML hypotheses at a small scale when possible and usean iterative approach. This approach, with smaller models and datasets, providesthe following benefits:

Reduced costs from the start: Less compute power, storage, andprocessing time can result in lower costs during the initialexperimentation and development phases.
Faster iteration: Less training time is required, which lets youiterate faster, explore alternative approaches, and identify promisingdirections more efficiently.
Reduced complexity: Simpler debugging, analysis, and interpretationof results, which leads to faster development cycles.
Efficient resource utilization: Reduced chance of over-provisioningresources. You provision only the resources that are necessary for thecurrent workload.

Consider the following recommendations:

Use sample data first: Train your models on a representative subsetof your data. This approach lets you assess the model's performance andidentify potential issues without processing the entire dataset.
Experiment by using notebooks: Start with smaller instances and scaleas needed. You can useVertex AI Workbench,a managed Jupyter notebook environment that's well suited forexperimentation with different model architectures and datasets.
Start with simpler or pre-trained models: UseVertex AI Model Garden to discover and explore the pre-trained models. Such models require fewercomputational resources. Gradually increase thecomplexity as needed based on performance requirements.
- Use pre-trained models for tasks like image classification andnatural language processing. To save on training costs, you canfine-tune the models on smaller datasets initially.
- UseBigQuery ML for structured data. BigQuery ML lets you create and deploymodels directly within BigQuery. This approach can becost-effective for initial experimentation, because you can take advantageof the pay-per-query pricing model for BigQuery.
Scale for resource optimization: Use Google Cloud's flexibleinfrastructure to scale resources as needed. Start with smaller instancesand adjust their size or number when necessary.

Discover resource requirements through experimentation

Resource requirements for AI and ML workloads can vary significantly. Tooptimize resource allocation and costs, you must understand the specific needsof your workloads through systematic experimentation. To identify the mostefficient configuration for your models, test different configurations andanalyze their performance. Then, based on the requirements, right-size theresources that you used for training and serving.

We recommend the following approach for experimentation:

Start with a baseline: Begin with a baseline configuration based onyour initial estimates of the workload requirements. To create a baseline,you can use the cost estimator for new workloads or use an existing billingreport. For more information, seeUnlock the true cost of enterprise AI on Google Cloud.
Understand your quotas: Before launching extensive experiments,familiarize yourself with your Google Cloud projectquotas for the resources and APIs that you plan to use. The quotas determine therange of configurations that you can realistically test. By becomingfamiliar with quotas, you can work within the available resource limitsduring the experimentation phase.
Experiment systematically: Adjust parameters like the number ofCPUs, amount of memory, number and type of GPUs and TPUs, and storagecapacity.Vertex AI training andVertex AI predictions let you experiment with different machine types and configurations.
Monitor utilization, cost, and performance: Track the resourceutilization, cost, and key performance metrics such as training time,inference latency, and model accuracy, for each configuration that youexperiment with.
- To track resource utilization and performance metrics, you canuse the Vertex AI console.
- To collect and analyze detailed performance metrics, useCloud Monitoring.
- To view costs, useCloud Billing reports andCloud Monitoring dashboards.
- To identify performance bottlenecks in your models and optimizeresource utilization, use profiling tools likeVertex AI TensorBoard.
Analyze costs: Compare the cost and performance of eachconfiguration to identify the most cost-effective option.
Establish resource thresholds and improvement targets based on quotas:Define thresholds for when scaling begins to yielddiminishing returns in performance, such as minimal reduction in trainingtime or latency for a significant cost increase. Consider project quotaswhen setting these thresholds. Determine the point where the cost andpotential quota implications of further scaling are no longer justified byperformance gains.
Refine iteratively: Repeat the experimentation process withrefined configurations based on your findings. Always ensure that theresource usage remains within your allocated quotas and aligns withestablished cost-benefit thresholds.

Use MLOps to reduce inefficiencies

As organizations increasingly use ML to drive innovation and efficiency,managing the ML lifecycle effectively becomes critical. ML operations (MLOps) isa set of practices that automate and streamline the ML lifecycle, from modeldevelopment to deployment and monitoring.

Align MLOps with cost drivers

To take advantage of MLOps for cost efficiency, identify the primary costdrivers in the ML lifecycle. Then, you can adopt and implement MLOps practicesthat are aligned with the cost drivers. Prioritize and adopt the MLOps featuresthat address the most impactful cost drivers. This approach helps ensure amanageable and successful path to significant cost savings.

Implement MLOps for cost optimization

The following are common MLOps practices that help to reduce cost:

Version control: Tools like Git can help you to track versions ofcode, data, and models. Version control ensures reproducibility,facilitates collaboration, and prevents costly rework that can be caused byversioning issues.
Continuous integration and continuous delivery (CI/CD):Cloud Build andArtifact Registry let you implement CI/CD pipelines to automate building, testing, anddeployment of your ML models. CI/CD pipelines ensure efficient resourceutilization and minimize the costs associated with manual interventions.
Observability:Cloud Monitoring andCloud Logging let you track model performance in production, identify issues, and triggeralerts for proactive intervention. Observability lets you maintain modelaccuracy, optimize resource allocation, and prevent costly downtime orperformance degradation.
Model retraining:Vertex AI Pipelines simplifies the processes for retraining models periodically or whenperformance degrades. When you use Vertex AI Pipelines forretraining, it helps ensure that your models remain accurate and efficient,which can prevent unnecessary resource consumption and maintain optimalperformance.
Automated testing and evaluation:Vertex AI helps you accelerate and standardize model evaluation. Implement automatedtests throughout the ML lifecycle to ensure the quality and reliability ofyour models. Such tests can help you catch errors early, prevent costlyissues in production, and reduce the need for extensive manual testing.

For more information, seeMLOps: Continuous delivery and automation pipelines in machine learning.

Enforce data management and governance practices

Effective data management and governance practices are critical to costoptimization. Well organized data can encourage teams to reuse datasets, avoidneedless duplication, and reduce the effort to obtain high quality data. Byproactively managing data, you can reduce storage costs, enhance data quality,and ensure that your ML models are trained on the most relevant and valuabledata.

To implement data management and governance practices, consider the followingrecommendations.

Establish and adopt a data governance framework

The growing prominence of AI and ML has made data the most valuable asset fororganizations that are undergoing digital transformation. A robust framework fordata governance is a crucial requirement for managing AI and ML workloads cost-effectively atscale. A data governance framework with clearly defined policies, procedures,and roles provides a structured approach for managing data throughout itslifecycle. Such a framework helps to improve data quality, enhance security,improve utilization, and reduce redundancy.

Establish a data governance framework

There are many pre-existing frameworks for data governance, such as theframeworks published by theEDM Council,with options available for different industries and organization sizes. Chooseand adapt a framework that aligns with your specific needs and priorities.

Implement the data governance framework

Google Cloud provides the following services and tools to help you implement arobust data governance framework:

Dataplex Universal Catalog is an intelligent data fabric that helps you unify distributed data andautomate data governance without the need to consolidate data sets in oneplace. This helps to reduce the cost to distribute and maintain data,facilitate data discovery, and promote reuse.
- To organize data, use Dataplex Universal Catalog abstractions andset uplogical data lakes and zones.
- To administer access to data lakes and zones, useGoogle Groups andDataplex Universal Catalog roles.
- To streamline data quality processes, enableauto data quality.
Dataplex Universal Catalogis also a fully managed and scalable metadata management service. Thecatalog provides a foundation that ensures that data assets are accessible andreusable.
- Metadata from thesupported Google Cloud sources is automatically ingested into the universal catalog. For data sourcesoutside of Google Cloud,create custom entries.
- To improve the discoverability and management of data assets,enrich technical metadata with business metadata by usingaspects.
- Ensure that data scientists and ML practitioners have sufficientpermissions to access Dataplex Universal Catalog and use thesearch function.
BigQuery sharing lets you efficiently and securely exchange data assets across yourorganizations to address challenges of data reliability and cost.
- Set updata exchanges and ensure that curated data assets can be viewed aslistings.
- Usedata clean rooms to securely manage access to sensitive data and efficiently partnerwith external teams and organizations on AI and ML projects.
- Ensure that data scientists and ML practitioners have sufficientpermissions to view and publish datasets to BigQuery sharing.

Make datasets and features reusable throughout the ML lifecycle

For significant efficiency and cost benefits, reuse datasets and featuresacross multiple ML projects. When you avoid redundant data engineering andfeature development efforts, your organization can accelerate model development,reduce infrastructure costs, and free up valuable resources for other criticaltasks.

Google Cloud provides the following services and tools to help you reusedatasets and features:

Data and ML practitioners can publishdata products to maximize reuse across teams. The data products can then be discoveredand used through Dataplex Universal Catalog andBigQuery sharing.
For tabular and structured datasets, you can useVertex AI Feature Store to promote reusability and streamline feature management throughBigQuery.
You can store unstructured data in Cloud Storage and govern the databy usingBigQuery object tables and signed URLs.
You can manage vector embeddings by including metadata in yourVector Search indexes.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs fortechnology and personnel. Automation helps you avoid the duplication of MLactivities and reduce the workload for data scientists and ML engineers.

To automate and streamline ML development with MLOps, consider the followingrecommendations.

Automate and standardize data collection and processing

To help reduce ML development effort and time, automate and standardize yourdata collection and processing technologies.

Automate data collection and processing

This section summarizes the products, tools, and techniques that you can use toautomate data collection and processing.

Identify and choose the relevant data sources for your AI and ML tasks:

Database options such asCloud SQL,Spanner,AlloyDB for PostgreSQL,Firestore,andBigQuery.Your choice depends on your requirements, such as latency on write access(static or dynamic), data volume (high or low), and data format(structured, unstructured, or semi-structured). For more information, seeGoogle Cloud databases.
Data lakes such as Cloud Storage withBigLake.
Dataplex Universal Catalog for governing data across sources.
Streaming events platforms such asPub/Sub,Dataflow,orApache Kafka.
External APIs.

For each of your data sources, choose an ingestion tool:

Dataflow: For batch and stream processing of data fromvarious sources, with ML-component integration. For an event-drivenarchitecture, you can combine Dataflow withEventarc to efficiently process data for ML. To enhance MLOps and ML job efficiency,use GPU and right-fitting capabilities.
Cloud Run functions:For event-driven data ingestion that gets triggered by changes in datasources for real-time applications.
BigQuery: For classical tabular data ingestion withfrequent access.

Choose tools for data transformation and loading:

Use tools such asDataflow orDataform to automate data transformations like feature scaling, encoding categoricalvariables, and creating new features in batch, streaming, or real time. Thetools that you select depend upon your requirements and chosen services.
UseVertex AI Feature Store to automate feature creation and management. You can centralize featuresfor reuse across different models and projects.

Standardize data collection and processing

To discover, understand, and manage data assets, use metadata managementservices likeDataplex Universal Catalog.It helps you standardize data definitions and ensure consistency across yourorganization.

To enforce standardization and avoid the cost of maintaining multiple customimplementations, use automated training pipelines and orchestration. For moreinformation, see the next section.

Automate training pipelines and reuse existing assets

To boost efficiency and productivity in MLOps, automated training pipelines arecrucial. Google Cloud offers a robust set of tools and services to buildand deploy training pipelines, with a strong emphasis on reusing existingassets. Automated training pipelines help to accelerate model development,ensure consistency, and reduce redundant effort.

Automate training pipelines

The following table describes the Google Cloud services and features thatyou can use to automate the different functions of a training pipeline.

Function	Google Cloud services and features
Orchestration: Define complex ML workflowsconsisting of multiple steps and dependencies. You can define eachstep as a separate containerized task, which helps you manageand scale individual tasks with ease.	To create and orchestrate pipelines, useVertex AI Pipelines or Kubeflow Pipelines. These tools support simple data transformation, model training, model deployment, and pipeline versioning. They let you define dependencies between steps, manage data flow, and automate the execution of the entire workflow. For complex operational tasks with heavy CI/CD and extract, transform, and load (ETL) requirements, useCloud Composer. If you prefer Airflow for data orchestration, Cloud Composer is a compatible managed service that's built on Airflow. For pipelines that are managed outside of Vertex AI Pipelines, useWorkflows for infrastructure-focused tasks like starting and stopping VMs or integrating with external systems. To automate your CI/CD process, useCloud Build withPub/Sub. You can set up notifications and automatic triggers for when new code is pushed or when a new model needs to be trained. For a fully-managed, scalable solution for pipeline management, useCloud Data Fusion.
Versioning: Track and control different versions of pipelines and components to ensure reproducibility and auditability.	StoreKubeflow pipeline templates in a Kubeflow Pipelines repository inArtifact Registry.
Reusability: Reuse existing pipeline components and artifacts, such as prepared datasets and trained models, to accelerate development.	Store your pipeline templates inCloud Storage and share them across your organization.
Monitoring: Monitor pipeline execution to identify and address any issues.	Use Cloud Logging and Cloud Monitoring. For more information, seeMonitor resources continuously with dashboards, alerts, and reports.

Expand reusability beyond pipelines

Look for opportunities to expand reusability beyond training pipelines. Thefollowing are examples of Google Cloud capabilities that let you reuse MLfeatures, datasets, models, and code.

Vertex AI Feature Store provides a centralized repository for organizing, storing, and serving MLfeatures. It lets you reuse features across different projects and models,which can improve consistency and reduce feature engineering effort. Youcan store, share, and access features for both online and offline use cases.
Vertex AI datasets enable teams to create and manage datasets centrally, so your organizationcan maximize reusability and reduce data duplication. Your teams can searchand discover the datasets by usingDataplex Universal Catalog.
Vertex AI Model Registry lets you store, manage, and deploy your trained models.Model Registry letsyou reuse the models in subsequent pipelines or for online prediction,which helps you take advantage of previous training efforts.
Custom containers let you package your training code and dependencies into containers andstore the containers in Artifact Registry. Custom containers let youprovide consistent and reproducible training environments across differentpipelines and projects.

Use Google Cloud services for model evaluation and tuning

Google Cloud offers a powerful suite of tools and services to streamlineand automate model evaluation and tuning. These tools and services can help youreduce your time to production and reduce the resources required for continuoustraining and monitoring. By using these services, your AI and ML teams canenhance model performance with fewer expensive iterations, achieve fasterresults, and minimize wasted compute resources.

Use resource-efficient model evaluation and experimentation

Begin an AI project with experiments before you scale up your solution. In yourexperiments, track various metadata such as dataset version, model parameters,and model type. For further reproducibility and comparison of the results, usemetadata tracking in addition to code versioning, similar to the capabilities inGit. To avoid missing information or deploying the wrong version in production,useVertex AI Experiments before you implement full-scale deployment or training jobs.

Vertex AI Experiments lets you do the following:

Streamline and automate metadata tracking and discovery through a userfriendly UI and API for production-ready workloads.
Analyze the model's performance metrics and compare metrics acrossmultiple models.

After the model is trained, continuously monitor the performance and data driftover time for incoming data. To streamline this process, useVertex AI Model Monitoring to directly access the created models inModel Registry.Model Monitoring also automates monitoring for data andresults through online and batch predictions. You can export the results toBigQuery for further analysis and tracking.

Choose optimal strategies to automate training

For hyperparameter tuning, we recommend the following approaches:

To automate the process of finding the optimal hyperparameters for yourmodels, useVertex AI hyperparameter tuning.Vertex AI uses advanced algorithms to explore thehyperparameter space and identify the best configuration.
For efficient hyperparameter tuning, consider usingBayesian optimization techniques, especially when you deal with complex models and large datasets.

For distributed training, we recommend the following approaches:

For large datasets and complex models, use the distributed traininginfrastructure of Vertex AI. This approach lets you trainyour models on multiple machines, which helps to significantly reducetraining time and associated costs. Use tools like the following:
- Vertex AI tuning to perform supervised fine-tuning of Gemini, Imagen, and other models.
- Vertex AI training orRay on Vertex AI for custom distributed training.
Choose optimized ML frameworks, like Keras and PyTorch, that supportdistributed training and efficient resource utilization.

Use explainable AI

It's crucial to understand why a model makes certain decisions and to identifypotential biases or areas for improvement. UseVertex Explainable AI to gain insights into your model's predictions. Vertex Explainable AI offers a wayto automate feature-based and example-based explanations that are linked to yourVertex AI experiments.

Feature-based: To understand which features are most influential inyour model's predictions, analyzefeature attributions.This understanding can guide feature-engineering efforts and improve modelinterpretability.
Example-based:To return a list of examples (typically from the training set) that aremost similar to the input, Vertex AI uses nearest neighborsearch. Because similar inputs generally yield similar predictions, you canuse these explanations to explore and explain a model's behavior.

Use managed services and pre-trained models

Adopt an incremental approach to model selection and model development. Thisapproach helps you avoid excessive costs that are associated with startingafresh every time. To control costs, use ML frameworks, managed services, andpre-trained models.

To get the maximum value from managed services and pre-trained models, considerthe following recommendations.

Use notebooks for exploration and experiments

Notebook environments are crucial for cost-effective ML experimentation. A notebookprovides an interactive and collaborative space for data scientists andengineers to explore data, develop models, share knowledge, and iterateefficiently. Collaboration and knowledge sharing through notebooks significantlyaccelerates development, code reviews, and knowledge transfer. Notebooks helpstreamline workflows and reduce duplicated effort.

Instead of procuring and managing expensive hardware for your developmentenvironment, you can use the scalable and on-demand infrastructure ofVertex AI Workbench and Colab Enterprise.

Vertex AI Workbench is a Jupyter notebook development environment for the entire data scienceworkflow. You can interact with Vertex AI and other Google Cloudservices from within an instance's Jupyter notebook.Vertex AI Workbench integrations and features help you do thefollowing:
- Access and explore data from a Jupyter notebook by usingBigQuery and Cloud Storage integrations.
- Automate recurring updates to a model by using scheduledexecutions of code that runs on Vertex AI.
- Process data quickly by running a notebook on aDataproc cluster.
- Run a notebook as a step in a pipeline by usingVertex AI Pipelines.
Colab Enterprise is a collaborative, managed notebook environment that has the security andcompliance capabilities of Google Cloud.Colab Enterprise is ideal if your project's prioritiesinclude collaborative development and reducing the effort to manageinfrastructure. Colab Enterprise integrates withGoogle Cloud services and AI-powered assistance that usesGemini. Colab Enterprise lets you do the following:
- Work in notebooks without the need to manage infrastructure.
- Share a notebook with a single user, Google group, orGoogle Workspace domain. You can control notebook access throughIdentity and Access Management (IAM).
- Interact with features built into Vertex AI andBigQuery.

To track changes and revert to previous versions when necessary, you canintegrate your notebooks with version control tools like Git.

Start with existing and pre-trained models

Training complex models from scratch, especially deep-learning models, requiressignificant computational resources and time. To accelerate your model selectionand development process, start with existing and pre-trained models. Thesemodels, which are trained on vast datasets, eliminate the need to train modelsfrom scratch and significantly reduce cost and development time.

Reduce training and development costs

Select an appropriate model or API for each ML task and combine them to createan end-to-end ML development process.

Vertex AI Model Garden offers a vast collection of pre-trained models for tasks such as imageclassification, object detection, and natural language processing. The modelsare grouped into the following categories:

Google models like the Gemini family of models and Imagen for image generation.
Open-source models like Gemma and Llama.
Third-party models from partners like Anthropic and Mistral AI.

Google Cloud providesAI and ML APIs that let developers integrate powerful AI capabilities into applications withoutthe need to build models from scratch.

Cloud Vision API lets you derive insights from images. This API is valuable for applicationslike image analysis, content moderation, and automated data entry.
Cloud Natural Language API lets you analyze text to understand its structure and meaning. This API isuseful for tasks like customer feedback analysis, content categorization,and understanding social media trends.
Speech-to-Text API converts audio to text. This API supports a wide range of languages anddialects.
Video Intelligence API analyzes video content to identify objects, scenes, and actions. Use thisAPI for video content analysis, content moderation, and video search.
Document AI API processes documents to extract, classify, and understand data. This APIhelps you automate document processing workflows.
Dialogflow API enables the creation of conversational interfaces, such as chatbots andvoice assistants. You can use this API to create customer service bots andvirtual assistants.
Gemini API in Vertex AI provides access to Google's most capable and general-purpose AI model.

Reduce tuning costs

To help reduce the need for extensive data and compute time, fine-tune yourpre-trained models on specific datasets. We recommend the followingapproaches:

Learning transfer: Use the knowledge from a pre-trained model for anew task, instead of starting from scratch. This approach requires lessdata and compute time, which helps to reduce costs.
Adapter tuning (parameter-efficient tuning):Adapt models to new tasks or domains without full fine-tuning. This approachrequires significantly lower computational resources and a smaller dataset.
Supervised fine tuning:Adapt model behavior with a labeled dataset. This approachsimplifies the management of the underlying infrastructure and thedevelopment effort that's required for a custom training job.

Explore and experiment by using Vertex AI Studio

Vertex AI Studio lets you rapidly test, prototype, and deploy generative AI applications.

Integration with Model Garden: Provides quick access tothe latest models and lets you efficiently deploy the models to save timeand costs.
Unified access to specialized models: Consolidates access to a widerange of pre-trained models and APIs, including those for chat, text,media, translation, and speech. This unified access can help you reduce thetime spent searching for and integrating individual services.

Use managed services to train or serve models

Managed services can help reduce the cost of model training and simplify theinfrastructure management, which lets you focus on model development andoptimization. This approach can result in significant cost benefits andincreased efficiency.

Reduce operational overhead

To reduce the complexity and cost of infrastructure management, use managedservices such as the following:

Vertex AI training provides a fully managed environment for training your models at scale. Youcan choose from various prebuilt containers with popular ML frameworks oruse your own custom containers. Google Cloud handles infrastructureprovisioning, scaling, and maintenance, so you incur lower operational overhead.
Vertex AI predictions handles infrastructure scaling, load balancing, and request routing. Youget high availability and performance without manual intervention.
Ray on Vertex AI provides a fully managed Ray cluster. You can use the cluster to runcomplex custom AI workloads that perform many computations (hyperparametertuning, model fine-tuning, distributed model training, and reinforcementlearning from human feedback) without the need to manage your owninfrastructure.

Use managed services to optimize resource utilization

For details about efficient resource utilization, seeOptimize resource utilization.

Contributors

Authors:

Isaac Lo | AI Business Development Manager
Anastasia Prokaeva | Field Solutions Architect, Generative AI
Amy Southwood | Technical Solutions Consultant, Data Analytics & AI

Other contributors:

Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
Kumar Dhanagopal | Cross-Product Solution Developer
Marwan Al Shawi | Partner Customer Engineer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist

AI and ML perspective: Performance optimization

This document in theGoogle Cloud Well-Architected Framework: AI and ML perspective provides principles and recommendations to help you optimize the performance ofyour AI and ML workloads on Google Cloud. The recommendations inthis document align with theperformance optimization pillar of the Well-Architected Framework.

AI and ML systems enable advanced automation and decision-making capabilitiesfor your organization. The performance of these systems can directly affectimportant business drivers like revenue, costs, and customer satisfaction. Torealize the full potential of AI and ML systems, you must optimize theirperformance based on your business goals and technical requirements. Theperformance optimization process often involves trade-offs. For example, adesign choice that provides the required performance might lead to higher costs.The recommendations in this document prioritize performance over otherconsiderations.

To optimize AI and ML performance, you need to make decisions regarding factorslike the model architecture, parameters, and training strategy. When you makethese decisions, consider the entire lifecycle of the AI and ML systems andtheir deployment environment. For example, very large LLMs can be highlyperformant on massive training infrastructure, but might not perform well incapacity-constrained environments like mobile devices.

The recommendations in this document are mapped to the following coreprinciples:

Establish performance objectives and evaluation methods

Your business strategy and goals are the foundation for leveraging AI and MLtechnologies. Translate your business goals into measurable key performanceindicators (KPIs). Examples of KPIs include total revenue, costs, conversionrate, retention or churn rate, customer satisfaction, and employee satisfaction.

Define realistic objectives

According tosite reliability engineering (SRE) best practices,the objectives of a service must reflect a performance level that satisfies therequirements of typical customers. This means that service objectives must berealistic in terms of scale and feature performance.

Unrealistic objectives can lead to wasted resources for minimal performancegains. Models that provide the highest performance might not lead tooptimal business outcomes. Such models might require more time and cost to trainand run.

When you define objectives, distinguish between and prioritize quality andperformance objectives:

Quality refers to inherent characteristics that determine the value of anentity. It helps you assess whether the entity meets your expectations andstandards.
Performance refers to how efficiently and effectively an entity functionsor carries out its intended purpose.

ML engineers can improve the performance metrics of a model during the trainingprocess. Vertex AI provides anevaluation service that ML engineers can use to implement standardized and repeatable tracking ofquality metrics. Theprediction efficiency of a model indicates how well amodel performs in production or at inference time. To monitor performance, useCloud Monitoring andVertex AI Model Monitoring.To select appropriate models and decide how to train them, you must translatebusiness goals into technical requirements that determine quality andperformance metrics.

To understand how to set realistic objectives and identify appropriateperformance metrics, consider the following example for an AI-powered frauddetection system:

Business objective: For a fraud detection system, an unrealisticbusiness objective is to detect 100% of fraudulent transactions accuratelywithin one nanosecond at a peak traffic of 100 billion transactions persecond. A more realistic objective is to detect fraudulent transactionswith 95% accuracy in 100 milliseconds for 90% of online predictionsduring US working hours at a peak volume of one million transactions persecond.
Performance metrics: Detecting fraud is a classification problem. Youcan measure the quality of a fraud detection system by using metrics likerecall,F1 score,andaccuracy.To track system performance or speed, you can measure inference latency.Detecting potentially fraudulent transactions might be more valuable thanaccuracy. Therefore, a realistic goal might be a high recall with a p90latency that's less than 100 milliseconds.

Monitor performance at all stages of the model lifecycle

During experimentation and training and after model deployment, monitor yourKPIs and observe any deviations from the business objectives. A comprehensivemonitoring strategy helps you make critical decisions about model quality andresource utilization, such as the following:

Decide when to stop a training job.
Determine whether a model's performance is degrading in production.
Improve the cost and time-to-market for new models.

Monitoring during experimentation and training

The objective of the experimentation stage is to find the optimal overallapproach, model architecture, and hyperparameters for a specific task.Experimentation helps you iteratively determine the configuration that providesoptimal performance and how to train the model. Monitoring helps you efficientlyidentify potential areas of improvement.

To monitor a model's quality and training efficiency, ML engineers must do thefollowing:

Visualize model quality and performance metrics for each trial.
Visualize model graphs and metrics, such as histograms of weights andbiases.
Visually represent training data.
Profile training algorithms on different hardware.

To monitor experimentation and training, consider the followingrecommendations:

Monitoring aspect	Recommendation
Model quality	To visualize and track experiment metrics like accuracy and to visualize model architecture or training data, useTensorBoard. TensorBoard is an open-source suite of tools that's compatible with ML frameworks like the following: XGBoost JAX Flax PyTorch PyTorch with accelerated linear algebra (XLA) for TPU training TensorFlow
Experiment tracking	Vertex AI Experiments integrates with managed enterprise-gradeVertex AI TensorBoard instances to support experiment tracking. This integration enables reliable storage and sharing of logs and metrics. To let multiple teams and individuals track experiments, we recommend that you use theprinciple of least privilege.
Training and experimentation efficiency	Vertex AI exports metrics toMonitoring and collects telemetry data and logs by using an observability agent. You can visualize the metrics in the Google Cloud console. Alternatively, create dashboards or alerts based on these metrics by using Monitoring. For more information, seeMonitoring metrics for Vertex AI.
NVIDIA GPUs	TheOps Agent enables GPU monitoring for Compute Engine and forother products that Ops Agent supports. You can also use theNVIDIA Data Center GPU Manager (DCGM), which is a suite of tools for managing and monitoring NVIDIA GPUs in cluster environments. Monitoring NVIDIA GPUs is particularly useful for training and serving deep learning models.
Deep debugging	To debug problems with the training code or the configuration of a Vertex AI Training job, you can inspect the training container by using aninteractive shell session.

Monitoring during serving: Streaming prediction

After you train a model and export it toVertex AI Model Registry,you can create aVertex AI endpoint. This endpoint provides an HTTP endpoint for the model.

Model Monitoring helps you identify large changes in the distribution of input or outputfeatures. You can also monitorfeature attributions in production when compared to a baseline distribution. The baselinedistribution can be the training set or it can be based on past distributions ofproduction traffic. A change in the serving distribution might imply a reductionin predictive performance compared to training.

Choose a monitoring objective: Depending on the sensitivity of a usecase to changes in the data that's provided to the model, you canmonitor different types of objectives:input feature drift, output drift, and feature attribution.Model Monitoring v2 lets you monitor models that you deploy on a managed serving platform likeVertex AI and also on self-hosted services like Google Kubernetes Engine (GKE). In addition, for granular performance tracking, you can monitorparameters at the model level rather than for an endpoint.
Monitor generative AI model serving: To ensure stability andminimize latency, particularly for LLM endpoints, set up a robust monitoringstack.Gemini models providebuilt-in metrics,like time to first token (TTFT), which you can access directly inMetrics Explorer.To monitor throughput, latency, and error rates across allGoogle Cloud models, use themodel observability dashboard.

Monitoring during serving: Batch prediction

To monitor batch prediction, you can run standard evaluation jobs in theVertex AI evaluation service.Model Monitoring supports monitoring ofbatch inferences.If you use Batch to run your serving workload, you can monitor resourceconsumption by using themetrics in Metrics Explorer.

Automate evaluation for reproducibility and standardization

To transition models from prototypes to reliable production systems, you need astandardized evaluation process. This process helps you track progress acrossiterations, compare different models, detect and mitigate bias, and ensure thatyou meet regulatory requirements. To ensure reproducibility and scalability, youmust automate the evaluation process.

To standardize and automate the evaluation process for ML performance, completethe following steps:

Define quantitative and qualitative indicators.
Choose appropriate data sources and techniques.
Standardize the evaluation pipeline.

These steps are described in the following sections.

1. Define quantitative and qualitative indicators

Computation-based metrics are calculated by using numeric formulas.Remember thattraining loss metrics might differ from the evaluation metrics that are relevant to businessgoals. For example, a model that's used for supervised fraud detection might usecross-entropy loss for training. However, to evaluate inference performance, a more relevant metricmight berecall,which indicates the coverage of fraudulent transactions.Vertex AI provides an evaluation service for metrics like recall,precision, and area under the precision-recall curve (AuPRC). For moreinformation, seeModel evaluation in Vertex AI.

Qualitative indicators, such as the fluency or entertainment value ofgenerated content, can't be objectively computed. To evaluate these indicators,you can use theLLM-as-a-judge strategy or human labeling services likeLabelbox.

2. Choose appropriate data sources and techniques

An evaluation is statistically significant when it runs on a certain minimumvolume of varied examples. Choose the datasets and techniques that you use forevaluations by using approaches such as the following:

Golden dataset: Use trusted, consistent, and accurate data samplesthat reflect the probability distribution of a model in production.
LLM-as-a-judge: Evaluate the output of a generative model by usingan LLM. This approach is relevant only to tasks where an LLM can evaluate amodel.
User feedback: To guide future improvements, capture direct userfeedback as a part of production traffic.

Depending on the evaluation technique, size and type of evaluation data, andfrequency of evaluation, you can use BigQuery or Cloud Storage as datasources, including for the Vertex AI evaluation service.

BigQuery lets you use SQL commands to run inference taskslike natural language processing and machine translation. For moreinformation, seeTask-specific solutions overview.
Cloud Storage provides acost-efficient storage solutionfor large datasets.

3. Standardize the evaluation pipeline

To automate the evaluation process, consider the following services andtools:

Vertex AI evaluation service:Provides ready-to-use primitives to track model performance as a part of theML lifecycle on Vertex AI.
Gen AI evaluation service:Lets you evaluate any generative model or application and benchmark theevaluation results against your own judgment and evaluation criteria. Thisservice also helps you perform specialized tasks like prompt engineering,retrieval-augmented generation (RAG), and AI agent optimization.
Vertex AI automatic side-by-side (AutoSxS) tool:Supports pairwise model-based evaluation.
Kubeflow:Provides special components to run model evaluations.

Run and track frequent experiments

To effectively optimize ML performance, you need a dedicated, powerful, andinteractive platform for experimentation. The platform must have the followingcapabilities:

Facilitate iterative development, so that teams can move from idea tovalidated results with speed, reliability, and scale.
Let teams efficiently discover optimal configurations that they can use totrigger training jobs.
Provide controlled access to relevant data, features, and tools toperform and track experiments.
Support reproducibility and data lineage tracking.

Treat data as a service

Isolate experimental workloads from production systems and set up appropriatesecurity controls for your data assets by using the following techniques:

Technique	Description	Benefits
Resource isolation	Isolate the resources for different environments in separate Google Cloud projects. For example, provision the resources for development, staging, and production environments in separate projects like`ml-dev`,`ml-staging`, and`ml-prod`.	Resource isolation helps to prevent experimental workloads from consuming resources that production systems need. For example, if you use a single project for experimentsand production, an experiment might consume all of the available NVIDIA A100 GPUs for Vertex AI Training. This might cause interruptions in the retraining of a critical production model.
Identity and access control	Apply the principles ofzero trust and least privilege and use workload-specific service accounts. Grant access by using predefined Identity and Access Management (IAM) roles likeVertex AI User (`roles/aiplatform.user`).	This approach helps to prevent accidental or malicious actions that might corrupt experiments.
Network security	Isolate network traffic by usingVirtual Private Cloud (VPC) networks and enforce security perimeters by usingVPC Service Controls.	This approach helps to protect sensitive data and prevents experimental traffic from affecting production services.
Data isolation	Store experimental data in separate Cloud Storage buckets and BigQuery datasets.	Data isolation prevents accidental modification of production data. For example, without data isolation, an experiment might inadvertently alter the feature values in a shared BigQuery table, which might lead to a significant degradation in model accuracy in the production environment.

Equip teams with appropriate tools

To establish a curated set of tools to accelerate the entire experimentationlifecycle—from data exploration to model training and analysis—use thefollowing techniques:

Interactive prototyping: For rapid data exploration, hypothesistesting, and code prototyping, use Colab Enterprise or managedJupyterLab instances on Vertex AI Workbench. For more information, seeChoose a notebook solution.
Scalable model training: Run training jobs on a managed service thatsupports distributed training and scalable compute resources like GPUs andTPUs. This approach helps to reduce training time from days to hours andenables more parallel experimentation. For more information, seeUse specialized components for training.
In-database ML: Train models directly inBigQuery ML using SQL.This technique helps to eliminate data movement and acceleratesexperimentation for analysts and SQL-centric users.
Tracking experiments: Create a searchable and comparable history ofexperiment data by logging parameters, metrics, and artifacts for everyexperiment run. For more information, seeBuild a data and model lineage system.
Optimizing generative AI: To optimize the performance of generativeAI applications, you must experiment with prompts, model selection, andfine-tuning. For rapid prompt engineering, useVertex AI Studio.To experiment with foundation models (like Gemini) and find amodel that's appropriate for your use case and business goals, useModel Garden.

Standardize for reproducibility and efficiency

To ensure that experiments consume resources efficiently and produce consistentand trustworthy results, standardize and automate experiments by using thefollowing approaches:

Ensure consistent environments by using containers: Package yourtraining code and dependencies as Docker containers. Manage and serve thecontainers by usingArtifact Registry.This approach lets you reproduce issues on different machines by repeatingexperiments in identical environments. Vertex AI providesprebuilt containers for serverless training.
Automate ML workflows as pipelines: Orchestrate the end-to-end MLworkflow as a codified pipeline by usingVertex AI Pipelines.This approach helps to enforce consistency, ensure reproducibility, andautomatically track all of the artifacts and metadata inVertex ML Metadata.
Automate provisioning with infrastructure-as-code (IaC): Define anddeploy standardized experimentation environments by using IaC tools likeTerraform.To ensure that every project adheres to a standardized set of configurationsfor security, networking, and governance, useTerraform modules.

Build and automate training and serving infrastructure

To train and serve AI models, set up a robust platform that supports efficientand reliable development, deployment, and serving. This platform lets your teamsefficiently improve the quality and performance of training and serving in thelong run.

Use specialized components for training

A reliable training platform helps to accelerate performance and provides astandardized approach to automate repeatable tasks in the ML lifecycle–from datapreparation to model validation.

Data collection and preparation: For effective model training, youneed to collect and prepare the data that's necessary for training,testing, and validation. The data might come from different sources and beof different data types. You also need to reuse relevant data acrosstraining runs and share features across teams. To improve the repeatabilityof the data collection and preparation phase, consider the followingrecommendations:
- Improve data discoverability withDataplex Universal Catalog.
- Centralize feature engineering inVertex AI Feature Store.
- Preprocess data by usingDataflow.
Training execution: When you train a model, you use data to create amodel object. To do this, you need to set up the necessary infrastructureand dependencies of the training code. You also need to decide how topersist the training models, track training progress, evaluate the model,and present the results. To improve training repeatability, consider thefollowing recommendations:
- Follow the guidelines inAutomate evaluation for reproducibility and standardization.
- To submit training jobs on one or more worker nodes without managing theunderlying infrastructure provisioning or dependencies, use theCustomJob resource in Vertex AI Training. You can also use theCustomJob resource forhyperparameter tuning jobs.
- Optimize scheduling goodput by leveragingDynamic Workload Scheduler orreservations for Vertex AI Training.
- Register your models toVertex AI Model Registry.

Training orchestration: Deploy training workloads as stages of apipeline by usingVertex AI Pipelines. This service provides managedKubeflow andTFX services. Define each step of the pipeline as acomponent that runs in a container. Each component works like a function, with inputparameters and output artifacts that become inputs to the subsequentcomponents in the pipeline. To optimize the efficiency of your pipeline,consider the recommendations in the following table:

Goal	Recommendations
Implement core automation.	UseKubeflow Pipelines components. UseControl Flows for advanced pipeline designs, such as conditional gates. Integrate thepipeline compilation as a part of your CI/CD flow. Execute the pipeline withCloud Scheduler.
Increase speed and cost efficiency.	To increase the speed of iterations and reduce cost, use theexecution cache of Vertex AI Pipelines. Specify the machine configuration for each component based on the resource requirements of each step.
Increase robustness.	To increase robustness against temporary issues without requiring manual intervention, configureretries. Tofail fast and iterate efficiently, configure afailure policy. To handle failures,configure email notifications.
Implement governance and tracking.	To experiment with different model configurations or training configurations,add pipeline runs to experiments. Monitorlogs andmetrics for Vertex AI Pipelines. Follow the recommendations inMonitor performance at all stages of the model lifecycle.

Use specialized infrastructure for prediction

To reduce the toil of managing infrastructure and model deployments, automatethe repeatable task flows. A service-oriented approach lets you focus on speedand fastertime to value (TTV). Consider thefollowing recommendations:

Recommendation	Techniques
Implement automatic deployment.	After a model is trained, use automation to deploy the model. Vertex AI Pipelines providesspecialized components for deploying models to Vertex AI endpoints. You can manage automatic deployment by usingKubeflow Control Flows. To decide whether to deploy a model based on an evaluation against a validation set, take advantage of theevaluation components of the pipeline. You can automate deployment for batch inference by using specializedbatch prediction components in Vertex AI.
Take advantage of managed scaling features.	Vertex AI provides batch and online inference capabilities that let you configure thescaling behavior. Cloud Run supportsautoscaling GPUs.
Optimize latency and throughput on Vertex AI endpoints.	Choose an appropriatemachine type. Choose an appropriateendpoint type for your models.Dedicated private endpoints might help to reduce latency. If you use Vertex AI Feature Store, choose an appropriate online serving type. Optimized online serving with Private Service Connect might provide low latency. For more information, seeOnline service types.
Optimize resource utilization.	To optimize resource utilization, you canshare resources across model deployments. Instead of using a separate VM for each model (default behavior), you can deploy several models on a single VM. To assure or optimize resource availability, usereservations,Dynamic Workload Scheduler, orSpot VMs.
Optimize model deployment.	To avoid breaking changes, you can perform gradual migration and implement A/B testing by automaticallydeploying several models to the same endpoint. Vertex AI decouples the endpoint layer from the models, which lets you split traffic between model versions. To replace a deployed model, use arolling deployment strategy.
Monitor performance.	To improve serving performance in production, observelogs andinference metrics for Vertex AI endpoints. To understand the behavior of models in production, you need tools to monitor quality and performance.Model Monitoring v2 lets you monitor several versions of a model and iterate rapidly.

Match design choices to performance requirements

When you make design choices to improve performance, assess whether the choicessupport your business requirements or are wasteful and counterproductive. Tochoose appropriate infrastructure, models, and configurations, identifyperformance bottlenecks and assess how they're linked to performance metrics.For example, even on very powerful GPU accelerators, training tasks canexperience performance bottlenecks. These bottlenecks can be caused by data I/Oissues in the storage layer or by performance limitations of the model.

Focus on holistic performance of the ML flow

As training requirements grow in terms of model size and cluster size, thefailure rate and infrastructure costs might increase. Therefore, the cost offailure might increase quadratically. You can't rely solely on conventionalresource-efficiency metrics likemodel FLOPs utilization (MFU).To understand why MFU might not be a sufficient indicator of overall trainingperformance, examine the lifecycle of a typical training job. The lifecycleconsists of the following cyclical flow:

Cluster creation: Worker nodes are provisioned.
Initialization: Training is initialized on the worker nodes.
Training execution: Resources are used for forward or backwardpropagation.
Interruption: The training process is interrupted during modelcheckpointing or due to worker-node preemptions.

After each interruption, the preceding flow is repeated.

The training execution step constitutes a fraction of the lifecycle of an MLjob. Therefore, the utilization of worker nodes for the training execution stepdoesn't indicate the overall efficiency of the job. For example, even if thetraining execution step runs at 100% efficiency, the overall efficiency mightbe low if interruptions occur frequently or if it takes a long time to resumetraining after interruptions.

Adopt and track goodput metrics

To ensure holistic performance measurement and optimization, shift your focusfrom conventional resource-efficiency metrics like MFU togoodput.Goodput considers the availability and utilization of your clusters and computeresources and it helps to measure resource efficiency across multiple layers.

The focus of goodput metrics is theoverall progress of a job rather thanwhether the jobappears to be busy. Goodput metrics help you optimizetraining jobs for tangible overall gains in productivity and performance.

Goodput gives you a granular understanding of potential losses in efficiencythrough the following metrics:

Scheduling goodput is the fraction of time when all of the resources thatare required for training or serving are available for use.
Runtime goodput represents the proportion of useful training stepsthat are completed during a given period.
Program goodput is the peak hardware performance or MFU that atraining job can extract from the accelerator. It depends on efficientutilization of the underlying compute resources during training.

Optimize scheduling goodput

To optimize scheduling goodput for a workload, you must identify the specificinfrastructure requirements of the workload. For example, batch inference,streaming inference, and training have different requirements:

Batch inference workloads might accommodate some interruptions anddelays in resource availability.
Streaming inference workloads require stateless infrastructure.
Training workloads require longer term infrastructure commitments.

Choose appropriate obtainability modes

In cloud computing,obtainability is the ability to provision resources whenthey're required. Google Cloud provides the following obtainability modes:

On-demand VMs: You provision Compute Engine VMs when they'reneeded and run your workloads on the VMs. The provisioning request issubject to the availability of resources, such as GPUs. If a sufficientquantity of a requested resource type isn't available, the request fails.
Spot VMs:You create VMs by using unused compute capacity. Spot VMs arebilled at a discounted price when compared to on-demand VMs, butGoogle Cloud mightpreempt Spot VMs at any time. We recommend that you useSpot VMs for stateless workloads that can fail gracefullywhen the host VMs are preempted.
Reservations:You reserve capacity as a pool of VMs. Reservations are ideal for workloadsthat need capacity assurance. Use reservations to maximize schedulinggoodput by ensuring that resources are available when required.
Dynamic Workload Scheduler:This provisioning mechanism queues requests for GPU-powered VMs in adedicated pool. Dynamic Workload Scheduler helps you avoid the constraints of theother obtainability modes:
- Stock-out situations in the on-demand mode.
- Statelessness constraint and preemption risk of Spot VMs.
- Cost and availability implications of reservations.

The following table summarizes the obtainability modes for Google Cloudservices and provides links to relevant documentation:

Product	On-demand VMs	Spot VMs	Reservations	Dynamic Workload Scheduler
Compute Engine	Create and start a Compute Engine instance	About Spot VMs	About reservations	Create a MIG with GPU VMs
GKE	Add and manage node pools	About Spot VMs in GKE	Consuming reserved zonal resources	GPU, TPU, and H4D consumption with flex-start provisioning
Cloud Batch	Create and run a job	Batch job with Spot VMs	Ensure resource availability using VM reservations	Use GPUs and flex-start VMs
Vertex AI Training	Create a serverless training job	Use Spot VMs for training jobs	Use reservations for training jobs	Schedule training jobs based on resource availability
Vertex AI	Getbatch inferences andonline inferences from custom-trained models.	Use Spot VMs for inference	Use reservations for online inference	Use flex-start VMs for inference

Plan for maintenance events

You can improve scheduling goodput by anticipating and planning forinfrastructure maintenance and upgrades.

GKE lets you control when automatic cluster maintenance canbe performed on your clusters. For more information, seeMaintenance windows and exclusions.
Compute Engine provides the following capabilities:
- To keep an instance running during a host event, such asplanned maintenance for the underlying hardware, Compute Engineperforms alive migration of the instance to another host in the samezone. For more information, seeLive migration process during maintenance events.
- To control how an instance responds when the underlying hostrequires maintenance or has an error, you canset a host maintenance policy for the instance.
For information about planning for host events that are related to largetraining clusters in AI Hypercomputer, seeManage host events across compute instances.

Optimize runtime goodput

The model training process is frequently interrupted by events like modelcheckpointing and resource preemption. To optimize runtime goodput, you mustensure that the system resumes training and inference efficiently after therequired infrastructure is ready and after any interruption.

During model training, AI researchers use checkpointing to track progress andminimize the learning lost due to resource preemptions. Larger model sizes makecheckpointing interruptions longer, which further affects overall efficiency.After interruptions, the training application must be restarted on every node inthe cluster. These restarts can take some time because the necessary artifactsmust be reloaded.

To optimize runtime goodput, use the following techniques:

Technique	Description
Implement automatic checkpointing.	Frequent checkpointing lets you track the progress of training at a granular level. However, the training process is interrupted for each checkpoint, which reduces runtime goodput. To minimize interruptions, you can set up automatic checkpointing, where the host'sSIGTERM signal triggers the creation of a checkpoint. This approach limits checkpointing interruptions to when the host needs maintenance. Remember that some hardware failures might not trigger SIGTERM signals; therefore, you must find a suitable balance between automatic checkpointing and SIGTERM events. You can set up automatic checkpointing by using the following techniques: Preserve training progress on TPUs by usingautocheckpoint. Use the built-in automatic checkpointing in theMaxText framework. Configure host maintenance notifications forGPUs andTPUs. UseOrbax for ML using JAX. Implementmulti-tier checkpointing in GKE.
Use appropriate container-loading strategies.	In a GKE cluster, before nodes can resume training jobs, it might take some time to complete loading the required artifacts like data or model checkpoints. To reduce the time that's required to reload data and resume training, use the following techniques: Preload data or container images by usingsecondary boot disks. Pull container images by usingImage streaming. Uselarger persistent disks or provision ephemeral storage by usingLocal SSD. Build and decompress images by usingZstandard. Preload container images by usingDaemonSets. For more information about how to reduce the data reload time, seeTips and tricks to reduce cold-start latency on GKE.
Use the compilation cache.	If training requires a compilation-based stack, check whether you can use a compilation cache. When you use a compilation cache, the computation graph isn't recompiled after each training interruption. The resulting reductions in time and cost are particularly beneficial when you use TPUs.JAX lets you store the compilation cache in a Cloud Storage bucket and then use the cached data in the case of interruptions.

Technique

Description

Implement automatic checkpointing.

Frequent checkpointing lets you track the progress of training at a granular level. However, the training process is interrupted for each checkpoint, which reduces runtime goodput. To minimize interruptions, you can set up automatic checkpointing, where the host'sSIGTERM signal triggers the creation of a checkpoint. This approach limits checkpointing interruptions to when the host needs maintenance. Remember that some hardware failures might not trigger SIGTERM signals; therefore, you must find a suitable balance between automatic checkpointing and SIGTERM events.

You can set up automatic checkpointing by using the following techniques:

Preserve training progress on TPUs by usingautocheckpoint.
Use the built-in automatic checkpointing in theMaxText framework.
Configure host maintenance notifications forGPUs andTPUs.
UseOrbax for ML using JAX.
Implementmulti-tier checkpointing in GKE.

Use appropriate container-loading strategies.

In a GKE cluster, before nodes can resume training jobs, it might take some time to complete loading the required artifacts like data or model checkpoints. To reduce the time that's required to reload data and resume training, use the following techniques:

Preload data or container images by usingsecondary boot disks.
Pull container images by usingImage streaming.
Uselarger persistent disks or provision ephemeral storage by usingLocal SSD.
Build and decompress images by usingZstandard.
Preload container images by usingDaemonSets.

For more information about how to reduce the data reload time, seeTips and tricks to reduce cold-start latency on GKE.

Use the compilation cache.

If training requires a compilation-based stack, check whether you can use a compilation cache. When you use a compilation cache, the computation graph isn't recompiled after each training interruption. The resulting reductions in time and cost are particularly beneficial when you use TPUs.JAX lets you store the compilation cache in a Cloud Storage bucket and then use the cached data in the case of interruptions.

Optimize program goodput

Program goodput represents peak resource utilization during training, which isthe conventional way to measure training and serving efficiency. To improveprogram goodput, you need an optimized distribution strategy, efficientcompute-communication overlap, optimized memory access, and efficientpipelines.

To optimize program goodput, use the following strategies:

Strategy	Description
Use framework-level customization options.	Frameworks or compilers likeAccelerated Linear Algebra (XLA) provide many key components of program goodput. To further optimize performance, you can customize fundamental components of the computation graph. For example,Pallas supports custom kernels for TPUs and GPUs.
Offload memory to the host DRAM.	For large-scale training, which requires significantly high memory from accelerators, you can offload some memory usage to the host DRAM. For example, XLA lets youoffload model activations from the forward pass to the host memory instead of using the accelerator's memory. With this strategy, you can improve training performance by increasing the model capacity or the batch size.
Leveragequantization during training.	You can improve training efficiency and program goodput by leveraging model quantization during training. This strategy reduces the precision of the gradients or weights during certain steps of the training; therefore, program goodput improves. However, this strategy might require additional engineering effort during model development. For more information, see the following resources: Quantization-aware training on TensorFlow Quantization-aware training on PyTorch Qwix: a quantization library for JAX
Implement parallelism.	To increase the utilization of the available compute resources, you can use parallelism strategies at the model level during training and when loading data. For information about model parallelism, see the following: Pipeline parallelism on PyTorch Pipeline parallelism on GPUs with JAX Pipeline parallelism in DeepSpeed To achieve data parallelism, you can use tools like the following: FSDP on PyTorch Distributed training onTensorFlow orKeras Distributed data loading with JAX

Focus on workload-specific requirements

To ensure that your performance optimization efforts are effective and holistic,you must match optimization decisions to the specific requirements of yourtraining and inference workloads. Choose appropriate AI models and use relevantprompt optimization strategies. Select appropriate frameworks and tools based onthe requirements of your workloads.

Identify workload-specific requirements

Evaluate the requirements and constraints of your workloads across the followingareas:

Area	Description
Task and quality requirements	Define the core task of the workload and the performance baseline. Answer questions like the following: Does the task involve regression, classification, or generation? What is the minimum acceptable quality: for example, accuracy, precision, or recall?
Serving context	Analyze the operational environment where you plan to deploy the model. The serving context often has a significant impact on design decisions. Consider the following factors: Latency: Does the system need real-time, synchronous predictions, or is an asynchronous, batch process acceptable? Connectivity: Do you need to run the model offline or on an edge device? Volume: What are the expected average and peak levels for prediction load? The load levels influence your decisions regarding scaling strategies and infrastructure.
Team skills and economics	Assess the business value of buying the solution against the cost and complexity of building and maintaining it. Determine whether your team has the specialized skills that are required for custom model development or whether a managed service might provide faster time to value.

Area

Description

Task and quality requirements

Define the core task of the workload and the performance baseline. Answer questions like the following:

Does the task involve regression, classification, or generation?
What is the minimum acceptable quality: for example, accuracy, precision, or recall?

Serving context

Analyze the operational environment where you plan to deploy the model. The serving context often has a significant impact on design decisions. Consider the following factors:

Latency: Does the system need real-time, synchronous predictions, or is an asynchronous, batch process acceptable?
Connectivity: Do you need to run the model offline or on an edge device?
Volume: What are the expected average and peak levels for prediction load? The load levels influence your decisions regarding scaling strategies and infrastructure.

Team skills and economics

Assess the business value of buying the solution against the cost and complexity of building and maintaining it. Determine whether your team has the specialized skills that are required for custom model development or whether a managed service might provide faster time to value.

Choose an appropriate model

If an API or an open model can deliver the required performance and quality, usethat API or model.

For modality-specific tasks like optical character recognition (OCR),labeling, and content moderation, choose ML APIs like the following:
For generative AI applications, consider Google models likeGemini,Imagen,andVeo.
ExploreModel Garden and choose from a curated collection of foundation and task-specific Googlemodels. Model Garden also provides open models likeGemma and third-party models, which you can run inVertex AI or deploy on runtimes like GKE.
If a task can be completed by using either an ML API or a generative AImodel, consider the complexity of the task. For complex tasks, large modelslike Gemini might provide higher performance than smallermodels.

Improve quality through better prompting

To improve the quality of your prompts at scale, use theVertex AI prompt optimizer.You don't need to manually rewrite system instructions and prompts. The promptoptimizer supports the following approaches:

Zero-shot optimization: A low-latency approach that improves asingle prompt or system instruction in real time.
Data-driven optimization: An advanced approach that improvesprompts by evaluating a model's responses to sample prompts againstspecific evaluation metrics.

For more prompt-optimization guidelines, seeOverview of prompting strategies.

Improve performance for ML and generative AI endpoints

To improve the latency or throughput (tokens per second) for ML and generativeAI endpoints, consider the following recommendations:

Cache results by usingMemorystore.
For generative AI APIs, use the following techniques:
- Context caching:Achieve lower latency for requests that contain repeated content.
- Provisioned Throughput:Improve the throughput for Gemini endpoints by reserving throughput capacity.

For self-hosted models, consider the following optimized inferenceframeworks:

Framework	Notebooks and guides
OptimizedvLLM containers in Model Garden	Deploy open models with pre-built containers
GPU-based inference on GKE	Serve an LLM with Ray Serve Gemma open models with Hugging Face TGI Serve Gemma open models with vLLM Serve Gemma open models with TensorRT-LLM Serve scalable LLMs with TorchServe
TPU-based inference	Serve open models using TPUs on GKE with Optimum TPU Serve open models using vLLM TPU on Cloud TPU

Use low-code solutions and tuning

If pre-trained models don't meet your requirements, you can improve theirperformance for specific domains by using the solutions:

AutoML is a low-code solution to improve inference results with minimal technicaleffort for a wide range of tasks. AutoML lets you createmodels that are optimized on several dimensions: architecture, performance,and training stage (through checkpointing).
Tuning helps you achieve higher quality, more stable generation and lower latencywith shorter prompts and without a lot of data. We recommend that you starttuning by using thedefault values for hyperparameters.For more information, seeSupervised Fine Tuning for Gemini: A best practices guide.

Optimize self-managed training

In some cases, you might decide to retrain a model or fully manage a fine-tuningjob. This approach requires advanced skills and additional time depending on themodel, framework, and resources that you use.

Take advantage of performance-optimized framework options, such as thefollowing:

Use deep learningimages orcontainers,which include the latest software dependencies andGoogle Cloud-specific libraries.
Run model training with Ray on Google Cloud:
- Ray on Vertex AI lets you bring Ray's distributed training framework toCompute Engine or GKE and it simplifies theoverhead of managing the framework.
- You can self-manage Ray on GKE with KubeRay by deployingtheRay operator on an existing cluster.
Deploy training workloads on a compute cluster that you provision byusing the open-sourceCluster Toolkit.To efficiently provision performance-optimized clusters, use YAML-basedblueprints. Manage the clustersby using schedulers likeSlurm and GKE.
Train standard model architectures by usingGPU-optimized recipes.

Build training architectures and strategies that optimize performance by usingthe following techniques:

Implementdistributed training on Vertex AI or on the frameworks described earlier.Distributed training enables model parallelism and data parallelism, whichcan help to increase the training dataset size and model size, and help toreduce training time.
For efficient model training and to explore different performanceconfigurations, run checkpointing at appropriate intervals. For moreinformation, seeOptimize runtime goodput.

Optimize self-managed serving

For self-managed serving, you need efficient inference operations and a highthroughput (number of inferences per unit of time).

To optimize your model for inference, consider the following approaches:

Quantization:Reduce the model size by representing its parameters in a lower precisionformat. This approach helps to reduce memory consumption and latency.However, quantizationafter training might change model quality. Forexample, quantization after training might cause a reduction in accuracy.
- Post-training quantization (PTQ) is a repeatable task. Major MLframeworks likePyTorch andTensorFlow support PTQ.
- You can orchestrate PTQ by using a pipeline onVertex AI Pipelines.
- To stabilize model performance and benefit from reductions inmodel size, you can useQwix.
Tensor parallelism:Improve inference throughput by distributing computational load acrossmultiple GPUs.
Memory optimization:Increase throughput and optimize attention caching, batch sizes, and inputsizes.

Use inference-optimized frameworks, such as the following:

For generative models, use an open framework likeMaxText,MaxDiffusion,orvLLM.
Runprebuilt container images on Vertex AI for predictions and explanations. If you choose TensorFlow,then use theoptimized TensorFlow runtime.This runtime enables more efficient and lower cost inference than prebuiltcontainers that use open-source TensorFlow.
Runmulti-host inferencing with large models on GKE by using theLeaderWorkerSet (LWS) API.
Leverage theNVIDIA Triton inference server for Vertex AI.
Simplify the deployment of inference workloads on GKE byusing LLM-optimized configurations. For more information, seeAnalyze model serving performance and costs with GKE Inference Quickstart.

Optimize resource consumption based on performance goals

Resource optimization helps to accelerate training, iterate efficiently, improvemodel quality, and increase the serving capacity.

Choose appropriate processor types

Your choice of compute platform can have a significant impact on the trainingefficiency of a model.

Deep learning models perform well on GPUs and TPUs because such modelsrequire large amounts of memory and parallel matrix computation. For moreinformation about workloads that are well suited to CPUs, GPUs, and TPUs,seeWhen to use TPUs.
Compute-optimized VMs are ideal for HPC workloads.

Optimize training and serving on GPUs

To optimize the performance of training and inference workloads that aredeployed on GPUs, consider the following recommendations:

Recommendation	Description
Select appropriate memory specifications.	When you chooseGPU machine types, select memory specifications based on the following factors: Model capacity: The total memory size (memory footprint) of the model's trainable parameters and gradients. Workload type: Training requires more memory than serving. Training batch size: For larger batches, more activations are stored and the memory requirement is higher. Data type: Workloads that process high-quality images or that use high-precision arithmetics need machine types with larger memory specifications.
Assess core and memory bandwidth requirements.	In addition to memory size, consider other requirements like the number of Tensor cores and memory bandwidth. These factors influence the speed of data access and computations on the chip.
Choose appropriate GPU machine types.	Training and serving might need different GPU machine types. Training jobs need one or more GPUs, or even multiple nodes, with significantly large memory and bandwidth. Inference workloads require relatively less memory and fewer high-performance GPUs. We recommend that you use large machine types for training and smaller, cost-effective machine types for inference. To detect resource utilization issues, use monitoring tools like the NVIDIA DCGM agent and adjust resources appropriately.
Leverage GPU sharing on GKE.	Dedicating a full GPU to a single container might be an inefficient approach in some cases. To help you overcome this inefficiency, GKE supports the followingGPU-sharing strategies: Multi-instance GPU: Allocate a single GPU to multiple containers on a node. GPU time-sharing: Allocate GPU time across multiple containers. Each container gets a time slice. NVIDIA multi-process service (MPS): Run multi-process workloads concurrently on a single GPU by using a binary-compatibleCUDA implementation. To maximize resource utilization, we recommend that you use an appropriate combination of these strategies. For example, when you virtualize a large H100 GPU by using the GPU time-sharingand multi-instance GPU strategies, the serving platform can scale up and down based on traffic. GPU resources are repurposed in real time based on the load on the model containers.
Optimize routing and load balancing.	When you deploy multiple models on a cluster, you can useGKE Inference Gateway for optimized routing and load balancing. Inference Gateway extends the routing mechanisms of theKubernetes Gateway API by using the following capabilities: Request body routing logic and endpoint selection. These features enable optimization strategies like using the same base model formultiple LoRA adapters. Generative AI-specificsecurity extensions like Model Armor.
Share resources for Vertex AI endpoints.	You can configure multiple Vertex AI endpoints to use a common pool of resources. For more information about this feature and its limitations, seeShare resources across deployments.

Optimize training and serving on TPUs

TPUs are Google chips that help solvemassive scale challenges for ML algorithms. These chips provide optimal performance for AItraining and inference workloads. When compared to GPUs, TPUs provide higherefficiency for deep learning training and serving. For information about the usecases that are suitable for TPUs, seeWhen to use TPUs.TPUs are compatible with ML frameworks likeTensorFlow,PyTorch,andJAX.

To optimize TPU performance, use the following techniques, which are describedin theCloud TPU performance guide:

Maximize the batch size for each TPU memory unit.
Ensure that TPUs aren't idle. For example, implement parallel data reads.
Optimize the XLA compiler. Adjust tensor dimensions as required and avoidpadding. XLA automatically optimizes for graph execution performance byusing tools like fusion and broadcasting.

Optimize training on TPUs and serving on GPUs

TPUs support efficient training. GPUs provide versatility and wider availabilityfor inference workloads. To combine the strengths of TPUs and GPUs, you cantrain models on TPUs and serve them on GPUs. This approach can help to reduceoverall costs and accelerate development, particularly for large models. Forinformation about the locations where TPU and GPU machine types are available,seeTPU regions and zones andGPU locations.

Optimize the storage layer

The storage layer of your training and serving infrastructure is critical toperformance. Training jobs and inferencing workloads involve the followingstorage-related activities:

Loading and processing data.
Checkpointing the model during training.
Reloading binaries to resume training after node preemptions.
Loading the model efficiently to handle inferencing at scale.

The following factors determine your requirements for storage capacity,bandwidth, and latency:

Model size
Volume of the training dataset
Checkpointing frequency
Scaling patterns

If your training data is in Cloud Storage, you can reduce the data loadinglatency by usingfile caching in Cloud Storage FUSE. Cloud Storage FUSE lets you mount aCloud Storagebucket on compute nodes that have Local SSD disks. For information aboutimproving the performance of Cloud Storage FUSE, seePerformance tuning best practices.

APyTorch connector to Cloud Storage provides high performance for data reads and writes. Thisconnector is particularly beneficial for training with large datasets and forcheckpointing large models.

Compute Engine supports variousPersistent Disk types. WithGoogle Cloud Hyperdisk ML,you can provision the required throughput and IOPS based on training needs. Tooptimize disk performance, start by resizing the disks and then considerchanging the machine type. For more information, seeOptimize Persistent Disk performance.To load test the read-write performance and latency at the storage layer, youcan use tools likeFlexible I/O tester (FIO).

For more information about choosing and optimizing storage services for your AIand ML workloads, see the following documentation:

Optimize the network layer

To optimize the performance of AI and ML workloads, configure yourVPC networks to provide adequate bandwidth and maximum throughputwith minimum latency. Consider the following recommendations:

Recommendation	Suggested techniques for implementation
Optimize VPC networks.	Configure a largemaximum transmission unit (MTU) size. UsePremium Tier networking. For improved connectivity on VMs with NVIDIA ConnectX machine types (A3 Ultra and later), configure the VPC network with theRemote Direct Memory Access (RDMA) network profile.
Place VMs closer to each other.	Definecompact placement policies in Compute Engine. UseTopology Aware Scheduling in GKE.
Configure VMs to support higher network speeds.	UseGoogle Virtual NICs (gVNICs). Configureper VM Tier_1 networking. For information about testing Tier_1 networking performance, seeBenchmark higher network bandwidth for VM instances. For multi-GPU training, optimize GPU-to-GPU network performance by using NVIDIA-specificconnectivity configurations. InstallNVIDIA GPUDirect and configure theNVIDIA collective communication library (NCCL). For more information, seeInstall the GPUDirect binary and configure NCCL. To test the performance of NCCL operations, you can use standardNCCL tests.

Link performance metrics to design and configuration choices

To innovate, troubleshoot, and investigate performance issues, you mustestablish a clear link between design choices and performance outcomes. You needa reliable record of the lineage of ML assets, deployments, model outputs, andthe corresponding configurations and inputs that produced the outputs.

Build a data and model lineage system

To reliably improve performance, you need the ability to trace every modelversion back to the exact data, code, and configurations that were used toproduce the model. As you scale a model, such tracing becomes difficult. Youneed a lineage system that automates the tracing process and creates a recordthat's clear and can be queried for every experiment. This system lets yourteams efficiently identify and reproduce the choices that lead to the optimallyperforming models.

To view and analyze the lineage of pipeline artifacts for workloads inVertex AI, you can useVertex ML Metadata orDataplex Universal Catalog.Both options let you register events or artifacts to meet governancerequirements and to query the metadata and retrieve information when needed.This section provides an overview of the two options. For detailed informationabout the differences between Vertex ML Metadata andDataplex Universal Catalog, seeTrack the lineage of pipeline artifacts.

Default implementation: Vertex AI ML Metadata

Your first pipeline run orexperiment in Vertex AIcreates a default Vertex ML Metadata service. The parameters andartifact metadata that the pipeline consumes and generates are automaticallyregistered to a Vertex ML Metadata store. Thedata model that's used to organize and connect the stored metadata contains the followingelements:

Context: A group of artifacts and executions that represents anexperimentation run.
Execution: A step in a workflow, like data validation or model training.
Artifact: An input or output entity, object, or piece of data that aworkflow produces and consumes.
Event: A relationship between an artifact and an execution.

By default, Vertex ML Metadata captures andtracks all input and output artifacts of a pipeline run. It integrates these artifactswithVertex AI Experiments,Model Registry,andVertex AI managed datasets.

Autologging is a built-in feature in Vertex AI Training to automatically logdata to Vertex AI Experiments. To efficiently track experiments foroptimizing performance, use the built-in integrations betweenVertex AI Experiments and the associatedVertex ML Metadata service.

Vertex ML Metadata provides a filtering syntax and operators to runqueries about artifacts, executions, and contexts. When required, your teams canefficiently retrieve information about a model's registry link and its datasetor evaluation for a specific experiment run. This metadata can help toaccelerate the discovery of choices that optimize performance. For example, youcancompare pipeline runs,compare models,andcompare experiment runs.For more information, including example queries, seeAnalyze Vertex ML Metadata.

Alternative implementation: Dataplex Universal Catalog

Dataplex Universal Catalog discovers metadata fromGoogle Cloud resources,including Vertex AI artifacts. You can also integrate acustom data source.

Dataplex Universal Catalog can read metadata across multipleregions and organization-wide stores, whereas Vertex ML Metadatais a project-specific resource. When compared to Vertex ML Metadata,Dataplex Universal Catalog involves more setup effort. However,Dataplex Universal Catalog might be appropriate when you needintegration with your wider data portfolio in Google Cloud and withorganization-wide stores.

Dataplex Universal Catalog discovers and harvests metadata forprojects where the Data Lineage API is enabled. The metadata in the catalogis organized by using adata model that consists of projects, entry groups, entries, and aspects.Dataplex Universal Catalog provides aspecific syntax that you can use to discover artifacts. If required, you canmap Vertex ML Metadata artifacts to Dataplex Universal Catalog.

Use explainability tools

The behavior of an AI model is based on data that was used to train the model.This behavior is encoded asparametersin mathematical functions. Understanding exactlywhy a model performs in acertain way can be difficult. However, this knowledge is critical forperformance optimization.

For example, consider an image classification model where the training datacontains images of only red cars. The model might learn to identify the "car"label based on the color of the object rather than the object's spatial andshape attributes. When the model is tested with images that show cars ofdifferent colors, the performance of the model might degrade. The followingsections describe tools that you can use to identify and diagnose such problems.

Detect data biases

In theexploratory data analysis (EDA) phase of an ML project, you identify issues with the data, such asclass-imbalanced datasets andbiases.

In production systems, you often retrain models and run experiments withdifferent datasets. To standardize data and compare across experiments, werecommend a systematic approach to EDA that includes the followingcharacteristics:

Automation: As a training set grows in size, the EDA process must runautomatically in the background.
Wide coverage: When you add new features, the EDA must reveal insightsabout the new features.

Many EDA tasks are specific to the data type and the business context. Toautomate the EDA process, use BigQuery or a managed dataprocessing service like Dataflow. For more information, seeClassification on imbalanced data andData bias metrics for Vertex AI.

Understand model characteristics and behavior

In addition to understanding the distribution of data in the training andvalidation sets and their biases, you need to understand a model'scharacteristics and behavior at prediction time. To understand model behavior,use the following tools:

Tool	Description	Purposes
Example-based explanations	You can useexample-based explanations in Vertex Explainable AI to understand a prediction by finding the most similar examples from the training data. This approach is based on the principle that similar inputs yield similar outputs.	Identify and fix gaps in the training data by finding similar examples for an incorrect prediction. Classify inputs that the model was not originally trained to recognize, by usingneighbors from a reference set. Identify anomalies by finding inputs that are significantly different from all of the known training examples. Make data collection more efficient by identifying ambiguous cases, like when neighbors have mixed labels. Prioritize such cases for human review.
Feature-based explanations	For predictions that are based on tabular data or images, feature-based explanations show how much eachfeature affects a prediction when it's compared to a baseline. Vertex AI provides differentfeature attribution methods depending on the model type and task. The methods typically rely on sampling and sensitivity analysis to measure how much the output changes in response to changes in an input feature.	Identify biases that a validation step might miss. Optimize performance by identifying the features that are most important for a prediction. To increase a model's quality and performance, ML engineers can intentionally add, remove, or engineer features.
What-If Tool	The What-If Tool was developed by Google'sPeople + AI Research (PAIR) initiative to help you understand and visualize the behavior of image and tabular models. For examples of using the tool, seeWhat-If Tool Web Demos.	Debug and find the root cause of incorrect predictions. Investigate how a model performs across different subsets of data by identifying biases through a fairness analysis. Understand a model's behavior, particularly the relationship between a model's predictions and input features. Compare predictions by using a visual comparison tool that requires either two models or a ground truth baseline.

Tool

Description

Purposes

Example-based explanations

You can useexample-based explanations in Vertex Explainable AI to understand a prediction by finding the most similar examples from the training data. This approach is based on the principle that similar inputs yield similar outputs.

Identify and fix gaps in the training data by finding similar examples for an incorrect prediction.
Classify inputs that the model was not originally trained to recognize, by usingneighbors from a reference set.
Identify anomalies by finding inputs that are significantly different from all of the known training examples.
Make data collection more efficient by identifying ambiguous cases, like when neighbors have mixed labels. Prioritize such cases for human review.

Feature-based explanations

For predictions that are based on tabular data or images, feature-based explanations show how much eachfeature affects a prediction when it's compared to a baseline.

Vertex AI provides differentfeature attribution methods depending on the model type and task. The methods typically rely on sampling and sensitivity analysis to measure how much the output changes in response to changes in an input feature.

Identify biases that a validation step might miss.
Optimize performance by identifying the features that are most important for a prediction. To increase a model's quality and performance, ML engineers can intentionally add, remove, or engineer features.

What-If Tool

The What-If Tool was developed by Google'sPeople + AI Research (PAIR) initiative to help you understand and visualize the behavior of image and tabular models. For examples of using the tool, seeWhat-If Tool Web Demos.

Debug and find the root cause of incorrect predictions.
Investigate how a model performs across different subsets of data by identifying biases through a fairness analysis.
Understand a model's behavior, particularly the relationship between a model's predictions and input features.
Compare predictions by using a visual comparison tool that requires either two models or a ground truth baseline.

Contributors

Authors:

Benjamin Sadik | AI and ML Specialist Customer Engineer
Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist

Other contributors:

Daniel Lees | Cloud Security Architect
Kumar Dhanagopal | Cross-Product Solution Developer

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-02-14 UTC.

Movatterモバイル変換

Well-Architected Framework: AI and ML perspective Stay organized with collections Save and categorize content based on your preferences.

Contributors

AI and ML perspective: Operational excellence

Build a robust foundation for model development

Define the problems and the required outcomes

Collect and preprocess the necessary data

Select an appropriate ML approach

Set up version control for code, models, and data

Automate the model-development lifecycle

Use a managed pipeline orchestration system

Implement CI/CD pipelines

Implement safe and controlled model releases

Implement observability

Monitor performance continuously

Evaluate models during development

Monitor for availability

Set up custom alerts for business-specific thresholds

Build a culture of operational excellence

Champion automation and standardization

Prioritize continuous learning and improvement

Cultivate accountability and ownership

Embed AI ethics and safety considerations

Design for scalability

Plan for capacity and quotas

Prepare for peak events

Scale applications for production

Contributors

AI and ML perspective: Security

Define clear goals and requirements

Align AI and ML security with business goals

Identify potential attack vectors and risks

Keep data secure and prevent loss or mishandling

Adhere to data minimization principles

Monitor data collection, storage, and transformation

Implement role-based access controls with least privilege principles

Implement security measures for data movement

Guard against data poisoning

Keep AI pipelines secure and robust against tampering

Use secure coding practices

Protect pipelines and model artifacts from unauthorized access

Enforce lineage and tracking

Deploy on secure systems with secure tools and artifacts

Train and deploy models in a secure environment

Follow SLSA guidelines for AI artifacts

Use validated prebuilt container images

Consider Confidential Computing for GPUs

Verify and protect inputs

Implement practices that help secure generative AI systems

Prevent malicious queries to your AI systems

Monitor, evaluate, and prepare to respond to outputs

Evaluate model performance with metrics and security measures

Monitor AI and ML model outputs in production

Implement alerting and incident response procedures

Contributors

AI and ML perspective: Reliability

Ensure that ML infrastructure is scalable and highly available

Implement automatic and dynamic scaling capabilities

Design for HA and fault tolerance

Manage resources proactively and anticipate requirements

Optimize resource availability and obtainability

Distribute incoming traffic across multiple instances

Use a modular and loosely coupled architecture

Implement small self-contained modules or components

Containerized microservices on GKE

Serverless event-driven services

Vertex AI managed services

Agentic applications

Design well-defined interfaces

Protocol choice

Standardized and comprehensive documentation

Interaction with Google Cloud managed services like Vertex AI

Use APIs to isolate modules and abstract implementation details

Plan for graceful degradation

Fault isolation

Component redundancy

Proactive monitoring

SRE practices

Build an automated end-to-end MLOps platform

Automate the model development lifecycle

Well-Architected Framework: AI and ML perspective