Google Cloud Well-Architected Framework

Last reviewed 2026-01-28 UTC
This page provides a one-page view of all of the pages in theGoogle Cloud Well-Architected Framework. You can print this page or save it in PDF format by using your browser's print function.

This page doesn't have a table of contents. You can't use the links on this page to navigatewithin the page.

The Well-Architected Framework provides recommendations to helparchitects, developers, administrators, and other cloud practitioners design andoperate a cloud topology that's secure, efficient, resilient, high-performing,cost-effective, and sustainable.

A cross-functional team of experts at Google validates the recommendations inthe Well-Architected Framework. The team curates the Well-Architected Framework toreflect the expanding capabilities of Google Cloud, industry best practices,community knowledge, and feedback from you. For a summary of the significantchanges to the Well-Architected Framework, seeWhat's new.

The Well-Architected Framework is relevant to applications built for the cloudandfor workloads migrated from on-premises to Google Cloud, hybrid clouddeployments, and multi-cloud environments.

Well-Architected Framework pillars and perspectives

The recommendations in the Well-Architected Framework are organized into pillars andcross-pillar perspectives, as shown in the following diagram.

Well-Architected Framework.Well-Architected Framework.

  • Apillar in the Well-Architected Framework providesprinciples and recommendations for a specific non-functional focus area:security, reliability, performance, cost, operations, or sustainability.

  • Aperspective in the Well-Architected Frameworkis a cross-pillar view of recommendations for a specific technology or industry.The recommendations in a perspective align with the general principles andrecommendations in the pillars.

    For example, the financial services industry (FSI) perspectiverecommends a disaster recovery strategy that meets regulatory requirements fordata residency. This FSI-specific recommendation aligns with the reliabilitypillar's principle about realistic targets, because the data residencyrequirements influence the choice of failover region and, consequently, therecovery objectives.

Pillars

Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design forprivacy, and align with regulatory requirements and standards.
Reliability
Design and operate resilient and highly available workloads in the cloud.
Cost optimization
Maximize the business value of your investment in Google Cloud.
Performance optimization
Design and tune your cloud resources for optimal performance.
ecoSustainability
Build and manage cloud workloads that are environmentally sustainable.

Cross-pillar perspectives

AI and ML
A cross-pillar view of technology-specific recommendations for AI and MLworkloads.
Financial services industry (FSI)
A cross-pillar view of industry-specific recommendations for FSI workloads.

Core principles

Before you explore the recommendations in each pillar of the Well-Architected Framework,review the following core principles:

Design for change

No system is static. The needs of its users, the goals of the team that buildsthe system, and the system itself are constantly changing. With the need for changein mind, build a development and production process that enables teams toregularly deliver small changes and get fast feedback on those changes.Consistently demonstrating the ability to deploy changes helps to build trustwith stakeholders, including the teams responsible for the system, and the usersof the system. UsingDORA's software delivery metrics can help your team monitor the speed, ease, and safety of making changes to thesystem.

Document your architecture

When you start to move your workloads to the cloud or build your applications,lack of documentation about the system can be a major obstacle. Documentationis especially important for correctly visualizing the architecture of yourcurrent deployments.

Quality documentation isn't achieved by producing a specificamount of documentation, but by how clear content is, how useful it is, and howit's maintained as the system changes.

A properly documented cloud architecture establishes a common language andstandards, which enable cross-functional teams to communicate and collaborateeffectively. The documentation also provides the information that's necessary toidentify and guide future design decisions. Documentation should be written withyour use cases in mind, to provide context for the design decisions.

Over time, your design decisions will evolve and change. The change historyprovides the context that your teams require to align initiatives, avoidduplication, and measure performance changes effectively over time. Change logsare particularly valuable when you onboard a new cloud architect who is not yetfamiliar with your current design, strategy, or history.

Analysis by DORA has found a clear link between documentation quality and organizationalperformance — the organization's ability to meet their performance andprofitability goals.

Simplify your design and use fully managed services

Simplicity is crucial for design. If your architecture is too complex tounderstand, it will be difficult to implement the design and manage it overtime. Where feasible, use fully managed services to minimize the risks, time,and effort associated with managing and maintaining baseline systems.

If you're already running your workloads in production, test with managedservices to see how they might help to reduce operational complexities. Ifyou're developing new workloads, then start simple, establish a minimal viableproduct (MVP), and resist the urge to over-engineer. You can identifyexceptional use cases, iterate, and improve your systems incrementally overtime.

Decouple your architecture

Research from DORA shows that architecture is an important predictor for achieving continuousdelivery. Decoupling is a technique that's used to separate your applicationsand service components into smaller components that can operate independently.For example, you might separate a monolithic application stack into individualservice components. In a loosely coupled architecture, an application can runits functions independently, regardless of the various dependencies.

A decoupled architecture gives you increased flexibility to do the following:

  • Apply independent upgrades.
  • Enforce specific security controls.
  • Establish reliability goals for each subsystem.
  • Monitor health.
  • Granularly control performance and cost parameters.

You can start the decoupling process early in your design phase or incorporateit as part of your system upgrades as you scale.

Use a stateless architecture

A stateless architecture can increase both the reliability and scalability ofyour applications.

Stateful applications rely on various dependencies to perform tasks, such aslocal caching of data. Stateful applications often require additional mechanismsto capture progress and restart gracefully. Stateless applications can performtasks without significant local dependencies by using shared storage or cachedservices. A stateless architecture enables your applications to scale up quicklywith minimum boot dependencies. The applications can withstand hard restarts,have lower downtime, and provide better performance for end users.

Well-Architected Framework: Operational excellence pillar

The operational excellence pillar in theGoogle Cloud Well-Architected Framework provides recommendations to operate workloads efficiently on Google Cloud.Operational excellence in the cloud involves designing, implementing, andmanaging cloud solutions that provide value, performance, security, andreliability. The recommendations in this pillar help you to continuously improveand adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

The operational excellence pillar is relevant to the following audiences:

  • Managers and leaders: A framework to establish and maintainoperational excellence in the cloud and to ensure that cloud investmentsdeliver value and support business objectives.
  • Cloud operations teams: Guidance to manage incidents and problems,plan capacity, optimize performance, and manage change.
  • Site reliability engineers (SREs): Best practices that help you toachieve high levels of service reliability, including monitoring, incidentresponse, and automation.
  • Cloud architects and engineers: Operational requirements and bestpractices for the design and implementation phases, to help ensure thatsolutions are designed for operational efficiency and scalability.
  • DevOps teams: Guidance about automation, CI/CD pipelines, and changemanagement, to help enable faster and more reliable software delivery.

To achieve operational excellence, you should embrace automation,orchestration, and data-driven insights. Automation helps to eliminate toil. Italso streamlines and builds guardrails around repetitive tasks. Orchestrationhelps to coordinate complex processes. Data-driven insights enableevidence-based decision-making. By using these practices, you can optimize cloudoperations, reduce costs, improve service availability, and enhance security.

Operational excellence in the cloud goes beyond technical proficiency in cloudoperations. It includes a cultural shift that encourages continuous learning andexperimentation. Teams must be empowered to innovate, iterate, and adopt agrowth mindset. A culture of operational excellence fosters a collaborativeenvironment where individuals are encouraged to share ideas, challengeassumptions, and drive improvement.

For operational excellence principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Operational excellencein the Well-Architected Framework.

Core principles

The recommendations in the operational excellence pillar of the Well-Architected Frameworkare mapped to the following core principles:

Contributors

Authors:

Other contributors:

Ensure operational readiness and performance using CloudOps

This principle in the operational excellence pillar of theGoogle Cloud Well-Architected Framework helps you to ensure operational readiness and performance of your cloudworkloads. It emphasizes establishing clear expectations and commitments forservice performance, implementing robust monitoring and alerting, conductingperformance testing, and proactively planning for capacity needs.

Principle overview

Different organizations might interpret operational readiness differently.Operational readiness is howyour organization prepares to successfullyoperate workloads on Google Cloud. Preparing to operate a complex,multilayered cloud workload requires careful planning for both go-live andday-2 operations. These operations are often calledCloudOps.

Focus areas of operational readiness

Operational readiness consists of four focus areas. Each focus area consists ofa set of activities and components that are necessary to prepare to operate acomplex application or environment in Google Cloud. The following tablelists the components and activities of each focus area:

Note: The recommendations in the operational excellence pillar of theWell-Architected Framework are relevant to one or more of theseoperational-readiness focus areas.
Focus area of operational readinessActivities and components
Workforce
  • Defining clear roles and responsibilities for the teams that manage and operate the cloud resources.
  • Ensuring that team members have appropriate skills.
  • Developing a learning program.
  • Establishing a clear team structure.
  • Hiring the required talent.
Processes
  • Observability.
  • Managing service disruptions.
  • Cloud delivery.
  • Core cloud operations.
ToolingTools that are required to support CloudOps processes.
Governance
  • Service levels and reporting.
  • Cloud financials.
  • Cloud operating model.
  • Architectural review and governance boards.
  • Cloud architecture and compliance.

Recommendations

To ensure operational readiness and performance by using CloudOps, consider therecommendations in the following sections. Each recommendation in this documentis relevant to one or more of thefocus areas of operational readiness.

Define SLOs and SLAs

A core responsibility of the cloud operations team is to define service levelobjectives (SLOs) and service level agreements (SLAs) for all of the criticalworkloads. This recommendation is relevant to the governancefocus area of operational readiness.

SLOs must be specific, measurable, achievable, relevant, and time-bound (SMART),and they must reflect the level of service and performance that you want.

  • Specific: Clearly articulates the required level of service andperformance.
  • Measurable: Quantifiable and trackable.
  • Achievable: Attainable within the limits of your organization'scapabilities and resources.
  • Relevant: Aligned with business goals and priorities.
  • Time-bound: Has a defined timeframe for measurement and evaluation.

For example, an SLO for a web application might be "99.9% availability" or"average response time less than 200 ms." Such SLOs clearly define the requiredlevel of service and performance for the web application, and the SLOs can bemeasured and tracked over time.

SLAs outline the commitments to customers regarding service availability,performance, and support, including any penalties or remedies for noncompliance.SLAs must include specific details about the services that are provided, thelevel of service that can be expected, the responsibilities of both the serviceprovider and the customer, and any penalties or remedies for noncompliance. SLAsserve as a contractual agreement between the two parties, ensuring that bothhave a clear understanding of the expectations and obligations that areassociated with the cloud service.

Google Cloud provides tools likeCloud Monitoring and service level indicators (SLIs) to help you define and track SLOs.Cloud Monitoring provides comprehensive monitoring and observabilitycapabilities that enable your organization to collect and analyze metrics thatare related to the availability, performance, and latency of cloud-basedapplications and services. SLIs are specific metrics that you can use to measureand track SLOs over time. By utilizing these tools, you can effectively monitorand manage cloud services, and ensure that they meet the SLOs and SLAs.

Clearly defining and communicating SLOs and SLAs for all of your critical cloudservices helps to ensure reliability and performance of your deployedapplications and services.

Implement comprehensive observability

To get real-time visibility into the health and performance of your cloudenvironment, we recommend that you use a combination ofGoogle Cloud Observability toolsandthird-party solutions. This recommendation is relevant to thesefocus areas of operational readiness:processes and tooling.

Implementing a combination of observability solutions provides you with acomprehensive observability strategy that covers various aspects of your cloudinfrastructure and applications. Google Cloud Observability is a unified platform forcollecting, analyzing, and visualizing metrics, logs, and traces from variousGoogle Cloud services, applications, and external sources. By usingCloud Monitoring, you can gain insights into resource utilization,performance characteristics, and overall health of your resources.

To ensure comprehensive monitoring, monitor important metrics that align withsystem health indicators such as CPU utilization, memory usage, network traffic,disk I/O, and application response times. You must also considerbusiness-specific metrics. By tracking these metrics, you can identify potentialbottlenecks, performance issues, and resource constraints. Additionally, you canset up alerts to notify relevant teams proactively about potential issues oranomalies.

To enhance your monitoring capabilities further, you can integrate third-partysolutions with Google Cloud Observability. These solutions can provide additionalfunctionality, such as advanced analytics, machine learning-powered anomalydetection, and incident management capabilities. This combination ofGoogle Cloud Observability tools and third-party solutions lets you create a robust andcustomizable monitoring ecosystem that's tailored to your specific needs. Byusing this combination approach, you can proactively identify and addressissues, optimize resource utilization, and ensure the overall reliability andavailability of your cloud applications and services.

Implement performance and load testing

Performing regular performance testing helps you to ensure that yourcloud-based applications and infrastructure can handle peak loads and maintainoptimal performance. Load testing simulates realistic traffic patterns. Stresstesting pushes the system to its limits to identify potential bottlenecks andperformance limitations. This recommendation is relevant to thesefocus areas of operational readiness:processes and tooling.

Tools likeCloud Load Balancing andload testing services can help you to simulate real-world traffic patterns and stress-test yourapplications. These tools provide valuable insights into how your system behavesunder various load conditions, and can help you to identify areas that requireoptimization.

Based on the results of performance testing, you can make decisions to optimizeyour cloud infrastructure and applications for optimal performance andscalability. This optimization might involve adjusting resource allocation,tuning configurations, or implementing caching mechanisms.

For example, if you find that your application is experiencing slowdowns duringperiods of high traffic, you might need to increase the number of virtualmachines or containers that are allocated to the application. Alternatively, youmight need to adjust the configuration of your web server or database to improveperformance.

By regularly conducting performance testing and implementing the necessaryoptimizations, you can ensure that your cloud-based applications andinfrastructure always run at peak performance, and deliver a seamless andresponsive experience for your users. Doing so can help you to maintain acompetitive advantage and build trust with your customers.

Plan and manage capacity

Proactively planning for future capacity needs—both organic or inorganic—helpsyou to ensure the smooth operation and scalability of your cloud-based systems.This recommendation is relevant to the processesfocus area of operational readiness.

Planning for future capacity includes understanding and managingquotas for various resources like compute instances, storage, and API requests. Byanalyzing historical usage patterns, growth projections, and businessrequirements, you can accurately anticipate future capacity requirements. Youcan use tools likeCloud Monitoring andBigQuery to collect and analyze usage data, identify trends, and forecast future demand.

Historical usage patterns provide valuable insights into resource utilizationover time. By examining metrics like CPU utilization, memory usage, and networktraffic, you can identify periods of high demand and potential bottlenecks.Additionally, you can help to estimate future capacity needs by making growthprojections based on factors like growth in the user base, new products andfeatures, and marketing campaigns. When you assess capacity needs, you shouldalso consider business requirements like SLAs and performance targets.

When you determine the resource sizing for a workload, consider factors that canaffect utilization of resources. Seasonal variations like holiday shoppingperiods or end-of-quarter sales can lead to temporary spikes in demand. Plannedevents like product launches or marketing campaigns can also significantlyincrease traffic. To make sure that your primary and disaster recovery (DR)system can handle unexpected surges in demand, plan for capacity that cansupport graceful failover during disruptions like natural disasters andcyberattacks.

Autoscaling is an important strategy for dynamically adjusting your cloudresources based on workload fluctuations. By using autoscaling policies, you canautomatically scale compute instances, storage, and other resources in responseto changing demand. This ensures optimal performance during peak periods whileminimizing costs when resource utilization is low. Autoscaling algorithms usemetrics like CPU utilization, memory usage, and queue depth to determine when toscale resources.

Continuously monitor and optimize

To manage and optimize cloud workloads, you must establish a process forcontinuously monitoring and analyzing performance metrics. This recommendationis relevant to thesefocus areas of operational readiness:processes and tooling.

To establish a process for continuous monitoring and analysis, you track,collect, and evaluate data that's related to various aspects of your cloudenvironment. By using this data, you can proactively identify areas forimprovement, optimize resource utilization, and ensure that your cloudinfrastructure consistently meets or exceeds your performance expectations.

An important aspect of performance monitoring is regularly reviewing logs andtraces. Logs provide valuable insights into system events, errors, and warnings.Traces provide detailed information about the flow of requests through yourapplication. By analyzing logs and traces, you can identify potential issues,identify the root causes of problems, and get a better understanding of how yourapplications behave under different conditions. Metrics like the round-trip timebetween services can help you to identify and understand bottlenecks that are inyour workloads.

Further, you can use performance-tuning techniques to significantly enhanceapplication response times and overall efficiency. The following are examples oftechniques that you can use:

  • Caching: Store frequently accessed data in memory to reduce theneed for repeated database queries or API calls.
  • Database optimization: Use techniques like indexing and queryoptimization to improve the performance of database operations.
  • Code profiling: Identify areas of your code that consume excessiveresources or cause performance issues.

By applying these techniques, you can optimize your applications and ensurethat they run efficiently in the cloud.

Manage incidents and problems

This principle in the operational excellence pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you manage incidents and problems related toyour cloud workloads. It involves implementing comprehensive monitoring andobservability, establishing clear incident response procedures, conductingthorough root cause analysis, and implementing preventive measures. Many of thetopics that are discussed in this principle are covered in detail in theReliability pillar.

Principle overview

Incident management and problem management are important components of afunctional operations environment. How you respond to, categorize, and solveincidents of differing severity can significantly affect your operations. Youmust also proactively and continuously make adjustments to optimize reliabilityand performance. An efficient process for incident and problem management relieson the following foundational elements:

  • Continuous monitoring: Identify and resolve issues quickly.
  • Automation: Streamline tasks and improve efficiency.
  • Orchestration: Coordinate and manage cloud resources effectively.
  • Data-driven insights: Optimize cloud operations and makeinformed decisions.

These elements help you to build a resilient cloud environment that can handlea wide range of challenges and disruptions. These elements can also help toreduce the risk of costly incidents and downtime, and they can help you toachieve greater business agility and success. These foundational elements arespread across thefour focus areas of operational readiness:Workforce, Processes, Tooling, and Governance.

Note: TheGoogle SRE Book defines many of the terms and concepts that are described in this document. Werecommend the Google SRE Book as supplemental reading to support therecommendations that are described in this document.

Recommendations

To manage incidents and problems effectively, consider the recommendations inthe following sections. Each recommendation in this document is relevant to oneor more of thefocus areas of operational readiness.

Establish clear incident response procedures

Clear roles and responsibilities are essential to ensure effective andcoordinated response to incidents. Additionally, clear communication protocolsand escalation paths help to ensure that information is shared promptly andeffectively during an incident. This recommendation is relevant to thesefocus areas of operational readiness:workforce, processes, and tooling.

To establish incident response procedures, you need to define the roles andexpectations of each team member, such as incident commanders, investigators,communicators, and technical experts. Establishing communication and escalationpaths includes identifying important contacts, setting up communicationchannels, and defining the process for escalating incidents to higher levels ofmanagement when necessary. Regular training and preparation helps to ensure thatteams are equipped with the knowledge and skills to respond to incidentseffectively.

By documenting incident response procedures in a runbook or playbook, you canprovide a standardized reference guide for teams to follow during an incident.The runbook must outline the steps to be taken at each stage of the incidentresponse process, including communication, triage, investigation, andresolution. It must also include information about relevant tools and resourcesand contact information for important personnel. You must regularly review andupdate the runbook to ensure that it remains current and effective.

Centralize incident management

For effective tracking and management throughout the incident lifecycle,consider using a centralized incident management system. This recommendation isrelevant to thesefocus areas of operational readiness:processes and tooling.

A centralized incident management system provides the following advantages:

  • Improved visibility: By consolidating all incident-related data ina single location, you eliminate the need for teams to search in variouschannels or systems for context. This approach saves time and reducesconfusion, and it gives stakeholders a comprehensive view of the incident,including its status, impact, and progress.
  • Better coordination and collaboration: A centralized system providesa unified platform for communication and task management. It promotesseamless collaboration between the different departments and functions thatare involved in incident response. This approach ensures that everyone hasaccess to up-to-date information and it reduces the risk ofmiscommunication and misalignment.
  • Enhanced accountability and ownership: A centralized incidentmanagement system enables your organization to allocate tasks to specificindividuals or teams and it ensures that responsibilities are clearlydefined and tracked. This approach promotes accountability and encouragesproactive problem-solving because team members can easily monitor theirprogress and contributions.

A centralized incident management system must offer robust features forincident tracking, task assignment, and communication management. These featureslet you customize workflows, set priorities, and integrate with other systems,such as monitoring tools and ticketing systems.

By implementing a centralized incident management system, you can optimize yourorganization's incident response processes, improve collaboration, and enhancevisibility. Doing so leads to faster incident resolution times, reduceddowntime, and improved customer satisfaction. It also helps foster a culture ofcontinuous improvement because you can learn from past incidents and identifyareas for improvement.

Conduct thorough post-incident reviews

After an incident occurs, you must conduct a detailed post-incident review(PIR), which is also known as apostmortem,to identify the root cause, contributing factors, and lessons learned. Thisthorough review helps you to prevent similar incidents in the future. Thisrecommendation is relevant to thesefocus areas of operational readiness:processes and governance.

The PIR process must involve a multidisciplinary team that has expertise invarious aspects of the incident. The team must gather all of the relevantinformation through interviews, documentation review, and site inspections. Atimeline of events must be created to establish the sequence of actions that ledup to the incident.

After the team gathers the required information, they must conduct a root causeanalysis to determine the factors that led to the incident. This analysis mustidentify both the immediate cause and the systemic issues that contributed tothe incident.

Along with identifying the root cause, the PIR team must identify any othercontributing factors that might have caused the incident. These factors couldinclude human error, equipment failure, or organizational factors likecommunication breakdowns and lack of training.

The PIR report must document the findings of the investigation, including thetimeline of events, root cause analysis, and recommended actions. The report isa valuable resource for implementing corrective actions and preventingrecurrence. The report must be shared with all of the relevant stakeholders andit must be used to develop safety training and procedures.

To ensure a successful PIR process, your organization must foster a blamelessculture that focuses on learning and improvement rather than assigning blame.This culture encourages individuals to report incidents without fear ofretribution, and it lets you address systemic issues and make meaningfulimprovements.

By conducting thorough PIRs and implementing corrective measures based on thefindings, you can significantly reduce the risk of similar incidents occurringin the future. This proactive approach to incident investigation and preventionhelps to create a safer and more efficient work environment for everyoneinvolved.

Maintain a knowledge base

A knowledge base of known issues, solutions, and troubleshooting guides isessential for incident management and resolution. Team members can use theknowledge base to quickly identify and address common problems. Implementing aknowledge base helps to reduce the need for escalation and it improves overallefficiency. This recommendation is relevant to thesefocus areas of operational readiness:workforce and processes.

A primary benefit of a knowledge base is that it lets teams learn from pastexperiences and avoid repeating mistakes. By capturing and sharing solutions toknown issues, teams can build a collective understanding of how to resolvecommon problems and best practices for incident management. Use of a knowledgebase saves time and effort, and helps to standardize processes and ensureconsistency in incident resolution.

Along with helping to improve incident resolution times, a knowledge basepromotes knowledge sharing and collaboration across teams. With a centralrepository of information, teams can easily access and contribute to theknowledge base, which promotes a culture of continuous learning and improvement.This culture encourages teams to share their expertise and experiences, leadingto a more comprehensive and valuable knowledge base.

To create and manage a knowledge base effectively, use appropriate tools andtechnologies. Collaboration platforms likeGoogle Workspace are well-suited for this purpose because they let you easily create, edit, andshare documents collaboratively. These tools also support version control andchange tracking, which ensures that the knowledge base remains up-to-date andaccurate.

Make the knowledge base easily accessible to all relevant teams. You can achievethis by integrating the knowledge base with existing incident management systemsor by providing a dedicated portal or intranet site. A knowledge base that'sreadily available lets teams quickly access the information that they need toresolve incidents efficiently. This availability helps to reduce downtime andminimize the impact on business operations.

Regularly review and update the knowledge base to ensure that it remainsrelevant and useful. Monitor incident reports, identify common issues andtrends, and incorporate new solutions and troubleshooting guides into theknowledge base. An up-to-date knowledge base helps your teams resolve incidentsfaster and more effectively.

Automate incident response

Automation helps to streamline your incident response and remediationprocesses. It lets you address security breaches and system failurespromptly and efficiently. By using Google Cloud products likeCloud Run functions orCloud Run,you can automate various tasks that are typically manual and time-consuming.This recommendation is relevant to thesefocus areas of operational readiness:processes and tooling.

Automated incident response provides the following benefits:

  • Reduction in incident detection and resolution times: Automatedtools can continuously monitor systems and applications, detect suspiciousor anomalous activities in real time, and notify stakeholders or respondwithout intervention. This automation lets you identify potential threatsor issues before they escalate into major incidents. When an incident isdetected, automated tools can trigger predefined remediation actions, suchas isolating affected systems, quarantining malicious files, or rollingback changes to restore the system to a known good state.
  • Reduced burden on security and operations teams: Automated incidentresponse lets the security and operations teams focus on more strategictasks. By automating routine and repetitive tasks, such as collectingdiagnostic information or triggering alerts, your organization can free uppersonnel to handle more complex and critical incidents. This automationcan lead to improved overall incident response effectiveness and efficiency.
  • Enhanced consistency and accuracy of the remediation process:Automated tools can ensure that remediation actions are applied uniformlyacross all affected systems, minimizing the risk of human error orinconsistency. This standardization of the remediation process helps tominimize the impact of incidents on users and the business.

Manage and optimize cloud resources

This principle in the operational excellence pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you manage and optimize the resources that areused by your cloud workloads. It involves right-sizing resources based on actualusage and demand, using autoscaling for dynamic resource allocation,implementing cost optimization strategies, and regularly reviewing resourceutilization and costs. Many of the topics that are discussed in this principleare covered in detail in theCost optimization pillar.

Principle overview

Cloud resource management and optimization play a vital role in optimizingcloud spending, resource usage, and infrastructure efficiency. It includesvarious strategies and best practices aimed at maximizing the value and returnfrom your cloud spending.

This pillar's focus on optimization extends beyond cost reduction. It emphasizesthe following goals:

  • Efficiency: Using automation and data analytics to achieve peakperformance and cost savings.
  • Performance:Scaling resources effortlessly to meet fluctuating demands and deliveroptimal results.
  • Scalability: Adapting infrastructure and processes toaccommodate rapid growth and diverse workloads.

By focusing on these goals, you achieve a balance between cost andfunctionality. You can make informed decisions regarding resource provisioning,scaling, and migration. Additionally, you gain valuable insights into resourceconsumption patterns, which lets you proactively identify and address potentialissues before they escalate.

Recommendations

To manage and optimize resources, consider the recommendations in the followingsections. Each recommendation in this document is relevant to one or more ofthefocus areas of operational readiness.

Right-size resources

Continuously monitoring resource utilization and adjusting resource allocationto match actual demand are essential for efficient cloud resource management.Over-provisioning resources can lead to unnecessary costs, andunder-provisioning can cause performance bottlenecks that affect applicationperformance and user experience. To achieve an optimal balance, you must adopt aproactive approach to right-sizing cloud resources. This recommendation isrelevant to the governancefocus area of operational readiness.

Cloud Monitoring andRecommender can help you to identify opportunities for right-sizing. Cloud Monitoringprovides real-time visibility into resource utilization metrics. This visibilitylets you track resource usage patterns and identify potential inefficiencies.Recommender analyzes resource utilization data to make intelligentrecommendations for optimizing resource allocation. By using these tools, youcan gain insights into resource usage and make informed decisions aboutright-sizing the resources.

In addition to Cloud Monitoring and Recommender, considerusing custom metrics to trigger automated right-sizing actions. Custom metricslet you track specific resource utilization metrics that are relevant to yourapplications and workloads. You can also configure alerts to notifyadministrators when predefined thresholds are met. The administrators can thentake necessary actions to adjust resource allocation. This proactive approachensures that resources are scaled in a timely manner, which helps to optimizecloud costs and prevent performance issues.

Use autoscaling

Autoscaling compute and other resources helps to ensure optimal performance andcost efficiency of your cloud-based applications. Autoscaling lets youdynamically adjust the capacity of your resources based on workloadfluctuations, so that you have the resources that you need when you need themand you can avoid over-provisioning and unnecessary costs. This recommendationis relevant to the processesfocus area of operational readiness.

To meet the diverse needs of different applications and workloads,Google Cloud offers various autoscaling options, including the following:

  • Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. WithMIGs, you can define autoscaling policies that specify the minimum andmaximum number of VMs to maintain in the group, and the conditions thattrigger autoscaling. For example, you can configure a policy to add VMs ina MIG when the CPU utilization reaches a certain threshold and to removeVMs when the utilization drops below a different threshold.
  • Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application'sneeds. It offers the following tools:

    • Cluster Autoscaler adds or removes nodes based on Pod resource demands.
    • Horizontal Pod Autoscaler changes the number of Pod replicas based onCPU, memory, or custom metrics.
    • Vertical Pod Autoscaler fine-tunes Pod resource requests and limitsbased on usage patterns.
    • Node Auto-Provisioning automatically creates optimized node pools foryour workloads.

    These tools work together to optimize resource utilization, ensureapplication performance, and simplify cluster management.

  • Cloud Run is a serverless platform that lets you run code without having to manageinfrastructure. Cloud Run offers built-in autoscaling, whichautomatically adjusts the number of instances based on the incomingtraffic. When the volume of traffic increases, Cloud Runscales up the number of instances to handle the load. When trafficdecreases, Cloud Run scales down the number of instances toreduce costs.

By using these autoscaling options, you can ensure that your cloud-basedapplications have the resources that they need to handle varying workloads,while avoiding overprovisioning and unnecessary costs. Using autoscaling canlead to improved performance, cost savings, and more efficient use of cloudresources.

Leverage cost optimization strategies

Optimizing cloud spending helps you to effectively manage your organization'sIT budgets. This recommendation is relevant to the governancefocus area of operational readiness.

Google Cloud offers several tools and techniques to help you optimizecloud costs. By using these tools and techniques, you can get the best valuefrom your cloud spending. These tools and techniques help you to identify areaswhere costs can be reduced, such as identifying underutilized resources orrecommending more cost-effective instance types. Google Cloud options tohelp optimize cloud costs include the following:

Pricing models might change over time, and new features might be introduced thatoffer better performance or lower cost compared to existing options. Therefore,you should regularly review pricing models and consider alternative features. Bystaying informed about the latest pricing models and features, you can makeinformed decisions about your cloud architecture to minimize costs.

Google Cloud'sCost Management tools, such as budgets and alerts, provide valuable insights into cloudspending. Budgets and alerts let users set budgets and receive alerts when thebudgets are exceeded. These tools help users track their cloud spending andidentify areas where costs can be reduced.

Track resource usage and costs

You can use tagging and labeling to track resource usage and costs. Byassigning tags and labels to your cloud resources like projects, departments, orother relevant dimensions, you can categorize and organize the resources. Thislets you monitor and analyze spending patterns for specific resources andidentify areas of high usage or potential cost savings. This recommendation isrelevant to thesefocus areas of operational readiness:governance and tooling.

Tools like Cloud Billing and Cost Management help you to get a comprehensiveunderstanding of your spending patterns. These tools provide detailed insightsinto your cloud usage and they let you identify trends, forecast costs, and makeinformed decisions. By analyzing historical data and current spending patterns,you can identify the focus areas for your cost-optimization efforts.

Custom dashboards and reports help you to visualize cost data and gain deeperinsights into spending trends. By customizing dashboards with relevant metricsand dimensions, you can monitor key performance indicators (KPIs) and trackprogress towards your cost optimization goals. Reports offer deeper analyses ofcost data. Reports let you filter the data by specific time periods or resourcetypes to understand the underlying factors that contribute to your cloudspending.

Regularly review and update your tags, labels, and cost analysis tools to ensurethat you have the most up-to-date information on your cloud usage and costs. Bystaying informed and conducting cost postmortems or proactive cost reviews, youcan promptly identify any unexpected increases in spending. Doing so lets youmake proactive decisions to optimize cloud resources and control costs.

Establish cost allocation and budgeting

Accountability and transparency in cloud cost management are crucial foroptimizing resource utilization and ensuring financial control. Thisrecommendation is relevant to the governancefocus area of operational readiness.

To ensure accountability and transparency, you need to have clear mechanismsfor cost allocation and chargeback. By allocating costs to specific teams,projects, or individuals, your organization can ensure that each of theseentities is responsible for its cloud usage. This practice fosters a sense ofownership and encourages responsible resource management. Additionally,chargeback mechanisms enable your organization to recover cloud costs frominternal customers, align incentives with performance, and promote fiscaldiscipline.

Establishing budgets for different teams or projects is another essentialaspect of cloud cost management. Budgets enable your organization to definespending limits and track actual expenses against those limits. This approachlets you make proactive decisions to prevent uncontrolled spending. By settingrealistic and achievable budgets, you can ensure that cloud resources are usedefficiently and aligned with business objectives. Regular monitoring of actualspending against budgets helps you to identify variances and address potentialoverruns promptly.

To monitor budgets, you can use tools likeCloud Billing budgets and alerts.These tools provide real-time insights into cloud spending and they notifystakeholders of potential overruns. By using these capabilities, you can trackcloud costs and take corrective actions before significant deviations occur.This proactive approach helps to prevent financial surprises and ensures thatcloud resources are used responsibly.

Automate and manage change

This principle in the operational excellence pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you automate and manage change for your cloudworkloads. It involves implementing infrastructure as code (IaC), establishingstandard operating procedures, implementing a structured change managementprocess, and using automation and orchestration.

Principle overview

Change management and automation play a crucial role in ensuring smooth andcontrolled transitions within cloud environments. For effective changemanagement, you need to use strategies and best practices that minimizedisruptions and ensure that changes are integrated seamlessly with existingsystems.

Effective change management and automation include the following foundationalelements:

  • Change governance: Establish clear policies and procedures forchange management, including approval processes and communication plans.
  • Risk assessment: Identify potential risks associated withchanges and mitigate them through risk management techniques.
  • Testing and validation: Thoroughly test changes to ensure that theymeet functional and performance requirements and mitigate potential regressions.
  • Controlled deployment: Implement changes in a controlled manner,ensuring that users are seamlessly transitioned to the new environment, withmechanisms to seamlessly roll back if needed.

These foundational elements help to minimize the impact of changes and ensurethat changes have a positive effect on business operations. These elements arerepresented by the processes, tooling, and governancefocus areas of operational readiness.

Recommendations

To automate and manage change, consider the recommendations in the followingsections. Each recommendation in this document is relevant to one or more of thefocus areas of operational readiness.

Adopt IaC

Infrastructure as code (IaC) is a transformative approach for managing cloudinfrastructure. You can define and manage cloud infrastructure declaratively byusing tools likeTerraform.IaC helps you achieve consistency, repeatability, and simplified changemanagement. It also enables faster and more reliable deployments. Thisrecommendation is relevant to thesefocus areas of operational readiness:processes and tooling.

The following are the main benefits of adopting the IaC approach for your clouddeployments:

  • Human-readable resource configurations: With the IaC approach, youcan declare your cloud infrastructure resources in a human-readable format,like JSON or YAML. Infrastructure administrators and operators can easilyunderstand and modify the infrastructure and collaborate with others.
  • Consistency and repeatability: IaC enables consistency andrepeatability in your infrastructure deployments. You can ensure that yourinfrastructure is provisioned and configured the same way every time,regardless of who is performing the deployment. This approach helps toreduce errors and ensures that your infrastructure is always in a known state.
  • Accountability and simplified troubleshooting: The IaC approachhelps to improve accountability and makes it easier to troubleshoot issues.By storing your IaC code in a version control system, you can trackchanges, and identify when changes were made and by whom. If necessary, youcan easily roll back to previous versions.

Implement version control

A version control system like Git is a key component of the IaC process. Itprovides robust change management and risk mitigation capabilities, which is whyit's widely adopted, either through in-house development or SaaS solutions. Thisrecommendation is relevant to thesefocus areas of operational readiness:governance and tooling.

By tracking changes to IaC code and configurations, version control providesvisibility into the evolution of the code, making it easier to understand theimpact of changes and identify potential issues. This enhanced visibilityfosters collaboration among team members who work on the same IaC project.

Most version control systems let you easily roll back changes if needed. Thiscapability helps to mitigate the risk of unintended consequences or errors. Byusing tools like Git in your IaC workflow, you can significantly improve changemanagement processes, foster collaboration, and mitigate risks, which leads to amore efficient and reliable IaC implementation.

Build CI/CD pipelines

Continuous integration and continuous delivery (CI/CD) pipelines streamline theprocess of developing and deploying cloud applications. CI/CD pipelines automatethe building, testing, and deployment stages, which enables faster and morefrequent releases with improved quality control. This recommendation is relevantto the toolingfocus area of operational readiness.

CI/CD pipelines ensure that code changes are continuously integrated into acentral repository, typically a version control system like Git. Continuousintegration facilitates early detection and resolution of issues, and it reducesthe likelihood of bugs or compatibility problems.

To create and manage CI/CD pipelines for cloud applications, you can use toolslikeCloud BuildandCloud Deploy.

  • Cloud Build is a fully managed build service that lets developersdefine and execute build steps in a declarative manner. It integratesseamlessly with popular source-code management platforms and it can betriggered by events like code pushes and pull requests.
  • Cloud Deploy is a serverless deployment service that automates theprocess of deploying applications to various environments, such as testing,staging, and production. It provides features like blue-green deployments,traffic splitting, and rollback capabilities, making it easier to manageand monitor application deployments.

Integrating CI/CD pipelines with version control systems and testing frameworkshelps to ensure the quality and reliability of your cloud applications. Byrunning automated tests as part of the CI/CD process, development teams canquickly identify and fix any issues before the code is deployed to theproduction environment. This integration helps to improve the overall stabilityand performance of your cloud applications.

Use configuration management tools

Tools like Puppet, Chef, Ansible, andVM Manager help you to automate the configuration and management of cloud resources. Usingthese tools, you can ensure resource consistency and compliance across yourcloud environments. This recommendation is relevant to the toolingfocus area of operational readiness.

Automating the configuration and management of cloud resources provides thefollowing benefits:

  • Significant reduction in the risk of manual errors: When manualprocesses are involved, there is a higher likelihood of mistakes due tohuman error. Configuration management tools reduce this risk by automatingprocesses, so that configurations are applied consistently and accuratelyacross all cloud resources. This automation can lead to improved reliabilityand stability of the cloud environment.
  • Improvement in operational efficiency: By automating repetitivetasks, your organization can free up IT staff to focus on more strategicinitiatives. This automation can lead to increased productivity and costsavings and improved responsiveness to changing business needs.
  • Simplified management of complex cloud infrastructure: As cloudenvironments grow in size and complexity, managing the resources can becomeincreasingly difficult. Configuration management tools provide acentralized platform for managing cloud resources. The tools make it easierto track configurations, identify issues, and implement changes. Usingthese tools can lead to improved visibility, control, and security of yourcloud environment.

Automate testing

Integrating automated testing into your CI/CD pipelines helps to ensure thequality and reliability of your cloud applications. By validating changes beforedeployment, you can significantly reduce the risk of errors and regressions,which leads to a more stable and robust software system. This recommendation isrelevant to thesefocus areas of operational readiness:processes and tooling.

The following are the main benefits of incorporating automated testing into yourCI/CD pipelines:

  • Early detection of bugs and defects: Automated testing helps todetect bugs and defects early in the development process, before they cancause major problems in production. This capability saves time andresources by preventing the need for costly rework and bug fixes at laterstages in the development process.
  • High quality and standards-based code: Automated testing can helpimprove the overall quality of your code by ensuring that the code meetscertain standards and best practices. This capability leads to moremaintainable and reliable applications that are less prone to errors.

You can use various types of testing techniques in CI/CD pipelines. Each testtype serves a specific purpose.

  • Unit testing focuses on testing individual units of code, such asfunctions or methods, to ensure that they work as expected.
  • Integration testing tests the interactions between differentcomponents or modules of your application to verify that they work properlytogether.
  • End-to-end testing is often used along with unit and integrationtesting. End-to-end testing simulates real-world scenarios to test theapplication as a whole, and helps to ensure that the application meets therequirements of your end users.

To effectively integrate automated testing into your CI/CD pipelines, you mustchoose appropriate testing tools and frameworks. There are many differentoptions, each with its own strengths and weaknesses. You must also establish aclear testing strategy that outlines the types of tests to be performed, thefrequency of testing, and the criteria for passing or failing a test. Byfollowing these recommendations, you can ensure that your automated testingprocess is efficient and effective. Such a process provides valuable insightsinto the quality and reliability of your cloud applications.

Continuously improve and innovate

This principle in the operational excellence pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you continuously optimize cloud operations anddrive innovation.

Principle overview

To continuously improve and innovate in the cloud, you need to focus oncontinuous learning, experimentation, and adaptation. This helps you to explorenew technologies and optimize existing processes and it promotes a culture ofexcellence that enables your organization to achieve and maintain industryleadership.

Through continuous improvement and innovation, you can achieve the followinggoals:

  • Accelerate innovation: Explore new technologies and services toenhance capabilities and drive differentiation.
  • Reduce costs: Identify and eliminate inefficiencies throughprocess-improvement initiatives.
  • Enhance agility: Adapt rapidly to changing market demands andcustomer needs.
  • Improve decision making: Gain valuable insights from data andanalytics to make data-driven decisions.

Organizations that embrace the continuous improvement and innovation principlecan unlock the full potential of the cloud environment and achieve sustainablegrowth. This principle maps primarily to the Workforcefocus area of operational readiness.A culture of innovation lets teams experiment with new tools and technologies toexpand capabilities and reduce costs.

Recommendations

To continuously improve and innovate your cloud workloads, consider therecommendations in the following sections. Each recommendation in this documentis relevant to one or more of thefocus areas of operational readiness.

Foster a culture of learning

Encourage teams to experiment, share knowledge, and learn continuously. Adopt ablameless culture where failures are viewed as opportunities for growth andimprovement. This recommendation is relevant to the workforcefocus area of operational readiness.

When you foster a culture of learning, teams can learn from mistakes anditerate quickly. This approach encourages team members to take risks, experimentwith new ideas, and expand the boundaries of their work. It also creates apsychologically safe environment where individuals feel comfortable sharingfailures and learning from them. Sharing in this way leads to a more open andcollaborative environment.

To facilitate knowledge sharing and continuous learning, create opportunitiesfor teams to share knowledge and learn from each other. You can do this throughinformal and formal learning sessions and conferences.

By fostering a culture of experimentation, knowledge sharing, and continuouslearning, you can create an environment where teams are empowered to take risks,innovate, and grow. This environment can lead to increased productivity,improved problem-solving, and a more engaged and motivated workforce. Further,by promoting a blameless culture, you can create a safe space for employees tolearn from mistakes and contribute to the collective knowledge of the team. Thisculture ultimately leads to a more resilient and adaptable workforce that isbetter equipped to handle challenges and drive success in the long run.

Conduct regular retrospectives

Retrospectives give teams an opportunity to reflect on their experiences,identify what went well, and identify what can be improved. By conductingretrospectives after projects or major incidents, teams can learn from successesand failures, and continuously improve their processes and practices. Thisrecommendation is relevant to thesefocus areas of operational readiness:processes and governance.

An effective way to structure a retrospective is to use theStart-Stop-Continue model:

  • Start: In theStart phase of the retrospective, team membersidentify new practices, processes, and behaviors that they believe canenhance their work. They discuss why the changes are needed and how theycan be implemented.
  • Stop: In theStop phase, team members identify and eliminatepractices, processes, and behaviors that are no longer effective or thathinder progress. They discuss why these changes are necessary and how theycan be implemented.
  • Continue: In theContinue phase, team members identify practices,processes, and behaviors that work well and must be continued. They discusswhy these elements are important and how they can be reinforced.

By using a structured format like the Start-Stop-Continue model, teams canensure that retrospectives are productive and focused. This model helps tofacilitate discussion, identify the main takeaways, and identify actionablesteps for future enhancements.

Stay up-to-date with cloud technologies

To maximize the potential of Google Cloud services, you must keep up withthe latest advancements, features, and best practices. This recommendation isrelevant to the workforcefocus area of operational readiness.

Participating in relevant conferences, webinars, and training sessions is avaluable way to expand your knowledge. These events provide opportunities tolearn from Google Cloud experts, understand new capabilities, and engagewith industry peers who might face similar challenges. By attending thesesessions, you can gain insights into how to use new features effectively,optimize your cloud operations, and drive innovation within your organization.

To ensure that your team members keep up with cloud technologies, encouragethem to obtain certifications and attend training courses. Google Cloudoffers a wide range ofcertifications that validate skills and knowledge in specific cloud domains. Earning thesecertifications demonstrates commitment to excellence and provides tangibleevidence of proficiency in cloud technologies. The training courses that areoffered by Google Cloud and our partners delve deeper into specifictopics. They provide direct experience and practical skills that can beimmediately applied to real-world projects. By investing in the professionaldevelopment of your team, you can foster a culture of continuous learning andensure that everyone has the necessary skills to succeed in the cloud.

Actively seek and incorporate feedback

Collect feedback from users, stakeholders, and team members. Use the feedbackto identify opportunities to improve your cloud solutions. This recommendationis relevant to the workforcefocus area of operational readiness.

The feedback that you collect can help you to understand the evolving needs,issues, and expectations of the users of your solutions. This feedback serves asa valuable input to drive improvements and prioritize future enhancements. Youcan use various mechanisms to collect feedback:

  • Surveys are an effective way to gather quantitative data from alarge number of users and stakeholders.
  • User interviews provide an opportunity for in-depth qualitative datacollection. Interviews let you understand the specific challenges andexperiences of individual users.
  • Feedback forms that are placed within the cloud solutions offer aconvenient way for users to provide immediate feedback on their experience.
  • Regular meetings with team members can facilitate the collection offeedback on technical aspects and implementation challenges.

The feedback that you collect through these mechanisms must be analyzed andsynthesized to identify common themes and patterns. This analysis can help youprioritize future enhancements based on the impact and feasibility of thesuggested improvements. By addressing the needs and issues that are identifiedthrough feedback, you can ensure that your cloud solutions continue to meet theevolving requirements of your users and stakeholders.

Measure and track progress

Key performance indicators (KPIs) and metrics are crucial for tracking progressand measuring the effectiveness of your cloud operations. KPIs are quantifiablemeasurements that reflect the overall performance. Metrics are specific datapoints that contribute to the calculation of KPIs. Review the metrics regularlyand use them to identify opportunities for improvement and measure progress.Doing so helps you to continuously improve and optimize your cloud environment.This recommendation is relevant to thesefocus areas of operational readiness:governance and processes.

A primary benefit of using KPIs and metrics is that they enable yourorganization to adopt a data-driven approach to cloud operations. By trackingand analyzing operational data, you can make informed decisions about how toimprove the cloud environment. This data-driven approach helps you to identifytrends, patterns, and anomalies that might not be visible without the use ofsystematic metrics.

To collect and analyze operational data, you can use tools likeCloud Monitoring andBigQuery.Cloud Monitoring enables real-time monitoring of cloud resources andservices. BigQuery lets you store and analyze the data that yougather through monitoring. Using these tools together, you can create customdashboards to visualize important metrics and trends.

Operational dashboards can provide a centralized view of the most importantmetrics, which lets you quickly identify any areas that need attention. Forexample, a dashboard might include metrics like CPU utilization, memory usage,network traffic, and latency for a particular application or service. Bymonitoring these metrics, you can quickly identify any potential issues and takesteps to resolve them.

Well-Architected Framework: Security, privacy, and compliance pillar

The Security, Privacy and Compliance pillar in theGoogle Cloud Well-Architected Framework provides recommendations to help you design, deploy, and operate cloud workloadsthat meet your requirements for security, privacy, and compliance.

This document is designed to offer valuable insights and meet the needs of arange of security professionals and engineers. The following table describesthe intended audiences for this document:

AudienceWhat this document provides
Chief information security officers (CISOs), business unit leaders,and IT managersA general framework to establish and maintain security excellence inthe cloud and to ensure a comprehensive view of security areas tomake informed decisions about security investments.
Security architects and engineersKey security practices for the design and operational phases to helpensure that solutions are designed for security, efficiency, andscalability.
DevSecOps teamsGuidance to incorporate overarching security controls to planautomation that enables secure and reliable infrastructure.
Compliance officers and risk managersKey security recommendations to follow a structured approach to riskmanagement with safeguards that help to meet complianceobligations.

To ensure that your Google Cloud workloads meet your security, privacy,and compliance requirements, all of the stakeholders in your organization mustadopt a collaborative approach. In addition, you must recognize that cloudsecurity is a shared responsibility between you and Google. For moreinformation, seeShared responsibilities and shared fate on Google Cloud.

The recommendations in this pillar are grouped into core security principles.Each principle-based recommendation is mapped to one or more of thefocus areas of cloud security that might be critical to your organization. Eachrecommendation highlights guidance about the use and configuration ofGoogle Cloud products and capabilities to help improve your organization'ssecurity posture.

Core principles

The recommendations in this pillar are grouped within the following coreprinciples of security. Every principle in this pillar is important. Dependingon the requirements of your organization and workload, you might choose toprioritize certain principles.

  • Implement security by design:Integrate cloud security and network security considerations starting fromthe initial design phase of your applications and infrastructure.Google Cloud provides architecture blueprints and recommendations tohelp you apply this principle.
  • Implement zero trust:Use anever trust, always verify approach, where access to resources isgranted based on continuous verification of trust. Google Cloudsupports this principle through products like Chrome Enterprise Premium andIdentity-Aware Proxy (IAP).
  • Implement shift-left security:Implement security controls early in the software development lifecycle.Avoid security defects before system changes are made. Detect and fixsecurity bugs early, fast, and reliably after the system changes arecommitted. Google Cloud supports this principle through products likeCloud Build, Binary Authorization, and Artifact Registry.
  • Implement preemptive cyber defense:Adopt a proactive approach to security by implementing robust fundamentalmeasures like threat intelligence. This approach helps you build afoundation for more effective threat detection and response.Google Cloud'sapproach to layered security controls aligns with this principle.
  • Use AI securely and responsibly:Develop and deploy AI systems in a responsible and secure manner. Therecommendations for this principle are aligned with guidance in theAI and ML perspective of the Well-Architected Framework and in Google'sSecure AI Framework (SAIF).
  • Use AI for security:Use AI capabilities to improve your existing security systems and processesthroughGemini in Security and overall platform-security capabilities. Use AI as a tool to increasethe automation of remedial work and ensure security hygiene to make othersystems more secure.
  • Meet regulatory, compliance, and privacy needs:Adhere to industry-specific regulations, compliance standards, and privacyrequirements. Google Cloud helps you meet these obligations throughproducts like Assured Workloads, Organization Policy Service, and ourcompliance resource center.

Organizational security mindset

A security-focused organizational mindset is crucial for successful cloudadoption and operation. This mindset should be deeply ingrained in yourorganization's culture and reflected in its practices, which are guided by coresecurity principles as described earlier.

An organizational security mindset emphasizes that you think about securityduring system design, assume zero trust, and integrate security featuresthroughout your development process. In this mindset, you also think proactivelyabout cyber-defense measures, use AI securely and for security, and consideryour regulatory, privacy, and compliance requirements. By embracing theseprinciples, your organization can cultivate a security-first culture thatproactively addresses threats, protects valuable assets, and helps to ensureresponsible technology usage.

Focus areas of cloud security

This section describes the areas for you to focus on when you plan,implement, and manage security for your applications, systems, and data. Therecommendations in each principle of this pillar are relevant to one or more ofthese focus areas. Throughout the rest of this document, the recommendationsspecify the corresponding security focus areas to provide further clarity andcontext.

Focus areaActivities and componentsRelated Google Cloud products, capabilities, and solutions
Infrastructure security
  • Secure network infrastructure.
  • Encrypt data in transit and at rest.
  • Control traffic flow.
  • Secure IaaS and PaaS services.
  • Protect against unauthorized access.
Identity and access management
  • Use authentication, authorization, and access controls.
  • Manage cloud identities.
  • Manage identity and access management policies.
Data security
  • Store data in Google Cloud securely.
  • Control access to the data.
  • Discover and classify the data.
  • Design necessary controls, such as encryption, access controls, and data loss prevention.
  • Protect data at rest, in transit, and in use.
AI and ML security
  • Apply security controls at different layers of the AI and ML infrastructure and pipeline.
  • Ensure model safety.
Security operations (SecOps)
  • Adopt a modern SecOps platform and set of practices, for effective incident management, threat detection, and response processes.
  • Monitor systems and applications continuously for security events.
Application security
  • Secure applications against software vulnerabilities and attacks.
Cloud governance, risk, and compliance
  • Establish policies, procedures, and controls to manage cloud resources effectively and securely.
Logging, auditing, and monitoring
  • Analyze logs to identify potential threats.
  • Track and record system activities for compliance and security analysis.

Contributors

Authors:

  • Wade Holmes | Global Solutions Director
  • Hector Diaz | Cloud Security Architect
  • Carlos Leonardo Rosario | Google Cloud Security Specialist
  • John Bacon | Partner Solutions Architect
  • Sachin Kalra | Global Security Solution Manager

Other contributors:

Implement security by design

This principle in the security pillar of theGoogle Cloud Well-Architected Framework provides recommendations to incorporate robust security features, controls, andpractices into the design of your cloud applications, services, and platforms.From ideation to operations, security is more effective when it's embedded as anintegral part of every stage of your design process.

Principle overview

As explained inAn Overview of Google's Commitment to Secure by Design,secure by default andsecure by design are often used interchangeably, butthey represent distinct approaches to building secure systems. Both approachesaim to minimize vulnerabilities and enhance security, but they differ in scopeand implementation:

  • Secure by default: focuses on ensuring that a system's defaultsettings are set to a secure mode, minimizing the need for users oradministrators to take actions to secure the system. This approach aims toprovide a baseline level of security for all users.
  • Secure by design: emphasizes proactively incorporating securityconsiderations throughout a system's development lifecycle. This approachis about anticipating potential threats and vulnerabilities early andmaking design choices that mitigate risks. This approach involves usingsecure coding practices, conducting security reviews, and embeddingsecurity throughout the design process. The secure-by-design approach is anoverarching philosophy that guides the development process and helps to ensurethat security isn't an afterthought but is an integral part of a system's design.

Recommendations

To implement the secure by design principle for your cloud workloads, considerthe recommendations in the following sections:

Choose system components that help to secure your workloads

This recommendation is relevant to all of thefocus areas.

A fundamental decision for effective security is the selection of robust systemcomponents—including both hardware and software components—that constituteyour platform, solution, or service. To reduce the security attack surface andlimit potential damage, you must also carefully consider the deployment patternsof these components and their configurations.

In your application code, we recommend that you use straightforward, safe, andreliable libraries, abstractions, and application frameworks in order to eliminateclasses of vulnerabilities. To scan for vulnerabilities in software libraries,you can use third-party tools. You can also useAssured Open Source Software,which helps to reduce risks to your software supply chain by using open sourcesoftware (OSS) packages that Google uses and secures.

Your infrastructure must use networking, storage, and compute options thatsupport safe operation and align with your security requirements and riskacceptance levels. Infrastructure security is important for both internet-facingand internal workloads.

For information about other Google solutions that support this recommendation,seeImplement shift-left security.

Build a layered security approach

This recommendation is relevant to the followingfocus areas:

  • AI and ML security
  • Infrastructure security
  • Identity and access management
  • Data security

We recommend that you implement security at each layer of your application andinfrastructure stack by applying a defense-in-depth approach.

Use the security features in each component of your platform. To limit accessand identify the boundaries of the potential impact (that is, theblast radius)in the event of a security incident, do the following:

  • Simplify your system's design to accommodate flexibility where possible.
  • Document the security requirements of each component.
  • Incorporate a robust secured mechanism to address resiliency and recoveryrequirements.

When you design the security layers, perform a risk assessment to determine thesecurity features that you need in order to meet internal security requirementsand external regulatory requirements. We recommend that you use anindustry-standard risk assessment framework that applies to cloud environmentsand that is relevant to your regulatory requirements. For example, the CloudSecurity Alliance (CSA) provides theCloud Controls Matrix (CCM).Your risk assessment provides you with a catalog of risks and correspondingsecurity controls to mitigate them.

When you perform the risk assessment, remember that you have a sharedresponsibility arrangement with your cloud provider. Therefore, your risks in acloud environment differ from your risks in an on-premises environment. Forexample, in an on-premises environment, you need to mitigate vulnerabilities toyour hardware stack. In contrast, in a cloud environment, the cloud providerbears these risks. Also, remember that the boundaries of shared responsibilitiesdiffer between IaaS, PaaS, and SaaS services for each cloud provider.

After you identify potential risks, you must design and create a mitigationplan that uses technical, administrative, and operational controls, as well ascontractual protections and third-party attestations. In addition, a threatmodeling method, such as theOWASP application threat modeling method,helps you to identify potential gaps and suggest actions to address the gaps.

Use hardened and attested infrastructure and services

This recommendation is relevant to all of thefocus areas.

A mature security program mitigates new vulnerabilities as described insecurity bulletins. The security program should also provide remediation to fixvulnerabilities in existing deployments and secure your VM and containerimages. You can use hardening guides that are specific to the OS and applicationof your images, as well as benchmarks like the one provided by theCenter of Internet Security (CIS).

If you use custom images for your Compute Engine VMs, you need to patch theimages yourself. Alternatively, you can use Google-providedcurated OS images,which are patched regularly. To run containers on Compute Engine VMs, useGoogle-curatedContainer-optimized OS images.Google regularly patches and updates these images.

If you use GKE, we recommend that you enablenode auto-upgrades so that Google updates your cluster nodes with the latest patches. Googlemanages GKE control planes, which are automatically updatedand patched. To further reduce the attack surface of your containers, you canusedistroless images.Distroless images are ideal for security-sensitive applications, microservices,and situations where minimizing the image size and attack surface isparamount.

For sensitive workloads, useShielded VM,which prevents malicious code from being loaded during the VM boot cycle.Shielded VM instances provide boot security, monitor integrity, and usetheVirtual Trusted Platform Module (vTPM).

To help secure SSH access,OS Login lets your employees connect to your VMs by usingIdentity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. Therefore, you don't needto manage SSH keys throughout your organization. OS Login ties anadministrator's access to their employee lifecycle, so when employees changeroles or leave your organization, their access is revoked with their account. OSLogin also supportsGoogle two-factor authentication,which adds an extra layer of security against account takeover attacks.

In GKE, application instances run within Dockercontainers. To enable a defined risk profile and to restrict employees frommaking changes to containers, ensure that your containers are stateless andimmutable. The immutability principle means that your employees don't modify thecontainer or access it interactively. If the container must be changed, youbuild a new image and redeploy that image. Enable SSH access to the underlyingcontainers only in specific debugging scenarios.

To help globally secure configurations across your environment, you can useorganization policies to set constraints or guardrails on resources that affect the behavior of yourcloud assets. For example, you can define the following organization policiesand apply them either globally across a Google Cloud organization orselectively at the level of a folder or project:

  • Disable external IP address allocation to VMs.
  • Restrict resource creation to specific geographical locations.
  • Disable the creation of Service Accounts or their keys.

Encrypt data at rest and in transit

This recommendation is relevant to the followingfocus areas:

  • Infrastructure security
  • Data security

Data encryption is a foundational control to protect sensitive information, andit's a key part ofdata governance.An effective data protection strategy includes access control, data segmentationand geographical residency, auditing,and encryption implementation that'sbased on a careful assessment of requirements.

By default, Google Cloudencrypts customer data that's stored at rest,with no action required from you. In addition to default encryption,Google Cloud provides options for envelope encryption and encryption keymanagement. You must identify the solutions that best fit your requirements forkey generation, storage, and rotation, whether you're choosing the keys for yourstorage, for compute, or for big data workloads. For example,Customer-managed encryption keys (CMEKs) can be created inCloud Key Management Service (Cloud KMS).The CMEKs can be either software-based orHSM-protected to meet your regulatory or compliance requirements, such as the need to rotateencryption keys regularly. Cloud KMS Autokey lets you automate theprovisioning and assignment of CMEKs. In addition, you can bring your own keysthat are sourced from a third-party key management system by usingCloud External Key Manager (Cloud EKM).

We strongly recommend that data be encrypted in-transit.Google encrypts and authenticates data in transit at one or more network layers when data moves outside physical boundaries thataren't controlled by Google or on behalf of Google. All VM-to-VM traffic withina VPC network and between peered VPC networks is encrypted. You can useMACsec for encryption of traffic over Cloud Interconnect connections. IPsec providesencryption for traffic overCloud VPN connections. You can protect application-to-application traffic in the cloud byusing security features likeTLS and mTLS configurations in Apigee andCloud Service Mesh for containerized applications.

By default, Google Cloud encrypts data at rest and data in transit acrossthe network. However, data isn't encrypted by default while it's in use inmemory. If your organization handles confidential data, you need to mitigate anythreats that undermine the confidentiality and integrity of either theapplication or the data in system memory. To mitigate these threats, you can useConfidential Computing, which provides a trusted execution environment for yourcompute workloads. For more information, seeConfidential VM overview.

Implement zero trust

This principle in the security pillar of theGoogle Cloud Well-Architected Framework helps you ensure comprehensive security across your cloud workloads. The principleof zero trust emphasizes the following practices:

  • Eliminating implicit trust
  • Applying the principle of least privilege to access control
  • Enforcing explicit validation of all access requests
  • Adopting anassume-breach mindset to enable continuous verification andsecurity posture monitoring

Principle overview

Thezero-trust model shifts the security focus from perimeter-based securityto an approach where no user or device is considered to be inherentlytrustworthy. Instead, every access request must be verified, regardless of itsorigin. This approach involves authenticating and authorizing every user anddevice, validating their context (location and device posture), and grantingleast privilege access to only the necessary resources.

Implementing the zero-trust model helps your organization enhance itssecurity posture by minimizing the impact of potential breaches and protectingsensitive data and applications against unauthorized access. The zero-trustmodel helps you ensure confidentiality, integrity, and availability of data andresources in the cloud.

Recommendations

To implement the zero-trust model for your cloud workloads, consider therecommendations in the following sections:

Secure your network

This recommendation is relevant to the followingfocus area:Infrastructure security.

Transitioning from conventional perimeter-based security to a zero-trust modelrequires multiple steps. Your organization might have already integrated certainzero-trust controls into its security posture. However, a zero-trust modelisn't a singular product or solution. Instead, it's a holistic integration ofmultiple security layers and best practices. This section describesrecommendations and techniques to implement zero trust for network security.

  • Access control: Enforce access controls based on user identity andcontext by using solutions likeChrome Enterprise Premium andIdentity-Aware Proxy (IAP).By doing this, you shift security from the network perimeter to individualusers and devices. This approach enables granular access control andreduces the attack surface.
  • Network security: Secure network connections between youron-premises, Google Cloud, and multicloud environments.
  • Network design: Prevent potential security risks by deleting defaultnetworks in existing projects and disabling the creation of defaultnetworks in new projects.
    • To avoid conflicts, plan your network and IP address allocationcarefully.
    • To enforce effective access control, limitthe number of Virtual Private Cloud (VPC) networks per project.
  • Segmentation: Isolate workloads but maintain centralized networkmanagement.
    • To segment your network, useShared VPC.
    • Define firewall policies and rules at the organization, folder, and VPCnetwork levels.
    • To preventdata exfiltration,establish secure perimeters around sensitive data and services by usingVPC Service Controls.
  • Perimeter security: Protect against DDoS attacks and web applicationthreats.
    • To protect against threats, useGoogle Cloud Armor.
    • Configure security policies to allow, deny, or redirect traffic at theGoogle Cloud edge.
  • Automation: Automate infrastructure provisioning by embracinginfrastructure as code (IaC) principles and by using tools like Terraform,Jenkins, andCloud Build.IaC helps to ensure consistent security configurations, simplifieddeployments, and rapid rollbacks in case of issues.
  • Secure foundation: Establish a secure application environment byusing theEnterprise foundations blueprint.This blueprint provides prescriptive guidance and automation scripts tohelp you implement security best practices and configure yourGoogle Cloud resources securely.

Verify every access attempt explicitly

This recommendation is relevant to the followingfocus areas:

  • Identity and access management
  • Security operations (SecOps)
  • Logging, auditing, and monitoring

Implement strong authentication and authorization mechanisms for any user,device, or service that attempts to access your cloud resources. Don't rely onlocation or network perimeter as a security control. Don't automatically trustany user, device, or service, even if they are already inside the network.Instead, every attempt to access resources must be rigorously authenticated andauthorized. You must implement strong identity verification measures, such asmulti-factor authentication (MFA). You must also ensure that access decisionsare based on granular policies that consider various contextual factors likeuser role, device posture, and location.

To implement this recommendation, use the following methods, tools, andtechnologies:

  • Unified identity management: Ensure consistent identity managementacross your organization by using a single identity provider (IdP).
    • Google Cloud supports federation with most IdPs, includingon-premisesActive Directory.Federation lets you extend your existing identity management infrastructureto Google Cloud and enable single sign-on (SSO) for users.
    • If you don't have an existing IdP, consider usingCloud Identity Premium orGoogle Workspace.
  • Limited service account permissions: Useservice accounts carefully, and adhere to the principle of least privilege.
    • Grant only thenecessary permissions required for each service account to perform itsdesignated tasks.
    • UseWorkload Identity Federation for applications that run on Google Kubernetes Engine (GKE) or run outsideGoogle Cloud to access resources securely.
  • Robust processes: Update your identity processes to align with cloudsecurity best practices.
    • To help ensure compliance with regulatoryrequirements, implement identity governance to track access, risks, andpolicy violations.
    • Review and update your existing processes for grantingand auditing access-control roles and permissions.
  • Strong authentication: Implement SSO for user authentication andimplement MFA for privileged accounts.
    • Google Cloud supports variousMFA methods, includingTitan Security Keys,for enhanced security.
    • For workload authentication, use OAuth 2.0 or signed JSON Web Tokens (JWTs).
  • Least privilege: Minimize the risk of unauthorized access and databreaches by enforcing the principles of least privilege and separation ofduties.
    • Avoid overprovisioning user access.
    • Consider implementing just-in-time privileged access for sensitiveoperations.
  • Logging: Enable audit logging for administrator and data accessactivities.

Monitor and maintain your network

This recommendation is relevant to the followingfocus areas:

  • Logging, auditing, and monitoring
  • Application security
  • Security operations (SecOps)
  • Infrastructure security

When you plan and implement security measures, assume that an attacker isalready inside your environment. This proactive approach involves using the followingmultiple tools and techniques to provide visibility into your network:

  • Centralized logging and monitoring: Collect andanalyze security logs from all of your cloud resources through centralizedlogging and monitoring.

    • Establish baselines for normal network behavior, detect anomalies, andidentify potential threats.
    • Continuously analyze network traffic flows to identify suspicious patternsand potential attacks.
  • Insights into network performance and security: Use tools likeNetwork Analyzer.Monitor traffic for unusual protocols, unexpected connections, or sudden spikesin data transfer, which could indicate malicious activity.

  • Vulnerability scanning and remediation: Regularly scan your network and applications for vulnerabilities.

    • UseWeb Security Scanner,which can automatically identify vulnerabilities in your Compute Engineinstances, containers, and GKE clusters.
    • Prioritize remediation based on the severity of vulnerabilities and theirpotential impact on your systems.
  • Intrusion detection: Monitor network traffic for malicious activity andautomatically block or get alerts for suspicious events by usingCloud IDS andCloud NGFW intrusion prevention service.

  • Security analysis: Consider implementingGoogle SecOps to correlate security events from various sources, provide real-time analysis ofsecurity alerts, and facilitate incident response.

  • Consistent configurations: Ensure that you have consistent securityconfigurations across your network by using configuration management tools.

Implement shift-left security

This principle in the security pillar of theGoogle Cloud Well-Architected Framework helps you identify practical controls that you can implement early in thesoftware development lifecycle to improve your security posture. It providesrecommendations that help you implement preventive security guardrails andpost-deployment security controls.

Principle overview

Shift-left security means adopting security practices early in the softwaredevelopment lifecycle. This principle has the following goals:

  • Avoid security defects before system changes are made. Implementpreventive security guardrails and adopt practices such as infrastructureas code (IaC), policy as code, and security checks in the CI/CD pipeline.You can also use other platform-specific capabilities likeOrganization Policy Service andhardened GKE clusters in Google Cloud.
  • Detect and fix security bugs early, fast, and reliably after any systemchanges are committed. Adopt practices like code reviews, post-deploymentvulnerability scanning, and security testing.

TheImplement security by design and shift-left security principles are related but they differ in scope.The security-by-design principle helps you to avoid fundamental design flaws that wouldrequire re-architecting the entire system. For example, a threat-modelingexercise reveals that the current design doesn't include an authorizationpolicy, and all users would have the same level of access without it. Shift-leftsecurity helps you to avoid implementation defects (bugs and misconfigurations)before changes are applied, and it enables fast, reliable fixes after deployment.

Recommendations

To implement the shift-left security principle for your cloud workloads,consider the recommendations in the following sections:

Adopt preventive security controls

This recommendation is relevant to the followingfocus areas:

  • Identity and access management
  • Cloud governance, risk, and compliance

Preventive security controls are crucial for maintaining a strong securityposture in the cloud. These controls help you proactively mitigate risks. Youcan prevent misconfigurations and unauthorized access to resources, enabledevelopers to work efficiently, and help ensure compliance with industrystandards and internal policies.

Preventive security controls are more effective when they're implemented byusing infrastructure as code (IaC). With IaC, preventive security controls caninclude more customized checks on the infrastructure code before changes aredeployed. When combined with automation, preventive security controls can run aspart of your CI/CD pipeline's automatic checks.

The following products and Google Cloud capabilities can help you implementpreventive controls in your environment:

IAM lets you authorizewho can act on specific resources basedon permissions. For more information, seeAccess control for organization resources with IAM.

Organization Policy Service lets you set restrictions on resources to specify how they canbe configured. For example, you can use an organization policy to do thefollowing:

In addition to using organizational policies, you can restrict access toresources by using the following methods:

  • Tags with IAM:assign a tag to a set of resources and then set the access definition forthe tag itself, rather than defining the access permissions on each resource.
  • IAM Conditions:define conditional, attribute-based access control for resources.
  • Defense in depth: use VPC Service Controls to further restrictaccess to resources.

For more information about resource management, seeDecide a resource hierarchy for your Google Cloud landing zone.

Automate provisioning and management of cloud resources

This recommendation is relevant to the followingfocus areas:

  • Application security
  • Cloud governance, risk, and compliance

Automating the provisioning and management of cloud resources and workloads ismore effective when you also adopt declarative IaC, as opposed to imperativescripting. IaC isn't a security tool or practice on its own, but it helps youto improve the security of your platform. Adopting IaC lets you createrepeatable infrastructure and provides your operations team with a known goodstate. IaC also improves the efficiency of rollbacks, audit changes, andtroubleshooting.

When combined with CI/CD pipelines and automation, IaC also gives you theability to adopt practices such aspolicy as code with tools like OPA. You canaudit infrastructure changes over time and run automatic checks on theinfrastructure code before changes are deployed.

To automate the infrastructure deployment, you can use tools likeConfig Controller,Terraform, Jenkins, andCloud Build.To help you build a secure application environment using IaC and automation,Google Cloud provides theenterprise foundations blueprint.This blueprint is Google's opinionated design that follows all of ourrecommended practices and configurations. The blueprint provides step-by-stepinstructions to configure and deploy your Google Cloud topology by usingTerraform and Cloud Build.

You can modify the scripts of theenterprise foundations blueprint to configure an environment that follows Googlerecommendations and meets your own security requirements. You can further buildon the blueprint with additional blueprints or design your own automation. TheGoogle Cloud Architecture Center provides other blueprints that can beimplemented on top of the enterprise foundations blueprint. The following are afew examples of these blueprints:

Automate secure application releases

This recommendation is relevant to the followingfocus area:Application security.

Without automated tools, it can be difficult to deploy, update, and patchcomplex application environments to meet consistent security requirements. Werecommend that you build automated CI/CD pipelines for your software developmentlifecycle (SDLC). Automated CI/CD pipelines help you to remove manual errors,provide standardized development feedback loops, and enable efficient productiterations. Continuous delivery is one of the best practices that theDORA framework recommends.

Automating application releases by using CI/CD pipelines helps to improve yourability to detect and fix security bugs early, fast, and reliably. For example,you can scan for security vulnerabilities automatically when artifacts arecreated, narrow the scope of security reviews, and roll back to a known and safeversion. You can also define policies for different environments (such asdevelopment, test, or production environments) so that only verified artifactsare deployed.

To help you automate application releases and embed security checks in yourCI/CD pipeline, Google Cloud provides multiple tools includingCloud Build,Cloud Deploy,Web Security Scanner,andBinary Authorization.

To establish a process that verifies multiple security requirements in your SDLC,use theSupply-chain Levels for Software Artifacts (SLSA) framework, which has been defined by Google. SLSA requires securitychecks for source code, build process, and code provenance. Many of theserequirements can be included in an automated CI/CD pipeline. To understand howGoogle applies these practices internally, seeGoogle Cloud's approach to change.

Ensure that application deployments follow approved processes

This recommendation is relevant to the followingfocus area:Application security.

If an attacker compromises your CI/CD pipeline, your entire application stackcan be affected. To help secure the pipeline, you should enforce an establishedapproval process before you deploy the code into production.

If you use Google Kubernetes Engine (GKE) orCloud Run, you can establish an approval process by usingBinary Authorization.Binary Authorization attaches configurable signatures to container images.These signatures (also calledattestations) help to validate the image. Atdeployment time, Binary Authorization uses these attestations to determinewhether a process was completed. For example, you can use Binary Authorizationto do the following:

  • Verify that a specific build system or CI pipeline created a containerimage.
  • Validate that a container image is compliant with a vulnerabilitysigning policy.
  • Verify that a container image passes the criteria for promotion to thenext deployment environment, such as from development to QA.

By using Binary Authorization, you can enforce that only trusted code runs onyour target platforms.

Scan for known vulnerabilities before application deployment

This recommendation is relevant to the followingfocus area:Application security.

We recommend that you use automated tools that can continuously performvulnerability scans on application artifacts before they're deployed toproduction.

For containerized applications, useArtifact Analysis to automatically run vulnerability scans for container images.Artifact Analysis scans new images when they're uploaded toArtifact Registry.The scan extracts information about the system packages in the container. Afterthe initial scan, Artifact Analysis continuously monitors the metadataof scanned images in Artifact Registry for new vulnerabilities. WhenArtifact Analysis receives new and updated vulnerability informationfromvulnerability sources,it does the following:

  • Updates the metadata of the scanned images to keep them up to date.
  • Creates new vulnerability occurrences for new notes.
  • Deletes vulnerability occurrences that are no longer valid.

Monitor your application code for known vulnerabilities

This recommendation is relevant to the followingfocus area:Application security.

Use automated tools to constantly monitor your application code for knownvulnerabilities such as theOWASP Top 10.For more information about Google Cloud products and features that supportOWASP Top 10 mitigation techniques, seeOWASP Top 10 mitigation options on Google Cloud.

UseWeb Security Scanner to help identify security vulnerabilities in your App Engine,Compute Engine, and GKE web applications. The scannercrawls your application, follows all of the links within the scope of yourstarting URLs, and attempts to exercise as many user inputs and event handlersas possible. It can automatically scan for and detect common vulnerabilities,includingcross-site scripting,code injection,mixed content,and outdated or insecure libraries. Web Security Scanner provides earlyidentification of these types of vulnerabilities without distracting you withfalse positives.

In addition, if you use GKE to manage fleets ofKubernetes clusters, the security posture dashboard shows opinionated,actionable recommendations to help improve your fleet's security posture.

Implement preemptive cyber defense

This principle in the security pillar of theGoogle Cloud Well-Architected Framework provides recommendations to build robust cyber-defense programs as part of youroverall security strategy.

This principle emphasizes the use of threat intelligence to proactively guideyour efforts across the core cyber-defense functions, as defined inThe Defender's Advantage: A guide to activating cyber defense.

Principle overview

When you defend your system against cyber attacks, you have a significant,underutilized advantage against attackers. Asthe founder of Mandiant states,"You should know more about your business, your systems, your topology, yourinfrastructure than any attacker does. This is an incredible advantage." Tohelp you use this inherent advantage, this document provides recommendationsabout proactive and strategic cyber-defense practices that are mapped to theDefender's Advantage framework.

Recommendations

To implement preemptive cyber defense for your cloud workloads, consider therecommendations in the following sections:

Integrate the functions of cyber defense

This recommendation is relevant to all of thefocus areas.

The Defender's Advantage framework identifies six critical functions ofcyber defense:Intelligence,Detect,Respond,Validate,Hunt, andMission Control. Each function focuses on a unique part of thecyber-defense mission, but these functions must be well-coordinated andwork together to provide an effective defense. Focus on building a robustand integrated system where each function supports the others. If you needa phased approach for adoption, consider the following suggested order.Depending on your current cloud maturity, resource topology, and specificthreat landscape, you might want to prioritize certain functions.

  1. Intelligence: The Intelligence function guides all the otherfunctions. Understanding the threat landscape—including the most likelyattackers, their tactics, techniques, and procedures (TTPs), and thepotential impact—is critical to prioritizing actions across the entireprogram. The Intelligence function is responsible for stakeholderidentification, definition of intelligence requirements, data collection,analysis and dissemination, automation, and the creation of a cyberthreat profile.
  2. Detect and Respond: These functions make up the core of activedefense, which involves identifying and addressing malicious activity.These functions are necessary to act on the intelligence that's gathered bythe intelligence function. The Detect function requires a methodicalapproach that aligns detections to attacker TTPs and ensures robustlogging. The Respond function must focus on initial triage, datacollection, and incident remediation.
  3. Validate: The Validate function is a continuous process thatprovides assurance that your security control ecosystem is up-to-date andoperating as designed. This function ensures that your organizationunderstands the attack surface, knows where vulnerabilities exist, andmeasures the effectiveness of controls. Security validation is also animportant component of the detection engineering lifecycle and must be usedto identify detection gaps and create new detections.
  4. Hunt: The Hunt function involves proactively searching for activethreats within an environment. This function must be implemented when yourorganization has a baseline level of maturity in the Detect and Respondfunctions. The Hunt function expands the detection capabilities and helpsto identify gaps and weaknesses in controls. The Hunt function must bebased on specific threats. This advanced function benefits from afoundation of robust intelligence, detection, and response capabilities.
  5. Mission Control: The Mission Control function acts as the centralhub that connects all of the other functions. This function isresponsible for strategy, communication, and decisive action across yourcyber-defense program. It ensures that all of the functions are workingtogether and that they're aligned with your organization's business goals.You must focus on establishing a clear understanding of the purpose of theMission Control function before you use it to connect the other functions.

Use the Intelligence function in all aspects of cyber defense

This recommendation is relevant to all of thefocus areas.

This recommendation highlights the Intelligence function as a core part of a strongcyber-defense program. Threat intelligence provides knowledge about threatactors, their TTPs, and indicators of compromise (IOCs). This knowledge shouldinform and prioritize actions across all cyber-defense functions. Anintelligence-driven approach helps you align defenses to meet the threats thatare most likely to affect your organization. This approach also helps withefficient allocation and prioritization of resources.

The following Google Cloud products and features help you take advantageof threat intelligence to guide your security operations. Use these features toidentify and prioritize potential threats, vulnerabilities, and risks, and thenplan and implement appropriate actions.

  • Google Security Operations (Google SecOps) helps you store and analyze security data centrally. UseGoogle SecOps to map logs into a common model, enrich thelogs, and link the logs to timelines for a comprehensive view of attacks.You can also create detection rules, set up IoC matching, and performthreat-hunting activities. The platform also provides curated detections,which are predefined and managed rules to help identify threats.Google SecOps can also integrate withMandiant frontline intelligence.Google SecOps uniquely integrates industry-leading AI, alongwiththreat intelligence from Mandiant andGoogle VirusTotal.This integration is critical for threat evaluation and understanding who istargeting your organization and the potential impact.

  • Security Command Center Enterprise, which is powered by Google AI, enables security professionals toefficiently assess, investigate, and respond to security issues acrossmultiple cloud environments. The security professionals who can benefitfrom Security Command Center include security operations center (SOC) analysts,vulnerability and posture analysts, and compliance managers. Security Command CenterEnterprise enriches security data, assesses risk, and prioritizesvulnerabilities. This solution provides teams with the information that theyneed to address high-risk vulnerabilities and to remediate active threats.

  • Chrome Enterprise Premium offers threat and data protection, which helps to protect users fromexfiltration risks and prevents malware from getting ontoenterprise-managed devices. Chrome Enterprise Premium also provides visibility intounsafe or potentially unsafe activity that can happen within the browser.

  • Network monitoring, through tools likeNetwork Intelligence Center,provides visibility into network performance. Network monitoring can alsohelp you detect unusual traffic patterns or detect data transfer amountsthat might indicate an attack or data exfiltration attempt.

Understand and capitalize on your defender's advantage

This recommendation is relevant to all of thefocus areas.

As mentioned earlier, you have an advantage over attackers when you havea thorough understanding of your business, systems, topology, andinfrastructure. To capitalize on this knowledge advantage, utilize this dataabout your environments during cyberdefense planning.

Google Cloud provides the following features to help you proactively gainvisibility to identify threats, understand risks, and respond in a timely mannerto mitigate potential damage:

  • Chrome Enterprise Premium helps you enhance security for enterprise devices byprotecting users from exfiltration risks. It extendsSensitive Data Protection services into the browser, and prevents malware. It also offers features likeprotection against malware and phishing to help prevent exposure to unsafecontent. In addition, it gives you control over the installation ofextensions to help prevent unsafe or unvetted extensions. Thesecapabilities help you establish a secure foundation for your operations.

  • Security Command Center Enterprise provides a continuousrisk engine that offers comprehensive and ongoing risk analysis and management. The riskengine feature enriches security data, assesses risk, and prioritizesvulnerabilities to help fix issues quickly. Security Command Center enables yourorganization to proactively identify weaknesses and implement mitigations.

  • Google SecOps centralizes security data and providesenriched logs with timelines. This enables defenders to proactivelyidentify active compromises and adapt defenses based on attackers' behavior.

  • Network monitoring helps identify irregular network activity that mightindicate an attack and it provides early indicators that you can use to takeaction. To help proactively protect your data from theft, continuously monitorfor data exfiltration and use the provided tools.

Validate and improve your defenses continuously

This recommendation is relevant to all of thefocus areas.

This recommendation emphasizes the importance of targeted testing andcontinuous validation of controls to understand strengths and weaknessesacross the entire attack surface. This includes validating theeffectiveness of controls, operations, and staff through methods like thefollowing:

You must also actively search for threats and use the results to improvedetection and visibility. Use the following tools to continuously test andvalidate your defenses against real-world threats:

  • Security Command Center Enterprise provides a continuous risk engine to evaluatevulnerabilities and prioritize remediation, which enables ongoingevaluation of your overall security posture. By prioritizing issues,Security Command Center Enterprise helps you to ensure that resources are used effectively.

  • Google SecOps offers threat-hunting and curateddetections that let you proactively identify weaknesses in your controls.This capability enables continuous testing and improvement of your abilityto detect threats.

  • Chrome Enterprise Premium provides threat and data protection features thatcan help you to address new and evolving threats, and continuously updateyour defenses against exfiltration risks and malware.

  • Cloud Next Generation Firewall (Cloud NGFW) provides network monitoringand data-exfiltration monitoring. These capabilities can help you to validatethe effectiveness of your current security posture and identify potentialweaknesses. Data-exfiltration monitoring helps you to validate the strengthof your organization's data protection mechanisms and make proactiveadjustments where necessary. When you integrate threat findings fromCloud NGFW with Security Command Center and Google SecOps,you can optimize network-based threat detection, optimize threat response,and automate playbooks. For more information about this integration, seeUnifying Your Cloud Defenses: Security Command Center & Cloud NGFW Enterprise.

Manage and coordinate cyber-defense efforts

This recommendation is relevant to all of thefocus areas.

As described earlier inIntegrate the functions of cyber defense,the Mission Control function interconnects the other functions of thecyber-defense program. This function enables coordination and unifiedmanagement across the program. It also helps you coordinate with other teamsthat don't work on cybersecurity. The Mission Control function promotesempowerment and accountability, facilitates agility and expertise, and drivesresponsibility and transparency.

The following products and features can help you implement the Mission Controlfunction:

  • Security Command Center Enterprise acts as a central hub for coordinating andmanaging your cyber-defense operations. It brings tools, teams, and datatogether, along with the built-in Google SecOps responsecapabilities. Security Command Center provides clear visibility into yourorganization's security state and enables the identification of securitymisconfigurations across different resources.
  • Google SecOps provides a platform for teams to respond tothreats by mapping logs and creating timelines. You can also definedetection rules and search for threats.
  • Google Workspace and Chrome Enterprise Premium help you to manage and control end-user access tosensitive resources. You can define granular access controls based on useridentity and the context of a request.
  • Network monitoring provides insights into the performance of networkresources. You can import network monitoring insights into Security Command Centerand Google SecOps for centralized monitoring and correlationagainst other timeline based data points. This integration helps you todetect and respond to potential network usage changes caused by nefariousactivity.
  • Data-exfiltration monitoring helps to identify possible data lossincidents. With this feature, you can efficiently mobilize an incidentresponse team, assess damages, and limit further data exfiltration. You canalso improve current policies and controls to ensure data protection.

Product summary

The following table lists the products and features that are described in thisdocument and maps them to the associated recommendations and securitycapabilities.

Google Cloud productApplicable recommendations
Google SecOpsUse the Intelligence function in all aspects of cyber defense: Enables threat hunting and IoC matching, and integrates with Mandiant for comprehensive threat evaluation.

Understand and capitalize on your defender's advantage: Provides curated detections and centralizes security data for proactive compromise identification.

Validate and improve your defenses continuously: Enables continuous testing and improvement of threat detection capabilities.

Manage and coordinate cyber-defense efforts through Mission Control: Provides a platform for threat response, log analysis, and timeline creation.

Security Command Center EnterpriseUse the Intelligence function in all aspects of cyber defense: Uses AI to assess risk, prioritize vulnerabilities, and provide actionable insights for remediation.

Understand and capitalize on your defender's advantage: Offers comprehensive risk analysis, vulnerability prioritization, and proactive identification of weaknesses.

Validate and improve your defenses continuously: Provides ongoing security posture evaluation and resource prioritization.

Manage and coordinate cyber-defense efforts through Mission Control: Acts as a central hub for managing and coordinating cyber-defense operations.

Chrome Enterprise PremiumUse the Intelligence function in all aspects of cyber defense: Protects users from exfiltration risks, prevents malware, and provides visibility into unsafe browser activity.

Understand and capitalize on your defender's advantage: Enhances security for enterprise devices through data protection, malware prevention, and control over extensions.

Validate and improve your defenses continuously: Addresses new and evolving threats through continuous updates to defenses against exfiltration risks and malware.

Manage and coordinate cyber-defense efforts through Mission Control: Manage and control end-user access to sensitive resources, including granular access controls.

Google WorkspaceManage and coordinate cyber-defense efforts through Mission Control: Manage and control end-user access to sensitive resources, including granular access controls.
Network Intelligence CenterUse the Intelligence function in all aspects of cyber defense: Provides visibility into network performance and detects unusual traffic patterns or data transfers.
Cloud NGFWValidate and improve your defenses continuously: Optimizes network-based threat detection and response through integration with Security Command Center and Google SecOps.

Use AI securely and responsibly

This principle in the security pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you secure your AI systems. Theserecommendations are aligned with Google'sSecure AI Framework (SAIF),which provides a practical approach to address the security and risk concerns ofAI systems. SAIF is a conceptual framework that aims to provide industry-widestandards for building and deploying AI responsibly.

Principle overview

To help ensure that your AI systems meet your security, privacy, and compliancerequirements, you must adopt a holistic strategy that starts with the initialdesign and extends to deployment and operations. You can implement this holisticstrategy by applying thesix core elements of SAIF.

Google uses AI to enhance security measures, such as identifying threats,automating security tasks, and improving detection capabilities, while keepinghumans in the loop for critical decisions.

Google emphasizes a collaborative approach to advancing AI security. Thisapproach involves partnering with customers, industries, and governments toenhance the SAIF guidelines and offer practical, actionable resources.

The recommendations to implement this principle are grouped within thefollowing sections:

Recommendations to use AI securely

To use AI securely, you need both foundational security controls andAI-specific security controls. This section provides an overview ofrecommendations to ensure that your AI and ML deployments meet the security,privacy, and compliance requirements of your organization.For an overview of architectural principles and recommendations that are specific to AIand ML workloads in Google Cloud, see theAI and ML perspectivein the Well-Architected Framework.

Define clear goals and requirements for AI usage

This recommendation is relevant to the followingfocus areas:

  • Cloud governance, risk, and compliance
  • AI and ML security

This recommendation aligns with the SAIF element about contextualizing AIsystem risks in the surrounding business processes. When you design and evolveAI systems, it's important to understand your specific business goals, risks, andcompliance requirements.

Keep data secure and prevent loss or mishandling

This recommendation is relevant to the followingfocus areas:

  • Infrastructure security
  • Identity and access management
  • Data security
  • Application security
  • AI and ML security

This recommendation aligns with the following SAIF elements:

  • Expand strong security foundations to the AI ecosystem. This elementincludes data collection, storage, access control, and protection againstdata poisoning.
  • Contextualize AI system risks. Emphasize data security to supportbusiness objectives and compliance.

Keep AI pipelines secure and robust against tampering

This recommendation is relevant to the followingfocus areas:

  • Infrastructure security
  • Identity and access management
  • Data security
  • Application security
  • AI and ML security

This recommendation aligns with the following SAIF elements:

  • Expand strong security foundations to the AI ecosystem. As a keyelement of establishing a secure AI system, secure your code and modelartifacts.
  • Adapt controls for faster feedback loops. Because it's important formitigation and incident response, track your assets and pipeline runs.

Deploy apps on secure systems using secure tools and artifacts

This recommendation is relevant to the followingfocus areas:

  • Infrastructure security
  • Identity and access management
  • Data security
  • Application security
  • AI and ML security

Using secure systems and validated tools and artifacts in AI-based applicationsaligns with the SAIF element about expanding strong security foundations to theAI ecosystem and supply chain. This recommendation can be addressed through thefollowing steps:

Protect and monitor inputs

This recommendation is relevant to the followingfocus areas:

  • Logging, auditing, and monitoring
  • Security operations
  • AI and ML security

This recommendation aligns with the SAIF element about extending detection andresponse to bring AI into an organization's threat universe. To prevent issues,it's critical to manage prompts for generative AI systems, monitor inputs, andcontrol user access.

Recommendations for AI governance

All of the recommendations in this section are relevant to the followingfocus area:Cloud governance, risk, and compliance.

Google Cloud offers a robust set of tools and services that you can useto build responsible and ethical AI systems. We also offer a framework ofpolicies, procedures, and ethical considerations that can guide the development,deployment, and use of AI systems.

As reflected in our recommendations, Google's approach for AI governance isguided by the following principles:

  • Fairness
  • Transparency
  • Accountability
  • Privacy
  • Security

Use fairness indicators

Vertex AI can detect bias during the data collection or post-training evaluation process.Vertex AI providesmodel evaluation metrics likedata bias andmodel bias to help you evaluate your model for bias.

These metrics are related to fairness across different categories like race,gender, and class. However, interpreting statistical deviations isn't astraightforward exercise, because differences across categories might not be aresult of bias or a signal of harm.

Use Vertex Explainable AI

To understand how the AI models make decisions, use Vertex Explainable AI. Thisfeature helps you to identify potential biases that might be hidden in themodel's logic.

This explainability feature is integrated withBigQuery ML andVertex AI,which provide feature-based explanations. You can either perform explainabilityin BigQuery ML orregister your model in Vertex AI and perform explainability inVertex AI.

Track data lineage

Track the origin and transformation of data that's used in your AI systems.This tracking helps you understand the data's journey and identify potentialsources of bias or error.

Data lineage is a Dataplex Universal Catalog feature that lets you track how data moves through yoursystems: where it comes from, where it's passed to, and what transformations areapplied to it.

Establish accountability

Establish clear responsibility for the development, deployment, and outcomes ofyour AI systems.

UseCloud Logging to log key events and decisions made by your AI systems. The logs provide anaudit trail to help you understand how the system is performing and identifyareas for improvement.

UseError Reporting to systematically analyze errors made by the AI systems. This analysis canreveal patterns that point to underlying biases or areas where the model needsfurther refinement.

Implement differential privacy

During model training,add noise to the data in order to make it difficult to identify individual data points butstill enable the model to learn effectively. WithSQL in BigQuery,you can transform the results of a query with differentially privateaggregations.

Use AI for security

This principle in the security pillar of theGoogle Cloud Well-Architected Framework provides recommendations to use AI to help you improve the security of yourcloud workloads.

Because of the increasing number and sophistication of cyber attacks, it'simportant to take advantage of AI's potential to help improve security. AI canhelp to reduce the number of threats, reduce the manual effort required by securityprofessionals, and help compensate for the scarcity of experts in the cyber-securitydomain.

Principle overview

Use AI capabilities to improve your existing security systems and processes.You can useGemini in Security as well as the intrinsic AI capabilities that are built into Google Cloud services.

These AI capabilities can transform security by providing assistance acrossevery stage of the security lifecycle. For example, you can use AI to do thefollowing:

  • Analyze and explain potentially malicious code without reverse engineering.
  • Reduce repetitive work for cyber-security practitioners.
  • Use natural language to generate queries and interact with securityevent data.
  • Surface contextual information.
  • Offer recommendations for quick responses.
  • Aid in the remediation of events.
  • Summarize high-priority alerts for misconfigurations andvulnerabilities, highlight potential impacts, and recommend mitigations.

Levels of security autonomy

AI and automation can help you achieve better security outcomes when you'redealing with ever-evolving cyber-security threats. By using AI for security, youcan achieve greater levels of autonomy to detect and prevent threats and improveyour overall security posture. Google defines fourlevels of autonomy when you use AI for security, and they outline the increasing role of AI inassisting and eventually leading security tasks:

  1. Manual: Humans run all of the security tasks (prevent, detect,prioritize, and respond) across the entire security lifecycle.
  2. Assisted: AI tools, like Gemini, boost humanproductivity by summarizing information, generating insights, and makingrecommendations.
  3. Semi-autonomous: AI takes primary responsibility for many securitytasks and delegates to humans only when required.
  4. Autonomous: AI acts as a trusted assistant that drives the securitylifecycle based on your organization's goals and preferences, with minimalhuman intervention.

Recommendations

The following sections describe the recommendations for using AI for security.The sections also indicate how the recommendations align with Google's Secure AIFramework (SAIF)core elements and how they're relevant to thelevels of security autonomy.

Note: For more information about Google Cloud's overall vision forusing Gemini across our products to accelerate AI for security,see the whitepaperGoogle Cloud's Product Vision for AI-Powered Security.

Enhance threat detection and response with AI

This recommendation is relevant to the followingfocus areas:

  • Security operations (SecOps)
  • Logging, auditing, and monitoring

AI can analyze large volumes of security data, offer insights into threat actorbehavior, and automate the analysis of potentially malicious code. Thisrecommendation is aligned with the following SAIF elements:

  • Extend detection and response to bring AI into your organization'sthreat universe.
  • Automate defenses to keep pace with existing and new threats.

Depending on your implementation, this recommendation can be relevant to thefollowing levels of autonomy:

  • Assisted: AI helps with threat analysis and detection.
  • Semi-autonomous: AI takes on more responsibility for the security task.

Google Threat Intelligence,which uses AI to analyze threat actor behavior and malicious code, can help youimplement this recommendation.

Simplify security for experts and non-experts

This recommendation is relevant to the followingfocus areas:

  • Security operations (SecOps)
  • Cloud governance, risk, and compliance

AI-powered tools can summarize alerts and recommend mitigations, and thesecapabilities can make security more accessible to a wider range of personnel.This recommendation is aligned with the following SAIF elements:

  • Automate defenses to keep pace with existing and new threats.
  • Harmonize platform-level controls to ensure consistent security acrossthe organization.

Depending on your implementation, this recommendation can be relevant to thefollowing levels of autonomy:

  • Assisted: AI helps you to improve the accessibility of securityinformation.
  • Semi-autonomous: AI helps to make security practices more effectivefor all users.

Gemini inSecurity Command Center can provide summaries of alerts for misconfigurations and vulnerabilities.

Automate time-consuming security tasks with AI

This recommendation is relevant to the followingfocus areas:

  • Infrastructure security
  • Security operations (SecOps)
  • Application security

AI can automate tasks such as analyzing malware, generating security rules, andidentifying misconfigurations. These capabilities can help to reduce theworkload on security teams and accelerate response times. This recommendation isaligned with the SAIF element about automating defenses to keep pace withexisting and new threats.

Depending on your implementation, this recommendation can be relevant to thefollowing levels of autonomy:

  • Assisted: AI helps you to automate tasks.
  • Semi-autonomous: AI takes primary responsibility for security tasks,and only requests human assistance when needed.

Gemini inGoogle SecOps can help to automate high-toil tasks by assisting analysts, retrieving relevantcontext, and making recommendations for next steps.

Incorporate AI into risk management and governance processes

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

You can use AI to build a model inventory and risk profiles. You can also use AIto implement policies for data privacy, cyber risk, and third-party risk. Thisrecommendation is aligned with the SAIF element about contextualizing AI systemrisks in surrounding business processes.

Depending on your implementation, this recommendation can be relevant to thesemi-autonomous level of autonomy. At this level, AI can orchestrate securityagents that run processes to achieve your custom security goals.

Implement secure development practices for AI systems

This recommendation is relevant to the followingfocus areas:

  • Application security
  • AI and ML security

You can use AI for secure coding, cleaning training data, and validating toolsand artifacts. This recommendation is aligned with the SAIF element aboutexpanding strong security foundations to the AI ecosystem.

This recommendation can be relevant to all levels of security autonomy, becausea secure AI system needs to be in place before AI can be used effectively forsecurity. The recommendation is most relevant to the assisted level, wheresecurity practices are augmented by AI.

To implement this recommendation, follow theSupply-chain Levels for Software Artifacts (SLSA) guidelines for AI artifacts and use validated container images.

Meet regulatory, compliance, and privacy needs

This principle in the security pillar of theGoogle Cloud Well-Architected Framework helps you identify and meet regulatory, compliance, and privacy requirements forcloud deployments. These requirements influence many of the decisions that you needto make about the security controls that must be used for your workloads inGoogle Cloud.

Principle overview

Meeting regulatory, compliance, and privacy needs is an unavoidable challengefor all businesses. Cloud regulatory requirements depend on several factors,including the following:

  • The laws and regulations that apply to your organization's physical locations
  • The laws and regulations that apply to your customers' physical locations
  • Your industry's regulatory requirements

Privacy regulations define how you can obtain, process, store, and manage yourusers' data. You own your own data, including the data that you receive from yourusers. Therefore, many privacy controls are your responsibility, including controlsfor cookies, session management, and obtaining user permission.

The recommendations to implement this principle are grouped within the followingsections:

Recommendations to address organizational risks

This section provides recommendations to help you identify and address risks toyour organization.

Identify risks to your organization

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Before you create and deploy resources on Google Cloud, complete a riskassessment. This assessment should determine the security features that you needto meet your internal security requirements and external regulatoryrequirements.

Your risk assessment provides you with a catalog of organization-specific risks,and informs you about your organization's capability to detect and counteractsecurity threats. You must perform a risk analysis immediately after deploymentand whenever there are changes in your business needs, regulatory requirements,or threats to your organization.

As mentioned in theImplement security by design principle, your security risks in a cloud environment differ from on-premisesrisks. This difference is due to the shared responsibility model in the cloud,which varies by service (IaaS, PaaS, or SaaS) and your usage. Use acloud-specific risk assessment framework like theCloud Controls Matrix (CCM).Use threat modeling, likeOWASP application threat modeling,to identify and address vulnerabilities. For expert help with risk assessments,contact your Google account representative or consult Google Cloud'spartner directory.

After you catalog your risks, you must determine how to address them—that is,whether you want to accept, avoid, transfer, or mitigate the risks. Formitigation controls that you can implement, see the next section about mitigatingyour risks.

Mitigate your risks

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

When you adopt new public cloud services, you can mitigate risks by usingtechnical controls, contractual protections, and third-party verifications orattestations.

Technical controls are features and technologies that you use to protect yourenvironment. These include built-in cloud security controls like firewalls andlogging. Technical controls can also include using third-party tools toreinforce or support your security strategy. There are two categories oftechnical controls:

  • You can implement Google Cloud's security controls to help youmitigate the risks that apply to your environment. For example, you cansecure the connection between your on-premises networks and your cloudnetworks by usingCloud VPN andCloud Interconnect.
  • Google has robust internal controls and auditing to protect againstinsider access to customer data. Our audit logs provide you with nearreal-time logs ofGoogle administrator access on Google Cloud.

Contractual protections refer to the legal commitments made by us regardingGoogle Cloud services. Google is committed to maintaining and expandingour compliance portfolio. TheCloud Data Processing Addendum (CDPA) describes our commitments with regard to the processing and security of yourdata. The CDPA also outlines the access controls that limit Google supportengineers' access to customers' environments, and it describes our rigorouslogging and approval process. We recommend that you review Google Cloud'scontractual controls with your legal and regulatory experts, and verify thatthey meet your requirements. If you need more information,contact your technical account representative.

Third-party verifications or attestations refer to having a third-partyvendor audit the cloud provider to ensure that the provider meets compliancerequirements. For example, to learn about Google Cloud attestations withregard to the ISO/IEC 27017 guidelines, seeISO/IEC 27017 - Compliance.To view the current Google Cloud certifications and letters ofattestation, seeCompliance resource center.

Recommendations to address regulatory and compliance obligations

A typical compliance journey has three stages: assessment, gap remediation, andcontinual monitoring. This section provides recommendations that you can useduring each of these stages.

Assess your compliance needs

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Compliance assessment starts with a thorough review of all of your regulatoryobligations and how your business is implementing them. To help you with yourassessment of Google Cloud services, use theCompliance resource center.This site provides information about the following:

  • Service support for various regulations
  • Google Cloud certifications and attestations

To better understand the compliance lifecycle at Google and how yourrequirements can be met, you cancontact sales to request help from a Google compliance specialist. Or, you can contact yourGoogle Cloud account manager to request a compliance workshop.

For more information about tools and resources that you can use to managesecurity and compliance for Google Cloud workloads, seeAssuring Compliance in the Cloud.

Automate implementation of compliance requirements

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

To help you stay in compliance with changing regulations, determine whether youcan automate how you implement compliance requirements. You can use bothcompliance-focused capabilities that Google Cloud provides and blueprintsthat use recommended configurations for a particular compliance regime.

Assured Workloads builds on the controls within Google Cloud to help you meet yourcompliance obligations. Assured Workloads lets you do the following:

  • Select your compliance regime. Then, the tool automatically sets thebaseline personnel access controls for the selected regime.
  • Set the location for your data by using organization policies so thatyour data at rest and your resources remain only in thatregion.
  • Select the key-management option (such as the key rotation period) thatbest meets your security and compliance requirements.
  • Select the access criteria for Google support personnel to meet certainregulatory requirements such as FedRAMP Moderate. For example, you canselect whether Google support personnel have completed the appropriatebackground checks.
  • Use Google-owned and Google-managed encryption keys that are FIPS-140-2compliant and support FedRAMP Moderate compliance. For an added layer ofcontrol and for the separation of duties, you can use customer-managedencryption keys (CMEK). For more information about keys, seeEncrypt data at rest and in transit.

In addition to Assured Workloads, you can use Google Cloudblueprints that are relevant to your compliance regime. You can modify theseblueprints to incorporate your security policies into your infrastructuredeployments.

To help you build an environment that supports your compliance requirements,Google's blueprints and solution guides include recommended configurations andprovide Terraform modules. The following table lists blueprints that addresssecurity and alignment with compliance requirements.

RequirementBlueprints and solution guides
FedRAMP
HIPAA

Monitor your compliance

This recommendation is relevant to the followingfocus areas:

  • Cloud governance, risk, and compliance
  • Logging, monitoring, and auditing

Most regulations require that you monitor particular activities, which includeaccess-related activities. To help with your monitoring, you can use thefollowing:

  • Access Transparency:View near real-time logs when Google Cloud administrators access yourcontent.
  • Firewall Rules Logging:Record TCP and UDP connections inside a VPC network for any rules that youcreate. These logs can be useful for auditing network access or forproviding early warning that the network is being used in an unapproved manner.
  • VPC Flow Logs:Record network traffic flows that are sent or received by VM instances.
  • Security Command Center Premium:Monitor for compliance with various standards.
  • OSSEC (or another open source tool): Log the activity of individuals who haveadministrator access to your environment.
  • Key Access Justifications:View the reasons for a key-access request.
  • Security Command Center notifications:Get alerts when noncompliance issues occur.For example, get alerts when users disable two-step verification or whenservice accounts are over-privileged.You can also set up automatic remediation for specific notifications.

Recommendations to manage your data sovereignty

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Data sovereignty provides you with a mechanism to prevent Google from accessingyour data. You approve access only for provider behaviors that you agree arenecessary. For example, you can manage your data sovereignty in the followingways:

Manage your operational sovereignty

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Operational sovereignty provides you with assurances that Google personnel can'tcompromise your workloads. For example, you can manage operational sovereigntyin the following ways:

Manage software sovereignty

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Software sovereignty provides you with assurances that you can control theavailability of your workloads and run them wherever you want. Also, you canhave this control without being dependent or locked in with a single cloudprovider. Software sovereignty includes the ability to survive events thatrequire you to quickly change where your workloads are deployed and what levelof outside connection is allowed.

For example, to help you manage your software sovereignty, Google Cloudsupportshybrid and multicloud deployments.If you choose on-premises deployments for datasovereignty reasons,Google Distributed Cloud is a combination of hardware and software that brings Google Cloud intoyour data center.

Recommendations to address privacy requirements

Google Cloud includes the following controls that promote privacy:

  • Default encryption of all data when it's at rest, when it's in transit,and while it's being processed.
  • Safeguards against insider access.
  • Support for numerous privacy regulations.

The following recommendations address additional controls that you canimplement. For more information, seePrivacy Resource Center.

Control data residency

This recommendation is relevant to the followingfocus area:Cloud governance, risk, and compliance.

Data residency describes where your data is stored at rest. Data residencyrequirements vary based on system design objectives, industry regulatoryconcerns, national law, tax implications, and even culture.

Controlling data residency starts with the following:

  • Understand your data type and its location.
  • Determine what risks exist for your data and which laws and regulationsapply.
  • Control where your data is stored or where it goes.

To help you comply with data residency requirements, Google Cloud letsyou control where your data is stored, how it's accessed, and how it'sprocessed. You can useresource location policies to restrict where resources are created and to limit where data is replicatedbetween regions. You can use the location property of a resource to identifywhere the service is deployed and who maintains it. For more information, seeResource locations supported services.

Classify your confidential data

This recommendation is relevant to the followingfocus area:Data security.

You must define what data is confidential, and then ensure that the confidentialdata is properly protected. Confidential data can include credit card numbers,addresses, phone numbers, and other personally identifiable information (PII).UsingSensitive Data Protection,you can set up appropriate classifications. You can then tag and tokenize yourdata before you store it in Google Cloud. Additionally,Dataplex Universal Catalog offers acatalog service that provides a platform for storing, managing, and accessing yourmetadata. For more information and an example of data classification andde-identification, seeDe-identification and re-identification of PII using Sensitive Data Protection.

Lock down access to sensitive data

This recommendation is relevant to the followingfocus areas:

  • Data security
  • Identity and access management

Place sensitive data in its own service perimeter by usingVPC Service Controls.VPC Service Controls improves your ability to mitigate the risk of unauthorizedcopying or transferring of data (data exfiltration) from Google-managedservices. With VPC Service Controls, you configure security perimeters around theresources of your Google-managed services to control the movement of data acrossthe perimeter. SetGoogle Identity and Access Management (IAM) access controls for that data. Configuremultifactor authentication (MFA) for all users who require access to sensitive data.

Shared responsibilities and shared fate on Google Cloud

This document describes the differences between the shared responsibility modeland shared fate in Google Cloud. It discusses the challenges and nuancesof the shared responsibility model. This document describes what shared fate isand how we partner with our customers to address cloud security challenges.

Understanding the shared responsibility model is important when determining howto best protect your data and workloads on Google Cloud. The sharedresponsibility model describes the tasks that you have when it comes to securityin the cloud and how these tasks are different for cloud providers.

Understanding shared responsibility, however, can be challenging. The modelrequires an in-depth understanding of each service you utilize, theconfiguration options that each service provides, and what Google Clouddoes to secure the service. Every service has a different configuration profile,and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops shortof helping cloud customers achieve better security outcomes. Instead of sharedresponsibility, we believe inshared fate.

Shared fate includes us building and operating a trusted cloud platform foryour workloads. We provide best practice guidance and secured, attestedinfrastructure code that you can use to deploy your workloads in a secure way.We release solutions that combine various Google Cloud services to solvecomplex security problems and we offer innovative insurance options to help youmeasure and mitigate the risks that you must accept. Shared fate involves usmore closely interacting with you as you secure your resources onGoogle Cloud.

Shared responsibility

You're the expert in knowing the security and regulatory requirements for yourbusiness, and knowing the requirements for protecting your confidential data andresources. When you run your workloads on Google Cloud, you must identifythe security controls that you need to configure in Google Cloud to helpprotect your confidential data and each workload. To decide which securitycontrols to implement, you must consider the following factors:

  • Your regulatory compliance obligations
  • Your organization's security standards and risk management plan
  • Security requirements of your customers and your vendors

Defined by workloads

Traditionally, responsibilities are defined based on the type of workload thatyou're running and the cloud services that you require. Cloud services includethe following categories:

Cloud serviceDescription
Infrastructure as a service (IaaS)IaaS services includeCompute Engine,Cloud Storage, and networkingservices such asCloud VPN,Cloud Load Balancing, andCloud DNS.

IaaS provides compute, storage, and network services on demand withpay-as-you-go pricing. You can use IaaS if you plan on migrating anexisting on-premises workload to the cloud using lift-and-shift, orif you want to run your application on particular VMs, using specificdatabases or network configurations.

In IaaS, the bulk of the security responsibilities are yours, and ourresponsibilities are focused on the underlying infrastructure andphysical security.

Platform as a service (PaaS)PaaS services includeApp Engine,Google Kubernetes Engine (GKE), andBigQuery.

PaaS provides the runtime environment that you can develop and runyour applications in. You can use PaaS if you're building anapplication (such as a website), and want to focus on developmentnot on the underlying infrastructure.

In PaaS, we're responsible for more controls than in IaaS. Typically, this will vary by the services and features that you use. You share responsibility with us forapplication-level controls and IAM management. Youremain responsible for your data security and client protection.

Software as a service (SaaS)SaaS applications includeGoogle Workspace,Google Security Operations, andthird-party SaaS applications that are available inGoogle Cloud Marketplace.

SaaS provides online applications that you can subscribe to or payfor in some way. You can use SaaS applications when your enterprisedoesn't have the internal expertise or business requirement to buildthe application themselves, but does require the ability to processworkloads.

In SaaS, we own the bulk of the security responsibilities. You remainresponsible for your access controls and the data that you choose tostore in the application.

Function as a service (FaaS) or serverless

FaaS provides the platform for developers to run small,single-purpose code (calledfunctions) that run in responseto particular events. You would use FaaS when you want particularthings to occur based on a particular event. For example, you mightcreate a function that runs whenever data is uploaded toCloud Storage so that it can be classified.

FaaS has a similar shared responsibility list as SaaS.Cloud Run functions is a FaaSapplication.

The following diagram shows the cloud services and defines how responsibilitiesare shared between the cloud provider and customer.

Shared security responsibilities

As the diagram shows, the cloud provider always remains responsible for theunderlying network and infrastructure, and customers always remain responsiblefor their access policies and data.

Defined by industry and regulatory framework

Various industries have regulatory frameworks that define the security controlsthat must be in place. When you move your workloads to the cloud, you mustunderstand the following:

  • Which security controls are your responsibility
  • Which security controls are available as part of the cloud offering
  • Which default security controls are inherited

Inherited security controls (such as ourdefault encryption andinfrastructure controls)are controls that you can provide as part of your evidence of your securityposture to auditors and regulators. For example, the Payment Card Industry DataSecurity Standard (PCI DSS) defines regulations for payment processors. When youmove your business to the cloud, these regulations are shared between you andyour CSP. To understand how PCI DSS responsibilities are shared between you andGoogle Cloud, seeGoogle Cloud: PCI DSS Shared Responsibility Matrix.

As another example, in the United States, the Health Insurance Portability andAccountability Act (HIPAA) has set standards for handling electronic personalhealth information (PHI). These responsibilities are also shared between the CSPand you. For more information on how Google Cloud meets ourresponsibilities under HIPAA, seeHIPAA - Compliance.

Other industries (for example, finance or manufacturing) also have regulationsthat define how data can be gathered, processed, and stored. For moreinformation about shared responsibility related to these, and howGoogle Cloud meets our responsibilities, seeCompliance resource center.

Defined by location

Depending on your business scenario, you might need to consider yourresponsibilities based on the location of your business offices, your customers,and your data. Different countries and regions have created regulations thatinform how you can process and store your customer's data. For example, if yourbusiness has customers who reside in the European Union, your business mightneed to abide by the requirements that are described in theGeneral Data Protection Regulation (GDPR), and you might be obligated to keep your customer data in the EU itself.In this circumstance, you are responsible for ensuring that the data that youcollect remains in theGoogle Cloud regions in the EU.For more information about how we meet our GDPR obligations, seeGDPR and Google Cloud.

For information about the requirements related to your region, seeCompliance offerings.If your scenario is particularly complicated, we recommend speaking with oursales team or one of ourpartners to help you evaluate your security responsibilities.

Challenges for shared responsibility

Though shared responsibility helps define the security roles that you or thecloud provider has, relying on shared responsibility can still createchallenges. Consider the following scenarios:

  • Most cloud security breaches are the direct result of misconfiguration(listed asnumber 3 in the Cloud Security Alliance's Pandemic 11 Report)and this trend is expected to increase. Cloud products are constantlychanging, and new ones are constantly being launched. Keeping up withconstant change can seem overwhelming. Customers need cloud providers toprovide them with opinionated best practices to help keep up with thechange, starting with best practices by default and having a baselinesecure configuration.
  • Though dividing items by cloud services is helpful, many enterpriseshave workloads that require multiple cloud services types. In thiscircumstance, you must consider how various security controls for theseservices interact, including whether they overlap between and acrossservices. For example, you might have an on-premises application thatyou're migrating to Compute Engine, use Google Workspacefor corporate email, and also run BigQuery to analyze datato improve your products.
  • Your business and markets are constantly changing; as regulationschange, as you enter new markets, or as you acquire other companies. Yournew markets might have different requirements, and your new acquisitionmight host their workloads on another cloud. To manage the constantchanges, you must constantly re-assess your risk profile and be able toimplement new controls quickly.
  • How and where to manage your data encryption keys is an importantdecision that ties with your responsibilities to protect your data. Theoption that you choose depends on your regulatory requirements, whetheryou're running a hybrid cloud environment or still have an on-premisesenvironment, and the sensitivity of the data that you're processing andstoring.
  • Incident management is an important, and often overlooked, area whereyour responsibilities and the cloud provider responsibilities aren't easilydefined. Many incidents require close collaboration and support from thecloud provider to help investigate and mitigate them. Other incidents canresult from poorly configured cloud resources or stolen credentials, andensuring that you meet the best practices for securing your resources andaccounts can be quite challenging.
  • Advanced persistent threats (APTs) and new vulnerabilities can impactyour workloads in ways that you might not consider when you start yourcloud transformation. Ensuring that you remain up-to-date on the changinglandscape, and who is responsible for threat mitigation is difficult,particularly if your business doesn't have a large security team.

Shared fate

We developed shared fate in Google Cloud to start addressing thechallenges that the shared responsibility model doesn't address. Shared fatefocuses on how all parties can better interact to continuously improve security.Shared fate builds on the shared responsibility model because it views therelationship between cloud provider and customer as an ongoing partnership toimprove security.

Shared fate is about us taking responsibility for making Google Cloudmore secure. Shared fate includes helping you get started with a securedlanding zone and being clear, opinionated, and transparent about recommended securitycontrols, settings, and associated best practices. It includes helping youbetter quantify and manage your risk with cyber-insurance, using our RiskProtection Program. Using shared fate, we want to evolve from the standardshared responsibility framework to a better model that helps you secure yourbusiness and build trust in Google Cloud.

The following sections describe various components of shared fate.

Help getting started

A key component of shared fate is the resources that we provide to help you getstarted, in a secure configuration in Google Cloud. Starting with a secureconfiguration helps reduce the issue of misconfigurations which is the rootcause of most security breaches.

Our resources include the following:

  • Enterprise foundations blueprint that discuss top security concerns and our top recommendations.
  • Secure blueprints that let you deploy and maintain secure solutions usinginfrastructure ascode (IaC). Blueprints have our security recommendations enabled bydefault. Many blueprints are created by Google security teams and managedas products. This support means that they're updated regularly, go througha rigorous testing process, and receive attestations from third-partytesting groups. Blueprints include theenterprise foundations blueprint and thesecured data warehouse blueprint.

  • Google Cloud Well-Architected Framework recommendations for building security into your designs.

  • Landing zone navigation guides that step you through the top decisions that you need to make to build asecure foundation for your workloads, including resource hierarchy,identity onboarding, security and key management, and network structure.

Risk Protection Program

Shared fate also includes theRisk Protection Program (currently in preview), which helps you use the power of Google Cloud as aplatform to manage risk, rather than just seeing cloud workloads as anothersource of risk that you need to manage. The Risk Protection Program is acollaboration between Google Cloud and two leading cyber insurancecompanies, Munich Re and Allianz Global & Corporate Speciality.

The Risk Protection Program includesCyber Insurance Hub,which provides data-driven insights that you can use to better understand yourcloud security posture. If you're looking for cyber insurance coverage, you canshare these insights from Cyber Insurance Hub directly with our insurancepartners to obtain a quote. For more information, seeGoogle Cloud Risk Protection Program now in Preview.

Help with deployment and governance

Shared fate also helps with your continued governance of your environment. Forexample, we focus efforts on products such as the following:

Putting shared responsibility and shared fate into practice

As part of your planning process, consider the following actions to help youunderstand and implement appropriate security controls:

  • Create a list of the type of workloads that you will host inGoogle Cloud, and whether they require IaaS, PaaS, and SaaS services.You can use theshared responsibility diagram as a checklist to ensure that you know the security controls that you needto consider.
  • Create a list of regulatory requirements that you must comply with, andaccess resources in theCompliance resource center that relate to those requirements.
  • Review the list of available blueprints and architectures in theArchitecture Center for the security controls that you require for your particular workloads.The blueprints provide a list of recommended controls and the IaC code thatyou require to deploy that architecture.
  • Use thelanding zone documentation and the recommendations in theenterprise foundations guide to design a resource hierarchy and network architecture that meets yourrequirements. You can use the opinionated workload blueprints, like thesecured data warehouse, to accelerate your development process.
  • After you deploy your workloads, verify that you're meeting yoursecurity responsibilities using services such as theCyber Insurance Hub, Assured Workloads,Policy Intelligence tools, and Security Command Center Premium.

For more information, see theCISO's Guide to Cloud Transformation paper.

What's next

Well-Architected Framework: Reliability pillar

The reliability pillar in theGoogle Cloud Well-Architected Framework provides principles and recommendations to help you design, deploy, and managereliable workloads in Google Cloud.

This document is intended for cloud architects, developers, platform engineers,administrators, and site reliability engineers.

Reliability is a system's ability to consistently perform its intendedfunctions within the defined conditions and maintain uninterrupted service. Bestpractices for reliability include redundancy, fault-tolerant design, monitoring,and automated recovery processes.

As a part of reliability,resilience is the system's ability to withstand andrecover from failures or unexpected disruptions, while maintaining performance.Google Cloud features, likemulti-regional deployments,automated backups, and disaster recovery solutions, can help you improve yoursystem's resilience.

Reliability is important to your cloud strategy for many reasons, including thefollowing:

  • Minimal downtime: Downtime can lead to lost revenue, decreasedproductivity, and damage to reputation. Resilient architectures can helpensure that systems can continue to function during failures or recoverefficiently from failures.
  • Enhanced user experience: Users expect seamless interactions withtechnology. Resilient systems can help maintain consistent performance andavailability, and they provide reliable service even during high demand orunexpected issues.
  • Data integrity: Failures can cause data loss or data corruption.Resilient systems implement mechanisms such as backups, redundancy, andreplication to protect data and ensure that it remains accurate and accessible.
  • Business continuity: Your business relies on technology for criticaloperations. Resilient architectures can help ensure continuity after acatastrophic failure, which enables business functions to continue withoutsignificant interruptions and supports a swift recovery.
  • Compliance: Many industries have regulatory requirements for systemavailability and data protection. Resilient architectures can help you tomeet these standards by ensuring systems remain operational and secure.
  • Lower long-term costs: Resilient architectures require upfrontinvestment, but resiliency can help to reduce costs over time by preventingexpensive downtime, avoiding reactive fixes, and enabling more efficientresource use.

Organizational mindset

To make your systems reliable, you need a plan and an established strategy.This strategy must include education and the authority to prioritize reliabilityalongside other initiatives.

Set a clear expectation that the entire organization is responsible forreliability, including development, product management, operations, platformengineering, andsite reliability engineering (SRE).Even the business-focused groups, like marketing and sales, can influencereliability.

Every team must understand the reliability targets and risks of theirapplications. The teams must be accountable to these requirements. Conflictsbetween reliability and regular product feature development must be prioritizedand escalated accordingly.

Plan and manage reliability holistically, across all your functions and teams.Consider setting up a Cloud Centre of Excellence (CCoE) that includes areliability pillar. For more information, seeOptimize your organization's cloud journey with a Cloud Center of Excellence.

Focus areas for reliability

The activities that you perform to design, deploy, and manage a reliable systemcan be categorized in the following focus areas. Each of the reliabilityprinciples and recommendations in this pillar is relevant to one of these focusareas.

  • Scoping: To understand your system, conduct a detailed analysis ofits architecture. You need to understand the components, how they work andinteract, how data and actions flow through the system, and what could gowrong. Identify potential failures, bottlenecks, and risks, which helps youto take actions to mitigate those issues.
  • Observation: To help prevent system failures, implementcomprehensive and continuous observation and monitoring. Through thisobservation, you can understand trends and identify potential problemsproactively.
  • Response: To reduce the impact of failures, respond appropriatelyand recover efficiently. Automated responses can also help reduce theimpact of failures. Even with planning and controls, failures can still occur.
  • Learning: To help prevent failures from recurring, learn from eachexperience, and take appropriate actions.

Core principles

The recommendations in the reliability pillar of the Well-Architected Framework aremapped to the following core principles:

Note: To learn about the building blocks of infrastructure reliability inGoogle Cloud, seeGoogle Cloud infrastructure reliability guide.

Contributors

Authors:

Other contributors:

Define reliability based on user-experience goals

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework helps you to assess your users' experience, and then map the findings toreliability goals and metrics.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

Observability tools provide large amounts of data, but not all of the datadirectly relates to the impacts on the users. For example, you might observehigh CPU usage, slow server operations, or even crashed tasks. However, if theseissues don't affect the user experience, then they don't constitute an outage.

To measure the user experience, you need to distinguish between internal systembehavior and user-facing problems. Focus on metrics like the success ratio ofuser requests. Don't rely solely on server-centric metrics, like CPU usage,which can lead to misleading conclusions about your service's reliability. Truereliability means that users can consistently and effectively use yourapplication or service.

Recommendations

To help you measure user experience effectively, consider the recommendationsin the following sections.

Measure user experience

To truly understand your service's reliability, prioritize metrics that reflectyour users' actual experience. For example, measure the users' query successratio, application latency, and error rates.

Ideally, collect this data directly from the user's device or browser. If thisdirect data collection isn't feasible, shift your measurement pointprogressively further away from the user in the system. For example, you can usethe load balancer or frontend service as the measurement point. This approachhelps you identify and address issues before those issues can significantlyimpact your users.

Analyze user journeys

To understand how users interact with your system, you can use tracing toolslikeCloud Trace.By following a user's journey through your application, you can find bottlenecksand latency issues that might degrade the user's experience. Cloud Tracecaptures detailed performance data for eachhop in your service architecture.This data helps you identify and address performance issues more efficiently,which can lead to a more reliable and satisfying user experience.

Set realistic targets for reliability

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework helps you define reliability goals that are technically feasible for yourworkloads in Google Cloud.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

Design your systems to be just reliable enough for user happiness. It mightseem counterintuitive, but a goal of 100% reliability is often not the mosteffective strategy. Higher reliability might result in a significantly highercost, both in terms of financial investment and potential limitations oninnovation. If users are already happy with the current level of service, thenefforts to further increase happiness might yield a low return on investment.Instead, you can better spend resources elsewhere.

You need to determine the level of reliability at which your users are happy,and determine the point where the cost of incremental improvements begin tooutweigh the benefits. When you determine this level ofsufficientreliability, you can allocate resources strategically and focus on features andimprovements that deliver greater value to your users.

Recommendations

To set realistic reliability targets, consider the recommendations in thefollowing subsections.

Accept some failure and prioritize components

Aim for high availability such as 99.99% uptime, but don't set a target of 100%uptime. Acknowledge that some failures are inevitable.

The gap between 100% uptime and a 99.99% target is the allowance for failure.This gap is often called theerror budget. The error budget can help you takerisks and innovate, which is fundamental to any business to stay competitive.

Prioritize the reliability of the most critical components in the system.Accept that less critical components can have a higher tolerance for failure.

Balance reliability and cost

To determine the optimal reliability level for your system, conduct thoroughcost-benefit analyses.

Consider factors like system requirements, the consequences of failures, andyour organization's risk tolerance for the specific application. Remember toconsider yourdisaster recovery metrics,such as the recovery time objective (RTO) and recovery point objective (RPO).Decide what level of reliability is acceptable within the budget and otherconstraints.

Look for ways to improve efficiency and reduce costs without compromisingessential reliability features.

Build highly available systems through resource redundancy

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to plan, build, and manage resource redundancy, whichcan help you to avoid failures.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

After youdecide the level of reliability that you need, you must design your systems to avoid anysingle points of failure.Every critical component in the system must be replicated across multiplemachines, zones, andregions.For example, a critical database can't be located in only one region, and ametadata server can't be deployed in only one single zone or region. In thoseexamples, if the sole zone or region has an outage, the system has a globaloutage.

Recommendations

To build redundant systems, consider the recommendations in the followingsubsections.

Identify failure domains and replicate services

Map out your system'sfailure domains,from individual VMs to regions, and design for redundancy across the failuredomains.

To ensure high availability, distribute and replicate your services andapplications across multiple zones and regions. Configure the system forautomatic failover to make sure that the services and applications continue tobe available in the event of zone or region outages.

For examples of multi-zone and multi-region architectures, seeDesign reliable infrastructure for your workloads in Google Cloud.

Detect and address issues promptly

Continuously track the status of your failure domains to detect and addressissues promptly.

You can monitor the current status of Google Cloud services in all regionsby using theGoogle Cloud Service Health dashboard.You can also view incidents relevant to your project by usingPersonalized Service Health.You can use load balancers to detect resource health and automatically routetraffic to healthy backends. For more information, seeHealth checks overview.

Test failover scenarios

Like a fire drill, regularly simulate failures to validate the effectiveness ofyour replication and failover strategies.

For more information, seeSimulate a zone outage for a regional MIG andSimulate a zone failure in GKE regional clusters.

Take advantage of horizontal scalability

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you use horizontal scalability. By usinghorizontal scalability, you can help ensure that your workloads inGoogle Cloud can scale efficiently and maintain performance.

This principle is relevant to thescopingfocus area of reliability.

Principle overview

Re-architect your system to a horizontal architecture. To accommodate growth intraffic or data, you can add more resources. You can also remove resources whenthey're not in use.

To understand the value of horizontal scaling, consider the limitations ofvertical scaling.

A common scenario for vertical scaling is to use a MySQL database as theprimary database with critical data. As database usage increases, more RAM andCPU is required. Eventually, the database reaches the memory limit on the hostmachine, and needs to be upgraded. This process might need to be repeatedseveral times. The problem is that there are hard limits on how much a databasecan grow. VM sizes are not unlimited. The database can reach a point when it'sno longer possible to add more resources.

Even if resources were unlimited, a large VM can become a single point offailure. Any problem with the primary database VM can cause error responses orcause a system-wide outage that affects all users. Avoid single points offailure, as described inBuild highly available systems through resource redundancy.

Besides these scaling limits, vertical scaling tends to be more expensive. Thecost can increase exponentially as machines with greater amounts of computepower and memory are acquired.

Horizontal scaling, by contrast, can cost less. The potential for horizontalscaling is virtually unlimited in a system that's designed to scale.

Recommendations

To transition from a single VM architecture to a horizontal multiple-machinearchitecture, you need to plan carefully and use the right tools. To help youachieve horizontal scaling, consider the recommendations in the followingsubsections.

Use managed services

Managed services remove the need to manually manage horizontal scaling. Forexample, with Compute Engine managed instance groups (MIGs), you can add orremove VMs to scale your application horizontally. For containerizedapplications, Cloud Run is a serverless platform that can automaticallyscale your stateless containers based on incoming traffic.

Promote modular design

Modular components and clear interfaces help you scale individual components asneeded, instead of scaling the entire application. For more information, seePromote modular design in the performance optimization pillar.

Implement a stateless design

Design applications to be stateless, meaning no locally stored data. This letsyou add or remove instances without worrying about data consistency.

Detect potential failures by using observability

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you proactively identify areas where errors andfailures might occur.

This principle is relevant to theobservationfocus area of reliability.

Principle overview

To maintain and improve the reliability of your workloads inGoogle Cloud, you need to implement effective observability by usingmetrics, logs, and traces.

  • Metrics are numerical measurements of activities that you want to trackfor your application at specific time intervals. For example, you mightwant to track technical metrics like request rate and error rate, which canbe used as service-level indicators (SLIs). You might also need to trackapplication-specific business metrics like orders placed and payments received.
  • Logs are time-stamped records of discrete events that occur within anapplication or system. The event could be a failure, an error, or a changein state. Logs might include metrics, and you can also use logs for SLIs.
  • A trace represents the journey of a single user or transaction through anumber of separate applications or the components of an application. Forexample, these components could be microservices. Traces help you to trackwhat components were used in the journeys, where bottlenecks exist, and howlong the journeys took.

Metrics, logs, and traces help you monitor your system continuously.Comprehensive monitoring helps you find out where and why errors occurred. Youcan also detect potential failures before errors occur.

Recommendations

To detect potential failures efficiently, consider the recommendations in thefollowing subsections.

Gain comprehensive insights

To track key metrics like response times and error rates, useCloud Monitoring andCloud Logging.These tools also help you to ensure that the metrics consistently meet the needsof your workload.

To make data-driven decisions, analyze default service metrics to understandcomponent dependencies and their impact on overall workload performance.

To customize your monitoring strategy, create and publish your own metrics byusing the Google Cloud SDK.

Perform proactive troubleshooting

Implement robust error handling and enable logging across all of the componentsof your workloads in Google Cloud. Activate logs likeCloud Storage access logs andVPC Flow Logs.

When you configure logging, consider the associatedcosts.To control logging costs, you can configureexclusion filters on the log sinks to exclude certain logs from being stored.

Optimize resource utilization

Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detectunder-provisioned and over-provisioned resources in services likeGKE, Compute Engine, and Dataproc. For acomplete list of supported services, seeCloud Monitoring overview.

Prioritize alerts

For alerts, focus on critical metrics, set appropriate thresholds to minimizealert fatigue, and ensure timely responses to significant issues. This targetedapproach lets you proactively maintain workload reliability. For moreinformation, seeAlerting overview.

Design for graceful degradation

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you to design your Google Cloud workloadsto fail gracefully.

This principle is relevant to theresponsefocus area of reliability.

Principle overview

Graceful degradation is a design approach where a system that experiences ahigh load continues to function, possibly with reduced performance or accuracy.Graceful degradation ensures continued availability of the system and preventscomplete failure, even if the system's work isn't optimal. When the load returnsto a manageable level, the system resumes full functionality.

For example, during periods of high load, Google Search prioritizes resultsfrom higher-ranked web pages, potentially sacrificing some accuracy. When theload decreases, Google Search recomputes the search results.

Recommendations

To design your systems for graceful degradation, consider the recommendationsin the following subsections.

Implement throttling

Ensure that your replicas can independently handle overloads and can throttleincoming requests during high-traffic scenarios. This approach helps you toprevent cascading failures that are caused by shifts in excess traffic betweenzones.

Use tools likeApigee to control the rate of API requests during high-traffic times. You can configurepolicy rules to reflect how you want to scale back requests.

Drop excess requests early

Configure your systems to drop excess requests at the frontend layer to protectbackend components. Dropping some requests prevents global failures and enablesthe system to recover more gracefully.With this approach, some users mightexperience errors. However, you can minimize the impact of outages, in contrastto an approach likecircuit-breaking, whereall traffic is dropped during anoverload.

Handle partial errors and retries

Build your applications to handle partial errors and retries seamlessly. Thisdesign helps to ensure that as much traffic as possible is served duringhigh-load scenarios.

Test overload scenarios

To validate that the throttle and request-drop mechanisms work effectively,regularly simulate overload conditions in your system. Testing helps ensure thatyour system is prepared for real-world traffic surges.

Monitor traffic spikes

Use analytics and monitoring tools to predict and respond to traffic surgesbefore they escalate into overloads. Early detection and response can helpmaintain service availability during high-demand periods.

Perform testing for recovery from failures

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery in theevent of failures.

This principle is relevant to thelearningfocus area of reliability.

Principle overview

To be sure that your system can recover from failures, you must periodicallyrun tests that include regional failovers, release rollbacks, and datarestoration from backups.

This testing helps you to practice responses to events that pose major risks toreliability, such as the outage of an entireregion.This testing also helps you verify that your system behaves as intended during adisruption.

In the unlikely event of an entire region going down, you need to fail over alltraffic to another region. During normal operation of your workload, when datais modified, it needs to be synchronized from the primary region to the failoverregion. You need to verify that the replicated data is always very recent, sothat users don't experience data loss or session breakage. The load balancingsystem must also be able to shift traffic to the failover region at any timewithout service interruptions. To minimize downtime after a regional outage,operations engineers also need to be able to manually and efficiently shift usertraffic away from a region, in as less time as possible. This operation issometimes calleddraining a region, which means you stop the inbound trafficto the region and move all the traffic elsewhere.

Recommendations

When you design and run tests for failure recovery, consider therecommendations in the following subsections.

Define the testing objectives and scope

Clearly define what you want to achieve from the testing. For example, yourobjectives can include the following:

  • Validate the recovery time objective (RTO) and the recovery pointobjective (RPO). For details, seeBasics of DR planning.
  • Assess system resilience and fault tolerance under various failurescenarios.
  • Test the effectiveness of automated failover mechanisms.

Decide which components, services, or regions are in the testing scope. Thescope can include specific application tiers like the frontend, backend, anddatabase, or it can include specific Google Cloud resources likeCloud SQL instances or GKE clusters. The scope must also specify anyexternal dependencies, such as third-party APIs or cloud interconnections.

Prepare the environment for testing

Choose an appropriate environment, preferably a staging or sandbox environmentthat replicates your production setup. If you conduct the test in production,ensure that you have safety measures ready, like automated monitoring and manualrollback procedures.

Create a backup plan. Take snapshots or backups of critical databases andservices to prevent data loss during the test. Ensure that your team is preparedto do manual interventions if the automated failover mechanisms fail.

To prevent test disruptions, ensure that your IAM roles,policies, and failover configurations are correctly set up. Verify that thenecessary permissions are in place for the test tools and scripts.

Inform stakeholders, including operations, DevOps, and application owners,about the test schedule, scope, and potential impact. Provide stakeholders withan estimated timeline and the expected behaviors during the test.

Simulate failure scenarios

Plan and execute failures by using tools likeChaos Monkey.You can use custom scripts to simulate failures of critical services such as ashutdown of a primary node in a multi-zone GKE cluster or a disabledCloud SQL instance. You can also use scripts to simulate a region-widenetwork outage by using firewall rules or API restrictions based on your scopeof test. Gradually escalate the failure scenarios to observe system behaviorunder various conditions.

Introduce load testing alongside failure scenarios to replicate real-worldusage during outages. Test cascading failure impacts, such as how frontendsystems behave when backend services are unavailable.

To validate configuration changes and to assess the system's resilience againsthuman errors, test scenarios that involve misconfigurations. For example, runtests with incorrect DNS failover settings or incorrect IAMpermissions.

Monitor system behavior

Monitor how load balancers, health checks, and other mechanisms reroutetraffic. Use Google Cloud tools like Cloud Monitoring andCloud Logging to capture metrics and events during the test.

Observe changes in latency, error rates, and throughput during and after thefailure simulation, and monitor the overall performance impact. Identify anydegradation or inconsistencies in the user experience.

Ensure that logs are generated and alerts are triggered for key events, such asservice outages or failovers. Use this data to verify the effectiveness of youralerting and incident response systems.

Verify recovery against your RTO and RPO

Measure how long it takes for the system to resume normal operations after afailure, and then compare this data with the defined RTO and document anygaps.

Ensure that data integrity and availability align with the RPO. To testdatabase consistency, compare snapshots or backups of the database before andafter a failure.

Evaluate service restoration and confirm that all services are restored to afunctional state with minimal user disruption.

Document and analyze results

Document each test step, failure scenario, and corresponding system behavior.Include timestamps, logs, and metrics for detailed analyses.

Highlight bottlenecks, single points of failure, or unexpected behaviorsobserved during the test. To help prioritize fixes, categorize issues byseverity and impact.

Suggest improvements to the system architecture, failover mechanisms, ormonitoring setups. Based on test findings, update any relevant failover policiesand playbooks. Present a postmortem report to stakeholders. The report shouldsummarize the outcomes, lessons learned, and next steps. For more information,seeConduct thorough postmortems.

Iterate and improve

To validate ongoing reliability and resilience, plan periodic testing (forexample, quarterly).

Run tests under different scenarios, including infrastructure changes, softwareupdates, and increased traffic loads.

Automate failover tests by using CI/CD pipelines to integrate reliabilitytesting into your development lifecycle.

During the postmortem, use feedback from stakeholders and end users to improvethe test process and system resilience.

Perform testing for recovery from data loss

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you design and run tests for recovery from dataloss.

This principle is relevant to thelearningfocus area of reliability.

Principle overview

To ensure that your system can recover from situations where data is lost orcorrupted, you need to run tests for those scenarios. Instances of data lossmight be caused by a software bug or some type of natural disaster. After suchevents, you need to restore data from backups and bring all of the services backup again by using the freshly restored data.

We recommend that you use three criteria to judge the success or failure of thistype of recovery test: data integrity, recovery time objective (RTO), andrecovery point objective (RPO). For details about the RTO and RPO metrics, seeBasics of DR planning.

The goal of data restoration testing is to periodically verify that yourorganization can continue to meet business continuity requirements. Besidesmeasuring RTO and RPO, a data restoration test must include testing of theentire application stack and all the critical infrastructure services with therestored data. This is necessary to confirm that the entire deployed applicationworks correctly in the test environment.

Recommendations

When you design and run tests for recovering from data loss, consider therecommendations in the following subsections.

Verify backup consistency and test restoration processes

You need to verify that your backups contain consistent and usable snapshots ofdata that you can restore to immediately bring applications back into service.To validate data integrity, set up automated consistency checks to run aftereach backup.

To test backups, restore them in a non-production environment. To ensure yourbackups can be restored efficiently and that the restored data meets applicationrequirements, regularly simulate data recovery scenarios. Document the steps fordata restoration, and train your teams to execute the steps effectively during afailure.

Schedule regular and frequent backups

To minimize data loss during restoration and to meet RPO targets, it'sessential to have regularly scheduled backups. Establish a backup frequency thataligns with your RPO. For example, if your RPO is 15 minutes, schedule backupsto run at least every 15 minutes. Optimize the backup intervals to reduce therisk of data loss.

Use Google Cloud tools like Cloud Storage, Cloud SQLautomated backups, or Spanner backups to schedule and manage backups.For critical applications, use near-continuous backup solutions likepoint-in-time recovery (PITR) for Cloud SQL or incremental backups for large datasets.

Define and monitor RPO

Set a clear RPO based on your business needs, and monitor adherence to the RPO.If backup intervals exceed the defined RPO, use Cloud Monitoring to set upalerts.

Monitor backup health

UseGoogle Cloud Backup and DR service or similar tools to track the health of your backups and confirm that they arestored in secure and reliable locations. Ensure that the backups are replicatedacross multipleregions for added resilience.

Plan for scenarios beyond backup

Combine backups with disaster recovery strategies like active-active failoversetups or cross-region replication for improved recovery time in extreme cases.For more information, seeDisaster recovery planning guide.

Conduct thorough postmortems

This principle in the reliability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you conduct effective postmortems afterfailures and incidents.

This principle is relevant to thelearningfocus area of reliability.

Principle overview

A postmortem is a written record of an incident, its impact, the actions takento mitigate or resolve the incident, the root causes, and the follow-up actionsto prevent the incident from recurring. The goal of a postmortem is to learnfrom mistakes and not assign blame.

The following diagram shows the workflow of a postmortem:

The workflow of a postmortem.

The workflow of a postmortem includes the following steps:

  • Create postmortem
  • Capture the facts
  • Identify and analyze the root causes
  • Plan for the future
  • Execute the plan

Conduct postmortem analyses after major events and non-major events like thefollowing:

  • User-visible downtimes or degradations beyond a certain threshold.
  • Data losses of any kind.
  • Interventions from on-call engineers, such as a release rollback orrerouting of traffic.
  • Resolution times above a defined threshold.
  • Monitoring failures, which usually imply manual incident discovery.

Recommendations

Define postmortem criteria before an incident occurs so that everyone knowswhen a post mortem is necessary.

To conduct effective postmortems, consider the recommendations in the followingsubsections.

Conduct blameless postmortems

Effective postmortems focus on processes, tools, and technologies, and don'tplace blame on individuals or teams. The purpose of a postmortem analysis is toimprove your technology and future, not to find who is guilty. Everyone makesmistakes. The goal should be to analyze the mistakes and learn from them.

The following examples show the difference between feedback that assigns blameand blameless feedback:

  • Feedback that assigns blame: "We need to rewrite the entirecomplicated backend system! It's been breaking weekly for the last threequarters and I'm sure we're all tired of fixing things piecemeal.Seriously, if I get paged one more time I'll rewrite it myself…"
  • Blameless feedback: "An action item to rewrite the entire backendsystem might actually prevent these pages from continuing to happen. Themaintenance manual for this version is quite long and really difficult tobe fully trained up on. I'm sure our future on-call engineers will thank us!"

Make the postmortem report readable by all the intended audiences

For each piece of information that you plan to include in the report, assesswhether that information is important and necessary to help the audienceunderstand what happened. You can move supplementary data and explanations to anappendix of the report. Reviewers who need more information can request it.

Avoid complex or over-engineered solutions

Before you start to explore solutions for a problem, evaluate the importance ofthe problem and the likelihood of a recurrence. Adding complexity to the systemto solve problems that are unlikely to occur again can lead to increasedinstability.

Share the postmortem as widely as possible

To ensure that issues don't remain unresolved, publish the outcome of thepostmortem to a wide audience and get support from management. The value of apostmortem is proportional to the learning that occurs after the postmortem.When more people learn from incidents, the likelihood of similar failuresrecurring is reduced.

Well-Architected Framework: Cost optimization pillar

The cost optimization pillar in theGoogle Cloud Well-Architected Framework describes principles and recommendations to optimize the cost of your workloadsin Google Cloud.

The intended audience includes the following:

  • CTOs, CIOs, CFOs, and other executives who are responsible for strategiccost management.
  • Architects, developers, administrators, and operators who make decisionsthat affect cost at all the stages of an organization's cloud journey.

The cost models for on-premises and cloud workloads differ significantly.On-premises IT costs include capital expenditure (CapEx) and operationalexpenditure (OpEx). On-premises hardware and software assets are acquired andthe acquisition costs aredepreciated over the operating life of the assets. In the cloud, the costs for most cloudresources are treated as OpEx, where costs are incurred when the cloud resourcesare consumed. This fundamental difference underscores the importance of thefollowing core principles of cost optimization.

Note: You might be able to classify the cost of some Google Cloud services (likeCompute Engine sole-tenant nodes) as capital expenditure. For moreinformation, seeSole-tenancy accounting FAQ.

For cost optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Cost optimizationin the Well-Architected Framework.

Core principles

The recommendations in the cost optimization pillar of the Well-Architected Frameworkare mapped to the following core principles:

  • Align cloud spending with businessvalue:Ensure that your cloud resources deliver measurable business value byaligning IT spending with business objectives.
  • Foster a culture of costawareness:Ensure that people across your organization consider the cost impact oftheir decisions and activities, and ensure that they have access to the costinformation required to make informed decisions.
  • Optimize resourceusage:Provision only the resources that you need, and pay only for the resourcesthat you consume.
  • Optimizecontinuously:Continuously monitor your cloud resource usage and costs, and proactivelymake adjustments as needed to optimize your spending. This approach involvesidentifying and addressing potential cost inefficiencies before they becomesignificant problems.

These principles are closely aligned with the core tenets ofcloud FinOps.FinOps is relevant to any organization, regardless of its size or maturity inthe cloud. By adopting these principles and following the relatedrecommendations, you can control and optimize costs throughout your journey inthe cloud.

Contributors

Author:Nicolas Pintaux | Customer Engineer, Application Modernization Specialist

Other contributors:

Align cloud spending with business value

This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to align your use of Google Cloud resources withyour organization's business goals.

Principle overview

To effectively manage cloud costs, you need to maximize the business value thatthe cloud resources provide and minimize thetotal cost of ownership (TCO).When you evaluate the resource options for your cloudworkloads, consider not only the cost of provisioning and using the resources,but also the cost of managing them. For example, virtual machines (VMs) onCompute Engine might be a cost-effective option for hosting applications.However, when you consider the overhead to maintain, patch, and scale the VMs,the TCO can increase. On the other hand, serverless services likeCloud Run can offer greaterbusiness value. The lower operational overhead lets your team focus on coreactivities and helps to increase agility.

To ensure that your cloud resources deliver optimal value, evaluate the followingfactors:

  • Provisioning and usage costs: The expenses incurred when you purchase,provision, or consume resources.
  • Management costs: The recurring expenses for operating and maintainingresources, including tasks like patching, monitoring and scaling.
  • Indirect costs: The costs that you might incur to manage issues likedowntime, data loss, or security breaches.
  • Business impact: The potential benefits from the resources, likeincreased revenue, improved customer satisfaction, and faster time to market.

By aligning cloud spending with business value, you get the following benefits:

  • Value-driven decisions: Your teams are encouraged to prioritize solutionsthat deliver the greatest business value and to consider both short-term andlong-term cost implications.
  • Informed resource choice: Your teams have the information and knowledgethat they need to assess the business value and TCO of various deploymentoptions, so they choose resources that are cost-effective.
  • Cross-team alignment: Cross-functional collaboration between business,finance, and technical teams ensures that cloud decisions are aligned withthe overall objectives of the organization.

Recommendations

To align cloud spending with business objectives, consider the following recommendations.

Prioritize managed services and serverless products

Whenever possible, choose managed services andserverless products to reduce operational overhead and maintenance costs. This choice lets your teamsconcentrate on their core business activities. They can accelerate the deliveryof new features and functionalities, and help drive innovation and value.

The following are examples of how you can implement this recommendation:

  • To run PostgreSQL, MySQL, or Microsoft SQL Server server databases, useCloud SQL instead of deploying those databases on VMs.
  • To run and manage Kubernetes clusters, useGoogle Kubernetes Engine (GKE) Autopilot instead of deploying containers on VMs.
  • For your Apache Hadoop or Apache Spark processing needs, useDataproc andDataproc Serverless.Per-second billing can help to achieve significantlylower TCO when compared to on-premises data lakes.

Balance cost efficiency with business agility

Controlling costs and optimizing resource utilization are important goals.However, you must balance these goals with the need for flexible infrastructurethat lets you innovate rapidly, respond quickly to changes, and deliver valuefaster. The following are examples of how you can achieve this balance:

  • AdoptDORA metrics for software delivery performance. Metrics like change failure rate (CFR),time to detect (TTD), and time to restore (TTR) can help to identify and fixbottlenecks in your development and deployment processes. By reducing downtimeand accelerating delivery, you can achieve both operational efficiency andbusiness agility.
  • FollowSite Reliability Engineering (SRE) practices to improve operational reliability. SRE's focus on automation,observability, and incident response can lead to reduced downtime, lowerrecovery time, and higher customer satisfaction. By minimizing downtime andimproving operational reliability, you can prevent revenue loss and avoidthe need to overprovision resources as a safety net to handle outages.

Enable self-service optimization

Encourage a culture of experimentation and exploration by providing your teamswith self-service cost optimization tools, observability tools, and resourcemanagement platforms. Enable them to provision, manage, and optimize their cloudresources autonomously. This approach helps to foster a sense of ownership,accelerate innovation, and ensure that teams can respond quickly to changing needs while being mindful of cost efficiency.

Adopt and implement FinOps

Adopt FinOps to establish a collaborative environment where everyone is empoweredto make informed decisions that balance cost and value. FinOps fosters financialaccountability and promotes effective cost optimization in the cloud.

Promote a value-driven and TCO-aware mindset

Encourage your team members to adopt a holistic attitude toward cloud spending,with an emphasis on TCO and not just upfront costs. Use techniques likevalue stream mapping to visualize and analyze the flow of value through your software delivery processand to identify areas for improvement. Implementunit costing for your applications and services to gain a granular understanding of costdrivers and discover opportunities for cost optimization. For more information,seeMaximize business value with cloud FinOps.

Foster a culture of cost awareness

This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to promote cost awareness across your organization andensure that team members have the cost information that they need to makeinformed decisions.

Conventionally, the responsibility for cost management might be centralized to afew select stakeholders and primarily focused on initial project architecturedecisions. However, team members across all cloud user roles (analyst, architect,developer, or administrator) can help to reduce the cost of your resources inGoogle Cloud. By sharing cost data appropriately, you can empower teammembers to make cost-effective decisions throughout their development anddeployment processes.

Principle overview

Stakeholders across various roles – product owners, developers, deploymentengineers, administrators, and financial analysts – need visibility into relevantcost data and its relationship to business value. When provisioning and managingcloud resources, they need the following data:

  • Projected resource costs: Cost estimates at the time of design anddeployment.
  • Real-time resource usage costs: Up-to-date cost data that can be used forongoing monitoring and budget validation.
  • Costs mapped to business metrics: Insights into how cloud spending affectskey performance indicators (KPIs), to enable teams to identify cost-effectivestrategies.

Every individual might not need access to raw cost data. However, promoting costawareness across all roles is crucial because individual decisions can affectcosts.

By promoting cost visibility and ensuring clear ownership of cost managementpractices, you ensure that everyone is aware of the financial implications oftheir choices and everyone actively contributes to the organization's costoptimization goals. Whether through a centralized FinOps team or a distributedmodel, establishing accountability is crucial for effective cost optimizationefforts.

Recommendations

To promote cost awareness and ensure that your team members have the costinformation that they need to make informed decisions, consider the followingrecommendations.

Provide organization-wide cost visibility

To achieve organization-wide cost visibility, the teams that are responsible forcost management can take the following actions:

  • Standardize cost calculation and budgeting: Use a consistent method todetermine the full costs of cloud resources, after factoring in discounts andshared costs. Establish clear and standardized budgeting processes that alignwith your organization's goals and enable proactive cost management.
  • Use standardized cost management and visibility tools: Use appropriatetools that provide real-time insights into cloud spending and generateregular (for example, weekly) cost progression snapshots. These tools enableproactive budgeting, forecasting, and identification of optimizationopportunities. The tools could be cloud provider tools(like theGoogle Cloud Billing dashboard),third-party solutions, or open-source solutions like theCost Attribution solution.
  • Implement a cost allocation system: Allocate a portion of the overallcloud budget to each team or project. Such an allocation gives the teams asense of ownership over cloud spending and encourages them to makecost-effective decisions within their allocated budget.
  • Promote transparency: Encourage teams to discuss cost implications duringthe design and decision-making processes. Create a safe and supportiveenvironment for sharing ideas and concerns related to cost optimization.Some organizations use positive reinforcement mechanisms like leaderboardsor recognition programs. If your organization has restrictions on sharingraw cost data due to business concerns, explore alternative approaches forsharing cost information and insights. For example, consider sharingaggregated metrics (like the total cost for an environment or feature) orrelative metrics (like the average cost per transaction or user).

Understand how cloud resources are billed

Pricing for Google Cloud resources might vary acrossregions.Some resources are billed monthly at a fixed price, and others might be billedbased on usage.To understand how Google Cloud resources are billed, use theGoogle Cloud pricing calculator and product-specific pricing information (for example,Google Kubernetes Engine (GKE) pricing).

Understand resource-based cost optimization options

For each type of cloud resource that you plan to use, explore strategies tooptimize utilization and efficiency. The strategies include rightsizing,autoscaling, and adopting serverless technologies where appropriate. The followingare examples of cost optimization options for a few Google Cloud products:

  • Cloud Run lets you configurealways-allocated CPUs to handle predictable traffic loads at a fraction of the price of the defaultallocation method (that is, CPUs allocated only during request processing).
  • You can purchaseBigQuery slot commitments to save money on data analysis.
  • GKE provides detailed metrics to help you understand cost optimization options.
  • Understand hownetwork pricing can affect the cost of data transfers and how you can optimize costs forspecific networking services. For example, you can reduce the data transfercosts for external Application Load Balancers by using Cloud CDN or Google Cloud Armor.For more information, seeWays to lower external Application Load Balancer costs.

Understand discount-based cost optimization options

Familiarize yourself with the discount programs that Google Cloud offers,such as the following examples:

  • Committed use discounts (CUDs):CUDs are suitable for resources that have predictable and steady usage. CUDslet you get significant reductions in price in exchange for committing tospecific resource usage over a period (typically one to three years). Youcan also useCUD auto-renewal to avoid having to manually repurchase commitments when they expire.
  • Sustained use discounts:For certain Google Cloud products like Compute Engine andGKE, you can get automatic discount credits aftercontinuous resource usage beyond specific duration thresholds.
  • Spot VMs:For fault-tolerant and flexible workloads, Spot VMs can help toreduce your Compute Engine costs. The cost of Spot VMs issignificantly lower than regular VMs. However, Compute Engine mightpreemptively stop or delete Spot VMs to reclaim capacity.Spot VMs are suitable for batch jobs that can tolerate preemptionand don't have high availability requirements.
  • Discounts for specific product options: Some managed services likeBigQuery offerdiscounts when you purchase dedicated or autoscaling query processing capacity.

Evaluate and choose the discounts options that align with your workloadcharacteristics and usage patterns.

Incorporate cost estimates into architecture blueprints

Encourage teams to develop architecture blueprints that include cost estimatesfor different deployment options and configurations. This practice empowers teamsto compare costs proactively and make informed decisions that align with bothtechnical and financial objectives.

Use a consistent and standard set of labels for all your resources

You can uselabels to track costs and to identify and classify resources. Specifically, you can uselabels to allocate costs to different projects, departments, or cost centers.Defining aformal labeling policy that aligns with the needs of the main stakeholders in your organization helpsto make costs visible more widely. You can also use labels to filter resourcecost and usage data based on target audience.

Use automation tools like Terraform to enforce labeling on every resource thatis created. To enhance cost visibility and attribution further, you can use thetools provided by the open-sourcecost attribution solution.

Share cost reports with team members

By sharing cost reports with your team members, you empower them to takeownership of their cloud spending. This practice enables cost-effective decisionmaking, continuous cost optimization, and systematic improvements to your costallocation model.

Cost reports can be of several types, including the following:

  • Periodic cost reports: Regular reports inform teams about their currentcloud spending. Conventionally, these reports might be spreadsheet exports.More effective methods include automated emails and specialized dashboards.To ensure that cost reports provide relevant and actionable information without overwhelming recipients with unnecessary detail, the reports must be tailored to the target audiences. Setting up tailored reports is a foundational step toward more real-time and interactive cost visibility and management.
  • Automated notifications: You can configure cost reports to proactivelynotify relevant stakeholders (for example, through email or chat) about costanomalies, budget thresholds, or opportunities for cost optimization. Byproviding timely information directly to those who can act on it, automatedalerts encourage prompt action and foster a proactive approach to costoptimization.
  • Google Cloud dashboards: You can use thebuilt-in billing dashboards in Google Cloud to get insights into cost breakdowns and to identifyopportunities for cost optimization. Google Cloud also providesFinOps hub to help you monitor savings and get recommendations for cost optimization.An AI engine powers the FinOps hub to recommend cost optimizationopportunities for all the resources that are currently deployed. To controlaccess to these recommendations, you can implement role-based access control(RBAC).
  • Custom dashboards: You can create custom dashboards by exporting costdata to an analytics database, likeBigQuery.Use a visualization tool likeLooker Studio to connect to the analytics database to build interactive reports and enablefine-grained access control through role-based permissions.
  • Multicloud cost reports: For multicloud deployments, you need aunified view of costs across all the cloud providers to ensure comprehensiveanalysis, budgeting, and optimization. Use tools like BigQueryto centralize and analyze cost data from multiple cloud providers, and useLooker Studio to build team-specific interactive reports.

Optimize resource usage

This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you plan and provision resources to match the requirementsand consumption patterns of your cloud workloads.

Principle overview

To optimize the cost of your cloud resources, you need to thoroughly understandyour workloads resource requirements and load patterns. This understanding isthe basis for a well defined cost model that lets you forecast the total cost ofownership (TCO) and identify cost drivers throughout your cloud adoption journey.By proactively analyzing and forecasting cloud spending, you can make informedchoices about resource provisioning, utilization, and cost optimization. Thisapproach lets you control cloud spending, avoid overprovisioning, and ensure thatcloud resources are aligned with the dynamic needs of your workloads andenvironments.

Recommendations

To effectively optimize cloud resource usage, consider the following recommendations.

Choose environment-specific resources

Each deployment environment has different requirements for availability,reliability and scalability. For example, developers might prefer an environmentthat lets them rapidly deploy and run applications for short durations, but mightnot need high availability. On the other hand, a production environment typicallyneeds high availability. To maximize the utilization of your resources, defineenvironment-specific requirements based on your business needs. The followingtable lists examples of environment-specific requirements.

Note: The requirements that are listed in this table are not exhaustive orprescriptive. They're meant to serve as examples to help you understand howrequirements can vary based on the environment type.
EnvironmentRequirements
Production
  • High availability
  • Predictable performance
  • Operational stability
  • Security with robust resources
Development and testing
  • Cost efficiency
  • Flexible infrastructure withburstable capacity
  • Ephemeral infrastructure when data persistence is not necessary
Other environments (like staging and QA)
  • Tailored resource allocation based on environment-specific requirements

Choose workload-specific resources

Each of your cloud workloads might have different requirements for availability,scalability, security, and performance. To optimize costs, you need to alignresource choices with the specific requirements of each workload. For example,a stateless application might not require the same level of availability orreliability as a stateful backend. The following table lists more examples ofworkload-specific requirements.

Note: The requirements that are listed in this table are not exhaustive orprescriptive. They're meant to serve as examples to help you understand howrequirements can vary based on the workload type.
Workload typeWorkload requirementsResource options
Mission-criticalContinuous availability, robust security, and high performancePremium resources and managed services likeSpanner for high availability and global consistency of data.
Non-criticalCost-efficient and autoscaling infrastructureResources with basic features and ephemeral resources likeSpot VMs.
Event-drivenDynamic scaling based on the current demand for capacity and performanceServerless services likeCloud Run andCloud Run functions.
Experimental workloadsLow cost and flexible environment for rapid development, iteration, testing, and innovationResources with basic features, ephemeral resources likeSpot VMs, and sandbox environments with defined spending limits.

A benefit of the cloud is the opportunity to take advantage of the mostappropriate computing power for a given workload. Some workloads are developedto take advantage of processor instruction sets, and others might not be designedin this way. Benchmark and profile your workloads accordingly. Categorize yourworkloads and make workload-specific resource choices (for example, chooseappropriatemachine families for Compute Engine VMs). This practice helpsto optimize costs, enable innovation, and maintain the level of availability andperformance that your workloads need.

The following are examples of how you can implement this recommendation:

  • For mission-critical workloads that serve globally distributed users, considerusingSpanner. Spanner removes the need for complex database deployments byensuring reliability and consistency of data in allregions.
  • For workloads with fluctuating load levels, use autoscaling to ensure thatyou don't incur costs when the load is low and yet maintain sufficientcapacity to meet the current load. You can configure autoscaling for manyGoogle Cloud services, includingCompute Engine VMs,Google Kubernetes Engine (GKE) clusters,andCloud Run. When you set upautoscaling, you can configure maximum scaling limits to ensure that costsremain within specified budgets.

Select regions based on cost requirements

For your cloud workloads, carefully evaluate the available Google Cloudregions and choose regions that align with your cost objectives. The region withlowest cost might not offer optimal latency or it might not meet yoursustainability requirements. Make informed decisions about where to deploy yourworkloads to achieve the desired balance. You can use theGoogle Cloud Region Picker to understand the trade-offs between cost, sustainability, latency, and otherfactors.

Use built-in cost optimization options

Google Cloud products provide built-in features to help you optimizeresource usage and control costs. The following table lists examples of costoptimization features that you can use in some Google Cloud products:

ProductCost optimization feature
Compute Engine
GKE
  • Automatically adjust the size of GKE clusters based on the current loadby usingclusterautoscaler.
  • Automatically create and manage node pools based on workload requirementsand ensure optimal resource utilization by usingnode auto-provisioning.
Cloud Storage
  • Automatically transition data to lower-cost storage classes based on the ageof data or based on access patterns by usingObject Lifecycle Management.
  • Dynamically move data to the most cost-effective storage class based on usagepatterns by usingAutoclass.
BigQuery
  • Reduce query processing costs for steady-state workloads by usingcapacity-based pricing.
  • Optimize query performance and costs by using partitioning and clustering techniques.
Google Cloud VMware Engine

Optimize resource sharing

To maximize the utilization of cloud resources, you can deploy multipleapplications or services on the same infrastructure, while still meeting thesecurity and other requirements of the applications. For example, in developmentand testing environments, you can use the same cloud infrastructure to test allthe components of an application. For the production environment, you can deployeach component on a separate set of resources to limit the extent of impact incase of incidents.

The following are examples of how you can implement this recommendation:

  • Use a singleCloud SQL instance for multiple non-production environments.
  • Enable multiple development teams to share a GKE cluster by using thefleet team management feature in GKEwith appropriate access controls.
  • UseGKE Autopilot to take advantage ofcost-optimization techniques like bin packing and autoscaling thatGKE implements by default.
  • For AI and ML workloads, save GPU costs by usingGPU-sharing strategies like multi-instance GPUs, time-sharing GPUs, and NVIDIA MPS.

Develop and maintain reference architectures

Create and maintain a repository of reference architectures that are tailored tomeet the requirements of different deployment environments and workload types.To streamline the design and implementation process for individual projects, theblueprints can be centrally managed by a team like aCloud Center of Excellence (CCoE). Project teamscan choose suitable blueprints based on clearly defined criteria, to ensurearchitectural consistency and adoption of best practices. For requirements thatare unique to a project, the project team and the central architecture team shouldcollaborate to design new reference architectures. You can share the referencearchitectures across the organization to foster knowledge sharing and expand therepository of available solutions. This approach ensures consistency, acceleratesdevelopment, simplifies decision-making, and promotes efficient resourceutilization.

Review thereference architectures provided by Google for various use cases andtechnologies. These reference architectures incorporate best practices forresource selection, sizing, configuration, and deployment. By using thesereference architectures, you can accelerate your development process and achievecost savings from the start.

Enforce cost discipline by using organization policies

Consider usingorganization policiesto limit the available Google Cloud locations and products that teammembers can use. These policies help to ensure that teams adhere to cost-effectivesolutions and provision resources in locations that are aligned with your costoptimization goals.

Estimate realistic budgets and set financial boundaries

Develop detailed budgets for each project, workload, and deployment environment.Make sure that the budgets cover all aspects of cloud operations, includinginfrastructure costs, software licenses, personnel, and anticipated growth. Toprevent overspending and ensure alignment with your financial goals, establishclear spending limits or thresholds for projects, services, or specific resources.Monitor cloud spending regularly against these limits. You can useproactive quota alerts to identify potential cost overruns early and take timely corrective action.

In addition to setting budgets, you can usequotas and limits to help enforce cost discipline andprevent unexpected spikes in spending. You can exercise granular control overresource consumption by setting quotas at various levels, including projects,services, and even specific resource types.

The following are examples of how you can implement this recommendation:

  • Project-level quotas: Set spending limits or resource quotas at theproject level to establish overall financial boundaries and control resourceconsumption across all the services within the project.
  • Service-specific quotas: Configure quotas for specific Google Cloudservices like Compute Engine or BigQuery to limit thenumber of instances, CPUs, or storage capacity that can be provisioned.
  • Resource type-specific quotas: Apply quotas to individual resource typeslike Compute Engine VMs, Cloud Storage buckets,Cloud Run instances, or GKE nodes torestrict their usage and prevent unexpected cost overruns.
  • Quota alerts: Get notifications when your quota usage (at the projectlevel) reaches a percentage of the maximum value.

By using quotas and limits in conjunction with budgeting and monitoring, you cancreate a proactive and multi-layered approach to cost control. This approachhelps to ensure that your cloud spending remains within defined boundaries andaligns with your business objectives. Remember, these cost controls are notpermanent or rigid. To ensure that the cost controls remain aligned with currentindustry standards and reflect your evolving business needs, you must review thecontrols regularly and adjust them to include new technologies and best practices.

Optimize continuously

This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you optimize the cost of your cloud deploymentsbased on constantly changing and evolving business goals.

As your business grows and evolves, your cloud workloads need to adapt to changesin resource requirements and usage patterns. To derive maximum value from yourcloud spending, you must maintain cost-efficiency while continuing to supportbusiness objectives. This requires a proactive and adaptive approach that focuseson continuous improvement and optimization.

Principle overview

To optimize cost continuously, you must proactively monitor and analyze yourcloud environment and make suitable adjustments to meet current requirements.Focus your monitoring efforts on key performance indicators (KPIs) that directlyaffect your end users' experience, align with your business goals, and provideinsights for continuous improvement. This approach lets you identify and addressinefficiencies, adapt to changing needs, and continuously align cloud spendingwith strategic business goals. To balance comprehensive observability with costeffectiveness, understand the costs and benefits of monitoring resource usageand use appropriate process-improvement and optimization strategies.

Recommendations

To effectively monitor your Google Cloud environment and optimize costcontinuously, consider the following recommendations.

Focus on business-relevant metrics

Effective monitoring starts with identifying the metrics that are most importantfor your business and customers. These metrics include the following:

  • User experience metrics: Latency, error rates, throughput, and customersatisfaction metrics are useful for understanding your end users' experiencewhen using your applications.
  • Business outcome metrics: Revenue, customer growth, and engagement canbe correlated with resource usage to identify opportunities for costoptimization.
  • DevOps Research & Assessment (DORA) metrics: Metricslike deployment frequency, lead time for changes, change failure rate, andtime to restore provide insights into the efficiency and reliability of yoursoftware delivery process. By improving these metrics, you can increaseproductivity, reduce downtime, and optimize cost.
  • Site Reliability Engineering (SRE) metrics: Errorbudgets help teams to quantify and manage the acceptable level of servicedisruption. By establishing clear expectations for reliability, error budgetsempower teams to innovate and deploy changes more confidently, knowing theirsafety margin. This proactive approach promotes a balance between innovationand stability, helping prevent excessive operational costs associated withmajor outages or prolonged downtime.

Use observability for resource optimization

The following are recommendations to use observability to identify resourcebottlenecks and underutilized resources in your cloud deployments:

  • Monitor resource utilization: Use resource utilization metrics to identifyGoogle Cloud resources that are underutilized. For example, use metricslike CPU and memory utilization to identifyidle VM resources.For Google Kubernetes Engine (GKE), you can view a detailedbreakdown of costs andcost-related optimization metrics.For Google Cloud VMware Engine,review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
  • Use cloud recommendations:Active Assist is a portfolio of intelligent tools that help you optimize your cloudoperations. These tools provide actionable recommendations to reduce costs,increase performance, improve security and even make sustainability-focuseddecisions. For example,VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
  • Correlate resource utilization with performance: Analyze the relationshipbetween resource utilization and application performance to determine whetheryou can downgrade to less expensive resources without affecting the userexperience.

Balance troubleshooting needs with cost

Detailed observability data can help with diagnosing and troubleshooting issues.However, storing excessive amounts of observability data or exporting unnecessarydata to external monitoring tools can lead to unnecessary costs. For efficienttroubleshooting, consider the following recommendations:

  • Collect sufficient data for troubleshooting: Ensure that your monitoringsolution captures enough data to efficiently diagnose and resolve issues whenthey arise. This data might include logs, traces, and metrics at variouslevels of granularity.
  • Use sampling and aggregation: Balance the need for detailed data withcost considerations by using sampling and aggregation techniques. This approachlets you collect representative data without incurring excessive storage costs.
  • Understand the pricing models of your monitoring tools and services: Evaluatedifferent monitoring solutions and choose options that align with yourproject's specific needs, budget, and usage patterns. Consider factors likedata volume, retention requirements, and the required features whenmaking your selection.
  • Regularly review your monitoring configuration: Avoid collecting excessivedata by removing unnecessary metrics or logs.

Tailor data collection to roles and set role-specific retention policies

Consider the specific data needs of different roles. For example, developersmight primarily need access to traces and application-level logs, whereas ITadministrators might focus on system logs and infrastructure metrics. By tailoringdata collection, you can reduce unnecessary storage costs and avoid overwhelmingusers with irrelevant information.

Additionally, you can define retention policies based on the needs of each roleand any regulatory requirements. For example, developers might need access todetailed logs for a shorter period, while financial analysts might requirelonger-term data.

Consider regulatory and compliance requirements

In certain industries, regulatory requirements mandate data retention. To avoidlegal and financial risks, you need to ensure that your monitoring and dataretention practices help you adhere to relevant regulations. At the same time,you need to maintain cost efficiency. Consider the following recommendations:

  • Determine the specific data retention requirements for your industry or region,and ensure that your monitoring strategy meets the requirements of thoserequirements.
  • Implement appropriate data archival and retrieval mechanisms to meet auditand compliance needs while minimizing storage costs.

Implement smart alerting

Alerting helps to detect and resolve issues in a timely manner. However, abalance is necessary between an approach that keeps you informed, and one thatoverwhelms you with notifications. By designing intelligent alerting systems,you can prioritize critical issues that have higher business impact. Considerthe following recommendations:

  • Prioritize issues that affect customers: Design alerts that triggerrapidly for issues that directly affect the customer experience, like websiteoutages, slow response times, or transaction failures.
  • Tune for temporary problems: Use appropriate thresholds and delaymechanisms to avoid unnecessary alerts for temporary problems or self-healingsystem issues that don't affect customers.
  • Customize alert severity: Ensure that the most urgent issues receiveimmediate attention by differentiating between critical and noncriticalalerts.
  • Use notification channels wisely: Choose appropriate channels for alertnotifications (email, SMS, or paging) based on the severity and urgency ofthe alerts.

Well-Architected Framework: Performance optimization pillar

This pillar in theGoogle Cloud Well-Architected Framework provides recommendations to optimize the performance of workloads inGoogle Cloud.

This document is intended for architects, developers, and administrators whoplan, design, deploy, and manage workloads in Google Cloud.

The recommendations in this pillar can help your organization to operateefficiently, improve customer satisfaction, increase revenue, and reduce cost.For example, when the backend processing time of an application decreases, usersexperience faster response times, which can lead to higher user retention andmore revenue.

The performance optimization process can involve a trade-off betweenperformance and cost. However, optimizing performance can sometimes help youreduce costs. ​​For example, when the load increases, autoscaling can help toprovide predictable performance by ensuring that the system resources aren'toverloaded. Autoscaling also helps you to reduce costs by removing unusedresources during periods of low load.

Performance optimization is a continuous process, not a one-time activity. Thefollowing diagram shows the stages in the performance optimization process:

Performance optimization process

The performance optimization process is an ongoing cycle that includes thefollowing stages:

  1. Define requirements: Define granular performance requirements foreach layer of the application stack before you design and develop yourapplications. To planresource allocation, consider the key workload characteristics and performanceexpectations.
  2. Design and deploy: Use elastic and scalable design patterns that canhelp you meet your performance requirements.
  3. Monitor and analyze: Monitor performance continually by using logs,tracing, metrics, and alerts.
  4. Optimize: Consider potential redesigns as your applications evolve.Rightsize cloud resources and use new features to meet changing performancerequirements.

    As shown in the preceding diagram, continue the cycle of monitoring,re-assessing requirements, and adjusting the cloud resources.

For performance optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Performance optimizationin the Well-Architected Framework.

Core principles

The recommendations in the performance optimization pillar of the Well-Architected Frameworkare mapped to the following core principles:

Contributors

Authors:

Other contributors:

Plan resource allocation

This principle in the performance optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you plan resources for your workloads inGoogle Cloud. It emphasizes the importance of defining granularrequirements before you design and develop applications for cloud deployment ormigration.

Principle overview

To meet your business requirements, it's important that you define the performancerequirements for your applications, before design and development. Define theserequirements as granularly as possible for the application as a whole and foreach layer of the application stack. For example, in the storage layer, youmust consider the throughput and I/O operations per second (IOPS) that theapplications need.

From the beginning, plan application designs with performance and scalability inmind. Consider factors such as the number of users, data volume, and potentialgrowth over time.

Performance requirements for each workload vary and depend on the type ofworkload. Each workload can contain a mix of component systems and services thathave unique sets of performance characteristics. For example, a system that'sresponsible for periodic batch processing of large datasets has differentperformance demands than an interactive virtual desktop solution.Your optimization strategies must address the specific needs of each workload.

Select services and features that align with the performance goals of eachworkload. For performance optimization, there's no one-size-fits-all solution. When youoptimize each workload, the entire system can achieve optimal performance andefficiency.

Consider the following workload characteristics that can influence yourperformance requirements:

  • Deployment archetype: Thedeployment archetype that you select for an application can influence your choice of products and features,which then determine the performance that you can expect from your application.
  • Resource placement: When you select a Google Cloudregion for your application resources, we recommend that you prioritize low latency for endusers, adhere to data-locality regulations, and ensure the availability ofrequired Google Cloud products and services.
  • Network connectivity: Choose networking services that optimize dataaccess and content delivery. Take advantage of Google Cloud's globalnetwork, high-speed backbones, interconnect locations, and caching services.
  • Application hosting options: When you select a hosting platform, youmust evaluate the performance advantages and disadvantages of each option.For example, consider bare metal, virtual machines, containers, and serverlessplatforms.
  • Storage strategy: Choose anoptimal storage strategy that's based on your performance requirements.
  • Resource configurations: The machine type, IOPS, and throughput canhave a significant impact on performance. Additionally, early in the designphase, you must consider appropriate security capabilities and their impact onresources. When you plan security features, be prepared to accommodate thenecessary performance trade-offs to avoid any unforeseen effects.

Recommendations

To ensure optimal resource allocation, consider the recommendations in thefollowing sections.

Configure and manage quotas

Ensure that your application uses only the necessary resources, such as memory,storage, and processing power. Over-allocation can lead to unnecessary expenses,while under-allocation might result in performance degradation.

To accommodate elastic scaling and to ensure that adequate resources areavailable, regularly monitor the capacity of your quotas. Additionally, trackquota usage to identify potential scaling constraints or over-allocation issues,and then make informed decisions about resource allocation.

Educate and promote awareness

Inform your users about the performance requirements and provideeducational resources about effective performance management techniques.

To evaluate progress and to identify areas for improvement, regularly document thetarget performance and the actual performance. Load test your application to findpotential breakpoints and to understand how you can scale the application.

Monitor performance metrics

UseCloud Monitoring to analyze trends in performance metrics, to analyze the effects of experiments,to define alerts for critical metrics, and to perform retrospective analyses.

Active Assist is a set of tools that can provide insights and recommendations to help optimizeresource utilization. These recommendations can help you to adjust resourceallocation and improve performance.

Take advantage of elasticity

This principle in the performance optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you incorporate elasticity, which is the abilityto adjust resources dynamically based on changes in workload requirements.

Elasticity allows different components ofa system to scale independently. This targeted scaling can help improve performance andcost efficiency by allocating resources precisely where they're needed, withoutover provisioning or under provisioning your resources.

Principle overview

The performance requirements of a system directly influence when and how thesystem scales vertically or scales horizontally. You need to evaluate the system'scapacity and determine the load that the system is expected to handle at baseline.Then, you need to determine how you want the system to respond to increases and decreasesin the load.

When the load increases, the system must scale out horizontally, scale upvertically, or both. For horizontal scaling, add replica nodes to ensure thatthe system has sufficient overall capacity to fulfill the increased demand. Forvertical scaling, replace the application's existing components with componentsthat contain more capacity, more memory, and more storage.

When the load decreases, the system must scale down (horizontally, vertically,or both).

Define thecircumstances in which the system scales up or scales down. Plan tomanually scale up systems for known periods of high traffic. Use tools likeautoscaling, which responds to increases or decreases in the load.

Recommendations

To take advantage of elasticity, consider the recommendations in the followingsections.

Plan for peak load periods

You need to plan an efficient scaling path for known events, such as expectedperiods of increased customer demand.

Consider scaling up your system ahead of known periods of high traffic. Forexample, if you're a retail organization, you expect demand to increase duringseasonal sales. We recommend that you manually scale up or scale out your systems beforethose sales to ensure that your system can immediately handle the increased loador immediately adjust existing limits. Otherwise, the system might take several minutes toadd resources in response to real-time changes. Your application's capacitymight not increase quickly enough and cause some users to experience delays.

For unknown or unexpected events, such as a sudden surge in demand or traffic,you can use autoscaling features to trigger elastic scaling that's based onmetrics. These metrics can include CPU utilization, load balancer servingcapacity, latency, and even custom metrics that you define inCloud Monitoring.

For example, consider an application that runs on aCompute Engine managed instance group (MIG). This application has a requirement that each instance performsoptimally until the average CPU utilization reaches 75%. In this example, youmight define anautoscaling policy that creates more instances when the CPU utilization reaches the threshold.These newly-created instances help absorb the load, which helps ensure that the averageCPU utilization remains at an optimal rate until the maximum number of instancesthat you've configured for the MIG is reached. When the demand decreases, theautoscaling policy removes the instances that are no longer needed.

Planresource slot reservations in BigQuery or adjust the limits for autoscaling configurations in Spanner by using themanaged autoscaler.

Use predictive scaling

If your system components include Compute Engine, you must evaluate whetherpredictive autoscaling is suitable for your workload. Predictive autoscaling forecasts the future loadbased on your metrics' historical trends—for example, CPU utilization.Forecasts are recomputed every few minutes, so the autoscaler rapidly adapts itsforecast to very recent changes in load. Without predictive autoscaling, anautoscaler can only scale a group reactively, based on observed real-time changesin load. Predictive autoscaling works with both real-time data andhistorical data to respond to both the current and the forecasted load.

Implement serverless architectures

Consider implementing a serverless architecture with serverless services thatare inherently elastic, such as the following:

Unlike autoscaling in other services that require fine-tuning rules (forexample, Compute Engine), serverless autoscaling is instant and canscale down to zero resources.

Use Autopilot mode for Kubernetes

For complex applications that require greater control over Kubernetes, considerAutopilot mode in Google Kubernetes Engine (GKE).Autopilot mode provides automation and scalability by default.GKE automatically scales nodes and resources based ontraffic. GKE manages nodes, creates new nodes for your applications, andconfigures automatic upgrades and repairs.

Promote modular design

This principle in the performance optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you promote a modular design.Modular components and clear interfaces can enable flexible scaling,independent updates, and future component separation.

Principle overview

Understand the dependencies between the application components and the systemcomponents to design a scalable system.

Modular design enables flexibility and resilience, regardless of whether amonolithic or microservices architecture was initially deployed. By decomposingthe system into well-defined, independent modules with clear interfaces, you canscale individual components to meet specific demands.

Targeted scaling can help optimize resource utilization and reduce costs in thefollowing ways:

  • Provisions only the necessary resources to each component, andallocates fewer resources to less-demanding components.
  • Adds more resources during high-traffic periods to maintain the userexperience.
  • Removes under-utilized resources without compromising performance.

Modularity also enhances maintainability. Smaller, self-contained units areeasier to understand, debug, and update, which can lead to faster developmentcycles and reduced risk.

While modularity offers significant advantages, you must evaluate the potentialperformance trade-offs. The increased communication between modules canintroduce latency and overhead. Strive for a balance between modularity andperformance. A highly modular design might not be universally suitable. Whenperformance is critical, a more tightly coupled approach might be appropriate.System design is aniterative process,in which you continuously review and refine your modular design.

Recommendations

To promote modular designs, consider the recommendations in the followingsections.

Design for loose coupling

Design aloosely coupled architecture.Independent components with minimal dependencies can help you buildscalable and resilient applications.As you plan the boundaries for your services, you must consider the availabilityand scalability requirements. For example, if one component has requirements thatare different from your other components, you can design the component as a standaloneservice. Implement a plan for graceful failures for less-important subprocessesor services that don't impact the response time of the primary services.

Design for concurrency and parallelism

Design your application to support multiple tasks concurrently, like processingmultiple user requests or running background jobs while users interact with yoursystem. Break large tasks into smaller chunks that can be processed at the sametime by multiple service instances. Task concurrency lets you use features likeautoscaling to increase the resource allocation in products like the following:

Balance modularity for flexible resource allocation

Where possible, ensure that each component uses only the necessary resources(like memory, storage, and processing power) for specific operations. Resourceover-allocation can result in unnecessary costs, while resource under-allocation cancompromise performance.

Use well-defined interfaces

Ensure modular components communicate effectively through clear, standardizedinterfaces (like APIs and message queues) to reduce overhead from translationlayers or from extraneous traffic.

Use stateless models

A stateless model can help ensure that you can handle each request or interaction withthe service independently from previous requests. This model facilitatesscalability and recoverability, because you can grow, shrink, or restart theservice without losing the data necessary for in-progress requests orprocesses.

Choose complementary technologies

Choose technologies that complement the modular design. Evaluate programminglanguages, frameworks, and databases for their modularity support.

For more information, see the following resources:

Continuously monitor and improve performance

This principle in the performance optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you continuously monitor and improve performance.

After you deploy applications, continuously monitor their performance by usinglogs, tracing, metrics, and alerts. As your applications grow and evolve, youcan use the trends in these data points to re-assess your performancerequirements. You might eventually need to redesign parts of your applicationsto maintain or improve their performance.

Principle overview

The process of continuous performance improvement requires robust monitoringtools and strategies. Cloud observability tools can help you to collect keyperformance indicators (KPIs) such as latency, throughput, error rates, andresource utilization. Cloud environments offer a variety of methods to conductgranular performance assessments across the application, the network, and theend-user experience.

Improving performance is an ongoing effort that requires a multi-facetedapproach. The following key mechanisms and processes can help you to boostperformance:

  • To provide clear direction and help track progress, define performanceobjectives that align with your business goals. Set SMART goals: specific,measurable, achievable, relevant, and time-bound.
  • To measure performance and identify areas forimprovement, gather KPI metrics.
  • To continuously monitor your systems for issues, use visualizedworkflows in monitoring tools. Use architecture process mapping techniquesto identify redundancies and inefficiencies.
  • To create a culture of ongoing improvement, provide training and programsthat support your employees' growth.
  • To encourage proactive and continuous improvement, incentivize your employeesand customers to provide ongoing feedback about your application's performance.

Recommendations

To promote modular designs, consider the recommendations in the followingsections.

Define clear performance goals and metrics

Define clear performance objectives that align with your business goals. Thisrequires a deep understanding of your application's architecture and theperformance requirements of each application component.

As a priority, optimize the most critical components that directly influence yourcore business functions and user experience. To help ensure that these componentscontinue to run efficiently and meet your business needs, set specific andmeasurable performance targets. These targets can include response times, errorrates, and resource utilization thresholds.

This proactive approach can help you to identify and address potential bottlenecks,optimize resource allocation, and ultimately deliver a seamless andhigh-performing experience for your users.

Monitor performance

Continuously monitor your cloud systems for performance issues and set upalerts for any potential problems. Monitoring and alerts can help you to catchand fix issues before they affect users. Application profiling can help toidentify bottlenecks and can help to optimize resource use.

You can use tools that facilitate effective troubleshooting and networkoptimization. UseGoogle Cloud Observability to identify areas that havehigh CPU consumption, memory consumption, or network consumption. Thesecapabilities can help developers improve efficiency,reduce costs, and enhance the user experience.Network Intelligence Center showsvisualizations of the topology of your network infrastructure, and can help youto identify high-latency paths.

Incentivize continuous improvement

Create a culture of ongoing improvement that can benefit both the application andthe user experience.

Provide your employees with training and development opportunities that enhancetheir skills and knowledge in performance techniques across cloud services.Establish a community of practice (CoP) and offer mentorship and coachingprograms to support employee growth.

To preventreactive performance management and encourage proactive performancemanagement, encourage ongoing feedback from your employees, your customers, andyour stakeholders. You can consider gamifying the process by tracking KPIs onperformance and presenting those metrics to teams on a frequent basis in theform of a league table.

To understand your performance and user happiness over time, we recommend thatyou measure user feedback quantitatively and qualitatively. TheHEART framework can help you capture user feedback across five categories:

  • Happiness
  • Engagement
  • Adoption
  • Retention
  • Task success

By using such a framework, you can incentivize engineers with data-drivenfeedback, user-centered metrics, actionable insights, and a clear understandingof goals.

Well-Architected Framework: Sustainability pillar

The sustainability pillar in theGoogle Cloud Well-Architected Framework provides recommendations to design, build, and manage workloads in Google Cloudthat are energy-efficient and carbon-aware.

The target audience for this document includes decision-makers, architects,administrators, developers, and operators who design, build, deploy, andmaintain workloads in Google Cloud.

Architectural and operational decisions have a significant impact on the energyusage, water impact, and carbon footprint that's driven by your workloads in thecloud. Every workload, whether it's a small website or a large-scale ML model,consumes energy and contributes to carbon emissions and water resourceintensity. When you integrate sustainability into your cloud architecture anddesign process, you build systems that are efficient, cost-effective,andenvironmentally sustainable. A sustainable architecture is resilient andoptimized, which creates a positive feedback loop of higher efficiency, lowercost, and lower environmental impact.

Sustainable by design: Holistic business outcomes

Sustainability isn't a trade-off against other core business objectives;sustainability practices help to accelerate your other business objectives.Architecture choices that prioritize low-carbon resources and operations helpyou build systems that are also faster, cheaper, and more secure. Such systemsare considered to besustainable by design, where optimizing forsustainability leads to overall positive outcomes for performance, cost,security, resilience, and user experience.

Performance optimization

Systems that areoptimized for performance inherently use fewer resources. An efficient application that completes a taskfaster requires compute resources for a shorter duration. Therefore, theunderlying hardware consumes less kilowatt-hours (kWh) of energy. Optimizedperformance also leads to lower latency and better user experience. Time andenergy aren't wasted by resources waiting on inefficient processes. When youuse specialized hardware (for example, GPUs and TPUs), adopt efficientalgorithms, and maximize parallel processing, you improve performanceandreduce the carbon footprint of your cloud workload.

Cost optimization

Cloud operational expenditure depends on resource usage. Due to this directcorrelation, when you continuouslyoptimize cost,you also reduce energy consumption and carbon emissions. When you right-sizeVMs, implement aggressive autoscaling, archive old data, and eliminate idleresources, you reduce resource usage and cloud costs. You also reduce the carbonfootprint of your systems, because the data centers consume less energy to runyour workloads.

Security and resilience

Security andreliability are prerequisites for a sustainable cloud environment. A compromised system—forexample, a system that's affected by a denial of service (DoS) attack or anunauthorized data breach—can dramatically increase resource consumption. Theseincidents can trigger massive spikes in traffic, create runaway compute cyclesfor mitigation, and necessitate lengthy, high-energy operations for forensicanalysis, cleanup, and data restoration. Strong security measures can help toprevent unnecessary spikes in resource usage, so that your operations remainstable, predictable, and energy-efficient.

User experience

Systems that prioritize efficiency, performance, accessibility, and minimal useof data can help to reduce energy usage by end users. An application that loadsa smaller model or processes less data to deliver results faster helps to reducethe energy that's consumed by network devices and end-user devices. Thisreduction in energy usage particularly benefits users who have limited bandwidthor who use older devices. Further, sustainable architecture helps to minimizeplanetary harm and demonstrates your commitment to socially responsibletechnology.

Sustainability value of migrating to the cloud

Migrating on-premises workloads to the cloud can help to reduce yourorganization's environmental footprint. The transition to cloud infrastructurecan reduce energy usage and associated emissions by1.4 to 2 times when compared to typical on-premises deployments. Cloud data centers are modern,custom-designed facilities that are built for highpower usage effectiveness (PUE).Older on-premises data centers often lack the scale that's necessary to justifyinvestments in advanced cooling and power distribution systems.

Shared responsibility and shared fate

Shared responsibilities and shared fate on Google Cloud describes how security for cloud workloads is a shared responsibility betweenGoogle and you, the customer. This shared responsibility model also applies tosustainability.

Google is responsible for the sustainabilityof Google Cloud, whichmeans the energy efficiency and water stewardship of our data centers,infrastructure, and core services. We invest continuously in renewable energy,climate-conscious cooling, and hardware optimization. For more information aboutGoogle's sustainability strategy and progress, seeGoogle Sustainability 2025 Environmental Report.

You, the customer, are responsible for sustainabilityin the cloud, which means optimizing your workloads to be energy efficient.For example, you can right-size resources, use serverless services that scale tozero, and manage data lifecycles effectively.

We also advocate a shared fate model: sustainability isn't just a division oftasks but a collaborative partnership between you and Google to reduce theenvironmental footprint for the entire ecosystem.

Use AI for business impact

The sustainability pillar of the Well-Architected Framework (this document) includesguidance to help you design sustainable AI systems. However, a comprehensivesustainability strategy extends beyond the environmental impact of AI workloads.The strategy should include ways to use AI to optimize operations and create newbusiness value.

AI serves as a catalyst for sustainability by transforming vast datasets intoactionable insights. It enables organizations to transition from reactivecompliance to proactive optimization, such as in the following areas:

  • Operational efficiency: Streamline operations through improved inventorymanagement, supply chain optimization, and intelligent energy management.
  • Transparency and risk: Use data for granular supply chain transparency,regulatory compliance, and climate risk modeling.
  • Value and growth: Develop new revenue streams in sustainable finance andrecommerce.

Google offers the following products and features to help you derive insightsfrom data and build capabilities for a sustainable future:

  • Google Earth AI:Uses planetary-scale geospatial data to analyze environmental changes andmonitor supply chain impacts.
  • WeatherNext:Provides advanced weather forecasting and climate risk analytics to help youbuild resilience against climate volatility.
  • Geospatial insights with Google Earth:Uses geospatial data to add rich contextual data to locations, which enablessmarter site selection, resource planning, and operations.
  • Google Maps routes optimization:Optimizes logistics and delivery routes to increase efficiency and reduce fuelconsumption and transportation emissions.

Collaborations with partners and customers

Google Cloud andTELUS have partnered to advance cloud sustainability by migrating workloads toGoogle's carbon-neutral infrastructure and leveraging data analytics to optimizeoperations. This collaboration provides social and environmental benefitsthrough initiatives like smart-city technology, which uses real-time data toreduce traffic congestion and carbon emissions across municipalities in Canada.For more information about this collaboration, seeGoogle Cloud and TELUS collaborate for sustainability.

Core principles

The recommendations in the sustainability pillar of the Well-Architected Framework are mapped to the following core principles:

Contributors

Author:Brett Tackaberry | Principal Architect

Other contributors:

Use regions that consume low-carbon energy

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you select low-carbon regions for yourworkloads in Google Cloud.

Principle overview

When you plan to deploy a workload in Google Cloud, an importantarchitectural decision is the choice of Google Cloud region for theworkload. This decision affects the carbon footprint of your workload. Tominimize the carbon footprint, your region-selection strategy must include thefollowing elements:

  • Data-driven selection: To identify and prioritize regions, consider theleaf iconLow CO2 indicator and thecarbon-free energy (CFE) metric.
  • Policy-based governance: Restrict resource creation to environmentallyoptimal locations by using theresource locations constraint inOrganization Policy Service.
  • Operational flexibility: Use techniques like time-shifting andcarbon-aware scheduling to run batch workloads during hours when the carbonintensity of theelectrical grid is the lowest.

The electricity that's used to power your application and workloads in the cloudis an important factor that affects your choice of Google Cloud regions.In addition, consider the following factors:

  • Data residency and sovereignty: The location where you need to store yourdata is a foundational factor that dictates your choice of Google Cloudregion. This choice affects compliance with localdata residency requirements.
  • Latency for end users: The geographical distance between your end usersand the regions where you deploy applications affects user experience andapplication performance.
  • Cost: The pricing for Google Cloud resources can be different acrossregions.

TheGoogle Cloud Region Picker tool helps you select optimal Google Cloud regions based on yourrequirements for carbon footprint, cost, and latency. You can also useCloud Location Finder to find cloud locations in Google Cloud and other providers basedon your requirements for proximity, carbon-free energy (CFE) usage, and otherparameters.

Recommendations

To deploy your cloud workloads in low-carbon regions, consider therecommendations in the following sections. These recommendations are based onthe guidance inCarbon-free energy for Google Cloud regions.

Understand the carbon intensity of cloud regions

Google Cloud data centers in a region use energy from the electrical gridwhere the region is located. Google measures the carbon impact of a region byusing the CFE metric, which is calculated every hour. CFE indicates thepercentage of carbon-free energy out of the total energy that's consumed duringan hour. The CFE metric depends on two factors:

  • The type of power-generation plants that supply the grid during a given period.
  • Google-attributed clean energy that's supplied to the grid during that time.

For information about the aggregated average hourly CFE% for eachGoogle Cloud region, seeCarbon-free energy for Google Cloud regions.You can also get this data in a machine-readable format from theCarbon free energy for Google Cloud regions repository in GitHub and aBigQuery public dataset.

Incorporate CFE in your location-selection strategy

Consider the following recommendations:

  • Select the cleanest region for your applications. If you plan to run anapplication for a long period, run it in the region that has the highest CFE%.For batch workloads, you have greater flexibility in choosing a region becauseyou can predict when the workload must run.
  • Select low-carbon regions. Certain pages in the Google Cloud websiteand location selectors in the Google Cloud console show theleaf iconLow CO2 indicator forregions that have the lowest carbon impact.
  • Restrict the creation of resources to specific low-carbon Google Cloudregions by using theresource locations Organization Policy constraint. For example, to allow the creation ofresources in only US-based low-carbon regions, create a constraint thatspecifies thein:us-low-carbon-locations value group.

When you select locations for your Google Cloud resources, also considerbest practices for region selection,including factors like data residency requirements, latency to end users,redundancy of the application, availability of services, and pricing.

Use time-of-day scheduling

The carbon intensity of an electrical grid can vary significantly throughout theday. The variation depends on the mix of energy sources that supply the grid. Youcan schedule workloads, particularly those that are flexible or non-urgent, torun when the grid is supplied by a higher proportion of CFE.

For example, many grids have higher CFE percentages during off-peak hours orwhen renewable sources like solar and wind supply more power to the grid. Byscheduling compute-intensive tasks such as model training and large-scale batchinference during higher-CFE hours, you can significantly reduce the associatedcarbon emissions without affecting performance or cost. This approach is knownastime-shifting, where you use the dynamic nature of a grid's carbonintensity to optimize your workloads for sustainability.

Optimize AI and ML workloads for energy efficiency

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations for optimizing AI and ML workloads to reduce theirenergy usage and carbon footprint.

Principle overview

To optimize AI and ML workloads for sustainability, you need to adopt a holisticapproach to designing, deploying, and operating the workloads. Selectappropriate models and specialized hardware like Tensor Processing Units (TPUs),run the workloads in low-carbon regions, optimize to reduce resource usage, andapply operational best practices.

Architectural and operational practices that optimize the cost and performanceof AI and ML workloads inherently lead to reduced energy consumption and lowercarbon footprint. TheAI and ML perspective in the Well-Architected Framework describes principles and recommendations to design,build, and manage AI and ML workloads that meet your operational, security,reliability, cost, and performance goals. In addition, theCloud Architecture Center provides detailed reference architectures and design guides for AI and MLworkloads in Google Cloud.

Recommendations

To optimize AI and ML workloads for energy efficiency, consider therecommendations in the following sections.

Architect for energy efficiency by using TPUs

AI and ML workloads can be compute-intensive. The energy consumption by AI andML workloads is a key consideration for sustainability.TPUs let you significantly improve the energy efficiency and sustainability of yourAI and ML workloads.

TPUs are custom-designed accelerators that are purpose-built for AI and MLworkloads. The specialized architecture of TPUs make them highly effective forlarge-scale matrix multiplication, which is the foundation of deep learning.TPUs can perform complex tasks at scale with greater efficiency thangeneral-purpose processors like CPUs or GPUs.

TPUs provide the following direct benefits for sustainability:

  • Lower energy consumption: TPUs are engineered for optimal energyefficiency. They deliver higher computations per watt of energy consumed.Their specialized architecture significantly reduces the power demands oflarge-scale training and inference tasks, which leads to reducedoperational costsand lower energy consumption.
  • Faster training and inference: The exceptional performance of TPUslets you train complex AI models in hours rather than days. Thissignificant reduction in the total compute time contributes directly to asmaller environmental footprint.
  • Reduced cooling needs: TPUs incorporate advanced liquid cooling,which provides efficient thermal management and significantly reduces theenergy that's used for cooling the data center.
  • Optimization of the AI lifecycle: By integrating hardware andsoftware, TPUs provide an optimized solution across the entire AI lifecycle,from data processing to model serving.

Follow the 4Ms best practices for resource selection

Google recommends a set of best practices to reduce energy usage and carbonemissions significantly for AI and ML workloads. We call these best practices4Ms:

  • Model: Select efficient ML model architectures. For example,sparse models improve ML quality and reduce computation by 3-10 times when compared todense models.
  • Machine: Choose processors and systems that are optimized for MLtraining. These processors improve performance and energy efficiency by 2-5times when compared to general-purpose processors.
  • Mechanization: Deploy your compute-intensive workloads in the cloud.Your workloads use less energy and cause lower emissions by 1.4 to 2times when compared to on-premises deployments. Cloud data centers usenewer, custom-designed warehouses that are built for energy efficiency andhave a highpower usage effectiveness (PUE) ratio. On-premises data centers are often older and smaller, thereforeinvestments in energy-efficient cooling and power distribution systems mightnot be economical.
  • Map: Select Google Cloud locations that use the cleanest energy.This approach helps to reduce the gross carbon footprint of your workloadsby 5-10 times. For more information, seeCarbon-free energy for Google Cloud regions.

For more information about the 4Ms best practices and efficiency metrics, see thefollowing research papers:

Optimize AI models and algorithms for training and inference

The architecture of an AI model and the algorithms that are used for trainingand inference have a significant impact on energy consumption. Consider thefollowing recommendations.

Select efficient AI models

Choose smaller, more efficient AI models that meet your performancerequirements. Don't select the largest available model as a default choice. Forexample, a smaller, distilled model version likeDistilBERT can deliver similar performance with significantly less computational overheadand faster inference than a larger model like BERT.

Use domain-specific, hyper-efficient solutions

Choose specialized ML solutions that provide better performance and requiresignificantly less compute power than a large foundation model. Thesespecialized solutions are often pre-trained and hyper-optimized. They canprovide significant reductions in energy consumption and research effort forboth training and inference workloads. The following are examples ofdomain-specific specialized solutions:

  • Earth AI is an energy-efficient solution that synthesizes large amounts of globalgeospatial data to provide timely, accurate, and actionable insights.
  • WeatherNext produces faster, more efficient, and highly accurate global weatherforecasts when compared to conventional physics-based methods.

Apply appropriate model compression techniques

The following are examples of techniques that you can use for model compression:

  • Pruning: Remove unnecessary parameters from a neural network. These areparameters that don't contribute significantly to a model's performance.This technique reduces the size of the model and the computational resourcesthat are required for inference.
  • Quantization: Reduce the precision of model parameters. For example,reduce the precision from 32-bit floating-point to 8-bit integers. Thistechnique can help to significantly decrease the memory footprint and powerconsumption without a noticeable reduction in accuracy.
  • Knowledge distillation: Train a smallerstudent model to mimic thebehavior of a larger, more complexteacher model. The student model canachieve a high level of performance with fewer parameters and by using lessenergy.

Use specialized hardware

As mentioned inFollow the 4Ms best practices for resource selection, choose processorsand systems that are optimized for ML training. These processors improveperformance and energy efficiency by 2-5 times when compared to general-purposeprocessors.

Use parameter-efficient fine-tuning

Instead of adjusting all of a model's billions of parameters (fullfine-tuning), use parameter-efficient fine-tuning (PEFT) methods like low-rankadaptation (LoRA). With this technique, you freeze the original model's weightsand train only a small number of new, lightweight layers. This approach helps toreduce cost and energy consumption.

Follow best practices for AI and ML operations

Operational practices significantly affect the sustainability of your AI and MLworkloads. Consider the following recommendations.

Optimize model training processes

Use the following techniques to optimize your model training processes:

  • Early stopping: Monitor the training process and stop it when you don'tobserve further improvement in model performance against the validation set.This technique helps you prevent unnecessary computations and energy use.
  • Efficient data loading: Use efficient data pipelines to ensure thatthe GPUs and TPUs are always utilized and don't wait for data. Thistechnique helps to maximize resource utilization and reduce wasted energy.
  • Optimized hyperparameter tuning: To find optimal hyperparametersmore efficiently, use techniques like Bayesian optimization orreinforcement learning. Avoid exhaustive grid searches, which can beresource-intensive operations.

Improve inference efficiency

To improve the efficiency of AI inference tasks, use the following techniques:

  • Batching: Group multiple inference requests in batches and takeadvantage of parallel processing on GPUs and TPUs. This technique helps toreduce the energy cost per prediction.
  • Advanced caching: Implement a multi-layered caching strategy, whichincludes key-value (KV) caching for autoregressive generation andsemantic-prompt caching for application responses. This technique helps tobypass redundant model computations and can yield significant reductions inenergy usage and carbon emissions.

Measure and monitor

Monitor and measure the following parameters:

  • Usage and cost: Use appropriate tools to track the token usage,energy consumption, and carbon footprint of your AI workloads. This datahelps you identify opportunities for optimization and report progresstoward sustainability goals.
  • Performance: Continuously monitor model performance in production.Identify issues like data drift, which can indicate that the model needs tobe fine-tuned again. If you need to re-train the model, you can use theoriginal fine-tuned model as a starting point and save significant time,money, and energy on updates.

For more information about operationalizing continuous improvement, seeContinuously measure and improve sustainability.

Implement carbon-aware scheduling

Architect your ML pipeline jobs to run in regions with the cleanest energy mix.Use the Carbon Footprint report to identify the leastcarbon-intensive regions. Schedule resource-intensive tasks as batch jobs duringperiods when the local electrical grid has a higher percentage of carbon-freeenergy (CFE).

Optimize data pipelines

ML operations and fine-tuning require a clean, high-quality dataset. Beforeyou start ML jobs, use managed data processing services to prepare the dataefficiently. For example, useDataflow for streaming and batch processing and useDataproc for managed Spark and Hadoop pipelines. An optimized data pipeline helps toensure that your fine-tuning workload doesn't wait for data, so you canmaximize resource utilization and help reduce wasted energy.

Embrace MLOps

To automate and manage the entire ML lifecycle, implementML Operations (MLOps) practices. These practices helpto ensure that models are continuously monitored, validated, and redeployedefficiently, which helps to prevent unnecessary training or resource allocation.

Use managed services

Instead of managing your own infrastructure, use managed cloud services likeVertex AI.The cloud platform handles the underlying resource management, which lets youfocus on the fine-tuning process. Use services that include built-in tools forhyperparameter tuning, model monitoring, and resource management.

What's next

Optimize resource usage for sustainability

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you optimize resource usage by your workloadsin Google Cloud.

Principle overview

Optimizing resource usage is crucial for enhancing the sustainability of yourcloud environment. Every resource that's provisioned—from compute cycles to datastorage—directly affects energy usage, water intensity, and carbon emissions. Toreduce the environmental footprint of your workloads, you need to make informedchoices when you provision, manage, and use cloud resources.

Recommendations

To optimize resource usage, consider the recommendations in the followingsections.

Implement automated and dynamic scaling

Automated and dynamic scaling ensures that resource usage is optimal, whichhelps to prevent energy waste from idle or over-provisioned infrastructure. Thereduction in wasted energy translates to lower costs and lower carbon emissions.

Use the following techniques to implement automated and dynamic scalability.

Use horizontal scaling

Horizontal scaling is the preferred scaling technique for most cloud-firstapplications. Instead of increasing the size of each instance, known asvertical scaling, you add instances to distribute the load. For example, youcan usemanaged instance groups (MIGs) to automatically scale out a group of Compute Engine VMs. Horizontally scaledinfrastructure is more resilient because the failure of an instance doesn'taffect the availability of the application. Horizontal scaling is also aresource-efficient technique for applications that have variable load levels.

Configure appropriate scaling policies

Configure autoscaling settings based on the requirements of your workloads.Define custom metrics and thresholds that are specific to application behavior.Instead of relying solely on CPU utilization, consider metrics like queue depthfor asynchronous tasks, request latency, and custom application metrics. Toprevent frequent, unnecessary scaling orflapping, define clear scalingpolicies. For example, for workloads that you deploy inGoogle Kubernetes Engine (GKE), configure an appropriatecluster autoscaling policy.

Combine reactive and proactive scaling

With reactive scaling, the system scales in response to real-time load changes.This technique is suitable for applications that have unpredictable spikes inload.

Proactive scaling is suitable for workloads with predictable patterns, such asfixed daily business hours and weekly reports generation. For such workloads,use scheduled autoscaling to pre-provision resources so that they can handle ananticipated load level. This technique prevents a scramble for resources andensures smoother user experience with higher efficiency. This technique alsohelps you plan proactively for known spikes in load such as major sales eventsand focused marketing efforts.

Google Cloud managed services and features like GKEAutopilot, Cloud Run, and MIGs automatically manage proactivescaling by learning from your workload patterns. By default, when aCloud Run service doesn't receive any traffic, it scales to zeroinstances.

Design stateless applications

For an application to scale horizontally, its components should be stateless.This means that a specific user's session or data isn't tied to a single computeinstance. When you store session state outside the compute instance, such as inMemorystore for Redis, any compute instance can handle requests from anyuser. This design approach enables horizontal scaling that's seamless andefficient.

Use scheduling and batches

Batch processing is ideal for large-scale, non-urgent workloads. Batch jobs canhelp to optimize your workloads for energy efficiency and cost.

Use the following techniques to implement scheduling and batch jobs.

Schedule for low carbon intensity

Schedule your batch jobs to run in low-carbon regions and during periods whenthe local electrical grid has a high percentage of clean energy. To identifythe least carbon-intensive times of day for a region, use theCarbon Footprint report.

Use Spot VMs for noncritical workloads

Spot VMs let you take advantage of unused Compute Engine capacity at a steep discount.Spot VMs can be preempted, but they provide a cost-effective wayto process large datasets without the need for dedicated, always-on resources.Spot VMs are ideal for non-critical, fault-tolerant batch jobs.

Consolidate and parallelize jobs

To reduce the overhead for starting up and shutting down individual jobs, groupsimilar jobs into a single large batch. Run these high-volume workloads onservices likeBatch.The service automatically provisions and manages the necessary infrastructure,which helps to ensures optimal resource utilization.

Use managed services

Managed services like Batch andDataflow automatically handle resource provisioning, scheduling, and monitoring. Thecloud platform handles resource optimization. You can focus on the applicationlogic. For example,Dataflow automatically scales the number of workers based on the data volume in the pipeline, so you don't pay for idle resources.

Match VM machine families to workload requirements

The machine types that you can use for your Compute Engine VMs aregrouped intomachine families,which are optimized for different workloads. Choose appropriate machine familiesbased on the requirements of your workloads.

Machine familyRecommended for workload typesSustainability guidance
General-purpose instances (E2, N2, N4, Tau T2A/T2D): These instances provide a balanced ratio of CPU to memory.Web servers, microservices, small to medium databases, and development environments.The E2 series is highly cost-efficient and energy-efficient due to its dynamic allocation of resources. The Tau T2A series uses Arm-based processors, which are often more energy-efficient per unit of performance for large-scale workloads.
Compute-optimized instances (C2, C3): These instances provide a high vCPU-to-memory ratio and high performance per core.High performance computing (HPC), batch processing, gaming servers, and CPU-based data analytics.A C-series instance lets you complete CPU-intensive tasks faster, which reduces the total compute time and energy consumption of the job.
Memory-optimized instances (M3, M2): These instances are designed for workloads that require a large amount of memory.Large in-memory databases and data warehouses, such as SAP HANA or in-memory analytics.Memory-optimized instances enable the consolidation of memory-heavy workloads on fewer physical nodes. This consolidation reduces the total energy that's required when compared to using multiple smaller instances. High-performance memory reduces data-access latency, which can reduce the total time that the CPU spends in an active state.
Storage-optimized instances (Z3):These instances provide high-throughput, low-latency local SSD storage.Data warehousing, log analytics, and SQL, NoSQL, and vector databases.Storage-optimized instances process massive datasets locally, which helps to eliminate the energy that's used for cross-location network data egress. When you use local storage for high-IOPS tasks, you avoid over-provisioning multiple standard instances.
Accelerator-optimized instances (A3, A2, G2): These instances are built for GPU and TPU-accelerated workloads, such as AI, ML, and HPC.ML model training and inference, and scientific simulations.

TPUs are engineered for optimal energy efficiency. They deliver higher computations per watt.

A GPU-accelerated instance like the A3 series with NVIDIA H100 GPUs can be significantly more energy-efficient for training large models than a CPU-only alternative. Although a GPU-accelerated instance has higher nominal power usage, the task is completed much faster.

Upgrade to the latest machine types

Use of the latest machine types might help to improve sustainability. Whenmachine types are updated, they're often designed to be more energy-efficientand to provide higher performance per watt. VMs that use the latest machinetypes might complete the same amount of work with lower power consumption.

CPUs, GPUs, and TPUs often benefit from technical advancements in chiparchitecture, such as the following:

  • Specialized cores: Advancements in processors often include specializedcores or instructions for common workloads. For example, CPUs might havededicated cores for vector operations or integrated AI accelerators. Whenthese tasks are offloaded from the main CPU, the tasks are completed moreefficiently and they consume less energy.
  • Improved power management: Advancements in chip architectures ofteninclude more sophisticated power management features, such as dynamicadjustment of voltage and frequency based on the workload. Thesepower-management features enable the chips to run at peak efficiency andenter low-power states when they are idle, which minimizes energyconsumption.

The technical improvements in chip architecture provide the following directbenefits for sustainability and cost:

  • Higher performance per watt: This is a key metric for sustainability.For example, theC4 VMs demonstrate 40% higher price-performance when compared to C3 VMs for the same energy consumption. TheC4A processor provides 60% higher energy-efficiency over comparable x86 processors. These performance capabilities let youcomplete tasks faster or use fewer instances for the same load.
  • Lower total energy consumption: With improved processors, computeresources are used for a shorter duration for a given task, which reducesthe overall energy usage and carbon footprint. The carbon impact isparticularly high for short-lived, compute-intensive workloads like batchjobs and ML model training.
  • Optimal resource utilization: The latest machine types are oftenbetter suited for modern software and are more compatible with advancedfeatures of cloud platforms. These machine types typically enable betterresource utilization, which reduces the need for over-provisioning andhelps to ensure that every watt of power is used productively.

Deploy containerized applications

You can use container-based, fully-managed services such as GKEand Cloud Run as a part of your strategy for sustainable cloudcomputing. These services help to optimize resource utilization and automateresource management.

Leverage the scale-to-zero capability of Cloud Run

Cloud Run provides a managed serverless environment thatautomatically scales instances to zero when there is no incoming traffic for aservice or when a job is completed. Autoscaling helps to eliminate energyconsumption by idle infrastructure. Resources are powered only when theyactively process requests. This strategy is highly effective for intermittentor event-driven workloads. For AI workloads, you can useGPUs with Cloud Run,which lets you consume and pay for GPUs only when they are used.

Automate resource optimization using GKE

GKE is a container orchestration platform, which ensures thatapplications use only the resources that they need. To help youautomate resource optimization, GKE provides the followingtechniques:

  • Bin packing:GKE Autopilot intelligently packs multiplecontainers on the available nodes. Bin packing maximizes the utilization ofeach node and reduces the number of idle or underutilized nodes, which helpsto reduce energy consumption.
  • Horizontal Pod autoscaling (HPA):With HPA, the number of container replicas (Pods) is adjusted automaticallybased on predefined metrics like CPU usage or custom application-specificmetrics. For example, if your application experiences a spike in traffic,GKE adds Pods to meet the demand. When the traffic subsides,GKE reduces the number of Pods. This dynamic scaling preventsover-provisioning of resources, so you don't pay for or power up unnecessarycompute capacity.
  • Vertical Pod autoscaling (VPA):You can configure GKE to automatically adjust the CPU andmemory allocations and limits for individual containers. This configurationensures that a container isn't allocated more resources than it needs,which helps to prevent resource over-provisioning.
  • GKE multidimensional Pod autoscaling:For complex workloads, you can configure HPA and VPA simultaneously tooptimize both the number of Pods and the size of each Pod. This techniquehelps to ensure the smallest possible energy footprint for the requiredperformance.
  • Topology-Aware Scheduling (TAS):TAS enhances the network efficiency for AI and ML workloads inGKE by placing Pods based on the physical structure of thedata center infrastructure. TAS strategically colocates workloads tominimize network hops. This colocation helps to reduce communication latencyand energy consumption. By optimizing the physical alignment of nodes andspecialized hardware, TAS accelerates task completion and maximizes theenergy efficiency of large-scale AI and ML workloads.

Configure carbon-aware scheduling

At Google, we continually shift our workloads tolocations andtimes that provide the cleanest electricity. We also repurpose, orharvest,older equipment for alternative use cases. You can use this carbon-awarescheduling strategy to ensure that your containerized workloads use cleanenergy.

To implement carbon-aware scheduling, you need information about the energy mixthat powers data centers in a region in real time. You can get this informationin a machine-readable format from theCarbon free energy for Google Cloud regions repository in GitHub or from aBigQuery public dataset.The hourly grid mix and carbon intensity data that's used to calculate theGoogle annual carbon dataset is sourced fromElectricity Maps.

To implement carbon-aware scheduling, we recommend the following techniques:

  • Geographical shifting: Schedule your workloads to run in regions thatuse a higher proportion of renewable energy sources. This approach lets youuse cleaner electrical grids.
  • Temporal shifting: For non-critical, flexible workloads like batchprocessing, configure the workloads to run during off-peak hours or whenrenewable energy is most abundant. This approach is known as temporalshifting and helps reduce the overall carbon footprint by taking advantageof cleaner energy sources when they are available.

Architect energy-efficient disaster recovery

Preparing for disaster recovery (DR) often involves pre-provisioning redundantresources in a secondary region. However, idle or under-utilized resources cancause significant energy waste. Choose DR strategies that maximize resourceutilization and minimize the carbon impact without compromising your recoverytime objectives (RTO).

Optimize for cold start efficiency

Use the following approaches to minimize or eliminate active resources in yoursecondary (DR) region:

  • Prioritizecold DR: Keep resources in the DR region turned offor in a scaled-to-zero state. This approach helps to eliminate the carbonfootprint of idle compute resources.
  • Take advantage of serverless failover: Use managed serverless serviceslike Cloud Run for DR endpoints. Cloud Runscales to zero when it isn't in use, so you can maintain a DR topology thatconsumes no energy until traffic is diverted to the DR region.
  • Automate recovery with infrastructure-as-code (IaC): Instead of keepingresources in the DR site running (warm), use an IaC tool like Terraform torapidly provision environments only when needed.

Balance redundancy and utilization

Resource redundancy is a primary driver of energy waste. To reduce redundancy,use the following approaches:

  • Prefer active-active over active-passive: In an active-passivesetup, the resources in the passive site are idle, which results in wastedenergy. An active-active architecture that's optimally sized ensures thatall of the provisioned resources across both regions actively servetraffic. This approach helps you maximize the energy efficiency of yourinfrastructure.
  • Right-size redundancy: Replicate data and services acrossregions only when the replication is necessary to meet high-availability orDR requirements. Every additional replica increases the energy cost ofpersistent storage and network egress.

Develop energy-efficient software

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to write software that minimizes energy consumption andserver load.

Principle overview

When you follow best practices to build your cloud applications, you optimizethe energy that's utilized by the cloud infrastructure resources: AI, compute,storage, and network. You also help to reduce the water requirements of the datacenters and the energy that end-user devices consume when they access yourapplications.

To build energy-efficient software, you need to integrate sustainabilityconsiderations throughout the software lifecycle, from design and development todeployment, maintenance, and archival. For detailed guidance about using AI tobuild software that minimizes the environmental impact of cloud workloads, seethe Google Cloud ebook,Build Software Sustainably.

Recommendations

The recommendations in this section are grouped into the following focus areas:

Minimize computational work

To write energy-efficient software, you need to minimize the total amount ofcomputational work that your application performs. Every unnecessaryinstruction, redundant loop, and extra feature consumes energy, time, andresources. Use the following recommendations to build software that performsminimal computations.

Write lean, focused code

To write minimal code that's essential to achieve the required outcomes, use thefollowing approaches:

  • Eliminate redundant logic and feature bloat: Write code thatperforms only the essential functions. Avoid features that increase thecomputational overhead and complexity but don't provide measurable value toyour users.
  • Refactor: To improve energy efficiency over time, regularly audit yourapplications to identify unused features. Take action to remove or refactorsuch features as appropriate.
  • Avoid unnecessary operations: Don't compute a value or run an actionuntil the result is needed. Use techniques likelazy evaluation,which delay computations until a dependent component in the applicationneeds the output.
  • Prioritize code readability and reusability: Write code that'sreadable and reusable. This approach minimizes duplication and followsthedon't repeat yourself (DRY) principle,which can help to reduce carbon emissions from software development andmaintenance.

Use backend caching

Backend caching ensures that an application does not perform the same workrepeatedly. A high cache-hit ratio leads to an almost linear reduction in energyconsumption per request. To implement backend caching, use the followingtechniques:

  • Cache frequent data: Store frequently accessed data in a temporary,high-performance storage location. For example, use an in-memory cachingservice likeMemorystore.When an application retrieves data from a cache, the volume of databasequeries and disk I/O operations is reduced. Consequently, the load on thedatabases and servers in the backend decreases.
  • Cache API responses: To avoid redundant and costly network calls,cache the results of frequent API requests.
  • Prioritize in-memory caching: To eliminate slow disk I/O operationsand complex database queries, store data in high-speed memory (RAM).
  • Select appropriatecache-write strategies:
    • The write-through strategy ensures that data is written synchronously tothe cacheand the persistent store. This strategy increases thelikelihood of cache hits, so the persistent store gets fewerenergy-intensive read requests.
    • The write-back (write-behind) strategy enhances the performanceof write-heavy applications. Data is written to the cache first, andthe database is updated asynchronously later. This strategy reduces theimmediate write load on slower databases.
  • Use smart eviction policies: Keep the cache lean and efficient. Toremove stale or low-utility data and to maximize the space that's availablefor frequently requested data, use policies like time to live (TTL), leastrecently used (LRU), and least frequently used (LFU).

Use efficient algorithms and data structures

The algorithms and data structures that you choose determine the rawcomputational complexity of your software. When you select appropriatealgorithms and data structures, you minimize the number of CPU cycles and memoryoperations that are required to complete a task. Fewer CPU cycles and memoryoperations lead to lower energy consumption.

Choose algorithms for optimal time complexity

Prioritize algorithms that achieve the required result in the least amount oftime. This approach helps to reduce the duration of resource usage. To selectalgorithms that optimize resource usage, use the following approaches:

  • Focus on reducing complexity: To evaluate complexity, look beyondruntime metrics and consider the theoretical complexity of the algorithm.For example, when compared tobubble sorting,merge sorting significantly reduces the computational load and energy consumption forlarge datasets.
  • Avoid redundant work: Use built-in, optimized functions in yourchosen programming language or framework. These functions are oftenimplemented in a lower-level and more energy-efficient language like C orC++, so they are better optimized for the underlying hardware compared tocustom-coded functions.

Select data structures for efficiency

The data structures that you choose determine the speed at which data can beretrieved, inserted, or processed. This speed affects CPU and memory usage. Toselect efficient data structures, use the following approaches:

  • Optimize for search and retrieval: For common operations likechecking whether an item exists or retrieving a specific value, prefer datastructures that are optimized for speed. For example, hash maps or hashsets enable near-constant time lookups, which is a more energy-efficientapproach than linearly searching through an array.
  • Minimize memory footprint: Efficient data structures help to reducethe overall memory footprint of an application. Reduced memory access andmanagement leads to lower power consumption. In addition, a leaner memoryprofile enables processes to run more efficiently, which lets you postponeresource upgrades.
  • Use specialized structures: Use data structures that arepurpose-built for a given problem. For example, use atrie data structure for rapid string-prefix searching, and use a priority queuewhen you need to access only the highest or lowest value efficiently.

Optimize compute and data operations

When you develop software, focus on efficient and proportional resource usageacross the entire technology stack. Treat CPU, memory, disk, and network aslimited and shared resources. Recognize that efficient usage of resources leadsto tangible reductions in costs and energy consumption.

Optimize CPU utilization and idle time

To minimize the time that the CPU spends in an active, energy-consuming statewithout performing meaningful work, use the following approaches:

  • Prefer event-driven logic over polling: Replace resource-intensivebusy loops or constant checking (polling) with event-driven logic.An event-driven architecture ensures that the components of an applicationoperate only when they're triggered by relevant events. This approachenables on-demand processing, which eliminates the need forresource-intensive polling.
  • Prevent constant high frequency: Write code that doesn't force theCPU to constantly operate at its highest frequency. To minimize energyconsumption, systems that are idle should be able to enter low-power statesor sleep modes.
  • Use asynchronous processing: To prevent threads from being locked duringidle wait times, use asynchronous processing. This approach frees resourcesand leads to higher overall resource utilization.

Manage memory and disk I/O efficiently

Inefficient memory and disk usage leads to unnecessary processing and increasedpower consumption. To manage memory and I/O efficiently, use the followingtechniques:

  • Strict memory management: Take action to proactively release unusedmemory resources. Avoid holding large objects in memory for longer periodsthan necessary. This approach prevents performance bottlenecks and reducesthe power that's consumed for memory access.
  • Optimize disk I/O: Reduce the frequency of yourapplication's read and write interactions with persistent storage resources.For example, use an intermediary memory buffer to store data. Write the datato persistent storage at fixed intervals or when the buffer reaches acertain size.
  • Batch operations: Consolidate frequent, small disk operations intofewer, larger batch operations. A batch operation consumes less energy thanmany individual, small transactions.
  • Use compression: Reduce the amount of data that's written to or readfrom disks by applying suitable data-compression techniques. For example,to compress data that you store in Cloud Storage, you can usedecompressive transcoding.

Minimize network traffic

Network resources consume significant energy during data transfer operations.To optimize network communication, use the following techniques:

  • Minimize payload size: Design your APIs and applications totransfer only the data that's needed for a request. Avoid fetching orreturning large JSON or XML structures in cases where only a few fields arerequired. Ensure that the data structures that are returned are concise.
  • Reduce round-trips: To reduce the number of network round-trips thatare required to complete a user action, use smarter protocols. For example,prefer HTTP/3 over HTTP/1.1, choose GraphQL over REST, use binaryprotocols, and consolidate API calls. When you reduce the volume of networkcalls, you reduce the energy consumption for both your servers and forend-user devices.

Implement frontend optimization

Frontend optimization minimizes the data that your end users must download andprocess, which helps to reduce the load on the resources of end-user devices.

Minimize code and assets

When end users need to download and process smaller and more efficientlystructured resources, their devices consume less power. To minimize the downloadvolume and processing load on end-user devices, use the following techniques:

  • Minimization and compression: For JavaScript, CSS, and HTML files,remove unnecessary characters like whitespaces and comments by usingappropriate minimization tools. Ensure that files like images arecompressed and optimized. You can automate the minimization and compressionof web assets by using a CI/CD pipeline.
  • Lazy loading: Load images, videos, and non-critical assets only whenthey are actually needed, such as when these elements scroll into theviewport of a web page. This approach reduces the volume of initial datatransfer and the processing load on end-user devices.
  • Smaller JavaScript bundles: Eliminate unused code from your JavaScriptbundles by using modern module bundlers and techniques liketree shaking.This approach results in smaller files that load faster and use fewer serverresources.
  • Browser caching: Use HTTP caching headers to instruct the user'sbrowser to store static assets locally. Browser caching helps to preventrepeated downloads and unnecessary network traffic on subsequent visits.

Prioritize lightweight user experience (UX)

The design of your user interface can have a significant impact on thecomputational complexity for rendering frontend content. To build frontendinterfaces that provide a lightweight UX, use the following techniques:

  • Efficient rendering: Avoid resource-intensive, frequentDocument Object Model (DOM) manipulation. Write code that minimizes the rendering complexity andeliminates unnecessary re-rendering.
  • Lightweight design patterns: Where appropriate, prefer static sitesorprogressive web apps (PWAs).Such sites and apps load faster and require fewer server resources.
  • Accessibility and performance: Responsive, fast-loading sites areoften more sustainable and accessible. An optimized, clutter-free designreduces the resources that are consumed when content is rendered. Websitesthat are optimized for performance and speed can help to drive higherrevenue. According to a research study by Deloitte and Google,Milliseconds Make Millions,a 0.1-second (100ms) improvement in site speed leads to an 8.4%increase in conversions for retail sites and a 9.2% increase in the averageorder value.

Optimize data and storage for sustainability

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you optimize the energy efficiency and carbonfootprint for your storage resources in Google Cloud.

Principle overview

Stored data isn't a passive resource. Energy is consumed and carbon emissionsoccur throughout the lifecycle of data. Every gigabyte of stored data requiresphysical infrastructure that's continuously powered, cooled, and managed. Toachieve sustainable cloud architecture, treat data as a valuable butenvironmentally costly asset and prioritize proactive data governance.

Your decisions about data retention, quality, and location can help you achievesubstantial reductions in cloud costs and energy consumption. Minimize the datathat you store, optimize where and how data you store data, and implementautomated deletion and archival strategies. When you reduce data clutter, youimprove system performance and fundamentally reduce the long-term environmentalfootprint of your data.

Recommendations

To optimize your data lifecycle and storage resources for sustainability,consider the recommendations in the following sections.

Prioritize high-value data

Stored data that's unused, duplicated, or obsolete continues to consume energyto power the underlying infrastructure. To reduce the storage-related carbonfootprint, use the following techniques.

Identify and eliminate duplication

Establish policies to prevent the needless replication of datasets acrossmultiple Google Cloud projects or services. Use central data repositorieslikeBigQuery datasets orCloud Storage buckets as single sources of truth and grant appropriate access to these repositories.

Remove shadow data and dark data

Dark data is data for which the utility or owner is unknown.Shadow datameans unauthorized copies of data. Scan your storage systems and find dark dataand shadow data by using a data discovery and cataloging solution likeDataplex Universal Catalog.Regularly audit these findings and implement a process for archival or deletionof dark and shadow data as appropriate.

Minimize the data volume for AI workloads

Store only the features and processed data that are required for model trainingand serving. Where possible, use techniques like data sampling, aggregation, andsynthetic data generation to achieve model performance without relying onmassive raw datasets.

Integrate data quality checks

Implement automatic data validation and data cleaning pipelines by usingservices likeDataproc,Dataflow,or Dataplex Universal Catalog at the point of data ingestion.Low-quality data causes wasted storage space. It also leads to unnecessaryenergy consumption when the data is used later for analytics or AI training.

Review the value density of data

Periodically review high-volume datasets like logs and IoT streams. Determinewhether any data can be summarized, aggregated, or down-sampled to maintain therequired information density and reduce the physical storage volume.

Critically evaluate the need for backups

Assess the need for backups of data that you can regenerate with minimal effort.Examples of such data include intermediate ETL results, ephemeral caches, andtraining data that's derived from a stable, permanent source. Retain backups foronly the data that is unique or expensive to recreate.

Optimize storage lifecycle management

Automate the storage lifecycle so that when the utility of data declines, thedata is moved to an energy-efficient storage class or retired, as appropriate.Use the following techniques.

Select an appropriate Cloud Storage class

Automate the transition of data in Cloud Storage to lower-carbon storageclasses based on access frequency by usingObject Lifecycle Management.

  • Use Standard storage for only actively used datasets, such ascurrent production models.
  • Transition data such as older AI training datasets or less-frequentlyaccessed backups to Nearline or Coldline storage.
  • For long-term retention, use Archive storage, which is optimizedfor energy efficiency at scale.

Implement aggressive data lifecycle policies

Define clear, automated time to live (TTL) policies for non-essential data,such as log files, temporary model artifacts, and outdated intermediate results.Use lifecycle rules to automatically delete such data after a defined period.

Mandate resource tagging

Mandate the use of consistent resource tags and labels for all of yourCloud Storage buckets, BigQuery datasets, and persistentdisks. Create tags that indicate the data owner, purpose of the data, and theretention period. Use Organization Policy Service constraints to ensure that required tags,such as retention period, are applied to resources. Tags let you automatelifecycle management, create granular FinOps reports, and produce carbonemissions reports.

Right-size and deprovision compute storage

Regularly audit persistent disks that are attached to Compute Engine instancesand ensure that the disks aren't over-provisioned. Use snapshots only when theyare necessary for backup. Delete old, unused snapshots. For databases, use dataretention policies to reduce the size of the underlying persistent disks.

Optimize the storage format

For storage that serves analytics workloads, prefer compressed, columnar formatslike Parquet or optimized Avro over row-based formats like JSON or CSV. Columnarstorage significantly reduces physical disk-space requirements and improves theread efficiency. This optimization helps to reduce energy consumption for theassociated compute and I/O operations.

Optimize regionality and data movement

The physical location and movement of your data affect the consumption ofnetwork resources and the energy required for storage. Optimize data regionalityby using the following techniques.

Select low-carbon storage regions

Depending on your compliance requirements, store data in Google Cloud regionsthat use a higher percentage of carbon-free energy (CFE) or that have lower gridcarbon intensity. Restrict the creation of storage buckets in high-carbonregions by using theresource locations Organization Policy constraint. For information about CFE and carbon-intensitydata for Google Cloud regions, seeCarbon-free energy for Google Cloud regions.

Minimize replication

Replicate data across regions only to meet mandatory disaster recovery (DR) orhigh-availability (HA) requirements. Cross-region and multi-region replicationoperations significantly increase the energy cost and carbon footprint of yourdata.

Optimize data processing locations

To reduce energy consumption for network data transfer, deploycompute-intensive workloads like AI training and BigQueryprocessing in the same region as the data source.

Optimize data movement for your partners and customers

To move large volumes of data across cloud services, locations, and providers,encourage your partners and customers to useStorage Transfer Service or data-sharing APIs. Avoid mass data dumps. For public datasets, useRequester Pays buckets to shift the data transfer and processing costs and the environmentalimpact to end users.

Continuously measure and improve sustainability

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you measure and continuously improve thesustainability of your workloads in Google Cloud.

Principle overview

To ensure that your cloud workloads remain sustainable, you need accurate andtransparent metrics. Verifiable metrics let you translate sustainability goalsto actions. Every resource that you create in the cloud has an associated carbonfootprint. To build and maintain sustainable cloud architectures, you mustintegrate the measurement of carbon data into your operational feedback loop.

The recommendations in this section provide a framework for usingCarbon Footprint to quantify carbon emissions, identify carbon hotspots, implement targetedworkload optimizations, and verify the outcomes of the optimization efforts.This framework lets you efficiently align your cost optimization goals withverifiable carbon reduction targets.

Carbon Footprint reporting methodology

Carbon Footprint provides a transparent, auditable, andglobally-aligned report of your cloud-related emissions. The report adheres tointernational standards, primarily theGreenhouse Gas (GHG) Protocol for carbon reporting and accounting. The Carbon Footprint reportuses location-based and market-based accounting methods. Location-basedaccounting is based on the local grid's emissions factor. Market-basedaccounting considers Google's purchases of carbon-free energy (CFE). This dualapproach helps you understand both the physical-grid impact and thecarbon benefit of your workloads in Google Cloud.

For more information about how the Carbon Footprint report isprepared, including the data sources used, Scope-3 inclusions, and thecustomer allocation model, seeCarbon Footprint reporting methodology.

Recommendations

To use carbon measurement for continuous improvement, consider therecommendations in the following sections. The recommendations are structured asphases of maturity for implementing sustainable-by-design cloud operations:

Phase 1: Establish a baseline

In this phase, you set up the necessary tools and ensure that data isaccessible and correctly integrated.

  1. Grant permissions: Grant permissions to teams like FinOps, SecOps andplatform engineering so that they can access theCarbon Footprint dashboard in the Google Cloud console. Grant theCarbon Footprint Viewer role (roles/billing.carbonViewer) in Identity and Access Management (IAM) for theappropriate billing account.
  2. Automate data export: Configure automatedexport of Carbon Footprint data to BigQuery. The exported data lets you perform deep analysis,correlate carbon data with cost and usage data, and produce custom reports.
  3. Define carbon-related key performance indicators (KPIs): Establishmetrics that connect carbon emissions to business value. For example,carbon intensity is a metric for the number of kilograms ofCO2 equivalent per customer, transaction, or revenue unit.

Phase 2: Identify carbon hotspots

Identify the areas that have the largest environmental impact by analyzing thegranular data in the Carbon Footprint report. Use the followingtechniques for this analysis:

  • Prioritize by scope: To quickly identify the largest gross carbonemitters, analyze the data in the dashboard by project, region, and service.
  • Use dual-accounting: When you evaluate the carbon impact in aregion, consider both location-based emissions (the environmental impact ofthe local electrical grid) and market-based emissions (the benefit ofGoogle's CFE investments).
  • Correlate with cost: Join the carbon data in BigQuerywith your billing data and assess the impact of optimization actions onsustainabilityand cost. High cost can often be correlated with highcarbon emissions.
  • Annotate data to measure return on effort (ROE): Annotate the carbondata in BigQuery with specific events, like right-sizing aresource or decommissioning a large service. The annotations let youattribute reductions in carbon emission and cost to specific optimization initiatives,so that you can measure and demonstrate the outcome of each initiative.

Phase 3: Implement targeted optimization

This is the execution phase for implementing sustainable-by-design cloudoperations. Use the following strategies to optimize specific resources that youidentify as significant drivers of cost and carbon emissions:

  • Decommission unattended projects: Regularly check theunattended project recommender that's integrated with the Carbon Footprint data. To achieveimmediate, verified reductions in carbon emissions and cost, automate thereview and eventual removal of unused projects.
  • Right-size resources: Match the provisioned resource capacity to actualutilization by using Active Assist right-sizingrecommenders likemachine type recommendations for Compute Engine VMs.For compute-intensive tasks and AI workloads, use the most efficient machinetypes and AI models.
  • Adopt carbon-aware scheduling: For batch workloads that aren'ttime-critical, integrate regional CFE data into the scheduling logic. Wherefeasible, limit the creation of new resources to low-carbon regions byusing theresource locations constraint in Organization Policy Service.
  • Reduce data sprawl: Implement data governance policies to ensurethat infrequently accessed data is transitioned to an appropriate coldstorage class (Nearline, Coldline, or Archive)or is permanently deleted. This strategy helps to reduce the energy cost ofyour storage resources.
  • Refine application code: Fix code-level inefficiencies thatcause excessive resource usage or unnecessary computation.

For more information, see the following:

Phase 4: Institutionalize your sustainability practices and reporting

In this phase, you embed carbon measurement into your governance framework. Thisapproach helps to ensure that your organization has the capabilities andcontrols that are necessary for continuous sustainability improvements andverifiable reporting.

  • ImplementGreenOps governance: Establish a formalGreenOps function or working group to integrate Carbon Footprint datawith Cloud Billing data. This function must define accountability forcarbon reduction targets across projects, align cost optimization withsustainability goals, and implement reporting to track carbon efficiencyagainst spending.
  • Use Carbon Footprint data for reporting and compliance:Use the verified, auditable Carbon Footprint data inBigQuery to create formal environmental, social, andgovernance (ESG) disclosures. This approach lets you meet stakeholderdemands for transparency and helps to ensure compliance with mandatory andvoluntary regulations.
  • Invest in training and awareness: Implement mandatory sustainabilitytraining for relevant technical and non-technical teams. Your teamsneed to know how to access and interpret the Carbon Footprintdata and how to apply optimization recommendations in their daily workflowsand design choices. For more information, seeProvide role-based sustainability training.
  • Define carbon requirements: Incorporate carbon emission metrics asnon-functional requirements (NFR) in your application's acceptance criteriafor new deployments. This practice helps to ensure that architects anddevelopers prioritize low-carbon design options from the start of theapplication development lifecycle.
  • Automate GreenOps: Automate the implementation ofActive Assist recommendations by using scripts, templates,and infrastructure-as-code (IaC) pipelines. This practice ensures that teamsapply recommendations consistently and rapidly across the organization.

Promote a culture of sustainability

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you build a culture where teams across yourorganization are aware of and proficient in sustainability practices.

Principle overview

To apply sustainability practices, you need more than tools and techniques. Youneed a cultural shift that's driven by education and accountability. Your teamsneed to be aware of sustainability concerns and they must have practicalproficiency in sustainability practices.

  • Awareness of sustainability is the contextual knowledge that everyarchitectural and operational decision has tangible effects onsustainability. Teams must recognize that the cloud isn't an abstractcollection of virtual resources, but it's driven by physical resources thatconsume energy and produce carbon emissions.
  • Proficiency in sustainability practices includes knowledge to interpretcarbon emissions data, experience with implementing cloud sustainabilitygovernance, and technical skills to refactor code for energy efficiency.

To align sustainability practices with organizational goals, your teams need tounderstand how energy usage by cloud infrastructure and software contributes tothe organization's carbon footprint. Well-planned training helps to ensure thatall of your stakeholders—from developers and architects to finance professionalsand operations engineers—understand the sustainability context of their dailywork. This shared understanding empowers teams to move beyond passive complianceto active optimization, which makes your cloud workloads sustainable-by-design.Sustainability becomes a core non-functional requirement (NFR) like otherrequirements for security, cost, performance, and reliability.

Recommendations

To build awareness of sustainability concerns and proficiency in sustainabilitypractices, consider the recommendations in the following sections.

Provide business context and alignment with organizational goals

Sustainability isn't just a technical exercise; it requires a cultural shiftthat aligns individual actions with the environmental mission of yourorganization. When teams understand thewhy behind sustainability initiatives,they are more likely to adopt the initiatives as core principles rather than asoptional tasks.

Connect to the big picture

Help your teams understand how individual architectural choices—such asselecting a low-carbon region or optimizing a data pipeline—contribute to theorganization's overall sustainability commitments. Explicitly communicate howthese choices affect the local community and the industry. Transform abstractcarbon metrics into tangible indicators of progress toward corporate socialresponsibility (CSR) goals.

For example, a message like the following informs teams about the positiveoutcome and executive recognition of a decision to migrate a workload to alow-carbon region and to use a power-efficient machine type. The messagereferences theCO2 equivalent,which helps your team contextualize the impact of carbon reduction measures.

"By migrating our data analytics engine to the us-central1leaf iconLow CO2 region andupgrading our clusters to C4A Axion-based instances, we fundamentallychanged our carbon profile. This shift resulted in a 75% reduction in the carbonintensity of our data analytics engine, which translates to a reduction of 12metric tons of CO2 equivalent this quarter. This migration had asignificant impact on our business goals and was included in the Q4 newsletterto our board."

Communicate financial and sustainability goals

Transparency is critical for aligning sustainability practices with goals. Tothe extent feasible, widely share sustainability goals and progress across theorganization. Highlight sustainability progress in the annual financialstatements. Such communication ensures that technical teams view their work as avital part of the organization's public-facing commitments and financial health.

Embrace a shared fate mindset

Educate teams about the collaborative nature of cloud sustainability. Google isresponsible for the sustainabilityof the cloud, which includes the efficiencyof the infrastructure and data centers. You (the customer) are responsible forsustainability of your resources and workloadsin the cloud. When you framethis collaboration as a partnership of shared fate, you reinforce theunderstanding that your organization and Google work together to achieve optimalenvironmental outcomes.

Provide role-based sustainability training

To ensure that sustainability is a practical skill rather than a theoreticalconcept, tailor the sustainability training to specific job roles. Thesustainability tools and techniques that a data scientist can use are verydifferent from those available to a FinOps analyst, as described in thefollowing table:

RoleTraining focus
Data scientists and ML engineersCarbon-intensity of compute: Demonstrate the differences between running AI training jobs onlegacy systems versus purpose-built AI accelerators. Highlight how a model with fewer parameters can produce the required accuracy with significantly lower energy consumption.
DevelopersCode efficiency and resource consumption: Illustrate how high-latency code or inefficient loops translate directly to extended CPU runtime and increased energy consumption. Emphasize the importance of lightweight containers and the need to optimize application performance to reduce the environmental footprint of software.
ArchitectsSustainable by design: Focus on region selection and workload placement. Show how choosing aleaf iconLow CO2 region with a high percentage of renewable energy (likenorthamerica-northeast1) fundamentally changes the carbon profile of your entire application stack before you write a single line of code.
Platform engineers and operations engineersMaximizing utilization: Emphasize the environmental cost of idle resources and over-provisioning. Present scenarios for automated scaling and right-sizing to ensure that cloud resources are used efficiently. Explain how to create and track sustainability-related metrics like utilization and how to translate metrics like compute time into equivalent metrics of carbon emissions.
FinOpsUnit economics of carbon: Focus on the relationship between financial spend and environmental impact. Demonstrate how GreenOps practices let an organization track carbon per transaction, which helps to make sustainability a key performance indicator (KPI) that's as critical as conventional KPIs like cost and utilization.
Product managersSustainability as a feature: Demonstrate how to integrate carbon-reduction goals into product roadmaps. Show how simplified user journeys can help to reduce the energy consumption by both cloud resources and end-user devices.
Business leadersStrategic alignment and reporting: Focus on how cloud sustainability affects environmental, social, and governance (ESG) scores and public reputation. Illustrate how sustainability choices help to reduce regulatory risk and fulfill commitments to the community and industry.

Advocate for sustainability and recognize success

To sustain long-term progress, you need to move beyond internal technical fixesand begin influencing your partners and the industry.

Empower managers to advocate for sustainability

Provide managers the data and permissions that they need to prioritizeenvironmental impact similar to other business metrics like speed-to-market andcost. When managers have this data, they begin to view sustainability as aquality and efficiency standard rather than as a nice-to-have capability thatslows production. They actively advocate for new cloud provider features—such asmore granular carbon data and newer, greener processors in specific regions.

Align with industry standards and frameworks

To ensure that your sustainability efforts are credible and measurable, aligninternal practices with recognized global and regional standards. For moreinformation, seeAlign sustainability practices with industry guidelines.

Incentivize sustainability efforts

To ensure that sustainability becomes an enduring part of the engineeringculture, teams must realize the value of prioritizing sustainability. Transitionfrom high-level goals to specific, measurable KPIs that reward improvement andefficiency.

Define carbon KPIs and NFRs

Treat sustainability as a core technical requirement. When you define carbonKPIs, such as grams of CO2 equivalent per million requests orcarbon-intensity per AI training run, you make the impact on sustainabilityvisible and actionable. For example, integrate sustainability into the NFRs forevery new project. In other words, just as a system must meet a specific latencyor availability target, the system must also stay within a defined carbonemissions budget.

Measure return on effort

Help your teams identify high-impact, low-effort sustainability wins—such asshifting a batch job to a different region—versus a complex code refactoringexercise that might provide minimal gains. Provide visibility into the return oneffort (ROE). When a team chooses a more efficient processor family, they mustknow exactly how much carbon emission they avoided relative to the time andeffort that's required to migrate to the new processor.

Recognize and celebrate carbon reduction

Sustainability impact is often hidden in the background of infrastructure. Tobuild the momentum for sustainability progress, make successes visible to theentire organization. For example, use annotations in monitoring dashboards tomark when a team deployed a specific sustainability optimization. Thisvisibility lets teams point to data in the dashboard and claim recognition fortheir successes.

Align sustainability practices with industry guidelines

This principle in the sustainability pillar of theGoogle Cloud Well-Architected Framework provides an overview of industry guidelines and frameworks with which you shouldalign your sustainability efforts.

Principle overview

To ensure that your sustainability initiatives are built upon a foundation ofglobally recognized methods for measurement, reporting, and verification, werecommend that you align your initiatives with the following industryguidelines:

When you align your sustainability initiatives with these shared externalguidelines, your initiatives get the credibility and auditability thatinvestors, regulatory bodies, and other external stakeholders demand. You alsofoster accountability across engineering teams, embed sustainability withinemployee training, and successfully integrate cloud operations intoenterprise-wide commitments for environmental, social, and governance(ESG) reporting.

W3C Web Sustainability Guidelines

W3C Web Sustainability Guidelines (WSG) is an emerging framework of best practices developed by a W3C working group toaddress the environmental impact of digital products and services. Theguidelines cover the entire lifecycle of a digital solution including businessand product strategy, user experience (UX) design, web development, hosting,infrastructure, and systems. The core goal of WSG is to enable developers andarchitects to build websites and web applications that are more energy-efficientand that reduce network traffic, client-side processing, and server-sideresource consumption. These guidelines serve a critical reference point foraligning application-level sustainability with cloud-level architecturaldecisions.

Green Software Foundation

TheGreen Software Foundation (GSF) focuses on building an industry ecosystem around sustainable software. Itsmission is to drive the creation of software that's designed, built, andoperated to minimize the carbon footprint. The GSF developed the Software CarbonIntensity (SCI) specification, which provides a common standard for measuringthe rate of carbon emissions of any piece of software. Alignment with the GSFhelps developers connect an application's efficiency directly to the carbonimpact of the cloud environment.

Greenhouse Gas Protocol

TheGreenhouse Gas (GHG) Protocol is a widely used set of standards for measuring, managing, and publiclyreporting greenhouse gas emissions. The protocol was developed through apartnership between the World Resources Institute (WRI) and the World BusinessCouncil for Sustainable Development (WBCSD). The GHG protocol provides theessential framework for corporate climate accounting. TheCarbon Footprint report provides data foremission scopes that are relevant to cloud usage. For more information, seeCarbon Footprint reporting methodology.

Adherence to the GHG Protocol helps to ensure that your sustainabilityinitiatives have credibility and that external parties can audit your carbonemissions data. You also help prevent the perception ofgreenwashing and satisfy the due-diligence requirements of your investors, regulators, andexternal stakeholders. Verified and audited data helps your organization proveaccountability and build trust in public-facing sustainability commitments.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-01-28 UTC.