Well-Architected Framework: Cost optimization pillar Stay organized with collections Save and categorize content based on your preferences.
This page doesn't have a table of contents. You can't use the links on this page to navigate
The cost optimization pillar in theGoogle Cloud Well-Architected Framework describes principles and recommendations to optimize the cost of your workloadsin Google Cloud.
The intended audience includes the following:
- CTOs, CIOs, CFOs, and other executives who are responsible for strategiccost management.
- Architects, developers, administrators, and operators who make decisionsthat affect cost at all the stages of an organization's cloud journey.
The cost models for on-premises and cloud workloads differ significantly.On-premises IT costs include capital expenditure (CapEx) and operationalexpenditure (OpEx). On-premises hardware and software assets are acquired andthe acquisition costs aredepreciated over the operating life of the assets. In the cloud, the costs for most cloudresources are treated as OpEx, where costs are incurred when the cloud resourcesare consumed. This fundamental difference underscores the importance of thefollowing core principles of cost optimization.
Note: You might be able to classify the cost of some Google Cloud services (likeCompute Engine sole-tenant nodes) as capital expenditure. For moreinformation, seeSole-tenancy accounting FAQ.For cost optimization principles and recommendations that are specific to AI and ML workloads, seeAI and ML perspective: Cost optimizationin the Well-Architected Framework.
Core principles
The recommendations in the cost optimization pillar of the Well-Architected Frameworkare mapped to the following core principles:
- Align cloud spending with businessvalue:Ensure that your cloud resources deliver measurable business value byaligning IT spending with business objectives.
- Foster a culture of costawareness:Ensure that people across your organization consider the cost impact oftheir decisions and activities, and ensure that they have access to the costinformation required to make informed decisions.
- Optimize resourceusage:Provision only the resources that you need, and pay only for the resourcesthat you consume.
- Optimizecontinuously:Continuously monitor your cloud resource usage and costs, and proactivelymake adjustments as needed to optimize your spending. This approach involvesidentifying and addressing potential cost inefficiencies before they becomesignificant problems.
These principles are closely aligned with the core tenets ofcloud FinOps.FinOps is relevant to any organization, regardless of its size or maturity inthe cloud. By adopting these principles and following the relatedrecommendations, you can control and optimize costs throughout your journey inthe cloud.
Contributors
Author:Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Other contributors:
- Anuradha Bajpai | Solutions Architect
- Daniel Lees | Cloud Security Architect
- Eric Lam | Head of Google Cloud FinOps
- Fernando Rubbo | Cloud Solutions Architect
- Filipe Gracio, PhD | Customer Engineer, AI/ML Specialist
- Gary Harmson | Principal Architect
- Jose Andrade | Customer Engineer, SRE Specialist
- Kent Hua | Solutions Manager
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Radhika Kanakam | Program Lead, Google Cloud Well-Architected Framework
- Samantha He | Technical Writer
- Steve McGhee | Reliability Advocate
- Sergei Lilichenko | Solutions Architect
- Wade Holmes | Global Solutions Director
- Zach Seils | Networking Specialist
Align cloud spending with business value
This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to align your use of Google Cloud resources withyour organization's business goals.
Principle overview
To effectively manage cloud costs, you need to maximize the business value thatthe cloud resources provide and minimize thetotal cost of ownership (TCO).When you evaluate the resource options for your cloudworkloads, consider not only the cost of provisioning and using the resources,but also the cost of managing them. For example, virtual machines (VMs) onCompute Engine might be a cost-effective option for hosting applications.However, when you consider the overhead to maintain, patch, and scale the VMs,the TCO can increase. On the other hand, serverless services likeCloud Run can offer greaterbusiness value. The lower operational overhead lets your team focus on coreactivities and helps to increase agility.
To ensure that your cloud resources deliver optimal value, evaluate the followingfactors:
- Provisioning and usage costs: The expenses incurred when you purchase,provision, or consume resources.
- Management costs: The recurring expenses for operating and maintainingresources, including tasks like patching, monitoring and scaling.
- Indirect costs: The costs that you might incur to manage issues likedowntime, data loss, or security breaches.
- Business impact: The potential benefits from the resources, likeincreased revenue, improved customer satisfaction, and faster time to market.
By aligning cloud spending with business value, you get the following benefits:
- Value-driven decisions: Your teams are encouraged to prioritize solutionsthat deliver the greatest business value and to consider both short-term andlong-term cost implications.
- Informed resource choice: Your teams have the information and knowledgethat they need to assess the business value and TCO of various deploymentoptions, so they choose resources that are cost-effective.
- Cross-team alignment: Cross-functional collaboration between business,finance, and technical teams ensures that cloud decisions are aligned withthe overall objectives of the organization.
Recommendations
To align cloud spending with business objectives, consider the following recommendations.
Prioritize managed services and serverless products
Whenever possible, choose managed services andserverless products to reduce operational overhead and maintenance costs. This choice lets your teamsconcentrate on their core business activities. They can accelerate the deliveryof new features and functionalities, and help drive innovation and value.
The following are examples of how you can implement this recommendation:
- To run PostgreSQL, MySQL, or Microsoft SQL Server server databases, useCloud SQL instead of deploying those databases on VMs.
- To run and manage Kubernetes clusters, useGoogle Kubernetes Engine (GKE) Autopilot instead of deploying containers on VMs.
- For your Apache Hadoop or Apache Spark processing needs, useDataproc andDataproc Serverless.Per-second billing can help to achieve significantlylower TCO when compared to on-premises data lakes.
Balance cost efficiency with business agility
Controlling costs and optimizing resource utilization are important goals.However, you must balance these goals with the need for flexible infrastructurethat lets you innovate rapidly, respond quickly to changes, and deliver valuefaster. The following are examples of how you can achieve this balance:
- AdoptDORA metrics for software delivery performance. Metrics like change failure rate (CFR),time to detect (TTD), and time to restore (TTR) can help to identify and fixbottlenecks in your development and deployment processes. By reducing downtimeand accelerating delivery, you can achieve both operational efficiency andbusiness agility.
- FollowSite Reliability Engineering (SRE) practices to improve operational reliability. SRE's focus on automation,observability, and incident response can lead to reduced downtime, lowerrecovery time, and higher customer satisfaction. By minimizing downtime andimproving operational reliability, you can prevent revenue loss and avoidthe need to overprovision resources as a safety net to handle outages.
Enable self-service optimization
Encourage a culture of experimentation and exploration by providing your teamswith self-service cost optimization tools, observability tools, and resourcemanagement platforms. Enable them to provision, manage, and optimize their cloudresources autonomously. This approach helps to foster a sense of ownership,accelerate innovation, and ensure that teams can respond quickly to changing needs while being mindful of cost efficiency.
Adopt and implement FinOps
Adopt FinOps to establish a collaborative environment where everyone is empoweredto make informed decisions that balance cost and value. FinOps fosters financialaccountability and promotes effective cost optimization in the cloud.
Promote a value-driven and TCO-aware mindset
Encourage your team members to adopt a holistic attitude toward cloud spending,with an emphasis on TCO and not just upfront costs. Use techniques likevalue stream mapping to visualize and analyze the flow of value through your software delivery processand to identify areas for improvement. Implementunit costing for your applications and services to gain a granular understanding of costdrivers and discover opportunities for cost optimization. For more information,seeMaximize business value with cloud FinOps.
Foster a culture of cost awareness
This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to promote cost awareness across your organization andensure that team members have the cost information that they need to makeinformed decisions.
Conventionally, the responsibility for cost management might be centralized to afew select stakeholders and primarily focused on initial project architecturedecisions. However, team members across all cloud user roles (analyst, architect,developer, or administrator) can help to reduce the cost of your resources inGoogle Cloud. By sharing cost data appropriately, you can empower teammembers to make cost-effective decisions throughout their development anddeployment processes.
Principle overview
Stakeholders across various roles – product owners, developers, deploymentengineers, administrators, and financial analysts – need visibility into relevantcost data and its relationship to business value. When provisioning and managingcloud resources, they need the following data:
- Projected resource costs: Cost estimates at the time of design anddeployment.
- Real-time resource usage costs: Up-to-date cost data that can be used forongoing monitoring and budget validation.
- Costs mapped to business metrics: Insights into how cloud spending affectskey performance indicators (KPIs), to enable teams to identify cost-effectivestrategies.
Every individual might not need access to raw cost data. However, promoting costawareness across all roles is crucial because individual decisions can affectcosts.
By promoting cost visibility and ensuring clear ownership of cost managementpractices, you ensure that everyone is aware of the financial implications oftheir choices and everyone actively contributes to the organization's costoptimization goals. Whether through a centralized FinOps team or a distributedmodel, establishing accountability is crucial for effective cost optimizationefforts.
Recommendations
To promote cost awareness and ensure that your team members have the costinformation that they need to make informed decisions, consider the followingrecommendations.
Provide organization-wide cost visibility
To achieve organization-wide cost visibility, the teams that are responsible forcost management can take the following actions:
- Standardize cost calculation and budgeting: Use a consistent method todetermine the full costs of cloud resources, after factoring in discounts andshared costs. Establish clear and standardized budgeting processes that alignwith your organization's goals and enable proactive cost management.
- Use standardized cost management and visibility tools: Use appropriatetools that provide real-time insights into cloud spending and generateregular (for example, weekly) cost progression snapshots. These tools enableproactive budgeting, forecasting, and identification of optimizationopportunities. The tools could be cloud provider tools(like theGoogle Cloud Billing dashboard),third-party solutions, or open-source solutions like theCost Attribution solution.
- Implement a cost allocation system: Allocate a portion of the overallcloud budget to each team or project. Such an allocation gives the teams asense of ownership over cloud spending and encourages them to makecost-effective decisions within their allocated budget.
- Promote transparency: Encourage teams to discuss cost implications duringthe design and decision-making processes. Create a safe and supportiveenvironment for sharing ideas and concerns related to cost optimization.Some organizations use positive reinforcement mechanisms like leaderboardsor recognition programs. If your organization has restrictions on sharingraw cost data due to business concerns, explore alternative approaches forsharing cost information and insights. For example, consider sharingaggregated metrics (like the total cost for an environment or feature) orrelative metrics (like the average cost per transaction or user).
Understand how cloud resources are billed
Pricing for Google Cloud resources might vary acrossregions.Some resources are billed monthly at a fixed price, and others might be billedbased on usage.To understand how Google Cloud resources are billed, use theGoogle Cloud pricing calculator and product-specific pricing information (for example,Google Kubernetes Engine (GKE) pricing).
Understand resource-based cost optimization options
For each type of cloud resource that you plan to use, explore strategies tooptimize utilization and efficiency. The strategies include rightsizing,autoscaling, and adopting serverless technologies where appropriate. The followingare examples of cost optimization options for a few Google Cloud products:
- Cloud Run lets you configurealways-allocated CPUs to handle predictable traffic loads at a fraction of the price of the defaultallocation method (that is, CPUs allocated only during request processing).
- You can purchaseBigQuery slot commitments to save money on data analysis.
- GKE provides detailed metrics to help you understand cost optimization options.
- Understand hownetwork pricing can affect the cost of data transfers and how you can optimize costs forspecific networking services. For example, you can reduce the data transfercosts for external Application Load Balancers by using Cloud CDN or Google Cloud Armor.For more information, seeWays to lower external Application Load Balancer costs.
Understand discount-based cost optimization options
Familiarize yourself with the discount programs that Google Cloud offers,such as the following examples:
- Committed use discounts (CUDs):CUDs are suitable for resources that have predictable and steady usage. CUDslet you get significant reductions in price in exchange for committing tospecific resource usage over a period (typically one to three years). Youcan also useCUD auto-renewal to avoid having to manually repurchase commitments when they expire.
- Sustained use discounts:For certain Google Cloud products like Compute Engine andGKE, you can get automatic discount credits aftercontinuous resource usage beyond specific duration thresholds.
- Spot VMs:For fault-tolerant and flexible workloads, Spot VMs can help toreduce your Compute Engine costs. The cost of Spot VMs issignificantly lower than regular VMs. However, Compute Engine mightpreemptively stop or delete Spot VMs to reclaim capacity.Spot VMs are suitable for batch jobs that can tolerate preemptionand don't have high availability requirements.
- Discounts for specific product options: Some managed services likeBigQuery offerdiscounts when you purchase dedicated or autoscaling query processing capacity.
Evaluate and choose the discounts options that align with your workloadcharacteristics and usage patterns.
Incorporate cost estimates into architecture blueprints
Encourage teams to develop architecture blueprints that include cost estimatesfor different deployment options and configurations. This practice empowers teamsto compare costs proactively and make informed decisions that align with bothtechnical and financial objectives.
Use a consistent and standard set of labels for all your resources
You can uselabels to track costs and to identify and classify resources. Specifically, you can uselabels to allocate costs to different projects, departments, or cost centers.Defining aformal labeling policy that aligns with the needs of the main stakeholders in your organization helpsto make costs visible more widely. You can also use labels to filter resourcecost and usage data based on target audience.
Use automation tools like Terraform to enforce labeling on every resource thatis created. To enhance cost visibility and attribution further, you can use thetools provided by the open-sourcecost attribution solution.
Share cost reports with team members
By sharing cost reports with your team members, you empower them to takeownership of their cloud spending. This practice enables cost-effective decisionmaking, continuous cost optimization, and systematic improvements to your costallocation model.
Cost reports can be of several types, including the following:
- Periodic cost reports: Regular reports inform teams about their currentcloud spending. Conventionally, these reports might be spreadsheet exports.More effective methods include automated emails and specialized dashboards.To ensure that cost reports provide relevant and actionable information without overwhelming recipients with unnecessary detail, the reports must be tailored to the target audiences. Setting up tailored reports is a foundational step toward more real-time and interactive cost visibility and management.
- Automated notifications: You can configure cost reports to proactivelynotify relevant stakeholders (for example, through email or chat) about costanomalies, budget thresholds, or opportunities for cost optimization. Byproviding timely information directly to those who can act on it, automatedalerts encourage prompt action and foster a proactive approach to costoptimization.
- Google Cloud dashboards: You can use thebuilt-in billing dashboards in Google Cloud to get insights into cost breakdowns and to identifyopportunities for cost optimization. Google Cloud also providesFinOps hub to help you monitor savings and get recommendations for cost optimization.An AI engine powers the FinOps hub to recommend cost optimizationopportunities for all the resources that are currently deployed. To controlaccess to these recommendations, you can implement role-based access control(RBAC).
- Custom dashboards: You can create custom dashboards by exporting costdata to an analytics database, likeBigQuery.Use a visualization tool likeLooker Studio to connect to the analytics database to build interactive reports and enablefine-grained access control through role-based permissions.
- Multicloud cost reports: For multicloud deployments, you need aunified view of costs across all the cloud providers to ensure comprehensiveanalysis, budgeting, and optimization. Use tools like BigQueryto centralize and analyze cost data from multiple cloud providers, and useLooker Studio to build team-specific interactive reports.
Optimize resource usage
This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you plan and provision resources to match the requirementsand consumption patterns of your cloud workloads.
Principle overview
To optimize the cost of your cloud resources, you need to thoroughly understandyour workloads resource requirements and load patterns. This understanding isthe basis for a well defined cost model that lets you forecast the total cost ofownership (TCO) and identify cost drivers throughout your cloud adoption journey.By proactively analyzing and forecasting cloud spending, you can make informedchoices about resource provisioning, utilization, and cost optimization. Thisapproach lets you control cloud spending, avoid overprovisioning, and ensure thatcloud resources are aligned with the dynamic needs of your workloads andenvironments.
Recommendations
To effectively optimize cloud resource usage, consider the following recommendations.
Choose environment-specific resources
Each deployment environment has different requirements for availability,reliability and scalability. For example, developers might prefer an environmentthat lets them rapidly deploy and run applications for short durations, but mightnot need high availability. On the other hand, a production environment typicallyneeds high availability. To maximize the utilization of your resources, defineenvironment-specific requirements based on your business needs. The followingtable lists examples of environment-specific requirements.
Note: The requirements that are listed in this table are not exhaustive orprescriptive. They're meant to serve as examples to help you understand howrequirements can vary based on the environment type.| Environment | Requirements |
| Production |
|
| Development and testing |
|
| Other environments (like staging and QA) |
|
Choose workload-specific resources
Each of your cloud workloads might have different requirements for availability,scalability, security, and performance. To optimize costs, you need to alignresource choices with the specific requirements of each workload. For example,a stateless application might not require the same level of availability orreliability as a stateful backend. The following table lists more examples ofworkload-specific requirements.
Note: The requirements that are listed in this table are not exhaustive orprescriptive. They're meant to serve as examples to help you understand howrequirements can vary based on the workload type.| Workload type | Workload requirements | Resource options |
| Mission-critical | Continuous availability, robust security, and high performance | Premium resources and managed services likeSpanner for high availability and global consistency of data. |
| Non-critical | Cost-efficient and autoscaling infrastructure | Resources with basic features and ephemeral resources likeSpot VMs. |
| Event-driven | Dynamic scaling based on the current demand for capacity and performance | Serverless services likeCloud Run andCloud Run functions. |
| Experimental workloads | Low cost and flexible environment for rapid development, iteration, testing, and innovation | Resources with basic features, ephemeral resources likeSpot VMs, and sandbox environments with defined spending limits. |
A benefit of the cloud is the opportunity to take advantage of the mostappropriate computing power for a given workload. Some workloads are developedto take advantage of processor instruction sets, and others might not be designedin this way. Benchmark and profile your workloads accordingly. Categorize yourworkloads and make workload-specific resource choices (for example, chooseappropriatemachine families for Compute Engine VMs). This practice helpsto optimize costs, enable innovation, and maintain the level of availability andperformance that your workloads need.
The following are examples of how you can implement this recommendation:
- For mission-critical workloads that serve globally distributed users, considerusingSpanner. Spanner removes the need for complex database deployments byensuring reliability and consistency of data in allregions.
- For workloads with fluctuating load levels, use autoscaling to ensure thatyou don't incur costs when the load is low and yet maintain sufficientcapacity to meet the current load. You can configure autoscaling for manyGoogle Cloud services, includingCompute Engine VMs,Google Kubernetes Engine (GKE) clusters,andCloud Run. When you set upautoscaling, you can configure maximum scaling limits to ensure that costsremain within specified budgets.
Select regions based on cost requirements
For your cloud workloads, carefully evaluate the available Google Cloudregions and choose regions that align with your cost objectives. The region withlowest cost might not offer optimal latency or it might not meet yoursustainability requirements. Make informed decisions about where to deploy yourworkloads to achieve the desired balance. You can use theGoogle Cloud Region Picker to understand the trade-offs between cost, sustainability, latency, and otherfactors.
Use built-in cost optimization options
Google Cloud products provide built-in features to help you optimizeresource usage and control costs. The following table lists examples of costoptimization features that you can use in some Google Cloud products:
| Product | Cost optimization feature |
| Compute Engine |
|
| GKE |
|
| Cloud Storage |
|
| BigQuery |
|
| Google Cloud VMware Engine |
|
Optimize resource sharing
To maximize the utilization of cloud resources, you can deploy multipleapplications or services on the same infrastructure, while still meeting thesecurity and other requirements of the applications. For example, in developmentand testing environments, you can use the same cloud infrastructure to test allthe components of an application. For the production environment, you can deployeach component on a separate set of resources to limit the extent of impact incase of incidents.
The following are examples of how you can implement this recommendation:
- Use a singleCloud SQL instance for multiple non-production environments.
- Enable multiple development teams to share a GKE cluster by using thefleet team management feature in GKEwith appropriate access controls.
- UseGKE Autopilot to take advantage ofcost-optimization techniques like bin packing and autoscaling thatGKE implements by default.
- For AI and ML workloads, save GPU costs by usingGPU-sharing strategies like multi-instance GPUs, time-sharing GPUs, and NVIDIA MPS.
Develop and maintain reference architectures
Create and maintain a repository of reference architectures that are tailored tomeet the requirements of different deployment environments and workload types.To streamline the design and implementation process for individual projects, theblueprints can be centrally managed by a team like aCloud Center of Excellence (CCoE). Project teamscan choose suitable blueprints based on clearly defined criteria, to ensurearchitectural consistency and adoption of best practices. For requirements thatare unique to a project, the project team and the central architecture team shouldcollaborate to design new reference architectures. You can share the referencearchitectures across the organization to foster knowledge sharing and expand therepository of available solutions. This approach ensures consistency, acceleratesdevelopment, simplifies decision-making, and promotes efficient resourceutilization.
Review thereference architectures provided by Google for various use cases andtechnologies. These reference architectures incorporate best practices forresource selection, sizing, configuration, and deployment. By using thesereference architectures, you can accelerate your development process and achievecost savings from the start.
Enforce cost discipline by using organization policies
Consider usingorganization policiesto limit the available Google Cloud locations and products that teammembers can use. These policies help to ensure that teams adhere to cost-effectivesolutions and provision resources in locations that are aligned with your costoptimization goals.
Estimate realistic budgets and set financial boundaries
Develop detailed budgets for each project, workload, and deployment environment.Make sure that the budgets cover all aspects of cloud operations, includinginfrastructure costs, software licenses, personnel, and anticipated growth. Toprevent overspending and ensure alignment with your financial goals, establishclear spending limits or thresholds for projects, services, or specific resources.Monitor cloud spending regularly against these limits. You can useproactive quota alerts to identify potential cost overruns early and take timely corrective action.
In addition to setting budgets, you can usequotas and limits to help enforce cost discipline andprevent unexpected spikes in spending. You can exercise granular control overresource consumption by setting quotas at various levels, including projects,services, and even specific resource types.
The following are examples of how you can implement this recommendation:
- Project-level quotas: Set spending limits or resource quotas at theproject level to establish overall financial boundaries and control resourceconsumption across all the services within the project.
- Service-specific quotas: Configure quotas for specific Google Cloudservices like Compute Engine or BigQuery to limit thenumber of instances, CPUs, or storage capacity that can be provisioned.
- Resource type-specific quotas: Apply quotas to individual resource typeslike Compute Engine VMs, Cloud Storage buckets,Cloud Run instances, or GKE nodes torestrict their usage and prevent unexpected cost overruns.
- Quota alerts: Get notifications when your quota usage (at the projectlevel) reaches a percentage of the maximum value.
By using quotas and limits in conjunction with budgeting and monitoring, you cancreate a proactive and multi-layered approach to cost control. This approachhelps to ensure that your cloud spending remains within defined boundaries andaligns with your business objectives. Remember, these cost controls are notpermanent or rigid. To ensure that the cost controls remain aligned with currentindustry standards and reflect your evolving business needs, you must review thecontrols regularly and adjust them to include new technologies and best practices.
Optimize continuously
This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you optimize the cost of your cloud deploymentsbased on constantly changing and evolving business goals.
As your business grows and evolves, your cloud workloads need to adapt to changesin resource requirements and usage patterns. To derive maximum value from yourcloud spending, you must maintain cost-efficiency while continuing to supportbusiness objectives. This requires a proactive and adaptive approach that focuseson continuous improvement and optimization.
Principle overview
To optimize cost continuously, you must proactively monitor and analyze yourcloud environment and make suitable adjustments to meet current requirements.Focus your monitoring efforts on key performance indicators (KPIs) that directlyaffect your end users' experience, align with your business goals, and provideinsights for continuous improvement. This approach lets you identify and addressinefficiencies, adapt to changing needs, and continuously align cloud spendingwith strategic business goals. To balance comprehensive observability with costeffectiveness, understand the costs and benefits of monitoring resource usageand use appropriate process-improvement and optimization strategies.
Recommendations
To effectively monitor your Google Cloud environment and optimize costcontinuously, consider the following recommendations.
Focus on business-relevant metrics
Effective monitoring starts with identifying the metrics that are most importantfor your business and customers. These metrics include the following:
- User experience metrics: Latency, error rates, throughput, and customersatisfaction metrics are useful for understanding your end users' experiencewhen using your applications.
- Business outcome metrics: Revenue, customer growth, and engagement canbe correlated with resource usage to identify opportunities for costoptimization.
- DevOps Research & Assessment (DORA) metrics: Metricslike deployment frequency, lead time for changes, change failure rate, andtime to restore provide insights into the efficiency and reliability of yoursoftware delivery process. By improving these metrics, you can increaseproductivity, reduce downtime, and optimize cost.
- Site Reliability Engineering (SRE) metrics: Errorbudgets help teams to quantify and manage the acceptable level of servicedisruption. By establishing clear expectations for reliability, error budgetsempower teams to innovate and deploy changes more confidently, knowing theirsafety margin. This proactive approach promotes a balance between innovationand stability, helping prevent excessive operational costs associated withmajor outages or prolonged downtime.
Use observability for resource optimization
The following are recommendations to use observability to identify resourcebottlenecks and underutilized resources in your cloud deployments:
- Monitor resource utilization: Use resource utilization metrics to identifyGoogle Cloud resources that are underutilized. For example, use metricslike CPU and memory utilization to identifyidle VM resources.For Google Kubernetes Engine (GKE), you can view a detailedbreakdown of costs andcost-related optimization metrics.For Google Cloud VMware Engine,review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
- Use cloud recommendations:Active Assist is a portfolio of intelligent tools that help you optimize your cloudoperations. These tools provide actionable recommendations to reduce costs,increase performance, improve security and even make sustainability-focuseddecisions. For example,VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
- Correlate resource utilization with performance: Analyze the relationshipbetween resource utilization and application performance to determine whetheryou can downgrade to less expensive resources without affecting the userexperience.
Balance troubleshooting needs with cost
Detailed observability data can help with diagnosing and troubleshooting issues.However, storing excessive amounts of observability data or exporting unnecessarydata to external monitoring tools can lead to unnecessary costs. For efficienttroubleshooting, consider the following recommendations:
- Collect sufficient data for troubleshooting: Ensure that your monitoringsolution captures enough data to efficiently diagnose and resolve issues whenthey arise. This data might include logs, traces, and metrics at variouslevels of granularity.
- Use sampling and aggregation: Balance the need for detailed data withcost considerations by using sampling and aggregation techniques. This approachlets you collect representative data without incurring excessive storage costs.
- Understand the pricing models of your monitoring tools and services: Evaluatedifferent monitoring solutions and choose options that align with yourproject's specific needs, budget, and usage patterns. Consider factors likedata volume, retention requirements, and the required features whenmaking your selection.
- Regularly review your monitoring configuration: Avoid collecting excessivedata by removing unnecessary metrics or logs.
Tailor data collection to roles and set role-specific retention policies
Consider the specific data needs of different roles. For example, developersmight primarily need access to traces and application-level logs, whereas ITadministrators might focus on system logs and infrastructure metrics. By tailoringdata collection, you can reduce unnecessary storage costs and avoid overwhelmingusers with irrelevant information.
Additionally, you can define retention policies based on the needs of each roleand any regulatory requirements. For example, developers might need access todetailed logs for a shorter period, while financial analysts might requirelonger-term data.
Consider regulatory and compliance requirements
In certain industries, regulatory requirements mandate data retention. To avoidlegal and financial risks, you need to ensure that your monitoring and dataretention practices help you adhere to relevant regulations. At the same time,you need to maintain cost efficiency. Consider the following recommendations:
- Determine the specific data retention requirements for your industry or region,and ensure that your monitoring strategy meets the requirements of thoserequirements.
- Implement appropriate data archival and retrieval mechanisms to meet auditand compliance needs while minimizing storage costs.
Implement smart alerting
Alerting helps to detect and resolve issues in a timely manner. However, abalance is necessary between an approach that keeps you informed, and one thatoverwhelms you with notifications. By designing intelligent alerting systems,you can prioritize critical issues that have higher business impact. Considerthe following recommendations:
- Prioritize issues that affect customers: Design alerts that triggerrapidly for issues that directly affect the customer experience, like websiteoutages, slow response times, or transaction failures.
- Tune for temporary problems: Use appropriate thresholds and delaymechanisms to avoid unnecessary alerts for temporary problems or self-healingsystem issues that don't affect customers.
- Customize alert severity: Ensure that the most urgent issues receiveimmediate attention by differentiating between critical and noncriticalalerts.
- Use notification channels wisely: Choose appropriate channels for alertnotifications (email, SMS, or paging) based on the severity and urgency ofthe alerts.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-02-14 UTC.