Optimize continuously

This principle in the cost optimization pillar of theGoogle Cloud Well-Architected Framework provides recommendations to help you optimize the cost of your cloud deploymentsbased on constantly changing and evolving business goals.

As your business grows and evolves, your cloud workloads need to adapt to changesin resource requirements and usage patterns. To derive maximum value from yourcloud spending, you must maintain cost-efficiency while continuing to supportbusiness objectives. This requires a proactive and adaptive approach that focuseson continuous improvement and optimization.

Principle overview

To optimize cost continuously, you must proactively monitor and analyze yourcloud environment and make suitable adjustments to meet current requirements.Focus your monitoring efforts on key performance indicators (KPIs) that directlyaffect your end users' experience, align with your business goals, and provideinsights for continuous improvement. This approach lets you identify and addressinefficiencies, adapt to changing needs, and continuously align cloud spendingwith strategic business goals. To balance comprehensive observability with costeffectiveness, understand the costs and benefits of monitoring resource usageand use appropriate process-improvement and optimization strategies.

Recommendations

To effectively monitor your Google Cloud environment and optimize costcontinuously, consider the following recommendations.

Focus on business-relevant metrics

Effective monitoring starts with identifying the metrics that are most importantfor your business and customers. These metrics include the following:

User experience metrics: Latency, error rates, throughput, and customersatisfaction metrics are useful for understanding your end users' experiencewhen using your applications.
Business outcome metrics: Revenue, customer growth, and engagement canbe correlated with resource usage to identify opportunities for costoptimization.
DevOps Research & Assessment (DORA) metrics: Metricslike deployment frequency, lead time for changes, change failure rate, andtime to restore provide insights into the efficiency and reliability of yoursoftware delivery process. By improving these metrics, you can increaseproductivity, reduce downtime, and optimize cost.
Site Reliability Engineering (SRE) metrics: Errorbudgets help teams to quantify and manage the acceptable level of servicedisruption. By establishing clear expectations for reliability, error budgetsempower teams to innovate and deploy changes more confidently, knowing theirsafety margin. This proactive approach promotes a balance between innovationand stability, helping prevent excessive operational costs associated withmajor outages or prolonged downtime.

Use observability for resource optimization

The following are recommendations to use observability to identify resourcebottlenecks and underutilized resources in your cloud deployments:

Monitor resource utilization: Use resource utilization metrics to identifyGoogle Cloud resources that are underutilized. For example, use metricslike CPU and memory utilization to identifyidle VM resources.For Google Kubernetes Engine (GKE), you can view a detailedbreakdown of costs andcost-related optimization metrics.For Google Cloud VMware Engine,review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
Use cloud recommendations:Active Assist is a portfolio of intelligent tools that help you optimize your cloudoperations. These tools provide actionable recommendations to reduce costs,increase performance, improve security and even make sustainability-focuseddecisions. For example,VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
Correlate resource utilization with performance: Analyze the relationshipbetween resource utilization and application performance to determine whetheryou can downgrade to less expensive resources without affecting the userexperience.

Balance troubleshooting needs with cost

Detailed observability data can help with diagnosing and troubleshooting issues.However, storing excessive amounts of observability data or exporting unnecessarydata to external monitoring tools can lead to unnecessary costs. For efficienttroubleshooting, consider the following recommendations:

Collect sufficient data for troubleshooting: Ensure that your monitoringsolution captures enough data to efficiently diagnose and resolve issues whenthey arise. This data might include logs, traces, and metrics at variouslevels of granularity.
Use sampling and aggregation: Balance the need for detailed data withcost considerations by using sampling and aggregation techniques. This approachlets you collect representative data without incurring excessive storage costs.
Understand the pricing models of your monitoring tools and services: Evaluatedifferent monitoring solutions and choose options that align with yourproject's specific needs, budget, and usage patterns. Consider factors likedata volume, retention requirements, and the required features whenmaking your selection.
Regularly review your monitoring configuration: Avoid collecting excessivedata by removing unnecessary metrics or logs.

Tailor data collection to roles and set role-specific retention policies

Consider the specific data needs of different roles. For example, developersmight primarily need access to traces and application-level logs, whereas ITadministrators might focus on system logs and infrastructure metrics. By tailoringdata collection, you can reduce unnecessary storage costs and avoid overwhelmingusers with irrelevant information.

Additionally, you can define retention policies based on the needs of each roleand any regulatory requirements. For example, developers might need access todetailed logs for a shorter period, while financial analysts might requirelonger-term data.

Consider regulatory and compliance requirements

In certain industries, regulatory requirements mandate data retention. To avoidlegal and financial risks, you need to ensure that your monitoring and dataretention practices help you adhere to relevant regulations. At the same time,you need to maintain cost efficiency. Consider the following recommendations:

Determine the specific data retention requirements for your industry or region,and ensure that your monitoring strategy meets the requirements of thoserequirements.
Implement appropriate data archival and retrieval mechanisms to meet auditand compliance needs while minimizing storage costs.

Implement smart alerting

Alerting helps to detect and resolve issues in a timely manner. However, abalance is necessary between an approach that keeps you informed, and one thatoverwhelms you with notifications. By designing intelligent alerting systems,you can prioritize critical issues that have higher business impact. Considerthe following recommendations:

Prioritize issues that affect customers: Design alerts that triggerrapidly for issues that directly affect the customer experience, like websiteoutages, slow response times, or transaction failures.
Tune for temporary problems: Use appropriate thresholds and delaymechanisms to avoid unnecessary alerts for temporary problems or self-healingsystem issues that don't affect customers.
Customize alert severity: Ensure that the most urgent issues receiveimmediate attention by differentiating between critical and noncriticalalerts.
Use notification channels wisely: Choose appropriate channels for alertnotifications (email, SMS, or paging) based on the severity and urgency ofthe alerts.

Optimize resource usage

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-09-25 UTC.

Movatterモバイル変換

Optimize continuously Stay organized with collections Save and categorize content based on your preferences.