Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

FR: Immediate Workspace Timeout When GPU Resources Are Unavailable#17105

bjornrobertsson started this conversation inFeature Requests
Discussion options

Problem

When GPU resources (specifically AWS G5 48x large instances) are unavailable on the cloud provider, Coder workspaces continue to try to provision these resources for up to 60 minutes before timing out. This creates a poor user experience where users wait unnecessarily when the resources cannot be allocated. Users need an immediate notification when resources are unavailable rather than waiting for a long timeout period.

Proposed Solution

Add configuration options to customize timeout behavior when resources are unavailable:

  1. Configurable resource availability timeout - Allow administrators to set custom timeout periods specifically for resource availability issues (separate from other timeout categories)

  2. Improved error messaging in UI - Display specific resource unavailability errors to users, including:

    • Reason for the wait (insufficient GPU, RAM, CPU, Disk)
    • Estimated time until resources might be available (if known)
    • Option to cancel and try different resource specifications
  3. Resource type-specific timeouts - Allow different timeout settings for different resource types:

    timeouts:resource_unavailable:gpu:5mmemory:15mcpu:10mdisk:10m
  4. AWS-specific detection mechanism - Implement integration with AWS API to detect true resource unavailability vs. temporary scheduling issues

Implementation Details

Based on examination of the codebase, the following components would need modification:

  1. Inprovisioner/terraform/terraform.go, modify the resource creation logic to detect specific AWS resource unavailability errors and fail faster

  2. Update the workspace state model incoderd/workspaces.go to include more granular status information about resource allocation failures

  3. Add new configuration fields tocoderd/parameter.go to support customizable timeout settings for different resource types

  4. Enhance the UI components to display more detailed error information when resources are unavailable

Expected Outcome

  1. When GPU resources are unavailable, users will be notified within 5 minutes (or administrator-configured timeout) rather than waiting for 60 minutes
  2. Error messages will clearly indicate that resources are unavailable at the cloud provider level
  3. Users can make informed decisions about either waiting or selecting different resource specifications
  4. System administrators can configure appropriate timeout values based on their organization's needs and cloud provider characteristics

Related Documentation

These changes would require updates to:

Potential Implementation Challenges

  1. The AWS provider in Terraform considers resource allocation failures as retryable events, making it difficult to distinguish between temporary and permanent unavailability
  2. Kubernetes doesn't always receive detailed error information from the cloud provider's autoscaling groups
  3. Need to balance immediate failure notifications against allowing reasonable time for resources to become available through autoscaling
You must be logged in to vote

Replies: 2 comments

Comment options

I don't think it's a good idea to handle AWS specifically. We should resort to a more general terraform centric solution.

Here isan example showing timeouts configuration for an AWS resource.

You must be logged in to vote
0 replies
Comment options

bjornrobertsson
Apr 8, 2025
Collaborator Author

The particular pain point here is not the terraform create/destroy operation, which do have timeouts that function.
It is not viable to use those when restarting a Workspace (moving the state from Stopped to Running) - which is left to the k8s actions, and is likely to be resolvable as a k8s issue (and not AWS specifically).

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
None yet
2 participants
@bjornrobertsson@matifali

[8]ページ先頭

©2009-2025 Movatter.jp