coder/coderPublic

NotificationsYou must be signed in to change notification settings
Fork925
Star10.1k

FR: Immediate Workspace Timeout When GPU Resources Are Unavailable#17105

bjornrobertsson started this conversation inFeature Requests

bjornrobertsson

Mar 26, 2025

· 2 comments

Return to top

Discussion options

bjornrobertsson
Mar 26, 2025
Collaborator

Problem

When GPU resources (specifically AWS G5 48x large instances) are unavailable on the cloud provider, Coder workspaces continue to try to provision these resources for up to 60 minutes before timing out. This creates a poor user experience where users wait unnecessarily when the resources cannot be allocated. Users need an immediate notification when resources are unavailable rather than waiting for a long timeout period.

Proposed Solution

Add configuration options to customize timeout behavior when resources are unavailable:

Configurable resource availability timeout - Allow administrators to set custom timeout periods specifically for resource availability issues (separate from other timeout categories)
Improved error messaging in UI - Display specific resource unavailability errors to users, including:
- Reason for the wait (insufficient GPU, RAM, CPU, Disk)
- Estimated time until resources might be available (if known)
- Option to cancel and try different resource specifications
Resource type-specific timeouts - Allow different timeout settings for different resource types:
```
timeouts:resource_unavailable:gpu:5mmemory:15mcpu:10mdisk:10m
```
AWS-specific detection mechanism - Implement integration with AWS API to detect true resource unavailability vs. temporary scheduling issues

Implementation Details

Based on examination of the codebase, the following components would need modification:

Inprovisioner/terraform/terraform.go, modify the resource creation logic to detect specific AWS resource unavailability errors and fail faster
Update the workspace state model incoderd/workspaces.go to include more granular status information about resource allocation failures
Add new configuration fields tocoderd/parameter.go to support customizable timeout settings for different resource types
Enhance the UI components to display more detailed error information when resources are unavailable

Expected Outcome

When GPU resources are unavailable, users will be notified within 5 minutes (or administrator-configured timeout) rather than waiting for 60 minutes
Error messages will clearly indicate that resources are unavailable at the cloud provider level
Users can make informed decisions about either waiting or selecting different resource specifications
System administrators can configure appropriate timeout values based on their organization's needs and cloud provider characteristics

Potential Implementation Challenges

The AWS provider in Terraform considers resource allocation failures as retryable events, making it difficult to distinguish between temporary and permanent unavailability
Kubernetes doesn't always receive detailed error information from the cloud provider's autoscaling groups
Need to balance immediate failure notifications against allowing reasonable time for resources to become available through autoscaling

You must be logged in to vote

Replies: 2 comments

Comment options

matifali
Mar 26, 2025
Maintainer

I don't think it's a good idea to handle AWS specifically. We should resort to a more general terraform centric solution.

Here isan example showing timeouts configuration for an AWS resource.

You must be logged in to vote

0 replies

Comment options

bjornrobertsson
Apr 8, 2025
Collaborator Author

The particular pain point here is not the terraform create/destroy operation, which do have timeouts that function.
It is not viable to use those when restarting a Workspace (moving the state from Stopped to Running) - which is left to the k8s actions, and is likely to be resolvable as a k8s issue (and not AWS specifically).

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FR: Immediate Workspace Timeout When GPU Resources Are Unavailable#17105

Uh oh!

{{title}}

Uh oh!

bjornrobertsson
Mar 26, 2025
Collaborator

Problem

Proposed Solution

Implementation Details

Expected Outcome

Related Documentation

Potential Implementation Challenges

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

matifali
Mar 26, 2025
Maintainer

Uh oh!

{{title}}

Uh oh!

bjornrobertsson
Apr 8, 2025
Collaborator Author

Select a reply

Uh oh!

Movatterモバイル変換

FR: Immediate Workspace Timeout When GPU Resources Are Unavailable#17105

Uh oh!

bjornrobertssonMar 26, 2025 Collaborator

Problem

Proposed Solution

Implementation Details

Expected Outcome

Related Documentation

Potential Implementation Challenges

Replies: 2 comments

Uh oh!

matifaliMar 26, 2025 Maintainer

Uh oh!

bjornrobertssonApr 8, 2025 Collaborator Author

Uh oh!

bjornrobertsson
Mar 26, 2025
Collaborator

matifali
Mar 26, 2025
Maintainer

bjornrobertsson
Apr 8, 2025
Collaborator Author