ProblemWhen GPU resources (specifically AWS G5 48x large instances) are unavailable on the cloud provider, Coder workspaces continue to try to provision these resources for up to 60 minutes before timing out. This creates a poor user experience where users wait unnecessarily when the resources cannot be allocated. Users need an immediate notification when resources are unavailable rather than waiting for a long timeout period. Proposed SolutionAdd configuration options to customize timeout behavior when resources are unavailable: Configurable resource availability timeout - Allow administrators to set custom timeout periods specifically for resource availability issues (separate from other timeout categories) Improved error messaging in UI - Display specific resource unavailability errors to users, including: - Reason for the wait (insufficient GPU, RAM, CPU, Disk)
- Estimated time until resources might be available (if known)
- Option to cancel and try different resource specifications
Resource type-specific timeouts - Allow different timeout settings for different resource types: timeouts:resource_unavailable:gpu:5mmemory:15mcpu:10mdisk:10m AWS-specific detection mechanism - Implement integration with AWS API to detect true resource unavailability vs. temporary scheduling issues
Implementation DetailsBased on examination of the codebase, the following components would need modification: Inprovisioner/terraform/terraform.go , modify the resource creation logic to detect specific AWS resource unavailability errors and fail faster Update the workspace state model incoderd/workspaces.go to include more granular status information about resource allocation failures Add new configuration fields tocoderd/parameter.go to support customizable timeout settings for different resource types Enhance the UI components to display more detailed error information when resources are unavailable
Expected Outcome- When GPU resources are unavailable, users will be notified within 5 minutes (or administrator-configured timeout) rather than waiting for 60 minutes
- Error messages will clearly indicate that resources are unavailable at the cloud provider level
- Users can make informed decisions about either waiting or selecting different resource specifications
- System administrators can configure appropriate timeout values based on their organization's needs and cloud provider characteristics
Related DocumentationThese changes would require updates to: Potential Implementation Challenges- The AWS provider in Terraform considers resource allocation failures as retryable events, making it difficult to distinguish between temporary and permanent unavailability
- Kubernetes doesn't always receive detailed error information from the cloud provider's autoscaling groups
- Need to balance immediate failure notifications against allowing reasonable time for resources to become available through autoscaling
|