- Notifications
You must be signed in to change notification settings - Fork42k
Description
What happened?
/sig scheduling
I was verifying a scenario where two pods with topology spread and PVCs with WaitForFirstConsumer storage got their PVCs assigned to the same node, resulting in one of the pods becoming unschedulable due to conflicting rules (topology spread prevents using the same node, volume binding enforces using that node).
The suspected root cause was a pod binding failure or scheduler restart, something like:
- Pod#1's PVC is bound to Node#1
- Pod#1 fails to bind
- Pod#2's PVC is bound to Node#1
- Pod#2 is bound to Node#1
- Pod#1 is unschedulable because it has to use Node#1 due to the PVC binding but can't use it due to pod topology spread
I wrote an integration test to repro, and it's still reproducible even with theNominatedNodeNameForExpectation feature enabled.
Upon further investigation, it seems that the NominatedNodeName is cleared on binding error, so scheduler isn't aware of Pod#1 initial assignment when scheduling Pod#2.
kubernetes/pkg/scheduler/schedule_one.go
Line 387 inc180d67
| sched.FailureHandler(ctx,fwk,podInfo,status,clearNominatedNode,start) |
What did you expect to happen?
I was expecting Pod#2 to be scheduled on another node because the scheduler should be aware of the previous, partially failed, retriable binding of Pod#1.
How can we reproduce it (as minimally and precisely as possible)?
An integration test can follow these steps:
- Create 2 nodes of same sizes
- Create storage with WaitForFirstConsumer binding mode
- Create pvc using dynamic binding (pvc#1), and a pod using the pvc (pod#1). The pod has preferred affinity for node#1 and takes up all of its resources.
- Set up Bind plugin that will fail to bind pod#1 in the next cycle
- Run scheduler
- Expect: pvc#1 has selected-node annotation equal to node#1 (OK), pod#1 fails to bind (OK), pod#1 has nominated node name equal to node#1 (FAIL, nominated node name seems to be cleared)
To fully validate the scenario in the bug, add the following steps afterwards:
- Tear down the scheduler
- Add another pvc using dynamic binding (pvc#2) and a pod using that pvc (pod#2). The pod also has preferred affinity for node#1
- Set up QueueSort plugin that will force pod#2 to be scheduled before pod#1
- Run scheduler
- Expect: pod#2 is scheduled on node#2 (FAIL, actual node#1), pod#1 is scheduled on node#1 (FAIL, unschedulable)
Anything else we need to know?
No response
Kubernetes version
At least up to 1.35 inclusive
Cloud provider
Independent
OS version
No response
Install tools
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status