NotificationsYou must be signed in to change notification settings
Fork42k
Star119k

NominatedNodeName is cleared on Bind failure leading to unschedulable pods #135771

Open

NominatedNodeName is cleared on Bind failure leading to unschedulable pods#135771

Labels

kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

Description

brejman

opened

on Dec 16, 2025

What happened?

/sig scheduling

I was verifying a scenario where two pods with topology spread and PVCs with WaitForFirstConsumer storage got their PVCs assigned to the same node, resulting in one of the pods becoming unschedulable due to conflicting rules (topology spread prevents using the same node, volume binding enforces using that node).

The suspected root cause was a pod binding failure or scheduler restart, something like:

Pod#1's PVC is bound to Node#1
Pod#1 fails to bind
Pod#2's PVC is bound to Node#1
Pod#2 is bound to Node#1
Pod#1 is unschedulable because it has to use Node#1 due to the PVC binding but can't use it due to pod topology spread

I wrote an integration test to repro, and it's still reproducible even with theNominatedNodeNameForExpectation feature enabled.

Upon further investigation, it seems that the NominatedNodeName is cleared on binding error, so scheduler isn't aware of Pod#1 initial assignment when scheduling Pod#2.

kubernetes/pkg/scheduler/schedule_one.go

Line 387 inc180d67

sched.FailureHandler(ctx,fwk,podInfo,status,clearNominatedNode,start)

What did you expect to happen?

I was expecting Pod#2 to be scheduled on another node because the scheduler should be aware of the previous, partially failed, retriable binding of Pod#1.

How can we reproduce it (as minimally and precisely as possible)?

An integration test can follow these steps:

Create 2 nodes of same sizes
Create storage with WaitForFirstConsumer binding mode
Create pvc using dynamic binding (pvc#1), and a pod using the pvc (pod#1). The pod has preferred affinity for node#1 and takes up all of its resources.
Set up Bind plugin that will fail to bind pod#1 in the next cycle
Run scheduler
Expect: pvc#1 has selected-node annotation equal to node#1 (OK), pod#1 fails to bind (OK), pod#1 has nominated node name equal to node#1 (FAIL, nominated node name seems to be cleared)

To fully validate the scenario in the bug, add the following steps afterwards:

Tear down the scheduler
Add another pvc using dynamic binding (pvc#2) and a pod using that pvc (pod#2). The pod also has preferred affinity for node#1
Set up QueueSort plugin that will force pod#2 to be scheduled before pod#1
Run scheduler
Expect: pod#2 is scheduled on node#2 (FAIL, actual node#1), pod#1 is scheduled on node#1 (FAIL, unschedulable)

Anything else we need to know?

No response

Kubernetes version

At least up to 1.35 inclusive

Cloud provider

Independent

OS version

No response

Install tools

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Metadata

Assignees

No one assigned

Labels

Type

No type

Projects

SIG Scheduling

Status

Needs Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NominatedNodeName is cleared on Bind failure leading to unschedulable pods #135771

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions