Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

NominatedNodeName is cleared on Bind failure leading to unschedulable pods #135771

Open
Labels
kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.
@brejman

Description

@brejman

What happened?

/sig scheduling

I was verifying a scenario where two pods with topology spread and PVCs with WaitForFirstConsumer storage got their PVCs assigned to the same node, resulting in one of the pods becoming unschedulable due to conflicting rules (topology spread prevents using the same node, volume binding enforces using that node).

The suspected root cause was a pod binding failure or scheduler restart, something like:

  1. Pod#1's PVC is bound to Node#1
  2. Pod#1 fails to bind
  3. Pod#2's PVC is bound to Node#1
  4. Pod#2 is bound to Node#1
  5. Pod#1 is unschedulable because it has to use Node#1 due to the PVC binding but can't use it due to pod topology spread

I wrote an integration test to repro, and it's still reproducible even with theNominatedNodeNameForExpectation feature enabled.

Upon further investigation, it seems that the NominatedNodeName is cleared on binding error, so scheduler isn't aware of Pod#1 initial assignment when scheduling Pod#2.

sched.FailureHandler(ctx,fwk,podInfo,status,clearNominatedNode,start)

What did you expect to happen?

I was expecting Pod#2 to be scheduled on another node because the scheduler should be aware of the previous, partially failed, retriable binding of Pod#1.

How can we reproduce it (as minimally and precisely as possible)?

An integration test can follow these steps:

  1. Create 2 nodes of same sizes
  2. Create storage with WaitForFirstConsumer binding mode
  3. Create pvc using dynamic binding (pvc#1), and a pod using the pvc (pod#1). The pod has preferred affinity for node#1 and takes up all of its resources.
  4. Set up Bind plugin that will fail to bind pod#1 in the next cycle
  5. Run scheduler
  6. Expect: pvc#1 has selected-node annotation equal to node#1 (OK), pod#1 fails to bind (OK), pod#1 has nominated node name equal to node#1 (FAIL, nominated node name seems to be cleared)

To fully validate the scenario in the bug, add the following steps afterwards:

  1. Tear down the scheduler
  2. Add another pvc using dynamic binding (pvc#2) and a pod using that pvc (pod#2). The pod also has preferred affinity for node#1
  3. Set up QueueSort plugin that will force pod#2 to be scheduled before pod#1
  4. Run scheduler
  5. Expect: pod#2 is scheduled on node#2 (FAIL, actual node#1), pod#1 is scheduled on node#1 (FAIL, unschedulable)

Anything else we need to know?

No response

Kubernetes version

At least up to 1.35 inclusive

Cloud provider

Independent

OS version

No response

Install tools

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp