Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Workload API Changes](#workload-api-changes)
- [Basic and Gang Policy Extension](#basic-and-gang-policy-extension)
- [Scheduling Framework Extensions](#scheduling-framework-extensions)
- [1. Data Structures](#1-data-structures)
- [2. New Plugin Interfaces](#2-new-plugin-interfaces)
Expand Down Expand Up @@ -281,78 +280,6 @@ will be defined in a separate KEP:
Note: For the initial alpha scope, only a single TopologyConstraint will be
supported.

#### Basic and Gang Policy Extension

In the first alpha version of the Workload API, the `Basic` policy was a no-op.
We propose extending the `Basic` and `Gang` policies to accept a `desiredCount`
field. This field serves as a scheduler hint to improve placement decisions
without imposing hard scheduling constraints.

This feature will be gated behind a separate feature gate
(`PodGroupDesiredCount`) to decouple it from the core Gang Scheduling
and Topology Aware Scheduling features.

**1. Basic Policy Update**

We introduce `desiredCount` to the `Basic` policy to allow users to signal the
expected group size for optimization purposes.

```go
// BasicSchedulingPolicy indicates that standard Kubernetes
// scheduling behavior should be used.
type BasicSchedulingPolicy struct {
Copy link
Contributor

@mm4tt mm4tt Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that TAS for non-gangs is tricky due to edge cases (e.g., the scheduler making placement decisions based on a partial view of the pods). I treat this as an (unsuccessful) attempt to fix the problem. While removing it now makes sense until we have a more thorough design, I think we shouldn't just drop the context entirely.

Instead of simply removing it, should we add a short paragraph to the KEP outlining the current TAS limitations for basic policies, noting that this will be addressed in future releases? We could also mention scheduling gates as a currently available (albeit imperfect) mitigation.

The 'Phase 2' algorithm section might be a good place for this. We could expand on point '3. If all pods fit, the Placement is marked Feasible' to explain why this works perfectly for gangs (waiting for MinCount in pre-enqueue), but might behave inconsistently for basic policies (where the scheduler's decision depends on the number of observed pods).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply removing it, should we add a short paragraph to the KEP outlining the current TAS limitations for basic policies, noting that this will be addressed in future releases? We could also mention scheduling gates as a currently available (albeit imperfect) mitigation.

+1, but I think only the problem of "not-ready" pods is worth addressing as a part of TAS as it makes the feature working sub-optimally.

Addressing the desired future number of pods is rather an independent extension to TAS and IMHO deserves a separate KEP.

Copy link
Contributor Author

@44past4 44past4 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated Phase 2 algorithm description with a point explaining the limitations of using TAS with Basic Scheduling Policy. Please take a look.

// DesiredCount is the expected number of pods that will belong to this
// PodGroup. This field is a hint to the scheduler to help it make better
// placement decisions for the group as a whole.
//
// Unlike gang's minCount, this field does not block scheduling. If the number
// of available pods is less than desiredCount, the scheduler can still attempt
// to schedule the available pods, but will optimistically try to select a
// placement that can accommodate the future pods.
//
// +optional
DesiredCount *int32
}
```

**2. Gang Policy Update**

We similarly extend the `Gang` policy. While `minCount` provides a hard constraint
for admission, `desiredCount` provides a soft target for placement optimization.

```go
// GangSchedulingPolicy defines the parameters for gang scheduling.
type GangSchedulingPolicy struct {
// MinCount is the minimum number of pods that must be schedulable or scheduled
// at the same time for the scheduler to admit the entire group.
// It must be a positive integer.
//
// +required
MinCount int32

// DesiredCount is the expected number of pods that will belong to this
// PodGroup. This field is a hint to the scheduler to help it make better
// placement decisions for the group as a whole.
//
// Unlike gang's minCount, this field does not block scheduling. If the number
// of available pods is less than desiredCount but at least minCount, the scheduler
// can still attempt to schedule the available pods, but will optimistically try
// to select a placement that can accommodate the future pods.
//
// When provided desiredCount must be greater or equal to minCount.
//
// +optional
DesiredCount *int32
}
```

Those fields allow users to express their "true" workloads more easily and enables
the scheduler to optimize the placement of such pod groups by taking the desired state
into account. Ideally, the scheduler should prefer placements that can accommodate
the full `desiredCount`, even if not all pods are created yet. When `desiredCount`
is specified, the scheduler can delay scheduling the first Pod it sees for a short
amount of time in order to wait for more Pods to be observed.

### Scheduling Framework Extensions

The scheduler framework requires new plugin interfaces to handle "Placements". A
Expand Down Expand Up @@ -518,6 +445,14 @@ The algorithm proceeds in three main phases for a given PodGroup.
- **Potential Optimization:** Pre-filtering can check aggregate resources
requested by PodGroup Pods before running the full simulation.

- **Basic Scheduling Policy Handling:** The current algorithm may exhibit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to Gang Policy if more pods than minCount are created.

Copy link
Contributor Author

@44past4 44past4 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true that we have a similar problem with Gang Policy but in case of Gang Policy we have explicit information on minCount coming from the user and it is their conscious decision if they are setting minCount to lower number compared to total number of pods. So in the context of Gang Policy TAS adheres to explicit user intent and makes sure that it is fulfilled. For Basic scheduling policy user may expect some best effort behaviour from TAS which in many cases will simply not be true so this disclaimer makes sense for me for Basic Policy but not necessary for Gang Policy.

inconsistent behavior when used with the PodGroup Basic Scheduling Policy.
Because the scheduler may only observe a subset of pods when scheduling
a PodGroup, placement feasibility is only validated for those specific
pods rather than the entire group. This limitation may be addressed in
future releases; currently, scheduling gates may be used as a partial
mitigation.

- **Heterogeneous PodGroup Handling**: Sequential processing will be used
initially. Pods are processed sequentially; if any fail, the placement is
rejected.
Expand Down Expand Up @@ -702,10 +637,6 @@ kube-scheduler instance being a leader).
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- Feature gate name: PodGroupDesiredCount
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,6 @@ milestone:
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: TopologyAwareWorkloadScheduling
components:
- kube-apiserver
- kube-scheduler
- name: PodGroupDesiredCount
components:
- kube-apiserver
- kube-scheduler
Expand Down