-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5732: Remove desiredCount from the TAS KEP #5949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,7 +14,6 @@ | |
| - [Risks and Mitigations](#risks-and-mitigations) | ||
| - [Design Details](#design-details) | ||
| - [Workload API Changes](#workload-api-changes) | ||
| - [Basic and Gang Policy Extension](#basic-and-gang-policy-extension) | ||
| - [Scheduling Framework Extensions](#scheduling-framework-extensions) | ||
| - [1. Data Structures](#1-data-structures) | ||
| - [2. New Plugin Interfaces](#2-new-plugin-interfaces) | ||
|
|
@@ -281,78 +280,6 @@ will be defined in a separate KEP: | |
| Note: For the initial alpha scope, only a single TopologyConstraint will be | ||
| supported. | ||
|
|
||
| #### Basic and Gang Policy Extension | ||
|
|
||
| In the first alpha version of the Workload API, the `Basic` policy was a no-op. | ||
| We propose extending the `Basic` and `Gang` policies to accept a `desiredCount` | ||
| field. This field serves as a scheduler hint to improve placement decisions | ||
| without imposing hard scheduling constraints. | ||
|
|
||
| This feature will be gated behind a separate feature gate | ||
| (`PodGroupDesiredCount`) to decouple it from the core Gang Scheduling | ||
| and Topology Aware Scheduling features. | ||
|
|
||
| **1. Basic Policy Update** | ||
|
|
||
| We introduce `desiredCount` to the `Basic` policy to allow users to signal the | ||
| expected group size for optimization purposes. | ||
|
|
||
| ```go | ||
| // BasicSchedulingPolicy indicates that standard Kubernetes | ||
| // scheduling behavior should be used. | ||
| type BasicSchedulingPolicy struct { | ||
| // DesiredCount is the expected number of pods that will belong to this | ||
| // PodGroup. This field is a hint to the scheduler to help it make better | ||
| // placement decisions for the group as a whole. | ||
| // | ||
| // Unlike gang's minCount, this field does not block scheduling. If the number | ||
| // of available pods is less than desiredCount, the scheduler can still attempt | ||
| // to schedule the available pods, but will optimistically try to select a | ||
| // placement that can accommodate the future pods. | ||
| // | ||
| // +optional | ||
| DesiredCount *int32 | ||
| } | ||
| ``` | ||
|
|
||
| **2. Gang Policy Update** | ||
|
|
||
| We similarly extend the `Gang` policy. While `minCount` provides a hard constraint | ||
| for admission, `desiredCount` provides a soft target for placement optimization. | ||
|
|
||
| ```go | ||
| // GangSchedulingPolicy defines the parameters for gang scheduling. | ||
| type GangSchedulingPolicy struct { | ||
| // MinCount is the minimum number of pods that must be schedulable or scheduled | ||
| // at the same time for the scheduler to admit the entire group. | ||
| // It must be a positive integer. | ||
| // | ||
| // +required | ||
| MinCount int32 | ||
|
|
||
| // DesiredCount is the expected number of pods that will belong to this | ||
| // PodGroup. This field is a hint to the scheduler to help it make better | ||
| // placement decisions for the group as a whole. | ||
| // | ||
| // Unlike gang's minCount, this field does not block scheduling. If the number | ||
| // of available pods is less than desiredCount but at least minCount, the scheduler | ||
| // can still attempt to schedule the available pods, but will optimistically try | ||
| // to select a placement that can accommodate the future pods. | ||
| // | ||
| // When provided desiredCount must be greater or equal to minCount. | ||
| // | ||
| // +optional | ||
| DesiredCount *int32 | ||
| } | ||
| ``` | ||
|
|
||
| Those fields allow users to express their "true" workloads more easily and enables | ||
| the scheduler to optimize the placement of such pod groups by taking the desired state | ||
| into account. Ideally, the scheduler should prefer placements that can accommodate | ||
| the full `desiredCount`, even if not all pods are created yet. When `desiredCount` | ||
| is specified, the scheduler can delay scheduling the first Pod it sees for a short | ||
| amount of time in order to wait for more Pods to be observed. | ||
|
|
||
| ### Scheduling Framework Extensions | ||
|
|
||
| The scheduler framework requires new plugin interfaces to handle "Placements". A | ||
|
|
@@ -518,6 +445,14 @@ The algorithm proceeds in three main phases for a given PodGroup. | |
| - **Potential Optimization:** Pre-filtering can check aggregate resources | ||
| requested by PodGroup Pods before running the full simulation. | ||
|
|
||
| - **Basic Scheduling Policy Handling:** The current algorithm may exhibit | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same applies to Gang Policy if more pods than minCount are created.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is true that we have a similar problem with Gang Policy but in case of Gang Policy we have explicit information on minCount coming from the user and it is their conscious decision if they are setting minCount to lower number compared to total number of pods. So in the context of Gang Policy TAS adheres to explicit user intent and makes sure that it is fulfilled. For Basic scheduling policy user may expect some best effort behaviour from TAS which in many cases will simply not be true so this disclaimer makes sense for me for Basic Policy but not necessary for Gang Policy. |
||
| inconsistent behavior when used with the PodGroup Basic Scheduling Policy. | ||
| Because the scheduler may only observe a subset of pods when scheduling | ||
| a PodGroup, placement feasibility is only validated for those specific | ||
| pods rather than the entire group. This limitation may be addressed in | ||
| future releases; currently, scheduling gates may be used as a partial | ||
| mitigation. | ||
|
|
||
| - **Heterogeneous PodGroup Handling**: Sequential processing will be used | ||
| initially. Pods are processed sequentially; if any fail, the placement is | ||
| rejected. | ||
|
|
@@ -702,10 +637,6 @@ kube-scheduler instance being a leader). | |
| - Components depending on the feature gate: | ||
| - kube-apiserver | ||
| - kube-scheduler | ||
| - Feature gate name: PodGroupDesiredCount | ||
| - Components depending on the feature gate: | ||
| - kube-apiserver | ||
| - kube-scheduler | ||
| - [ ] Other | ||
| - Describe the mechanism: | ||
| - Will enabling / disabling the feature require downtime of the control | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know that TAS for non-gangs is tricky due to edge cases (e.g., the scheduler making placement decisions based on a partial view of the pods). I treat this as an (unsuccessful) attempt to fix the problem. While removing it now makes sense until we have a more thorough design, I think we shouldn't just drop the context entirely.
Instead of simply removing it, should we add a short paragraph to the KEP outlining the current TAS limitations for basic policies, noting that this will be addressed in future releases? We could also mention scheduling gates as a currently available (albeit imperfect) mitigation.
The 'Phase 2' algorithm section might be a good place for this. We could expand on point '3. If all pods fit, the Placement is marked Feasible' to explain why this works perfectly for gangs (waiting for MinCount in pre-enqueue), but might behave inconsistently for basic policies (where the scheduler's decision depends on the number of observed pods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, but I think only the problem of "not-ready" pods is worth addressing as a part of TAS as it makes the feature working sub-optimally.
Addressing the desired future number of pods is rather an independent extension to TAS and IMHO deserves a separate KEP.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated Phase 2 algorithm description with a point explaining the limitations of using TAS with Basic Scheduling Policy. Please take a look.