Detect DRA resource mismatch against autoscaler node templates#9352
Detect DRA resource mismatch against autoscaler node templates#9352mtrqq wants to merge 6 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mtrqq The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
442fb16 to
7f8128f
Compare
…and produce aggregate summaries
…with snapshot API changes
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a metric and reporting process for it to detect drift between cluster autoscaler in-memory DRA resources and real nodes. It does so by comparing attribute signatures of the resource pools found in resource slices attached to the ready nodes and template node infos. Exact algorithm for finding mismatches can be found in the resourcePoolComparator.CompareResourcePools(...) documentation
The workflow of this metric would be that we'll maintain a gauge indicating the amount of resource deltas found, their drivers and delta types. For closer inspection - autoscaler logs be be observed to see a summary of the deltas found for a fixed count of nodes
The performance impact of this change should be negligible as we barely do any allocations and at worst spending 5 microseconds for a node which has a lot of deltas, while being called just a single time once per loop. The amount of allocations is close to zero so it shouldn't make situation with CA memory pressure any worse. Non DRA nodes are barely affected by the change as there's little to no work performed for them during comparison
Which issue(s) this PR fixes:
Special notes for your reviewer:
In this change I've tried to apply couple practices to bring allocations down and drastically improve performance, namely store buffers for reusability. While the change differs from most of the cluster autoscaler code in style primarily due to buffer reuse - it achieves close to zero allocation rate per function call.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: