Skip to content

Detect DRA resource mismatch against autoscaler node templates#9352

Open
mtrqq wants to merge 6 commits intokubernetes:masterfrom
mtrqq:diff-devices
Open

Detect DRA resource mismatch against autoscaler node templates#9352
mtrqq wants to merge 6 commits intokubernetes:masterfrom
mtrqq:diff-devices

Conversation

@mtrqq
Copy link
Contributor

@mtrqq mtrqq commented Mar 12, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a metric and reporting process for it to detect drift between cluster autoscaler in-memory DRA resources and real nodes. It does so by comparing attribute signatures of the resource pools found in resource slices attached to the ready nodes and template node infos. Exact algorithm for finding mismatches can be found in the resourcePoolComparator.CompareResourcePools(...) documentation

The workflow of this metric would be that we'll maintain a gauge indicating the amount of resource deltas found, their drivers and delta types. For closer inspection - autoscaler logs be be observed to see a summary of the deltas found for a fixed count of nodes

The performance impact of this change should be negligible as we barely do any allocations and at worst spending 5 microseconds for a node which has a lot of deltas, while being called just a single time once per loop. The amount of allocations is close to zero so it shouldn't make situation with CA memory pressure any worse. Non DRA nodes are barely affected by the change as there's little to no work performed for them during comparison

goos: linux
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/simulator/dynamicresources/comparator
cpu: AMD EPYC 7B12
BenchmarkReportResourceDiscrepancies_1Node_0Discrepancies-24                               10000               635.3 ns/op             0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_NoDRA-24                                        10000                13.45 ns/op            0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_10Discrepancies-24                              10000              7007 ns/op            3488 B/op          4 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_10DiscrepanciesEach-24                        10000             47358 ns/op           19932 B/op         11 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_NoDRA-24                                      10000                21.44 ns/op            0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_0Discrepancies-24                             10000              5906 ns/op               0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_10Drivers_10Discrepancies-24                    10000              7891 ns/op            3488 B/op          4 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_10Drivers_10DiscrepanciesEach-24              10000             49441 ns/op           19926 B/op         11 allocs/op
BenchmarkCompareDraResourcesExact-24                                                       10000              4440 ns/op               0 B/op          0 allocs/op
BenchmarkCompareDraResourcesFuzzy-24                                                       10000              4738 ns/op               0 B/op          0 allocs/op
BenchmarkCompareDraResourcesRankingFuzzy-24                                                10000              8412 ns/op               0 B/op          0 allocs/op

Which issue(s) this PR fixes:

Special notes for your reviewer:

In this change I've tried to apply couple practices to bring allocations down and drastically improve performance, namely store buffers for reusability. While the change differs from most of the cluster autoscaler code in style primarily due to buffer reuse - it achieves close to zero allocation rate per function call.

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area labels Mar 12, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mtrqq
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Mar 12, 2026
@mtrqq mtrqq force-pushed the diff-devices branch 5 times, most recently from 442fb16 to 7f8128f Compare March 13, 2026 12:19
@mtrqq mtrqq marked this pull request as ready for review March 13, 2026 12:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants