Detect DRA resource mismatch against autoscaler node templates by mtrqq · Pull Request #9352 · kubernetes/autoscaler

mtrqq · 2026-03-12T17:56:02Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a metric and reporting process for it to detect drift between cluster autoscaler in-memory DRA resources and real nodes. It does so by comparing attribute signatures of the resource pools found in resource slices attached to the ready nodes and template node infos. Exact algorithm for finding mismatches can be found in the resourcePoolComparator.CompareResourcePools(...) documentation

The workflow of this metric would be that we'll maintain a gauge indicating the amount of resource deltas found, their drivers and delta types. For closer inspection - autoscaler logs be be observed to see a summary of the deltas found for a fixed count of nodes

The performance impact of this change should be negligible as we barely do any allocations and at worst spending 5 microseconds for a node which has a lot of deltas, while being called just a single time once per loop. The amount of allocations is close to zero so it shouldn't make situation with CA memory pressure any worse. Non DRA nodes are barely affected by the change as there's little to no work performed for them during comparison

goos: linux
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/simulator/dynamicresources/comparator
cpu: AMD EPYC 7B12
BenchmarkReportResourceDiscrepancies_1Node_0Discrepancies-24                               10000               635.3 ns/op             0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_NoDRA-24                                        10000                13.45 ns/op            0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_10Discrepancies-24                              10000              7007 ns/op            3488 B/op          4 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_10DiscrepanciesEach-24                        10000             47358 ns/op           19932 B/op         11 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_NoDRA-24                                      10000                21.44 ns/op            0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_0Discrepancies-24                             10000              5906 ns/op               0 B/op          0 allocs/op
BenchmarkReportResourceDiscrepancies_1Node_10Drivers_10Discrepancies-24                    10000              7891 ns/op            3488 B/op          4 allocs/op
BenchmarkReportResourceDiscrepancies_10Nodes_10Drivers_10DiscrepanciesEach-24              10000             49441 ns/op           19926 B/op         11 allocs/op
BenchmarkCompareDraResourcesExact-24                                                       10000              4440 ns/op               0 B/op          0 allocs/op
BenchmarkCompareDraResourcesFuzzy-24                                                       10000              4738 ns/op               0 B/op          0 allocs/op
BenchmarkCompareDraResourcesRankingFuzzy-24                                                10000              8412 ns/op               0 B/op          0 allocs/op

Which issue(s) this PR fixes:

Special notes for your reviewer:

In this change I've tried to apply couple practices to bring allocations down and drastically improve performance, namely store buffers for reusability. While the change differs from most of the cluster autoscaler code in style primarily due to buffer reuse - it achieves close to zero allocation rate per function call.

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2026-03-12T17:56:07Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-03-12T17:56:07Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-03-12T17:56:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mtrqq
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…and produce aggregate summaries

…with snapshot API changes

k8s-ci-robot requested review from aleksandra-malinowska and vadasambar March 12, 2026 17:56

k8s-ci-robot added area/cluster-autoscaler size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Mar 12, 2026

mtrqq force-pushed the diff-devices branch 5 times, most recently from 442fb16 to 7f8128f Compare March 13, 2026 12:19

mtrqq marked this pull request as ready for review March 13, 2026 12:45

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

k8s-ci-robot requested review from BigDarkClown and x13n March 13, 2026 12:45

mtrqq added 6 commits March 13, 2026 13:08

Implement DRA utilities to diff two collections of resource slices

c9471fc

Implement NodeResourcesComparator to compare nodes against templates …

0cf5f47

…and produce aggregate summaries

Register resource mismatch metric along with mismatch types.

bea5ad4

Drastically optimize comparator memory allocations and performance.

98ace1c

Integrate NodeResourcesComparator with DRA custom resource processor

6695291

Add NewCustomTestSnapshotAndHandle into comparator testing to comply …

a951392

…with snapshot API changes

mtrqq force-pushed the diff-devices branch from 7f8128f to a951392 Compare March 13, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect DRA resource mismatch against autoscaler node templates#9352

Detect DRA resource mismatch against autoscaler node templates#9352
mtrqq wants to merge 6 commits intokubernetes:masterfrom
mtrqq:diff-devices

mtrqq commented Mar 12, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mtrqq commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

k8s-ci-robot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mtrqq commented Mar 12, 2026 •

edited

Loading