Skip to content

feat(BA-4050): Add Prometheus-based kernel live stat query action#10998

Open
seedspirit wants to merge 5 commits intomainfrom
feat/BA-4050
Open

feat(BA-4050): Add Prometheus-based kernel live stat query action#10998
seedspirit wants to merge 5 commits intomainfrom
feat/BA-4050

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

Summary

  • Add query_kernel_live_stat_batch to UtilizationMetricService that issues gauge / diff / rate PromQL queries in parallel via asyncio.gather, reducing latency from 3×RTT to 1×RTT
  • Add KernelNode._entry_to_live_stat_mapping to transform Prometheus pipeline output into the legacy Valkey-compatible GQL shape
  • Add KernelLiveStatAction / KernelLiveStatActionResult action pair and wire through UtilizationMetricProcessors
  • Add unit tests for the batch query pipeline and component-level Valkey equivalence tests

Test plan

  • Unit tests: tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.py
  • Component tests: tests/component/metric/test_kernel_live_stat_valkey_equivalence.py
  • CI checks

Resolves BA-4050

🤖 Generated with Claude Code

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:common Related to Common component labels Apr 13, 2026
@seedspirit seedspirit changed the title feat(BA-4050): add Prometheus-based kernel live stat query pipeline feat(BA-4050): Add Prometheus-based kernel live stat query pipeline Apr 13, 2026
Comment on lines +68 to +72
gauge, diff, rate = await asyncio.gather(
self._query_gauge_kernel_live_stat(action.kernel_ids),
self._query_diff_kernel_live_stat(action.kernel_ids),
self._query_rate_kernel_live_stat(action.kernel_ids),
)
Copy link
Copy Markdown
Contributor Author

@seedspirit seedspirit Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was difficult to write a PromQL query that could handle everything at once, so I used asyncio.gather for query. In the case of the Prometheus client, it seems there are no issues since it uses a client pool, but I'm not entirely sure if this approach is the right one.

Copy link
Copy Markdown
Member

@fregataa fregataa Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave some details of the difficulty of batch query?
I got it. It is accurate to let prometheus compute such diff/rate metrics

seedspirit added a commit that referenced this pull request Apr 15, 2026
seedspirit added a commit that referenced this pull request Apr 15, 2026
@seedspirit seedspirit changed the title feat(BA-4050): Add Prometheus-based kernel live stat query pipeline feat(BA-4050): Add Prometheus-based kernel live stat query action Apr 15, 2026
seedspirit added a commit that referenced this pull request Apr 16, 2026
@seedspirit seedspirit requested review from a team and HyeockJinKim April 16, 2026 07:01
@seedspirit seedspirit marked this pull request as ready for review April 16, 2026 07:01
Copilot AI review requested due to automatic review settings April 16, 2026 07:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Prometheus-backed “kernel live stat” batch query path to the manager metric service, exposing it as a new action and integrating it into the utilization metric processor package.

Changes:

  • Implement UtilizationMetricService.query_kernel_live_stat_batch() that executes GAUGE/DIFF/RATE PromQL instant queries in parallel and merges results per kernel.
  • Introduce new live-stat action/result types and supporting result/value DTOs for per-kernel batch outputs.
  • Add unit tests for the batch query pipeline (including grouping, empty-kernel behavior, and PromQL rendering expectations).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.py Adds unit tests for the new batch live-stat query pipeline.
src/ai/backend/manager/services/metric/types.py Adds kernel live-stat result/value dataclasses and metric classification constants.
src/ai/backend/manager/services/metric/root_service.py Implements the batch live-stat Prometheus querying and query preset construction.
src/ai/backend/manager/services/metric/processors/utilization_metric.py Wires the new live-stat action into the utilization metric processor package.
src/ai/backend/manager/services/metric/container_metric.py Refactors metric type detection to use shared DIFF/RATE metric sets; uses UnreachableError.
src/ai/backend/manager/services/metric/actions/live_stat.py Introduces KernelLiveStatAction and KernelLiveStatActionResult.
src/ai/backend/common/data/permission/types.py Adds EntityType.CONTAINER_LIVE_STAT for permission/action typing.
changes/10998.feature.md Adds a changelog entry for the new batch live-stat pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/manager/services/metric/types.py Outdated
Comment thread src/ai/backend/manager/services/metric/types.py Outdated
Comment thread tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.py Outdated
Comment thread tests/unit/manager/services/utilization_metric/test_kernel_live_stat_batch.py Outdated
Comment thread src/ai/backend/manager/services/metric/actions/live_stat.py Outdated
Comment thread src/ai/backend/manager/services/metric/types.py Outdated
Add service-layer infrastructure for querying kernel live stats from
Prometheus instead of Valkey. Introduces KernelLiveStatAction, batch
query methods (gauge/diff/rate), and unit tests for the pipeline.
@seedspirit seedspirit requested review from a team and jopemachine April 16, 2026 08:54
Copy link
Copy Markdown
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about implementing query_kernel_..._metrics() methods in the prometheus client class rather than service?

Comment on lines +146 to +151
template = (
"sum by ({group_by})(rate("
+ CONTAINER_UTILIZATION_METRIC_NAME
+ "{{{labels}}}[{window}]))"
" / " + str(UTILIZATION_METRIC_INTERVAL)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

template = (
    "sum by ({group_by})("
    "rate({metric}{{{labels}}}[{window}])"
    ")"
    " / {interval}"
).format(
    group_by=group_by,
    metric=CONTAINER_UTILIZATION_METRIC_NAME,
    labels=labels,
    window=window,
    interval=UTILIZATION_METRIC_INTERVAL,
)

or

template = (
    f"sum by ({group_by})("
    f"rate({CONTAINER_UTILIZATION_METRIC_NAME}{{{{{labels}}}}}[{window}])"
    f")"
    f" / {UTILIZATION_METRIC_INTERVAL}"
)

Comment on lines +101 to +106
if (
info.kernel_id is None
or info.container_metric_name is None
or info.value_type is None
or not metric.values
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a method for indicating these to the Metric type?

Comment on lines +101 to +106
if (
info.kernel_id is None
or info.container_metric_name is None
or info.value_type is None
or not metric.values
):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a method for indicating these to the Metric type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants