Skip to content

feat(nutanix): add support for alerts tracking#23538

Draft
NouemanKHAL wants to merge 12 commits intomasterfrom
noueman/nutanix-alert-resolution-tracking
Draft

feat(nutanix): add support for alerts tracking#23538
NouemanKHAL wants to merge 12 commits intomasterfrom
noueman/nutanix-alert-resolution-tracking

Conversation

@NouemanKHAL
Copy link
Copy Markdown
Member

What does this PR do?

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

NouemanKHAL and others added 12 commits April 15, 2026 18:20
Replace generic _collect-based alert flow with custom two-mode logic:
- First run: fetch only unresolved alerts (isResolved eq false), cache
  their extIds, and set the lastUpdatedTime cursor
- Subsequent runs: fetch alerts by lastUpdatedTime, detect new opens,
  resolutions, and alerts created-and-resolved between runs

Track open alert IDs in persistent cache so resolution detection
survives agent restarts.
Emit nutanix.alert.open=1 each cycle for every unresolved alert, tagged
with ext_id, severity, alertType, status, cluster, and source entity.
Submit a one-shot 0 when an alert is detected as resolved or deleted,
giving each alert its own time series for individual lifecycle monitoring.

Reconcile open-alert state against the unresolved-alerts API on every
cycle instead of using a persistent cursor cache. Removes orphaned metric
emissions when resolutions are missed (agent downtime spanning a
resolution, alert deletion, or cache survival across Prism Centrals).

Distinguish acknowledged-but-unresolved alerts with
ntnx_alert_status:acknowledged and event alert_type "warning".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single nutanix.alert.open gauge with three per-state metrics
that each represent one stage of the alert lifecycle:

  - nutanix.alert.open: 1 while alert is unresolved + unacknowledged
  - nutanix.alert.acknowledged: 1 while alert is unresolved + acknowledged
  - nutanix.alert.resolved: 1 once when alert enters the resolved state

State transitions emit an explicit 0 to the previous state's gauge
(open->ack zeroes nutanix.alert.open, ack->resolved zeroes
nutanix.alert.acknowledged), so per-alert monitor cases recover cleanly
when the alert leaves a state. The metric name itself encodes the state,
so monitor queries don't need a tag filter to disambiguate.

Add an exported monitor template at assets/monitors/alerts.json combining
nutanix.alert.open and nutanix.alert.resolved into a per-ext_id formula.
Cleanup pass to follow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Monitor template (assets/monitors/alerts.json):
- replace placeholder title and description with real text
- drop the exported monitor id, restriction_policy, silenced, priority
- normalize created_at/last_updated_at to YYYY-MM-DD per convention
- replace the personal @ mention in the message with @your-team
- add integration:nutanix tag

Register the monitor in manifest.json under assets.monitors.

Code:
- extract _submit_state_metric(alert, state, value) helper to remove the
  triplicated `gauge(f"alert.{state}", v, tags=_build_alert_tags(...))`
  pattern from emit_open_alert_metrics, the ack-transition branch in
  _reconcile_alerts, and _emit_resolution_event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three Tier-1 tags to alert events and metrics:

  - ntnx_originating_cluster_name: distinguishes the cluster the alert
    originated on from the cluster it's reported against. Critical when
    Prism Central federates managed clusters; today both cases collapse
    into ntnx_cluster_name.
  - ntnx_alert_user_defined: lets operators filter custom alert policies
    apart from Nutanix's built-in alerts.
  - ntnx_alert_service: identifies the Nutanix subsystem (e.g. IAMv2)
    when the alert carries a serviceName (sparse but high-signal).

Parameterize _add_cluster_name_tag with a tag_name argument so the same
helper handles both clusterUUID and originatingClusterUUID lookups.

Drop the redundant ntnx_alert_status tag from per-state metrics: the
metric name (nutanix.alert.open / .acknowledged / .resolved) already
encodes the state, so the tag adds noise without information. Events
keep the tag for filtering in the Events Explorer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fold emit_open_alert_metrics into _reconcile_alerts (one final loop
  emits :1 to the current state for each tracked alert), removing a
  public method and its check.py call site
- drop _submit_state_metric's extra_tags parameter (only one caller used
  it for ntnx_alert_auto_resolved, which is now appended once during
  _emit_resolution_event's tag construction)
- in _emit_resolution_event, build the metric tag set once and reuse it
  for both the prev-state :0 and the .resolved :1, instead of calling
  _build_alert_tags three times per resolution
- strip narrative docstrings on internal helpers (_reconcile_alerts,
  _submit_state_metric, _emit_resolution_event) and the inline narrative
  comments inside _reconcile_alerts; rely on the code being self-evident
- replace _fixture_alert's hand-rolled file open with conftest's
  load_fixture (gains the existing fixture cache for free, drops the
  os/json imports)
- remove test_alerts_collection (subsumed by
  test_alerts_first_run_collects_only_unresolved, which already asserts
  every event's status tag, no resolution events on first run, and
  cache size); fold the only unique assertions (event_type and
  source_type_name) into the surviving test
- delete narrative inline comments throughout test_alerts.py

Net: ~120 lines removed, no behavior change, all 119 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bulk strip removed too much. Restore comments that orient
the reader to test phases (first run / second run), explain non-obvious
mock setup (acknowledgment override, datetime past fixture window), and
state assertion intent (why two distinct cluster tags matter, why we
pick an unack'd alert as the transition target).

Kept removed: pure narration like "Verify message rendering" and
"Verify tags" labels — those just label sections of assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- hoist severity->alert_type map to a module-level constant
  (_SEVERITY_TO_ALERT_TYPE) so it isn't rebuilt on every _process_alert
  call; collapses the 5-line dict literal + lookup into a single line
  at the call site
- inline the now-unused `severity` and `status` locals in _process_alert
  (severity goes inline in the alert_type expression, status goes inline
  in the _build_alert_tags call); shaves ~4 lines without losing clarity
- shrink _build_alert_tags's 5-line docstring to a one-liner
- parametrize the two _emit_resolution_event tests (auto vs manual
  resolution) into a single pytest.mark.parametrize case; drops ~30
  lines of fixture-dict duplication while keeping coverage

All 119 unit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add five tests for scenarios the prior suite didn't deterministically
exercise:

- ack -> resolved: closes nutanix.alert.acknowledged (not .open) and
  emits nutanix.alert.resolved=1. Previously only incidentally covered
  depending on fixture iteration order.
- acknowledged -> open (un-ack): symmetric to the existing open->ack
  transition test; verifies :0 to .acknowledged and :1 to .open.
- Filter excludes a previously-tracked alert: documents current
  behavior — reconciliation treats filter exclusion as resolution and
  emits a spurious resolution event + nutanix.alert.resolved=1. Operators
  changing filter rules should be aware.
- Empty unresolved list: cold-start no-op (no events, no metrics).
- Alert deleted in Nutanix (_get_alert returns None): falls back to the
  cached alert dict and emits the bare "Resolved" msg_text branch.

124 unit tests pass (up from 119).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant