feat(nutanix): add support for alerts tracking#23538
Draft
NouemanKHAL wants to merge 12 commits intomasterfrom
Draft
feat(nutanix): add support for alerts tracking#23538NouemanKHAL wants to merge 12 commits intomasterfrom
NouemanKHAL wants to merge 12 commits intomasterfrom
Conversation
Replace generic _collect-based alert flow with custom two-mode logic: - First run: fetch only unresolved alerts (isResolved eq false), cache their extIds, and set the lastUpdatedTime cursor - Subsequent runs: fetch alerts by lastUpdatedTime, detect new opens, resolutions, and alerts created-and-resolved between runs Track open alert IDs in persistent cache so resolution detection survives agent restarts.
…-op behavioral tests
Emit nutanix.alert.open=1 each cycle for every unresolved alert, tagged with ext_id, severity, alertType, status, cluster, and source entity. Submit a one-shot 0 when an alert is detected as resolved or deleted, giving each alert its own time series for individual lifecycle monitoring. Reconcile open-alert state against the unresolved-alerts API on every cycle instead of using a persistent cursor cache. Removes orphaned metric emissions when resolutions are missed (agent downtime spanning a resolution, alert deletion, or cache survival across Prism Centrals). Distinguish acknowledged-but-unresolved alerts with ntnx_alert_status:acknowledged and event alert_type "warning". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single nutanix.alert.open gauge with three per-state metrics that each represent one stage of the alert lifecycle: - nutanix.alert.open: 1 while alert is unresolved + unacknowledged - nutanix.alert.acknowledged: 1 while alert is unresolved + acknowledged - nutanix.alert.resolved: 1 once when alert enters the resolved state State transitions emit an explicit 0 to the previous state's gauge (open->ack zeroes nutanix.alert.open, ack->resolved zeroes nutanix.alert.acknowledged), so per-alert monitor cases recover cleanly when the alert leaves a state. The metric name itself encodes the state, so monitor queries don't need a tag filter to disambiguate. Add an exported monitor template at assets/monitors/alerts.json combining nutanix.alert.open and nutanix.alert.resolved into a per-ext_id formula. Cleanup pass to follow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Monitor template (assets/monitors/alerts.json): - replace placeholder title and description with real text - drop the exported monitor id, restriction_policy, silenced, priority - normalize created_at/last_updated_at to YYYY-MM-DD per convention - replace the personal @ mention in the message with @your-team - add integration:nutanix tag Register the monitor in manifest.json under assets.monitors. Code: - extract _submit_state_metric(alert, state, value) helper to remove the triplicated `gauge(f"alert.{state}", v, tags=_build_alert_tags(...))` pattern from emit_open_alert_metrics, the ack-transition branch in _reconcile_alerts, and _emit_resolution_event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add three Tier-1 tags to alert events and metrics:
- ntnx_originating_cluster_name: distinguishes the cluster the alert
originated on from the cluster it's reported against. Critical when
Prism Central federates managed clusters; today both cases collapse
into ntnx_cluster_name.
- ntnx_alert_user_defined: lets operators filter custom alert policies
apart from Nutanix's built-in alerts.
- ntnx_alert_service: identifies the Nutanix subsystem (e.g. IAMv2)
when the alert carries a serviceName (sparse but high-signal).
Parameterize _add_cluster_name_tag with a tag_name argument so the same
helper handles both clusterUUID and originatingClusterUUID lookups.
Drop the redundant ntnx_alert_status tag from per-state metrics: the
metric name (nutanix.alert.open / .acknowledged / .resolved) already
encodes the state, so the tag adds noise without information. Events
keep the tag for filtering in the Events Explorer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fold emit_open_alert_metrics into _reconcile_alerts (one final loop emits :1 to the current state for each tracked alert), removing a public method and its check.py call site - drop _submit_state_metric's extra_tags parameter (only one caller used it for ntnx_alert_auto_resolved, which is now appended once during _emit_resolution_event's tag construction) - in _emit_resolution_event, build the metric tag set once and reuse it for both the prev-state :0 and the .resolved :1, instead of calling _build_alert_tags three times per resolution - strip narrative docstrings on internal helpers (_reconcile_alerts, _submit_state_metric, _emit_resolution_event) and the inline narrative comments inside _reconcile_alerts; rely on the code being self-evident - replace _fixture_alert's hand-rolled file open with conftest's load_fixture (gains the existing fixture cache for free, drops the os/json imports) - remove test_alerts_collection (subsumed by test_alerts_first_run_collects_only_unresolved, which already asserts every event's status tag, no resolution events on first run, and cache size); fold the only unique assertions (event_type and source_type_name) into the surviving test - delete narrative inline comments throughout test_alerts.py Net: ~120 lines removed, no behavior change, all 119 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bulk strip removed too much. Restore comments that orient the reader to test phases (first run / second run), explain non-obvious mock setup (acknowledgment override, datetime past fixture window), and state assertion intent (why two distinct cluster tags matter, why we pick an unack'd alert as the transition target). Kept removed: pure narration like "Verify message rendering" and "Verify tags" labels — those just label sections of assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- hoist severity->alert_type map to a module-level constant (_SEVERITY_TO_ALERT_TYPE) so it isn't rebuilt on every _process_alert call; collapses the 5-line dict literal + lookup into a single line at the call site - inline the now-unused `severity` and `status` locals in _process_alert (severity goes inline in the alert_type expression, status goes inline in the _build_alert_tags call); shaves ~4 lines without losing clarity - shrink _build_alert_tags's 5-line docstring to a one-liner - parametrize the two _emit_resolution_event tests (auto vs manual resolution) into a single pytest.mark.parametrize case; drops ~30 lines of fixture-dict duplication while keeping coverage All 119 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add five tests for scenarios the prior suite didn't deterministically exercise: - ack -> resolved: closes nutanix.alert.acknowledged (not .open) and emits nutanix.alert.resolved=1. Previously only incidentally covered depending on fixture iteration order. - acknowledged -> open (un-ack): symmetric to the existing open->ack transition test; verifies :0 to .acknowledged and :1 to .open. - Filter excludes a previously-tracked alert: documents current behavior — reconciliation treats filter exclusion as resolution and emits a spurious resolution event + nutanix.alert.resolved=1. Operators changing filter rules should be aware. - Empty unresolved list: cold-start no-op (no events, no metrics). - Alert deleted in Nutanix (_get_alert returns None): falls back to the cached alert dict and emits the bare "Resolved" msg_text branch. 124 unit tests pass (up from 119). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Motivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged