Refine PerfEvaluation to compare network perf data with sku cap#4430
Refine PerfEvaluation to compare network perf data with sku cap#4430
Conversation
🤖 AI Test SelectionNo test cases were selected for this PR. |
There was a problem hiding this comment.
Pull request overview
This PR updates the PerfEvaluation notifier to evaluate network throughput metrics against Azure SKU bandwidth capabilities (via cached location/SKU data) using a peak-based validation approach, and enriches failure summaries with more detailed context.
Changes:
- Added configurable
post_peak_drop_ratioand implemented peak-based throughput validation (post-peak drop rule + SKU 90% peak rule). - Added Azure SKU capability cache lookup (
azure_locations_{location}.json) to drive throughput evaluation when available. - Expanded performance-failure summaries to include per-metric failure reasons.
Comments suppressed due to low confidence (1)
lisa/notifiers/perfevaluation/perfevaluation.py:1207
_modify_messageappendsdetail_msgintoTestResultMessage.message, which can easily exceed the 2048-char cap applied when the message is first created (TestResult._send_result_message). This can cause downstream consumers to receive unexpectedly large messages. Consider keepingmessage.messageconcise (and/or truncatingdetail_msgto a safe length) while storing full details only inperf_evaluation_summary.
summary_msg = (
f"Performance evaluation failed: {len(failed_metrics)} "
f"metric(s) did not meet criteria. Details: {detail_msg}"
)
message.perf_evaluation_summary = {
"failed_metrics_count": len(failed_metrics),
"failed_metrics": failed_metrics,
"summary": summary_msg,
}
runbook = cast(PerfEvaluationSchema, self.runbook)
fail_on_perf = getattr(
runbook, "fail_test_on_performance_failure", False
)
if fail_on_perf and message.status != TestStatus.FAILED:
original_message = message.message or ""
perf_summary = (
f"Performance criteria failed: {len(failed_metrics)} "
f"metrics. {detail_msg}"
)
message.status = TestStatus.FAILED
if original_message:
message.message = f"{original_message}\n{perf_summary}"
| # Allowed post-peak throughput drop ratio, e.g. 0.02 means 2% drop allowed. | ||
| post_peak_drop_ratio: float = 0.02 |
There was a problem hiding this comment.
The PR description template is still empty (Description/Related Issue/Type of Change/Checklist). Please fill it in so reviewers understand the intent and validation plan for the new PerfEvaluation behavior.
| # Step 1: if this is a throughput metric and SKU bandwidth is available, | ||
| # use the peak-based check exclusively — skip perf_criteria entirely. | ||
| throughput_prefixes = ( | ||
| "throughput_in_gbps", | ||
| "tx_throughput_in_gbps", | ||
| "rx_throughput_in_gbps", | ||
| ) | ||
| if ( | ||
| any(metric_name.startswith(p) for p in throughput_prefixes) | ||
| and perf_message.metric_relativity | ||
| == messages.MetricRelativity.HigherIsBetter | ||
| ): | ||
| bw_spec = self._lookup_az_max_bandwidth( | ||
| perf_message.location, perf_message.vmsize or "" | ||
| ) | ||
| if bw_spec is not None: | ||
| self._update_bw_peak(perf_message, metric_name, actual_value, bw_spec) | ||
| return | ||
|
|
There was a problem hiding this comment.
The SKU-based throughput path returns early after _update_bw_peak(...) and does not record a passing evaluation result. If all perf metrics in a run are throughput metrics and they pass, _evaluation_results can remain empty and finalize() will log "No performance evaluations performed." Consider appending a pass result per throughput metric (and/or for the SKU 90% check) so reports/JSON output remain consistent with the non-SKU criteria path.
| try: | ||
| with open(cache_path, "r", encoding="utf-8") as f: | ||
| self._az_cap_cache[location] = json.load(f) | ||
| except Exception: |
There was a problem hiding this comment.
_lookup_az_max_bandwidth catches Exception and silently falls back to {} on any failure reading/parsing the cache file. This can mask real issues (e.g., JSON corruption) and make troubleshooting difficult. Prefer catching specific exceptions (e.g., OSError, json.JSONDecodeError, ValueError) and logging the exception (at least debug) with the cache path before falling back.
| except Exception: | |
| except (OSError, json.JSONDecodeError, ValueError) as e: | |
| self._log.debug( | |
| f"Failed to read Azure capability cache from {cache_path}: " | |
| f"{e}. Verify the cache file exists, is readable, and " | |
| "contains valid JSON. Falling back to an empty cache." | |
| ) |
Description
Related Issue
Type of Change
Checklist
Test Validation
Key Test Cases:
Impacted LISA Features:
Tested Azure Marketplace Images:
Test Results