Skip to content

[AI Generated] Azure: surface per-resource errors on truncated DeploymentFailed#4438

Open
LiliDeng wants to merge 2 commits intomainfrom
feat/log-deployment-operation-errors
Open

[AI Generated] Azure: surface per-resource errors on truncated DeploymentFailed#4438
LiliDeng wants to merge 2 commits intomainfrom
feat/log-deployment-operation-errors

Conversation

@LiliDeng
Copy link
Copy Markdown
Collaborator

Summary

When an Azure ARM deployment fails, LISA currently only surfaces the top-level HttpResponseError. In several common cases the top-level message is unactionable:

  • DeploymentFailed with the body "aggregated deployment error is too large" (Azure truncates the inner details when there are many failed sub-operations).
  • ResourceDeploymentFailure: Encountered internal server error. Diagnostic information: timestamp '...', subscription id '...', tracking id '...' — only a tracking id is returned, no per-resource error.
  • DeploymentFailed with no nested details.

In these cases users have to re-run with keep_environment:always and dig through the Portal / az deployment operation group list to find which resource actually failed and why.

Change

In lisa/sut_orchestrator/azure/platform_.py:

  • Add _collect_deployment_operation_errors() which calls self._rm_client.deployment_operations.list(resource_group_name, AZURE_DEPLOYMENT_NAME) after a failed deploy and extracts per-resource targetResource + nested statusMessage.error for every operation whose provisioningState == "Failed".
  • In _deploy()'s HttpResponseError handler, call this helper as a fallback when the top-level message looks aggregated/truncated, the code is ResourceDeploymentFailure, or DeploymentFailed has no nested details. Append the collected sub-resource errors to both the log and the raised LisaException.
  • Behavior is unchanged for normal deployment failures that already carry per-resource details. The fallback is wrapped in try/except and logs at debug on failure, so this never makes diagnostics worse.

After this change a previously opaque error becomes:

ResourceDeploymentFailure: Encountered internal server error. ...tracking id '...'.
Microsoft.Compute/virtualMachines/<name>: AllocationFailed: ...
Microsoft.Network/networkInterfaces/<name>: NetworkInterfaceCountExceeded: ...

Validation

  1. RHEL 8.1 marketplace smoke_test (success path) — Provisioning.smoke_test: PASSED (134.981 s case, 539 s total). Fallback does not affect the success path.
  2. RHEL 8.1 marketplace smoke_test without osdisk_size_in_gb:64 (deliberate failure) — produces a normal DeploymentFailed with details intact, fallback correctly skipped, error message identical to current behavior.

The fallback branch itself triggers on real-world ResourceDeploymentFailure / aggregated-error responses, which are non-deterministic from Azure's side and can't be reliably synthesized; the helper is defensive and is exercised through the success / normal-failure paths above.


🤖 Generated with the assistance of GitHub Copilot.

Copilot AI review requested due to automatic review settings April 27, 2026 02:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Azure ARM deployment failure diagnostics in LISA by collecting and surfacing per-resource deployment operation errors when Azure returns truncated or otherwise unactionable top-level HttpResponseError details.

Changes:

  • Add a helper to list ARM deployment operations and extract failed sub-resource errors.
  • Extend _deploy() error handling to optionally append these sub-resource errors when top-level deployment failures are aggregated/truncated or missing details.

Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
@github-actions
Copy link
Copy Markdown

✅ AI Test Selection — PASSED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Count
✅ Passed 1
❌ Failed 0
⏭️ Skipped 0
Total 1
Test case details
Test Case Status Time (s) Message
smoke_test (lisa_0_0) ✅ PASSED 32.996

LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
Copilot AI review requested due to automatic review settings April 27, 2026 02:57
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from f446593 to a95fe38 Compare April 27, 2026 02:57
@github-actions
Copy link
Copy Markdown

⏭️ AI Test Selection — SKIPPED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Selected test cases
smoke_test

LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from a95fe38 to 20caf6d Compare April 27, 2026 02:57
LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from 20caf6d to 08ad4be Compare April 27, 2026 02:58
@github-actions
Copy link
Copy Markdown

⏭️ AI Test Selection — SKIPPED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Selected test cases
smoke_test

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread lisa/sut_orchestrator/azure/platform_.py
LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from 08ad4be to 5ad13c5 Compare April 27, 2026 03:01
@github-actions
Copy link
Copy Markdown

⏭️ AI Test Selection — SKIPPED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Selected test cases
smoke_test

LiliDeng added a commit that referenced this pull request Apr 27, 2026
Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
Copilot AI review requested due to automatic review settings April 27, 2026 03:01
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from 5ad13c5 to 1c1b0f0 Compare April 27, 2026 03:01
@github-actions
Copy link
Copy Markdown

⏭️ AI Test Selection — SKIPPED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Selected test cases
smoke_test

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread lisa/sut_orchestrator/azure/platform_.py
Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
Comment thread lisa/sut_orchestrator/azure/platform_.py Outdated
@github-actions
Copy link
Copy Markdown

✅ AI Test Selection — PASSED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Count
✅ Passed 1
❌ Failed 0
⏭️ Skipped 0
Total 1
Test case details
Test Case Status Time (s) Message
smoke_test (lisa_0_0) ✅ PASSED 29.323

Catch Azure-specific exceptions (HttpResponseError, ResourceNotFoundError)
explicitly and log them at debug. Keep a broad fallback for unexpected
errors but log with exc_info=True so the traceback is preserved instead
of being silently swallowed. The original error path is unchanged: this
helper still never raises.

Addresses review comment on PR #4438.
@LiliDeng LiliDeng force-pushed the feat/log-deployment-operation-errors branch from 1c1b0f0 to 48a3d30 Compare April 27, 2026 03:08
@github-actions
Copy link
Copy Markdown

✅ AI Test Selection — PASSED

1 test case(s) selected (view run)

Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-arm64 latest

Count
✅ Passed 1
❌ Failed 0
⏭️ Skipped 0
Total 1
Test case details
Test Case Status Time (s) Message
smoke_test (lisa_0_0) ✅ PASSED 33.954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants