Harden OpenVMM Azure guest orchestration#4420
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the OpenVMM guest orchestrator flow when used as an Azure “guest-enabled” environment, adding cloud-init provisioning support, TAP networking/SSH forwarding, richer diagnostics, and smoke coverage to validate guest lifecycle operations.
Changes:
- Add cloud-init ISO generation and provisioning plumbing for OpenVMM guests, with improved failure-context capture.
- Add TAP networking support (dnsmasq DHCP, guest address resolution, host-side SSH port forwarding) and teardown/cleanup paths.
- Update runner/Azure plumbing to better handle guest-enabled environments, plus OpenVMM smoke suite + Azure runbook, and extend selftests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| selftests/test_openvmm_node.py | Updates selftests for stop-timeout behavior and logging expectations. |
| lisa/tools/openvmm.py | Improves detached launch robustness (TTY wrapper) and PID failure diagnostics. |
| lisa/sut_orchestrator/openvmm/schema.py | Extends OpenVMM guest/network schema for cloud-init + TAP/forwarding settings. |
| lisa/sut_orchestrator/openvmm/node.py | Implements TAP networking, cloud-init ISO creation, forwarding, diagnostics, artifact cleanup. |
| lisa/sut_orchestrator/azure/platform_.py | Avoids feature access for guest nodes unless the feature is supported. |
| lisa/runners/lisa_runner.py | Adds guest-environment initialization and environment-status matching for guest-enabled runs. |
| lisa/node.py | Ensures host nodes clean up/close guest nodes; relaxes connection test when shell is missing. |
| lisa/microsoft/testsuites/openvmm/openvmm.py | Adds OpenVMM Azure smoke coverage for boot/restart/stop-start scenarios. |
| lisa/microsoft/runbook/openvmm/openvmm-azure-smoke.yml | New Azure smoke runbook for OpenVMM guest validation with TAP+forwarding. |
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
6dc110d to
08cd8b1
Compare
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
08cd8b1 to
9d5c64b
Compare
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
6f94f54 to
c2a279c
Compare
⏭️ AI Test Selection — SKIPPED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest Selected test cases |
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
c2a279c to
1f2096c
Compare
✅ AI Test Selection — PASSED5 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
1f2096c to
12687a4
Compare
✅ AI Test Selection — PASSED3 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
anirudhrb
left a comment
There was a problem hiding this comment.
I haven't been able to go through the whole code yet. Here are some comments from the first pass. I'll do another pass tomorrow.
| class OpenVmmPlatform(TestSuite): | ||
| def before_case(self, log: Logger, **kwargs: Any) -> None: | ||
| node = kwargs["node"] | ||
| if not isinstance(node, OpenVmmGuestNode): |
There was a problem hiding this comment.
I believe the right way to ensure this is to add supported_platform_type in the TestSuiteMetadata. That will make sure these tests will only run on openvmm platform.
There was a problem hiding this comment.
I don’t think supported_platform_type is the right check here, because it filters the LISA platform/orchestrator rather than the guest node type. These cases run on Azure-backed host environments but target OpenVmmGuestNode guests, so supported_platform_type=[OPENVMM] ended up skipping them.
The before_case() node-type check is the right gate for this suite. i did try but it doesnt work and test get skipped.
There was a problem hiding this comment.
But how do OpenVmmGuestNodes get created? Isn't openvmm platform the only way to create them? So, if we filter via LISA platform openvmm it should be guaranteed that the nodes created are OpenVmmGuestNode right?
There was a problem hiding this comment.
It is a guest node type loaded from platform.guests, and guest nodes are initialized by the generic platform layer when guest_enabled: true
There was a problem hiding this comment.
Ahh I got it now. We're not using the OpenVMM platform directly. Instead, we're using Azure platform and defining this guests property. I was confused as to what was going on because I didn't know that this guests property existed.
12687a4 to
4ad8f7c
Compare
✅ AI Test Selection — PASSED3 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
| user: dict[str, Any] = { | ||
| "name": runbook.username, | ||
| "shell": "/bin/bash", | ||
| "sudo": ["ALL=(ALL) NOPASSWD:ALL"], | ||
| "groups": ["sudo"], |
There was a problem hiding this comment.
This module uses PEP 585 built-in generics in annotations (e.g. dict[str, Any] here, plus list[str]/tuple[str, str] later). The repo targets Python 3.8 (pyproject.toml), so these annotations will raise at import time unless from __future__ import annotations is enabled. Please add the future import at the top of this module or switch these annotations back to Dict/List/Tuple from typing to keep Python 3.8 compatibility.
4ad8f7c to
fd3f118
Compare
✅ AI Test Selection — PASSED3 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
fd3f118 to
51d33c6
Compare
✅ AI Test Selection — PASSED3 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
| "attempt=0; " | ||
| "while [ $attempt -lt 100 ]; do " | ||
| f"if [ -s {pid_path} ]; then cat {pid_path}; exit 0; fi; " | ||
| "if ! kill -0 $wrapper_pid >/dev/null 2>&1; then break; fi; " | ||
| "attempt=$((attempt + 1)); " | ||
| "sleep 0.1; " | ||
| "done; " |
There was a problem hiding this comment.
The retry loop uses magic numbers (100 attempts and sleep 0.1), which effectively encode a ~10s wait. Please replace them with named constants (or derive from a single timeout) and add a short inline note explaining the chosen duration so future changes don’t accidentally make launches slower/flakier.
| user: dict[str, Any] = { | ||
| "name": runbook.username, | ||
| "shell": "/bin/bash", | ||
| "sudo": ["ALL=(ALL) NOPASSWD:ALL"], | ||
| "groups": ["sudo"], | ||
| } | ||
| if runbook.private_key_file: | ||
| user["ssh_authorized_keys"] = [ | ||
| get_public_key_data(runbook.private_key_file) | ||
| ] | ||
|
|
||
| user_data: dict[str, Any] = { | ||
| "users": ["default", user], | ||
| } |
There was a problem hiding this comment.
This file uses PEP 585 built-in generics (e.g., dict[str, Any]) which are not supported at runtime on Python 3.8 unless from __future__ import annotations is enabled (repo targets py38). Please either add the future import at the top of the module or replace these annotations with typing.Dict[...]/typing.List[...] etc. to keep py38 compatibility.
| def _create_iso(self, file_path: str, files: List[tuple[str, str]]) -> None: | ||
| import pycdlib |
There was a problem hiding this comment.
List[tuple[str, str]] uses PEP 585 tuple[...], which is not valid on Python 3.8 at runtime without from __future__ import annotations. Use typing.Tuple[str, str] (and/or the future import) to keep the orchestrator compatible with the repo's py38 target.
| tmp_dir.cleanup() | ||
|
|
||
| def _create_iso(self, file_path: str, files: List[tuple[str, str]]) -> None: | ||
| import pycdlib |
There was a problem hiding this comment.
Importing pycdlib here will raise a raw ImportError/ModuleNotFoundError if the optional dependency isn't installed, which is hard to diagnose for users enabling cloud_init. Consider catching ImportError and raising a LisaException with clear guidance (e.g., install the appropriate extra / package) so failures are actionable.
| import pycdlib | |
| try: | |
| import pycdlib | |
| except ImportError as identifier: | |
| raise LisaException( | |
| "Failed to create the OpenVMM cloud-init ISO because the optional " | |
| "'pycdlib' package is not installed. Install 'pycdlib' in the LISA " | |
| "controller environment, then retry the run with cloud_init enabled." | |
| ) from identifier |
51d33c6 to
362a68d
Compare
✅ AI Test Selection — PASSED2 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
Add an Azure smoke runbook and OpenVMM smoke suite that exercise guest boot, restart, and stop/start on a prepared host. Describe the LisaRunner guest-enabled flow explicitly: - schedule environment initialization when the outer environment reaches Deployed - initialize guest nodes after the outer environment connects - allow guest-targeted tests that require Deployed to run once the outer host is Connected - keep retry on the runner redeploy path instead of eagerly deleting the environment Also wire the Azure/OpenVMM guest handling, schema updates, diagnostics, and selftest coverage needed for this flow. (cherry picked from commit ac4c15c)
Make guest teardown best-effort so parent node cleanup and close continue even when a guest cleanup path fails. Tighten OpenVMM host-side handling by using host-native paths for copied artifacts, validating whoami before TAP creation, guarding working directory deletion, scoping dnsmasq state to the OpenVMM interface, and restoring the original ip_forward state when SSH forwarding is torn down. Also update the launcher wrapper to return the real OpenVMM PID so later liveness checks and forced cleanup target the VM process rather than the wrapper shell. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
362a68d to
90912f1
Compare
✅ AI Test Selection — PASSED2 test case(s) selected (view run) Marketplace image: canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2 latest
Test case details
|
| can_run_results = [x for x in can_run_results if x.can_run] | ||
| if environment.status == EnvironmentStatus.Deployed and can_run_results: | ||
| if self._guest_enabled: | ||
| return self._generate_task( |
There was a problem hiding this comment.
In this function we're just about to run the tests (_dispatch_test_result()). Guest initialization should've have been done much before this. Feels like this is not the right place to do it. But also I'm not too familiar with this lisa_runner code. @LiliDeng could you please suggest?
There was a problem hiding this comment.
I think this is the right place if we distinguish guest creation from guest initialization.
Guest nodes are created earlier during platform deploy, but they are not initialized there. In LisaRunner, this is still part of the environment state transition, not the actual test execution path. For guest_enabled, a parent environment at Deployed is not yet ready to run tests on the guest, so the next scheduled task needs to be _initialize_environment_task, which runs the normal connected-phase init and then initializes guest_environment.nodes.
So I think the placement is okay, though a small clarifying comment here would help.
| class OpenVmmPlatform(TestSuite): | ||
| def before_case(self, log: Logger, **kwargs: Any) -> None: | ||
| node = kwargs["node"] | ||
| if not isinstance(node, OpenVmmGuestNode): |
There was a problem hiding this comment.
Ahh I got it now. We're not using the OpenVMM platform directly. Instead, we're using Azure platform and defining this guests property. I was confused as to what was going on because I didn't know that this guests property existed.
Description
pycdlibat module import timeFor Lisa-runner:
schedule guest initialization after host deploy
initialize guest nodes after host connect
allow guest runs to match Deployed against outer Connected
avoid eager delete before retry in guest flow
Related Issue
Type of Change
Checklist
Test Validation
Key Test Cases:
Impacted LISA Features:
Tested Azure Marketplace Images:
Test Results
2026-04-15 23:32:50.931[140036611700544][INFO] lisa.RootRunner ________________________________________
2026-04-15 23:32:50.932[140036611700544][INFO] lisa.RootRunner OpenVmmPlatform.verify_openvmm_guest_boot: PASSED
2026-04-15 23:32:50.932[140036611700544][INFO] lisa.RootRunner OpenVmmPlatform.verify_openvmm_restart_via_platform: PASSED