mshv: mount nvme resource disk for VM image copies#4434
mshv: mount nvme resource disk for VM image copies#4434
Conversation
🤖 AI Test SelectionNo test cases were selected for this PR. |
There was a problem hiding this comment.
Pull request overview
This PR updates the MSHV host stress test to avoid copying many large guest disk images onto the OS disk on NVMe-based Azure SKUs by detecting whether /mnt/resource or /mnt are actually backed by a mounted disk, and (when neither is mounted) attempting to format+mount an unused nvme*n1 disk at /mnt/resource for disk image copies.
Changes:
- Replace the prior “
/mntdirectory exists” heuristic withlsblk-based mountpoint detection for/mnt/resourceand/mnt. - Add logic to select an unused NVMe namespace disk and mount it as ext4 at
/mnt/resourceto host VM disk image copies, with fallback to the node working path. - Add helper methods for mountpoint detection and NVMe disk selection.
| def _find_unused_nvme_disk(self, disks: List[DiskInfo]) -> Optional[str]: | ||
| nvme_pattern = re.compile(r"^nvme\d+n1$") | ||
| for disk in disks: | ||
| if disk.is_os_disk: | ||
| continue | ||
| if not nvme_pattern.match(disk.name): | ||
| continue | ||
| if disk.partitions: | ||
| continue | ||
| if disk.is_mounted: | ||
| continue | ||
| return f"/dev/{disk.name}" |
There was a problem hiding this comment.
_find_unused_nvme_disk() will select any non-OS NVMe disk with no partitions and not mounted, and get_disk_img_copy_path() formats it (format=True). On hosts with additional NVMe data disks that are intentionally unmounted, this can wipe data. Consider adding stronger checks before formatting (e.g., require disk.fstype/uuid empty and/or validate it’s the expected temporary/resource disk via blkid label or platform disk feature) and refuse to format when the disk looks previously used.
| except Exception as e: | ||
| log.warning( | ||
| f"Failed to mount {candidate} at {mount_point}: {e}; " | ||
| "falling back to working path." | ||
| ) | ||
| return node.working_path |
There was a problem hiding this comment.
Catching a broad Exception here will also swallow unexpected programming errors (including assertion failures from mount command checks), making failures harder to diagnose. Please catch the specific exception types expected from mount/format operations (e.g., LisaException and AssertionError) and let unexpected exceptions propagate.
| try: | ||
| node.execute(f"mkdir -p {mount_point}", shell=True, sudo=True) | ||
| node.tools[Mount].mount( | ||
| name=candidate, | ||
| point=mount_point, | ||
| fs_type=FileSystem.ext4, | ||
| format_=True, | ||
| ) | ||
| except Exception as e: | ||
| log.warning( | ||
| f"Failed to mount {candidate} at {mount_point}: {e}; " | ||
| "falling back to working path." | ||
| ) | ||
| return node.working_path | ||
|
|
||
| log.info(f"Mounted {candidate} at {mount_point} for VM disk copies") | ||
| return PurePath(mount_point) |
There was a problem hiding this comment.
This path may format+mount a disk for the duration of the test run, but there’s no corresponding unmount/cleanup. That leaves host state changed for any subsequent test cases in the same environment. Consider tracking whether the disk was mounted by this test and unmounting it in a finally/after_case hook (or at least documenting the intentional persistent mount).
|
|
||
| candidate = self._find_unused_nvme_disk(disks) | ||
| if candidate is None: | ||
| log.warning( |
There was a problem hiding this comment.
Oh I see. What do you suggest instead? log.info?
There was a problem hiding this comment.
Changed to log.info
The previous logic in MshvHostStressTestSuite._get_disk_img_copy_path treated existence of /mnt as proof that a large resource disk was mounted there. On NVMe-based Azure SKUs the temporary disks show up as /dev/nvme*n1 and are not mounted anywhere, so the test ended up copying many large guest images onto the small OS disk and ran out of space. Use lsblk to detect what is actually mounted at /mnt/resource and /mnt and reuse those when present. Otherwise pick an unused nvme*n1 disk (not the OS disk, no partitions, nothing mounted), format it as ext4, and mount it at /mnt/resource. Fall back to the working path if no suitable disk is found or the mount fails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Anirudh Rayabharam <anrayabh@microsoft.com>
89c06b4 to
a973479
Compare
🤖 AI Test SelectionNo test cases were selected for this PR. |
The previous logic in MshvHostStressTestSuite._get_disk_img_copy_path treated existence of /mnt as proof that a large resource disk was mounted there. On NVMe-based Azure SKUs the temporary disks show up as /dev/nvme*n1 and are not mounted anywhere, so the test ended up copying many large guest images onto the small OS disk and ran out of space.
Use lsblk to detect what is actually mounted at /mnt/resource and /mnt and reuse those when present. Otherwise pick an unused nvme*n1 disk (not the OS disk, no partitions, nothing mounted), format it as ext4, and mount it at /mnt/resource. Fall back to the working path if no suitable disk is found or the mount fails.
Description
Related Issue
Type of Change
Checklist
Test Validation
Key Test Cases:
Impacted LISA Features:
Tested Azure Marketplace Images:
Test Results