Skip to content

mshv: mount nvme resource disk for VM image copies#4434

Open
anirudhrb wants to merge 1 commit intomainfrom
anrayabh/mshv_stress_vm_create_resource_disk
Open

mshv: mount nvme resource disk for VM image copies#4434
anirudhrb wants to merge 1 commit intomainfrom
anrayabh/mshv_stress_vm_create_resource_disk

Conversation

@anirudhrb
Copy link
Copy Markdown
Collaborator

The previous logic in MshvHostStressTestSuite._get_disk_img_copy_path treated existence of /mnt as proof that a large resource disk was mounted there. On NVMe-based Azure SKUs the temporary disks show up as /dev/nvme*n1 and are not mounted anywhere, so the test ended up copying many large guest images onto the small OS disk and ran out of space.

Use lsblk to detect what is actually mounted at /mnt/resource and /mnt and reuse those when present. Otherwise pick an unused nvme*n1 disk (not the OS disk, no partitions, nothing mounted), format it as ext4, and mount it at /mnt/resource. Fall back to the working path if no suitable disk is found or the mount fails.

Description

Related Issue

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation update

Checklist

  • Description is filled in above
  • No credentials, secrets, or internal details are included
  • Peer review requested (if not, add required peer reviewers after raising PR)
  • Tests executed and results posted below

Test Validation

Key Test Cases:

Impacted LISA Features:

Tested Azure Marketplace Images:

Test Results

Image VM Size Result
PASSED / FAILED / SKIPPED

Copilot AI review requested due to automatic review settings April 24, 2026 10:46
@github-actions
Copy link
Copy Markdown

🤖 AI Test Selection

No test cases were selected for this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MSHV host stress test to avoid copying many large guest disk images onto the OS disk on NVMe-based Azure SKUs by detecting whether /mnt/resource or /mnt are actually backed by a mounted disk, and (when neither is mounted) attempting to format+mount an unused nvme*n1 disk at /mnt/resource for disk image copies.

Changes:

  • Replace the prior “/mnt directory exists” heuristic with lsblk-based mountpoint detection for /mnt/resource and /mnt.
  • Add logic to select an unused NVMe namespace disk and mount it as ext4 at /mnt/resource to host VM disk image copies, with fallback to the node working path.
  • Add helper methods for mountpoint detection and NVMe disk selection.

Comment on lines +287 to +298
def _find_unused_nvme_disk(self, disks: List[DiskInfo]) -> Optional[str]:
nvme_pattern = re.compile(r"^nvme\d+n1$")
for disk in disks:
if disk.is_os_disk:
continue
if not nvme_pattern.match(disk.name):
continue
if disk.partitions:
continue
if disk.is_mounted:
continue
return f"/dev/{disk.name}"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_find_unused_nvme_disk() will select any non-OS NVMe disk with no partitions and not mounted, and get_disk_img_copy_path() formats it (format=True). On hosts with additional NVMe data disks that are intentionally unmounted, this can wipe data. Consider adding stronger checks before formatting (e.g., require disk.fstype/uuid empty and/or validate it’s the expected temporary/resource disk via blkid label or platform disk feature) and refuse to format when the disk looks previously used.

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +272
except Exception as e:
log.warning(
f"Failed to mount {candidate} at {mount_point}: {e}; "
"falling back to working path."
)
return node.working_path
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching a broad Exception here will also swallow unexpected programming errors (including assertion failures from mount command checks), making failures harder to diagnose. Please catch the specific exception types expected from mount/format operations (e.g., LisaException and AssertionError) and let unexpected exceptions propagate.

Copilot uses AI. Check for mistakes.
Comment on lines +259 to +275
try:
node.execute(f"mkdir -p {mount_point}", shell=True, sudo=True)
node.tools[Mount].mount(
name=candidate,
point=mount_point,
fs_type=FileSystem.ext4,
format_=True,
)
except Exception as e:
log.warning(
f"Failed to mount {candidate} at {mount_point}: {e}; "
"falling back to working path."
)
return node.working_path

log.info(f"Mounted {candidate} at {mount_point} for VM disk copies")
return PurePath(mount_point)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path may format+mount a disk for the duration of the test run, but there’s no corresponding unmount/cleanup. That leaves host state changed for any subsequent test cases in the same environment. Consider tracking whether the disk was mounted by this test and unmounting it in a finally/after_case hook (or at least documenting the intentional persistent mount).

Copilot uses AI. Check for mistakes.

candidate = self._find_unused_nvme_disk(disks)
if candidate is None:
log.warning(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use log.warning

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. What do you suggest instead? log.info?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log.info

The previous logic in MshvHostStressTestSuite._get_disk_img_copy_path
treated existence of /mnt as proof that a large resource disk was
mounted there. On NVMe-based Azure SKUs the temporary disks show up as
/dev/nvme*n1 and are not mounted anywhere, so the test ended up copying
many large guest images onto the small OS disk and ran out of space.

Use lsblk to detect what is actually mounted at /mnt/resource and /mnt
and reuse those when present. Otherwise pick an unused nvme*n1 disk
(not the OS disk, no partitions, nothing mounted), format it as ext4,
and mount it at /mnt/resource. Fall back to the working path if no
suitable disk is found or the mount fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@microsoft.com>
@anirudhrb anirudhrb force-pushed the anrayabh/mshv_stress_vm_create_resource_disk branch from 89c06b4 to a973479 Compare April 28, 2026 13:30
@github-actions
Copy link
Copy Markdown

🤖 AI Test Selection

No test cases were selected for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants