mantle: rework handling of MachineOptions#4510
Draft
dustymabe wants to merge 11 commits intocoreos:mainfrom
Draft
mantle: rework handling of MachineOptions#4510dustymabe wants to merge 11 commits intocoreos:mainfrom
dustymabe wants to merge 11 commits intocoreos:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
Contributor
There was a problem hiding this comment.
Code Review
This pull request refactors machine option handling by consolidating platform-specific configurations into a unified MachineOptions struct and introducing a BaseMachine type to share common logic across providers. It also implements a machine state tracking system. Feedback indicates a potential regression in the test harness, where the new background polling logic for machine boot states makes timeouts ineffective and introduces redundant, inefficient checks for states already managed by synchronous creation calls.
1f72b95 to
3cb9b5f
Compare
Store the test execution timeout context on RuntimeConfig.TestExecTimeout so that BaseCluster.SSH can enforce the timeout without requiring callers to pass a context.Context through every function signature. The harness sets TestExecTimeout to h.TimeoutContext() when building the RuntimeConfig in runTest(). BaseCluster.SSH uses Start()+Wait() with a select on this context, closing the SSH session when the context is cancelled. This enables us to essentially have timeout checking on every SSH() call we do. Written-by: <anthropic/claude-opus-4-6>
There were two deferred functions and we might as well combine them into one.
In the previous commit we plumbed through timeout/cancelling into every SSH command so now we don't really need to SkipStartMachine and then call mach.Start() inside a RunWithExecTimeoutCheck() here in RunTest any longer.
Some tests don't start their machines via the Harness, but rather directly in the tests via `NewMachine()`. In those cases we were releasing the memory reservation prematurely. Let's asynchronously wait for the machines to be up and then release the memory reservation.
The early return guard that checks t.ReservedMemoryCountMiB == 0 was performed outside the mutex. Since releaseMemoryCount can be called concurrently from both the async memory-release goroutine and the deferred cleanup function, this unsynchronized read could see a stale value under the Go memory model, potentially causing a double-subtract. Move the check inside the mutex to ensure the read of t.ReservedMemoryCountMiB has a proper happens-before relationship with the write that zeroes it. Written-by: <anthropic/claude-opus-4-6>
Maintaining separate QemuMachineOptions and MachineOptions structs led to complications: awkward two-level nesting at call sites, a shadowed Firmware field, and type assertions needed just to pass QEMU-specific options. Move the QEMU-only fields (HostForwardPorts, DisablePDeathSig, OverrideBackingFile, Nvme, Cex) into MachineOptions and delete the QemuMachineOptions type. Add EnsureNoQEMUOnlyOptions() so non-QEMU platforms (aws, azure, gcp, do, esx, openstack) reject these options early with a clear error. Keep NewMachineWithQemuOptions as a thin alias on qemu/qemuiso clusters for backward compatibility. Written-by: <anthropic/claude-opus-4.6>
MultiPathDisk, PrimaryDisk, MinMemory, NumaNodes, AdditionalNics, AppendKernelArgs, AppendFirstbootKernelArgs, and Firmware are only implemented by QEMU-based platforms. Move them below the QEMU-only comment boundary and check them in EnsureNoQEMUOnlyOptions(), removing the now-redundant per-platform rejection checks in the non-QEMU cluster implementations. The remaining cross-platform fields are AdditionalDisks (qemu, qemuiso, azure, gcloud), MinDiskSize (qemu, aws), and InstanceType (azure). Written-by: <anthropic/claude-opus-4.6>
Now that QemuMachineOptions has been folded into MachineOptions, NewMachineWithQemuOptions is just a trivial wrapper around NewMachineWithOptions. Delete it and update all callers to use NewMachineWithOptions through the Cluster interface directly. This removes the need for type assertions to *qemu.Cluster in tests that only used them to access NewMachineWithQemuOptions. Tests that still need the type assertion (luks tang setup, ostree sync, rhcos upgrade) keep it for platform-specific branching logic, but call NewMachineWithOptions through the interface rather than through the concrete type. Written-by: <anthropic/claude-opus-4.6>
Previously, kola spawn only entered the DisablePDeathSig / spawnMachineOptions code path when the platform was "qemu". This guard existed because DisablePDeathSig and the machine options JSON file were QEMU-specific concepts that required a type assertion to *qemu.Cluster. Now that MachineOptions is the single unified type used by all platforms, and non-QEMU platforms call EnsureNoQEMUOnlyOptions() in their NewMachineWithOptions(), the guard is no longer necessary: - On QEMU: behavior is unchanged. - On non-QEMU with --remove=false: previously this silently ignored the user's intent (DisablePDeathSig has no meaning outside QEMU) and created the machine via NewMachine() with no options. Now it calls NewMachineWithOptions() which will reject DisablePDeathSig with a clear error, surfacing the fact that the flag combination is unsupported rather than silently doing the wrong thing. - On non-QEMU with --machine-options: same improvement — the user gets a clear error if the JSON contains QEMU-only fields. Written-by: <anthropic/claude-opus-4.6>
…ptions The Test struct duplicated many fields from platform.MachineOptions (MultiPathDisk, AdditionalDisks, PrimaryDisk, MinMemory, MinDiskSize, NumaNodes, AdditionalNics, AppendKernelArgs, AppendFirstbootKernelArgs, InstanceType). The harness then manually copied each one into a MachineOptions before creating machines. Replace all of these with a single MachineOptions field on the Test struct. The harness now uses t.MachineOptions directly. Fields that are not MachineOptions (InjectContainer, ReservedMemoryCountMiB) stay on the Test struct. Update the two test files (multipath.go, network.go) that set these fields in their registrations, and update harness.go external test registration to populate the MachineOptions sub-struct. Written-by: <anthropic/claude-opus-4.6>
The basic.uefi, basic.uefi-secure, and basic.nvme tests previously set ClusterSize to 0 and manually created machines inside wrapper functions (uefiWithBasicTests, uefiSecureWithBasicTests, nvmeBasicTests) just to pass Firmware/Nvme in MachineOptions. They then manually called ScpKolet and LocalTests. Now that the Test struct carries MachineOptions directly, these tests can set ClusterSize to 1 with the desired MachineOptions and use LocalTests as their Run function. The harness handles machine creation, kolet upload, and native function dispatch automatically. This removes the three wrapper functions, runBasicTests, and the kola/cluster imports that were only needed for them. Written-by: <anthropic/claude-opus-4.6>
3cb9b5f to
c0cdea9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
To simplify things a bit I'm proposing that we fold QemuMachineOptions into MachineOptions and then add a full platform.MachineOptions entry into the definition of a test (the Test struct). The Test struct itself already has several fields in it that are essentially MachineOptions so we just expand that here to cover all MachineOptions.
The last commit in this series shows the value of this by reducing the boilerplate that we need to run the same tests on different machines but with a slightly tweaked option.
I think this work will be valuable to reduce a lot of boilerplate that is being introduced when we fold testiso tests into
kola run(see #4377).