Skip to content

mantle: rework handling of MachineOptions#4510

Draft
dustymabe wants to merge 11 commits intocoreos:mainfrom
dustymabe:dusty-machine-options
Draft

mantle: rework handling of MachineOptions#4510
dustymabe wants to merge 11 commits intocoreos:mainfrom
dustymabe:dusty-machine-options

Conversation

@dustymabe
Copy link
Copy Markdown
Member

To simplify things a bit I'm proposing that we fold QemuMachineOptions into MachineOptions and then add a full platform.MachineOptions entry into the definition of a test (the Test struct). The Test struct itself already has several fields in it that are essentially MachineOptions so we just expand that here to cover all MachineOptions.

The last commit in this series shows the value of this by reducing the boilerplate that we need to run the same tests on different machines but with a slightly tweaked option.

I think this work will be valuable to reduce a lot of boilerplate that is being introduced when we fold testiso tests into kola run (see #4377).

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors machine option handling by consolidating platform-specific configurations into a unified MachineOptions struct and introducing a BaseMachine type to share common logic across providers. It also implements a machine state tracking system. Feedback indicates a potential regression in the test harness, where the new background polling logic for machine boot states makes timeouts ineffective and introduces redundant, inefficient checks for states already managed by synchronous creation calls.

@dustymabe dustymabe force-pushed the dusty-machine-options branch from 1f72b95 to 3cb9b5f Compare March 27, 2026 20:02
Store the test execution timeout context on RuntimeConfig.TestExecTimeout
so that BaseCluster.SSH can enforce the timeout without requiring
callers to pass a context.Context through every function signature.

The harness sets TestExecTimeout to h.TimeoutContext() when building the
RuntimeConfig in runTest(). BaseCluster.SSH uses Start()+Wait() with
a select on this context, closing the SSH session when the context is
cancelled.

This enables us to essentially have timeout checking on every SSH()
call we do.

Written-by: <anthropic/claude-opus-4-6>
There were two deferred functions and we might as well combine them
into one.
In the previous commit we plumbed through timeout/cancelling into
every SSH command so now we don't really need to SkipStartMachine
and then call mach.Start() inside a RunWithExecTimeoutCheck() here
in RunTest any longer.
Some tests don't start their machines via the Harness, but rather
directly in the tests via `NewMachine()`. In those cases we were
releasing the memory reservation prematurely. Let's asynchronously
wait for the machines to be up and then release the memory reservation.
The early return guard that checks t.ReservedMemoryCountMiB == 0 was
performed outside the mutex. Since releaseMemoryCount can be called
concurrently from both the async memory-release goroutine and the
deferred cleanup function, this unsynchronized read could see a stale
value under the Go memory model, potentially causing a double-subtract.

Move the check inside the mutex to ensure the read of
t.ReservedMemoryCountMiB has a proper happens-before relationship
with the write that zeroes it.

Written-by: <anthropic/claude-opus-4-6>
Maintaining separate QemuMachineOptions and MachineOptions structs led
to complications: awkward two-level nesting at call sites, a shadowed
Firmware field, and type assertions needed just to pass QEMU-specific
options.

Move the QEMU-only fields (HostForwardPorts, DisablePDeathSig,
OverrideBackingFile, Nvme, Cex) into MachineOptions and delete the
QemuMachineOptions type. Add EnsureNoQEMUOnlyOptions() so non-QEMU
platforms (aws, azure, gcp, do, esx, openstack) reject these options
early with a clear error. Keep NewMachineWithQemuOptions as a thin
alias on qemu/qemuiso clusters for backward compatibility.

Written-by: <anthropic/claude-opus-4.6>
MultiPathDisk, PrimaryDisk, MinMemory, NumaNodes, AdditionalNics,
AppendKernelArgs, AppendFirstbootKernelArgs, and Firmware are only
implemented by QEMU-based platforms. Move them below the QEMU-only
comment boundary and check them in EnsureNoQEMUOnlyOptions(), removing
the now-redundant per-platform rejection checks in the non-QEMU
cluster implementations.

The remaining cross-platform fields are AdditionalDisks (qemu, qemuiso,
azure, gcloud), MinDiskSize (qemu, aws), and InstanceType (azure).

Written-by: <anthropic/claude-opus-4.6>
Now that QemuMachineOptions has been folded into MachineOptions,
NewMachineWithQemuOptions is just a trivial wrapper around
NewMachineWithOptions. Delete it and update all callers to use
NewMachineWithOptions through the Cluster interface directly.

This removes the need for type assertions to *qemu.Cluster in tests
that only used them to access NewMachineWithQemuOptions. Tests that
still need the type assertion (luks tang setup, ostree sync,
rhcos upgrade) keep it for platform-specific branching logic, but
call NewMachineWithOptions through the interface rather than through
the concrete type.

Written-by: <anthropic/claude-opus-4.6>
Previously, kola spawn only entered the DisablePDeathSig /
spawnMachineOptions code path when the platform was "qemu". This
guard existed because DisablePDeathSig and the machine options JSON
file were QEMU-specific concepts that required a type assertion to
*qemu.Cluster.

Now that MachineOptions is the single unified type used by all
platforms, and non-QEMU platforms call EnsureNoQEMUOnlyOptions() in
their NewMachineWithOptions(), the guard is no longer necessary:

- On QEMU: behavior is unchanged.
- On non-QEMU with --remove=false: previously this silently ignored
  the user's intent (DisablePDeathSig has no meaning outside QEMU)
  and created the machine via NewMachine() with no options. Now it
  calls NewMachineWithOptions() which will reject DisablePDeathSig
  with a clear error, surfacing the fact that the flag combination
  is unsupported rather than silently doing the wrong thing.
- On non-QEMU with --machine-options: same improvement — the user
  gets a clear error if the JSON contains QEMU-only fields.

Written-by: <anthropic/claude-opus-4.6>
…ptions

The Test struct duplicated many fields from platform.MachineOptions
(MultiPathDisk, AdditionalDisks, PrimaryDisk, MinMemory, MinDiskSize,
NumaNodes, AdditionalNics, AppendKernelArgs, AppendFirstbootKernelArgs,
InstanceType). The harness then manually copied each one into a
MachineOptions before creating machines.

Replace all of these with a single MachineOptions field on the Test
struct. The harness now uses t.MachineOptions directly. Fields that
are not MachineOptions (InjectContainer, ReservedMemoryCountMiB) stay
on the Test struct.

Update the two test files (multipath.go, network.go) that set these
fields in their registrations, and update harness.go external test
registration to populate the MachineOptions sub-struct.

Written-by: <anthropic/claude-opus-4.6>
The basic.uefi, basic.uefi-secure, and basic.nvme tests previously
set ClusterSize to 0 and manually created machines inside wrapper
functions (uefiWithBasicTests, uefiSecureWithBasicTests,
nvmeBasicTests) just to pass Firmware/Nvme in MachineOptions. They
then manually called ScpKolet and LocalTests.

Now that the Test struct carries MachineOptions directly, these tests
can set ClusterSize to 1 with the desired MachineOptions and use
LocalTests as their Run function. The harness handles machine
creation, kolet upload, and native function dispatch automatically.

This removes the three wrapper functions, runBasicTests, and the
kola/cluster imports that were only needed for them.

Written-by: <anthropic/claude-opus-4.6>
@dustymabe dustymabe force-pushed the dusty-machine-options branch from 3cb9b5f to c0cdea9 Compare March 28, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant