Skip to content

mantle/platform: reduce boilerplate; fix timeout/memory tracking bugs#4509

Open
dustymabe wants to merge 5 commits intocoreos:mainfrom
dustymabe:dusty-refactor-mantle-machine
Open

mantle/platform: reduce boilerplate; fix timeout/memory tracking bugs#4509
dustymabe wants to merge 5 commits intocoreos:mainfrom
dustymabe:dusty-refactor-mantle-machine

Conversation

@dustymabe
Copy link
Copy Markdown
Member

See individual commit messages.

Basically the previous code had some logic gaps when it comes to the timeout logic and the memory tracking logic that came from a misunderstanding about how "machines" are started.

Machines can be started by the harness OR directly by the tests (i.e. by setting ClusterSize=0 and then calling NewMachines() directly). The timeout logic, nor the memory tracking logic properly handled this case.

We change the code now to remove the skipstartmachines logic and add a go func that will poll machine state and behave accordingly with respect to the timer and memory tracking.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@dustymabe
Copy link
Copy Markdown
Member Author

Posting this up here for testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a BaseMachine struct to consolidate common machine interface implementations across different platforms and adds a MachineState tracking mechanism. It also refactors the kola harness to monitor machine booting asynchronously in a background goroutine. Review feedback identifies a missing plog import in the new platform/machine.go file and a potential race condition in the harness goroutine that could lead to premature exit before all machines are fully accounted for.

@dustymabe dustymabe force-pushed the dusty-refactor-mantle-machine branch from 2cad42e to 0d0634f Compare March 27, 2026 19:43
@dustymabe
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors machine lifecycle management by introducing a BaseMachine type and a MachineState enumeration to track machine status (Started, Booted, Destroyed). The kola harness is updated to monitor machine states in a background goroutine, allowing tests to create machines dynamically while still enforcing execution timeouts. Feedback was provided regarding the use of non-cancellable time.Sleep calls within this new goroutine, which could lead to resource leaks if a test is cancelled before machines are fully initialized.

@dustymabe dustymabe force-pushed the dusty-refactor-mantle-machine branch from 0d0634f to ff8f436 Compare March 27, 2026 19:59
Store the test execution timeout context on RuntimeConfig.TestExecTimeout
so that BaseCluster.SSH can enforce the timeout without requiring
callers to pass a context.Context through every function signature.

The harness sets TestExecTimeout to h.TimeoutContext() when building the
RuntimeConfig in runTest(). BaseCluster.SSH uses Start()+Wait() with
a select on this context, closing the SSH session when the context is
cancelled.

This enables us to essentially have timeout checking on every SSH()
call we do.

Written-by: <anthropic/claude-opus-4-6>
There were two deferred functions and we might as well combine them
into one.
In the previous commit we plumbed through timeout/cancelling into
every SSH command so now we don't really need to SkipStartMachine
and then call mach.Start() inside a RunWithExecTimeoutCheck() here
in RunTest any longer.
Some tests don't start their machines via the Harness, but rather
directly in the tests via `NewMachine()`. In those cases we were
releasing the memory reservation prematurely. Let's asynchronously
wait for the machines to be up and then release the memory reservation.
The early return guard that checks t.ReservedMemoryCountMiB == 0 was
performed outside the mutex. Since releaseMemoryCount can be called
concurrently from both the async memory-release goroutine and the
deferred cleanup function, this unsynchronized read could see a stale
value under the Go memory model, potentially causing a double-subtract.

Move the check inside the mutex to ensure the read of
t.ReservedMemoryCountMiB has a proper happens-before relationship
with the write that zeroes it.

Written-by: <anthropic/claude-opus-4-6>
@dustymabe dustymabe force-pushed the dusty-refactor-mantle-machine branch from ff8f436 to a267a14 Compare March 28, 2026 22:09
@dustymabe
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the test execution timeout logic by integrating it into the base SSH implementation via a new TestExecTimeout context and removes the SkipStartMachine option across all platform providers. It also addresses a potential race condition in memory reservation and adds a delay for journal flushing. Feedback identifies a potential goroutine leak in the asynchronous machine polling loop and suggests replacing a fragile time.Sleep with a more robust synchronization method.

@dustymabe dustymabe marked this pull request as ready for review March 29, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant