mantle/platform: reduce boilerplate; fix timeout/memory tracking bugs#4509
mantle/platform: reduce boilerplate; fix timeout/memory tracking bugs#4509dustymabe wants to merge 5 commits intocoreos:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Posting this up here for testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces a BaseMachine struct to consolidate common machine interface implementations across different platforms and adds a MachineState tracking mechanism. It also refactors the kola harness to monitor machine booting asynchronously in a background goroutine. Review feedback identifies a missing plog import in the new platform/machine.go file and a potential race condition in the harness goroutine that could lead to premature exit before all machines are fully accounted for.
2cad42e to
0d0634f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors machine lifecycle management by introducing a BaseMachine type and a MachineState enumeration to track machine status (Started, Booted, Destroyed). The kola harness is updated to monitor machine states in a background goroutine, allowing tests to create machines dynamically while still enforcing execution timeouts. Feedback was provided regarding the use of non-cancellable time.Sleep calls within this new goroutine, which could lead to resource leaks if a test is cancelled before machines are fully initialized.
0d0634f to
ff8f436
Compare
Store the test execution timeout context on RuntimeConfig.TestExecTimeout so that BaseCluster.SSH can enforce the timeout without requiring callers to pass a context.Context through every function signature. The harness sets TestExecTimeout to h.TimeoutContext() when building the RuntimeConfig in runTest(). BaseCluster.SSH uses Start()+Wait() with a select on this context, closing the SSH session when the context is cancelled. This enables us to essentially have timeout checking on every SSH() call we do. Written-by: <anthropic/claude-opus-4-6>
There were two deferred functions and we might as well combine them into one.
In the previous commit we plumbed through timeout/cancelling into every SSH command so now we don't really need to SkipStartMachine and then call mach.Start() inside a RunWithExecTimeoutCheck() here in RunTest any longer.
Some tests don't start their machines via the Harness, but rather directly in the tests via `NewMachine()`. In those cases we were releasing the memory reservation prematurely. Let's asynchronously wait for the machines to be up and then release the memory reservation.
The early return guard that checks t.ReservedMemoryCountMiB == 0 was performed outside the mutex. Since releaseMemoryCount can be called concurrently from both the async memory-release goroutine and the deferred cleanup function, this unsynchronized read could see a stale value under the Go memory model, potentially causing a double-subtract. Move the check inside the mutex to ensure the read of t.ReservedMemoryCountMiB has a proper happens-before relationship with the write that zeroes it. Written-by: <anthropic/claude-opus-4-6>
ff8f436 to
a267a14
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the test execution timeout logic by integrating it into the base SSH implementation via a new TestExecTimeout context and removes the SkipStartMachine option across all platform providers. It also addresses a potential race condition in memory reservation and adds a delay for journal flushing. Feedback identifies a potential goroutine leak in the asynchronous machine polling loop and suggests replacing a fragile time.Sleep with a more robust synchronization method.
See individual commit messages.
Basically the previous code had some logic gaps when it comes to the timeout logic and the memory tracking logic that came from a misunderstanding about how "machines" are started.
Machines can be started by the harness OR directly by the tests (i.e. by setting ClusterSize=0 and then calling NewMachines() directly). The timeout logic, nor the memory tracking logic properly handled this case.
We change the code now to remove the skipstartmachines logic and add a go func that will poll machine state and behave accordingly with respect to the timer and memory tracking.