Skip to content

[CI Flakiness] Fix MSB4216 task host failures on macOS and dotnet-watch test hangs#53423

Open
mmitche wants to merge 2 commits intodotnet:mainfrom
mmitche:fix/ci-flakiness/main
Open

[CI Flakiness] Fix MSB4216 task host failures on macOS and dotnet-watch test hangs#53423
mmitche wants to merge 2 commits intodotnet:mainfrom
mmitche:fix/ci-flakiness/main

Conversation

@mmitche
Copy link
Member

@mmitche mmitche commented Mar 12, 2026

Fix two major CI flakiness root causes:

  1. MSB4216 task host failures on macOS - add DOTNET_HOST_PATH to Helix scripts
  2. dotnet-watch test hangs - close stdin before kill, add exit timeout, reduce DCP timeouts

Baseline: 48/100 builds failed (48% failure rate)

Copilot AI review requested due to automatic review settings March 12, 2026 19:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces CI flakiness in Helix by improving MSBuild task host resolution (MSB4216 on macOS) and preventing dotnet-watch tests from hanging indefinitely when processes fail to terminate.

Changes:

  • Export DOTNET_HOST_PATH in Helix test runner scripts so MSBuild task hosts can reliably locate the dotnet host.
  • Bound DCP/Aspire watch operation timeouts in WatchableApp to avoid multi-hour hangs.
  • Improve process teardown in test utilities by closing stdin before killing and adding a bounded wait for process exit.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
test/Microsoft.DotNet.HotReload.Test.Utilities/WatchableApp.cs Replaces “effectively infinite” DCP/Aspire timeouts with bounded (300s) values to avoid Helix-duration hangs.
test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs Improves disposal reliability by closing stdin before kill and adding a 30s exit wait timeout.
build/RunTestsOnHelix.sh Sets DOTNET_HOST_PATH for Helix runs (notably fixes MSB4216 on macOS task hosts).
build/RunTestsOnHelix.cmd Same as above for Windows Helix runs.
Comments suppressed due to low confidence (1)

test/Microsoft.DotNet.HotReload.Test.Utilities/AwaitableProcess.cs:263

  • If the 30s timeout elapses, DisposeAsync continues without ensuring _processExitAwaiter has completed. In that case WaitForProcessExitAsync() may later try to call _outputCompletionSource.Cancel() after _outputCompletionSource.Dispose() runs here, which can throw ObjectDisposedException and introduce new flakiness. Consider either (a) not disposing _outputCompletionSource until _processExitAwaiter has completed, or (b) making WaitForProcessExitAsync() tolerate _outputCompletionSource being disposed (e.g., catch ObjectDisposedException around Cancel()).
            Process.Dispose();

            _outputCompletionSource.Dispose();
        }

You can also share your feedback on Copilot code review. Take the survey.

@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from 810cbfc to 666a7d6 Compare March 12, 2026 23:02
@mmitche mmitche requested review from a team and tmat as code owners March 12, 2026 23:02
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from 666a7d6 to b79f8b6 Compare March 13, 2026 01:06
@mmitche mmitche requested a review from a team as a code owner March 13, 2026 01:06
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch 10 times, most recently from 004d47a to 329b2a7 Compare March 13, 2026 19:52
Root causes fixed:
1. MSB4216 Task Host Failures (macOS): Set DOTNET_HOST_PATH in Helix scripts
2. AspireServiceFactory Semaphore Deadlock: Release semaphore when count==0 at disposal
3. AwaitableProcess Timeout: Cap per-operation timeout at 5min, close stdin before kill
4. Razor Tool Test Thread Starvation: Use dedicated thread for blocking dispatcher

Validation: 6 consecutive green builds (18/18 jobs), then 1 infrastructure flake
(Roslyn NuGet restore timeout - unrelated to changes)
Baseline: 48/100 builds failed (48% failure rate)
@mmitche mmitche force-pushed the fix/ci-flakiness/main branch from 329b2a7 to 5a8c007 Compare March 13, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants