Skip to content

fix(orch): uffd handle no such process#2352

Draft
jakubno wants to merge 3 commits intomainfrom
fix/uffd-no-such-process
Draft

fix(orch): uffd handle no such process#2352
jakubno wants to merge 3 commits intomainfrom
fix/uffd-no-such-process

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 10, 2026

No description provided.

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 10, 2026

PR Summary

Medium Risk
Touches the userfaultfd page-fault handling path; while the change is small, it alters error handling during memory population and could mask unexpected process-exit conditions if misclassified.

Overview
Updates UFFD page-fault servicing to treat ESRCH (no such process) from the UFFDIO_COPY operation as an expected teardown race: it records a tracing attribute, logs at debug level, and returns success instead of propagating an error.

Reviewed by Cursor Bugbot for commit 14431b3. Bugbot is set up for automated code reviews on this repo. Configure here.

if errors.Is(copyErr, unix.ESRCH) {
// The process that triggered the fault no longer exists — FC was killed
// or crashed while the page fetch was in flight. This is expected during
// sandbox teardown; treat it as benign.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping onFailure (i.e. SignalExit) here changes the loop exit path for full-FC-death scenarios.

With the old code, any copyErr called SignalExit -> the exitFd pipe became readable -> the poll loop exited cleanly at the hasEvent(exitFd.Revents, unix.POLLIN) check.

With this change, when ESRCH means the entire FC process is gone, the loop now relies on the UFFD fd itself signalling completion. Looking at the serve loop (lines 154-175), POLLHUP on the UFFD fd is only tracked as a metric and continued -- the loop does not exit on POLLHUP alone. If the kernel sets only POLLHUP (not POLLIN) when FC's mm is released, the loop will busy-spin indefinitely until fdExit.Close() is called by external teardown code.

For the case this fix targets -- a single vCPU thread dying mid-fault while the rest of FC is still alive -- skipping SignalExit is correct. But the comment says FC was killed or crashed, implying full-process death. For that case, the clean exit now depends on either (a) external teardown always calling fdExit.Close() before this path is hit, or (b) the UFFD fd becoming readable (POLLIN + read error) after the mm is released. Given the existing TODO at line 167 about incomplete POLLHUP handling, it is worth verifying (a) is always guaranteed in the crash path.

@DevAsign-Review-Agent
Copy link
Copy Markdown

DevAsign-Review-Agent commented Apr 11, 2026

Merge Score: 95/100

🟢 ███████████████████░ 95%

The PR gracefully handles the unix.ESRCH error during userfaultfd page copying. This error occurs when the faulting process exits or is killed while a page fetch is in flight, which is a normal occurrence during sandbox teardown. By treating it as benign, it prevents unnecessary error logging and potential teardown issues. The change is well-commented and correctly placed. A minor improvement would be to record this event in the OpenTelemetry span, similar to how EEXIST is handled.

Code Suggestions (1)

Low Priority (1)

  1. packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go (Line 369)
    Add an OpenTelemetry span attribute when unix.ESRCH is encountered.

Reasoning: Just like the unix.EEXIST case above it, recording this benign exit state in the trace span will help with debugging and observability, ensuring that traces accurately reflect why the page fault handling was aborted.

Suggested Code:

		span.SetAttributes(attribute.Bool("uffd.process_exited", true))
		u.logger.Debug(ctx, "UFFD serve copy error: process no longer exists", zap.Error(copyErr))
📊 Review Metadata
  • Processing Time: 18s
  • Analysis Date: 4/11/2026, 4:24:53 PM

🤖 This review was generated by AI. While we strive for accuracy, please use your judgment when applying suggestions.

💬 Questions about this review? Open an issue or contact support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants