Skip to content
Draft
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 75 additions & 25 deletions packages/api/internal/handlers/sandbox_pause.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
"errors"
"fmt"
"net/http"
"time"

"github.com/gin-gonic/gin"
"github.com/google/uuid"
Expand All @@ -22,6 +23,23 @@
"github.com/e2b-dev/infra/packages/shared/pkg/telemetry"
)

const (
pauseRequestTimeout = 15 * time.Minute
pauseRequestWaitTimeout = 60 * time.Second
)

type pauseRequestResult struct {
statusCode int
clientMsg string
}

func sendPauseInProgressResponse(c *gin.Context) {
c.JSON(http.StatusAccepted, gin.H{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update pause API contract for 202 in-progress responses

Returning 202 Accepted here introduces a new success path that is not represented in the generated API contract/client (PostSandboxesSandboxIDPauseResponse and ParsePostSandboxesSandboxIDPauseResponse only model 401/404/409/500, with callers commonly expecting 204). When pause runs longer than 60s or the request context is canceled, clients generated from the current spec will treat this as an unexpected status and may surface false failures even though the pause continues in the background.

Useful? React with 👍 / 👎.

"code": int32(http.StatusAccepted),
"message": "Pause is still in progress. Check the sandbox info endpoint for the latest status.",
})

Check failure on line 40 in packages/api/internal/handlers/sandbox_pause.go

View check run for this annotation

Claude / Claude Code Review

202 Accepted response not in OpenAPI spec - breaks SDK clients

The new 202 Accepted response from `sendPauseInProgressResponse` is not defined in the OpenAPI spec for `POST /sandboxes/{sandboxID}/pause`, which only declares 204, 409, 404, 401, and 500. SDK clients generated from the spec (Go, Python, JS) have no typed handling for 202 and may treat it as an unexpected response or assertion failure. The spec needs a 202 response entry added, and integration tests that assert `StatusCode == 204` will fail whenever pause takes longer than 60 seconds.
Comment on lines +35 to +40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new 202 Accepted response from sendPauseInProgressResponse is not defined in the OpenAPI spec for POST /sandboxes/{sandboxID}/pause, which only declares 204, 409, 404, 401, and 500. SDK clients generated from the spec (Go, Python, JS) have no typed handling for 202 and may treat it as an unexpected response or assertion failure. The spec needs a 202 response entry added, and integration tests that assert StatusCode == 204 will fail whenever pause takes longer than 60 seconds.

Extended reasoning...

What the bug is and how it manifests

The PR introduces sendPauseInProgressResponse which returns HTTP 202 Accepted with a JSON body {"code": 202, "message": "..."} when pause takes longer than 60 seconds or the client disconnects. However, the OpenAPI spec at spec/openapi.yml only defines the following responses for POST /sandboxes/{sandboxID}/pause: 204 (success), 409 (conflict), 404 (not found), 401 (unauthorized), and 500 (internal error). A 202 response is completely absent from the spec.

The specific code path that triggers it

In PostSandboxesSandboxIDPause (sandbox_pause.go:35-40), a goroutine runs pauseSandboxRequest with a 15-minute timeout. The outer handler then selects on a 60-second timer (pauseRequestWaitTimeout) or ctx.Done(). If pause takes longer than 60 seconds or the HTTP client disconnects early, sendPauseInProgressResponse is called and writes HTTP 202. This is a new code path introduced by this PR.

Why existing code doesn't prevent it

The auto-generated server code (api.gen.go) is derived from the OpenAPI spec. Because the spec was not updated to include 202, the generated client type PostSandboxesSandboxIDPauseResponse has no JSON202 field — only JSON401, JSON404, JSON409, and JSON500. The Go client's ParsePostSandboxesSandboxIDPauseResponse switch statement falls through to a default case for 202, leaving only raw bytes stored. Other SDKs generated from the same spec (Python, JS) may throw exceptions or return error states when encountering the undocumented 202.

What the impact would be

Any caller that checks response.StatusCode == 204 to determine success will incorrectly treat a valid in-progress pause (202) as a failure. The integration test sandbox_pause_test.go:29 does exactly this with require.Equal(t, http.StatusNoContent, resp.StatusCode()) — it will fail whenever a pause crosses the 60-second threshold. External SDK users who handle only documented response codes will get unexpected behavior on 202.

How to fix it

Add a 202 response entry to the POST /sandboxes/{sandboxID}/pause operation in spec/openapi.yml, defining the response body shape (with code and message fields). Then regenerate api.gen.go so the Go client gains a JSON202 field and the parser handles it correctly. Integration tests should also be updated to accept either 202 or 204 as valid success responses.

Step-by-step proof

  1. Client calls POST /sandboxes/{id}/pause on a sandbox whose snapshot takes 90 seconds.
  2. The handler goroutine starts and runs pauseSandboxRequest in the background with a 15-minute context.
  3. After exactly 60 seconds (pauseRequestWaitTimeout), the time.After case fires in the select.
  4. sendPauseInProgressResponse writes HTTP 202 with body {"code":202,"message":"Pause is still in progress..."}.
  5. A Python SDK client generated from the spec receives 202 — not in its known responses — and raises an ApiException or similar, treating the pause as failed.
  6. The integration test at sandbox_pause_test.go:29 calls require.Equal(t, http.StatusNoContent, resp.StatusCode()) which fails because resp.StatusCode() == 202, not 204.

}

func (a *APIStore) PostSandboxesSandboxIDPause(c *gin.Context, sandboxID api.SandboxID) {
ctx := c.Request.Context()
// Get team from context, use TeamContextKey
Expand All @@ -44,40 +62,36 @@

pause.LogInitiated(ctx, sandboxID, teamID.String(), pause.ReasonRequest)

err = a.orchestrator.RemoveSandbox(ctx, teamID, sandboxID, sandbox.RemoveOpts{Action: sandbox.StateActionPause})
var transErr *sandbox.InvalidStateTransitionError
resultCh := make(chan pauseRequestResult, 1)

switch {
case err == nil:
pause.LogSuccess(ctx, sandboxID, teamID.String(), pause.ReasonRequest)
case errors.Is(err, orchestrator.ErrSandboxNotFound):
apiErr := pauseHandleNotRunningSandbox(ctx, a.snapshotCache, sandboxID, teamID)
switch apiErr.Code {
case http.StatusConflict:
pause.LogSkipped(ctx, sandboxID, teamID.String(), pause.ReasonRequest, pause.SkipReasonAlreadyPaused)
case http.StatusNotFound:
pause.LogSkipped(ctx, sandboxID, teamID.String(), pause.ReasonRequest, pause.SkipReasonNotFound)
default:
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)
go func() {
pauseCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), pauseRequestTimeout)
defer cancel()

resultCh <- a.pauseSandboxRequest(pauseCtx, sandboxID, teamID)
close(resultCh)
}()

select {

Check failure on line 75 in packages/api/internal/handlers/sandbox_pause.go

View check run for this annotation

Claude / Claude Code Review

Background pause goroutine untracked - killed on shutdown

The goroutine spawned at line 68 is untracked and will be killed on process shutdown, leaving the DB permanently stuck in `BuildStatusSnapshotting`. When SIGTERM arrives after the handler has already returned 202, the background goroutine continues running (via `context.WithoutCancel`) but `os.Exit` kills it before `UpdateEnvBuildStatus` can be called. A `sync.WaitGroup` or goroutine pool is needed to allow these long-running goroutines (up to 15 minutes) to complete before process exit.
case result := <-resultCh:
if result.clientMsg == "" {
c.Status(result.statusCode)

return
}
a.sendAPIStoreError(c, apiErr.Code, apiErr.ClientMsg)

return
case errors.As(err, &transErr):
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)
a.sendAPIStoreError(c, http.StatusConflict, fmt.Sprintf("Sandbox '%s' cannot be paused while in '%s' state", sandboxID, transErr.CurrentState))
a.sendAPIStoreError(c, result.statusCode, result.clientMsg)

return
default:
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)
telemetry.ReportError(ctx, "error pausing sandbox", err)
case <-time.After(pauseRequestWaitTimeout):
sendPauseInProgressResponse(c)

a.sendAPIStoreError(c, http.StatusInternalServerError, "Error pausing sandbox")
return
case <-ctx.Done():
sendPauseInProgressResponse(c)

return
}

c.Status(http.StatusNoContent)
}

func pauseHandleNotRunningSandbox(ctx context.Context, cache *snapshotcache.SnapshotCache, sandboxID string, teamID uuid.UUID) api.APIError {
Expand Down Expand Up @@ -117,3 +131,39 @@
ClientMsg: "Error pausing sandbox",
}
}

func (a *APIStore) pauseSandboxRequest(ctx context.Context, sandboxID string, teamID uuid.UUID) pauseRequestResult {
err := a.orchestrator.RemoveSandbox(ctx, teamID, sandboxID, sandbox.RemoveOpts{Action: sandbox.StateActionPause})
var transErr *sandbox.InvalidStateTransitionError

switch {
case err == nil:
pause.LogSuccess(ctx, sandboxID, teamID.String(), pause.ReasonRequest)

return pauseRequestResult{statusCode: http.StatusNoContent}
case errors.Is(err, orchestrator.ErrSandboxNotFound):
apiErr := pauseHandleNotRunningSandbox(ctx, a.snapshotCache, sandboxID, teamID)
switch apiErr.Code {
case http.StatusConflict:
pause.LogSkipped(ctx, sandboxID, teamID.String(), pause.ReasonRequest, pause.SkipReasonAlreadyPaused)
case http.StatusNotFound:
pause.LogSkipped(ctx, sandboxID, teamID.String(), pause.ReasonRequest, pause.SkipReasonNotFound)
default:
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)
}

return pauseRequestResult{statusCode: apiErr.Code, clientMsg: apiErr.ClientMsg}
case errors.As(err, &transErr):
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)

return pauseRequestResult{
statusCode: http.StatusConflict,
clientMsg: fmt.Sprintf("Sandbox '%s' cannot be paused while in '%s' state", sandboxID, transErr.CurrentState),
}
default:
pause.LogFailure(ctx, sandboxID, teamID.String(), pause.ReasonRequest, err)
telemetry.ReportError(ctx, "error pausing sandbox", err)

return pauseRequestResult{statusCode: http.StatusInternalServerError, clientMsg: "Error pausing sandbox"}
}
}
11 changes: 8 additions & 3 deletions packages/api/internal/orchestrator/pause_instance.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ import (

type PauseQueueExhaustedError struct{}

const pauseCompletionTimeout = 15 * time.Minute

func (PauseQueueExhaustedError) Error() string {
return "The pause queue is exhausted"
}
Expand All @@ -38,6 +40,9 @@ func (o *Orchestrator) pauseSandbox(ctx context.Context, node *nodemanager.Node,
return err
}

ctx, cancel := context.WithTimeout(ctx, pauseCompletionTimeout)
defer cancel()

err = snapshotInstance(ctx, node, sbx, result.TemplateID, result.BuildID.String())
if errors.Is(err, PauseQueueExhaustedError{}) {
telemetry.ReportCriticalError(ctx, "pause queue exhausted", err)
Expand Down Expand Up @@ -70,12 +75,12 @@ func (o *Orchestrator) pauseSandbox(ctx context.Context, node *nodemanager.Node,
}

func snapshotInstance(ctx context.Context, node *nodemanager.Node, sbx sandbox.Sandbox, templateID, buildID string) error {
childCtx, childSpan := tracer.Start(ctx, "snapshot-instance")
ctx, childSpan := tracer.Start(ctx, "snapshot-instance")
defer childSpan.End()

client, childCtx := node.GetSandboxDeleteCtx(childCtx, sbx.SandboxID, sbx.ExecutionID)
client, ctx := node.GetSandboxDeleteCtx(ctx, sbx.SandboxID, sbx.ExecutionID)
_, err := client.Sandbox.Pause(
childCtx, &orchestrator.SandboxPauseRequest{
ctx, &orchestrator.SandboxPauseRequest{
SandboxId: sbx.SandboxID,
TemplateId: templateID,
BuildId: buildID,
Expand Down
Loading