Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions packages/orchestrator/pkg/sandbox/block/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,17 @@ func (c *Cache) Path() string {
return c.filePath
}

func (c *Cache) Data() []byte {
c.mu.RLock()
defer c.mu.RUnlock()

if c.mmap == nil {
return nil
}

return []byte(*c.mmap)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test body one

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice. The returned []byte is backed by mmap memory whose lifetime is controlled by Cache.Close() -> munmap. Any goroutine holding this slice (e.g. an upload goroutine calling storeData for a large memfile) can be left with a dangling pointer if Cache.Close() is called concurrently. The lock only protects the nil check, not the validity of the backing memory during use. In template_build.go this slice is passed into uploader.Upload which fans it out to concurrent HTTP goroutines with retries -- if the cache is closed while a retry is in flight the result is a SIGBUS or silent stale data.

Comment on lines 411 to +421
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Cache.Data() returns a mmap-backed []byte after releasing its RLock, so Cache.Close() can call mmap.Unmap() while upload goroutines are still reading the slice, causing a SIGSEGV. Fix by following the existing addressBytes() pattern: return both the slice and a releaseCacheCloseLock closure that keeps the RLock held until the upload is complete.

Extended reasoning...

The bug in detail

Cache.Data() (cache.go:411-421) acquires c.mu.RLock(), constructs []byte(*c.mmap) — a slice pointing directly into the memory-mapped region — and then immediately releases the lock via defer. Once Data() returns, the caller holds a raw pointer into mmap memory with no reference-counting, no lock, and no lifetime guarantee. Concurrently, Cache.Close() acquires c.mu.Lock() (write lock) and calls c.mmap.Unmap(), which instructs the kernel to tear down the mapping. Any subsequent read of the returned slice accesses unmapped memory and produces a SIGSEGV.

The concrete crash path

In template_build.go the Upload() function spawns two errgroup goroutines: one calls memfileDiff.Data() to get the mmap slice, then passes it to uploadMemfile → object.StoreData → gcpmultipart.Upload, where it is sliced as data[start:end] per part and written to HTTP PUT requests. For GB-scale files this upload can take minutes. The RLock is released the moment Data() returns — before uploadMemfile is even called. If any concurrent code path calls diff.Close() during this window, the mmap is unmapped while the upload is still streaming parts.

Why Close() can be called concurrently

Multiple triggerable paths exist: (1) TTL-based eviction in templateCache — the OnEviction callback calls template.Close() → diff.Close() → cache.Close(); (2) disk-space-based eviction in DiffStore, which has a scheduleDelete path with a 60-second delay added explicitly to prevent race conditions with exposed slices, but 60 seconds is insufficient for multi-GB uploads; (3) context cancellation during sandbox cleanup or build failure triggering deferred Close() calls. Any of these racing with the upload goroutine produces the use-after-free.

The smoking gun: addressBytes() already solves this

The same file (cache.go) contains addressBytes(), which explicitly keeps the RLock held and returns a releaseCacheCloseLock func() to the caller (used correctly in FullFetchChunker.fetchToCache and StreamingChunker.runFetch). Data() was introduced as part of this PR without following that pattern, breaking the safety invariant the rest of the codebase upholds.

Step-by-step proof

  1. The errgroup in template_build.go calls memfileDiff.Data() — RLock acquired, []byte(*c.mmap) returned, RLock released (lock count = 0).
  2. Upload goroutine enters gcpmultipart.Upload and issues, say, 20 concurrent part PUTs, each reading data[i*ChunkSize:(i+1)*ChunkSize].
  3. After 60 seconds, disk-space eviction fires: DiffStore.deleteOldestFromCache → scheduleDelete timer fires → cache.Delete → OnEviction → diff.Close() → cache.Close().
  4. cache.Close() calls c.mu.Lock() — succeeds immediately because lock count is 0 — then c.mmap.Unmap().
  5. Part PUT goroutines are still executing bytes.NewReader(data[start:end]) and writing to the HTTP body; the underlying memory is now unmapped → SIGSEGV.

How to fix

Data() should follow the addressBytes() pattern: acquire RLock, return both the slice and a release closure, and require the caller to hold that closure for the lifetime of the upload. The template_build.go goroutines would call release() in a defer after uploadMemfile/uploadRootfs returns. Alternatively, Data() could copy the mmap into a heap buffer (safe but defeats the zero-copy goal of this PR).

}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache.Data() missing isClosed check before returning mmap

Low Severity

Cache.Data() does not check c.isClosed() before returning the mmap contents. After Close() calls c.mmap.Unmap(), the mmap pointer is not set to nil, so Data() passes the nil check and returns a byte slice referencing unmapped memory. Every other method on Cache (Slice, addressBytes, Size) checks isClosed() as a safety guard. If Data() is ever called after Close(), it would return a dangling reference.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 62a5e03. Configure here.


func NewCacheFromProcessMemory(
ctx context.Context,
blockSize int64,
Expand Down
5 changes: 5 additions & 0 deletions packages/orchestrator/pkg/sandbox/block/chunk.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ type Chunker interface {
ReadAt(ctx context.Context, b []byte, off int64) (int, error)
WriteTo(ctx context.Context, w io.Writer) (int64, error)
Close() error
Data() []byte
FileSize() (int64, error)
}

Expand Down Expand Up @@ -296,6 +297,10 @@ func (c *FullFetchChunker) Close() error {
return c.cache.Close()
}

func (c *FullFetchChunker) Data() []byte {
return c.cache.Data()
}

func (c *FullFetchChunker) FileSize() (int64, error) {
return c.cache.FileSize()
}
4 changes: 4 additions & 0 deletions packages/orchestrator/pkg/sandbox/block/streaming_chunk.go
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,10 @@ func (c *StreamingChunker) Close() error {
return c.cache.Close()
}

func (c *StreamingChunker) Data() []byte {
return c.cache.Data()
}

func (c *StreamingChunker) FileSize() (int64, error) {
return c.cache.FileSize()
}
5 changes: 5 additions & 0 deletions packages/orchestrator/pkg/sandbox/build/diff.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ type Diff interface {
block.Slicer
CacheKey() DiffStoreKey
CachePath() (string, error)
Data() []byte
FileSize() (int64, error)
Init(ctx context.Context) error
}
Expand All @@ -42,6 +43,10 @@ func (n *NoDiff) CachePath() (string, error) {
return "", NoDiffError{}
}

func (n *NoDiff) Data() []byte {
return nil
}

func (n *NoDiff) Slice(_ context.Context, _, _ int64) ([]byte, error) {
return nil, NoDiffError{}
}
Expand Down
4 changes: 4 additions & 0 deletions packages/orchestrator/pkg/sandbox/build/local_diff.go
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,10 @@ func (b *localDiff) CachePath() (string, error) {
return b.cache.Path(), nil
}

func (b *localDiff) Data() []byte {
return b.cache.Data()
}

func (b *localDiff) Close() error {
return b.cache.Close()
}
Expand Down
46 changes: 46 additions & 0 deletions packages/orchestrator/pkg/sandbox/build/mocks/mockdiff.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 9 additions & 0 deletions packages/orchestrator/pkg/sandbox/build/storage_diff.go
Original file line number Diff line number Diff line change
Expand Up @@ -147,9 +147,18 @@

// The local file might not be synced.
func (b *StorageDiff) CachePath() (string, error) {
return b.cachePath, nil
}

func (b *StorageDiff) Data() []byte {
c, err := b.chunker.Wait()
if err != nil {
return nil
}

return c.Data()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageDiff.Data() can silently return incomplete data. When backed by a StreamingChunker, the mmap cache only contains bytes for ranges fetched via ReadAt; all other regions are zero-filled. The return value is a non-nil full-length slice, but callers in template_build.go only check if data == nil to skip the upload. If a StorageDiff-backed diff ever appears in the upload path (e.g. a snapshot whose base was only partially streamed), the upload proceeds with zero-filled holes and silently corrupts the stored object. Consider returning nil from Data() when completeness cannot be guaranteed.

}

Check failure on line 160 in packages/orchestrator/pkg/sandbox/build/storage_diff.go

View check run for this annotation

Claude / Claude Code Review

StorageDiff.Data() swallows chunker error, silently skips upload

StorageDiff.Data() returns nil when b.chunker.Wait() fails instead of propagating the error, causing the caller in template_build.go to silently skip the upload (via 'if data == nil { return nil }') as though there were no diff. Every other method on StorageDiff (Close, ReadAt, Slice, WriteTo, FileSize) properly returns the error; Data() is the only method that swallows it. Fix by changing the Diff.Data() interface to return ([]byte, error) so errors can propagate.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageDiff.Data() silently swallows chunker errors

Medium Severity

StorageDiff.Data() returns nil when b.chunker.Wait() fails, which causes the caller in template_build.go to silently skip the upload (if data == nil { return nil }). Every other method on StorageDiff (e.g., Close, ReadAt, Slice, FileSize) properly propagates the error from chunker.Wait(). Silently skipping an upload on error could lead to an incomplete template build stored without any error being reported.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 33cbbd4. Configure here.

Comment on lines 150 to +160
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 StorageDiff.Data() returns nil when b.chunker.Wait() fails instead of propagating the error, causing the caller in template_build.go to silently skip the upload (via 'if data == nil { return nil }') as though there were no diff. Every other method on StorageDiff (Close, ReadAt, Slice, WriteTo, FileSize) properly returns the error; Data() is the only method that swallows it. Fix by changing the Diff.Data() interface to return ([]byte, error) so errors can propagate.

Extended reasoning...

What the bug is

StorageDiff.Data() at storage_diff.go:152-160 calls b.chunker.Wait() and, if it returns an error (e.g. network failure during chunker initialization, context cancellation, or any error set via SetError), the method returns nil with no error signal:

func (b *StorageDiff) Data() []byte {
c, err := b.chunker.Wait()
if err != nil {
return nil // error silently dropped
}
return c.Data()
}

The specific code path that triggers it

In template_build.go (Upload function), both the memfile and rootfs upload goroutines call Diff.Data() and then check the result:

data := memfileDiff.Data()
if data == nil {
    return nil  // treated as NoDiff, upload silently skipped
}
return t.uploadMemfile(ctx, data)

Since NoDiff.Data() also returns nil (by design), the caller cannot distinguish between no diff to upload and a chunker failure. Both result in silently skipping the upload with nil returned from the goroutine.

Why existing code does not prevent it

Every other method on StorageDiff propagates errors from chunker.Wait(): Close, ReadAt, Slice, WriteTo, and FileSize all return the error to callers. The Data() method introduced by this PR is the sole exception. The root cause is structural: the Diff.Data() interface is declared as returning only []byte with no error return, making proper error propagation impossible without a signature change.

What the impact would be

If b.chunker.Wait() returns an error (e.g., upstream storage cannot be fetched during Init, or context is cancelled between Init and Upload), the memfile or rootfs upload is silently skipped. TemplateBuild.Upload() returns nil (success), Snapshot.Upload() returns nil, and the caller has no indication that the template build is incomplete. This creates a stored template entry missing its memfile or rootfs data: silent data loss during template builds.

How to fix it

Change the Diff interface Data() signature to return ([]byte, error). StorageDiff.Data() can then return (nil, err) on chunker failure and callers in template_build.go can properly check and propagate the error. NoDiff.Data() would return (nil, nil). Alternatively, store the initialization error in StorageDiff and expose it through a separate method.

Step-by-step proof

  1. StorageDiff is initialized with chunker = utils.NewSetOnceblock.Chunker. During Init(), if OpenSeekable or Size fails, chunker.SetError(errMsg) is called.
  2. Later, TemplateBuild.Upload() runs. The memfile goroutine calls memfileDiff.Data().
  3. Inside Data(), b.chunker.Wait() returns the stored error immediately.
  4. Data() returns nil to the caller with no error.
  5. The caller checks: if data == nil { return nil }. Condition is true, returns nil just like NoDiff.
  6. The errgroup collects no errors. Upload() returns nil to the caller.
  7. The template is recorded as successfully built but the memfile was never written to GCS.


func (b *StorageDiff) FileSize() (int64, error) {
c, err := b.chunker.Wait()
if err != nil {
Expand Down
28 changes: 2 additions & 26 deletions packages/orchestrator/pkg/sandbox/snapshot.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,6 @@ func (s *Snapshot) Upload(
persistence storage.StorageProvider,
paths storage.Paths,
) error {
var memfilePath *string
switch r := s.MemfileDiff.(type) {
case *build.NoDiff:
default:
memfileLocalPath, err := r.CachePath()
if err != nil {
return fmt.Errorf("error getting memfile diff path: %w", err)
}

memfilePath = &memfileLocalPath
}

var rootfsPath *string
switch r := s.RootfsDiff.(type) {
case *build.NoDiff:
default:
rootfsLocalPath, err := r.CachePath()
if err != nil {
return fmt.Errorf("error getting rootfs diff path: %w", err)
}

rootfsPath = &rootfsLocalPath
}

templateBuild := NewTemplateBuild(
s.MemfileDiffHeader,
s.RootfsDiffHeader,
Expand All @@ -61,8 +37,8 @@ func (s *Snapshot) Upload(
ctx,
s.Metafile.Path(),
s.Snapfile.Path(),
memfilePath,
rootfsPath,
s.MemfileDiff,
s.RootfsDiff,
); err != nil {
return fmt.Errorf("error uploading template files: %w", err)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,15 @@ func (s *peerSeekable) StoreFile(ctx context.Context, path string) error {
return fallback.StoreFile(ctx, path)
}

func (s *peerSeekable) StoreData(ctx context.Context, data []byte) error {
fallback, err := s.getOrOpenBase(ctx)
if err != nil {
return err
}

return fallback.StoreData(ctx, data)
}

// openPeerSeekableStream opens a ReadAtBuildSeekable stream, checks peer availability,
// and returns a recv function that yields data chunks starting with the first message's data.
// The passed context HAS to be canceled by the caller when done with the stream to avoid leaks.
Expand Down
21 changes: 12 additions & 9 deletions packages/orchestrator/pkg/sandbox/template_build.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (

"golang.org/x/sync/errgroup"

"github.com/e2b-dev/infra/packages/orchestrator/pkg/sandbox/build"
"github.com/e2b-dev/infra/packages/shared/pkg/storage"
headers "github.com/e2b-dev/infra/packages/shared/pkg/storage/header"
)
Expand Down Expand Up @@ -58,13 +59,13 @@ func (t *TemplateBuild) uploadMemfileHeader(ctx context.Context, h *headers.Head
return nil
}

func (t *TemplateBuild) uploadMemfile(ctx context.Context, memfilePath string) error {
func (t *TemplateBuild) uploadMemfile(ctx context.Context, data []byte) error {
object, err := t.persistence.OpenSeekable(ctx, t.paths.Memfile(), storage.MemfileObjectType)
if err != nil {
return err
}

err = object.StoreFile(ctx, memfilePath)
err = object.StoreData(ctx, data)
if err != nil {
return fmt.Errorf("error when uploading memfile: %w", err)
}
Expand All @@ -91,13 +92,13 @@ func (t *TemplateBuild) uploadRootfsHeader(ctx context.Context, h *headers.Heade
return nil
}

func (t *TemplateBuild) uploadRootfs(ctx context.Context, rootfsPath string) error {
func (t *TemplateBuild) uploadRootfs(ctx context.Context, data []byte) error {
object, err := t.persistence.OpenSeekable(ctx, t.paths.Rootfs(), storage.RootFSObjectType)
if err != nil {
return err
}

err = object.StoreFile(ctx, rootfsPath)
err = object.StoreData(ctx, data)
if err != nil {
return fmt.Errorf("error when uploading rootfs: %w", err)
}
Expand Down Expand Up @@ -153,7 +154,7 @@ func uploadFileAsBlob(ctx context.Context, b storage.Blob, path string) error {
return nil
}

func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapfilePath string, memfilePath *string, rootfsPath *string) error {
func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapfilePath string, memfileDiff build.Diff, rootfsDiff build.Diff) error {
eg, ctx := errgroup.WithContext(ctx)

eg.Go(func() error {
Expand All @@ -173,19 +174,21 @@ func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapf
})

eg.Go(func() error {
if rootfsPath == nil {
data := rootfsDiff.Data()
if data == nil {
return nil
Comment on lines +177 to 179
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid uploading directly from evictable mmap slices

TemplateBuild.Upload now pulls diff bytes via Diff.Data() and passes that slice to StoreData, but for local/storage diffs Data() returns the backing mmap (block.Cache.Data). Those mmaps can be closed by DiffStore eviction (scheduleDelete -> cache.Delete -> Diff.Close) while async upload is still running, and unlike the previous StoreFile(path) flow there is no open file descriptor to keep data valid after deletion. In low-disk-pressure scenarios with long uploads (especially >60s eviction delay), this can turn into invalid memory access or corrupted/failed uploads.

Useful? React with 👍 / 👎.

}

return t.uploadRootfs(ctx, *rootfsPath)
return t.uploadRootfs(ctx, data)
})

eg.Go(func() error {
if memfilePath == nil {
data := memfileDiff.Data()
if data == nil {
return nil
}

return t.uploadMemfile(ctx, *memfilePath)
return t.uploadMemfile(ctx, data)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors from Data() silently skip uploads causing data loss

High Severity

StorageDiff.Data() swallows the error from chunker.Wait() and returns nil, which Upload() treats identically to NoDiff (no data to upload). If the chunker fails for any reason, the memfile or rootfs upload is silently skipped and the overall Upload returns success. The old code propagated errors — CachePath() errors caused the snapshot upload to fail, and even without that, StoreFile would fail explicitly on a missing/incomplete file. The new path hides failures, risking silent data loss during template builds.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b3e8ad5. Configure here.

})

eg.Go(func() error {
Expand Down
Loading
Loading