-
Notifications
You must be signed in to change notification settings - Fork 282
perf(storage): rewrite multipart upload — zero-copy, HTTP/2, no MD5 #2381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -410,6 +410,17 @@ func (c *Cache) Path() string { | |
| return c.filePath | ||
| } | ||
|
|
||
| func (c *Cache) Data() []byte { | ||
| c.mu.RLock() | ||
| defer c.mu.RUnlock() | ||
|
|
||
| if c.mmap == nil { | ||
| return nil | ||
| } | ||
|
|
||
| return []byte(*c.mmap) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice. The returned []byte is backed by mmap memory whose lifetime is controlled by Cache.Close() -> munmap. Any goroutine holding this slice (e.g. an upload goroutine calling storeData for a large memfile) can be left with a dangling pointer if Cache.Close() is called concurrently. The lock only protects the nil check, not the validity of the backing memory during use. In template_build.go this slice is passed into uploader.Upload which fans it out to concurrent HTTP goroutines with retries -- if the cache is closed while a retry is in flight the result is a SIGBUS or silent stale data.
Comment on lines
411
to
+421
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Cache.Data() returns a mmap-backed []byte after releasing its RLock, so Cache.Close() can call mmap.Unmap() while upload goroutines are still reading the slice, causing a SIGSEGV. Fix by following the existing addressBytes() pattern: return both the slice and a releaseCacheCloseLock closure that keeps the RLock held until the upload is complete. Extended reasoning...The bug in detail Cache.Data() (cache.go:411-421) acquires c.mu.RLock(), constructs []byte(*c.mmap) — a slice pointing directly into the memory-mapped region — and then immediately releases the lock via defer. Once Data() returns, the caller holds a raw pointer into mmap memory with no reference-counting, no lock, and no lifetime guarantee. Concurrently, Cache.Close() acquires c.mu.Lock() (write lock) and calls c.mmap.Unmap(), which instructs the kernel to tear down the mapping. Any subsequent read of the returned slice accesses unmapped memory and produces a SIGSEGV. The concrete crash path In template_build.go the Upload() function spawns two errgroup goroutines: one calls memfileDiff.Data() to get the mmap slice, then passes it to uploadMemfile → object.StoreData → gcpmultipart.Upload, where it is sliced as data[start:end] per part and written to HTTP PUT requests. For GB-scale files this upload can take minutes. The RLock is released the moment Data() returns — before uploadMemfile is even called. If any concurrent code path calls diff.Close() during this window, the mmap is unmapped while the upload is still streaming parts. Why Close() can be called concurrently Multiple triggerable paths exist: (1) TTL-based eviction in templateCache — the OnEviction callback calls template.Close() → diff.Close() → cache.Close(); (2) disk-space-based eviction in DiffStore, which has a scheduleDelete path with a 60-second delay added explicitly to prevent race conditions with exposed slices, but 60 seconds is insufficient for multi-GB uploads; (3) context cancellation during sandbox cleanup or build failure triggering deferred Close() calls. Any of these racing with the upload goroutine produces the use-after-free. The smoking gun: addressBytes() already solves this The same file (cache.go) contains addressBytes(), which explicitly keeps the RLock held and returns a releaseCacheCloseLock func() to the caller (used correctly in FullFetchChunker.fetchToCache and StreamingChunker.runFetch). Data() was introduced as part of this PR without following that pattern, breaking the safety invariant the rest of the codebase upholds. Step-by-step proof
How to fix Data() should follow the addressBytes() pattern: acquire RLock, return both the slice and a release closure, and require the caller to hold that closure for the lifetime of the upload. The template_build.go goroutines would call release() in a defer after uploadMemfile/uploadRootfs returns. Alternatively, Data() could copy the mmap into a heap buffer (safe but defeats the zero-copy goal of this PR). |
||
| } | ||
cursor[bot] marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cache.Data() missing isClosed check before returning mmapLow Severity
Reviewed by Cursor Bugbot for commit 62a5e03. Configure here. |
||
|
|
||
| func NewCacheFromProcessMemory( | ||
| ctx context.Context, | ||
| blockSize int64, | ||
|
|
||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -147,9 +147,18 @@ | |
|
|
||
| // The local file might not be synced. | ||
| func (b *StorageDiff) CachePath() (string, error) { | ||
| return b.cachePath, nil | ||
| } | ||
|
|
||
| func (b *StorageDiff) Data() []byte { | ||
| c, err := b.chunker.Wait() | ||
| if err != nil { | ||
| return nil | ||
| } | ||
|
|
||
| return c.Data() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. StorageDiff.Data() can silently return incomplete data. When backed by a StreamingChunker, the mmap cache only contains bytes for ranges fetched via ReadAt; all other regions are zero-filled. The return value is a non-nil full-length slice, but callers in template_build.go only check if data == nil to skip the upload. If a StorageDiff-backed diff ever appears in the upload path (e.g. a snapshot whose base was only partially streamed), the upload proceeds with zero-filled holes and silently corrupts the stored object. Consider returning nil from Data() when completeness cannot be guaranteed. |
||
| } | ||
|
Check failure on line 160 in packages/orchestrator/pkg/sandbox/build/storage_diff.go
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. StorageDiff.Data() silently swallows chunker errorsMedium Severity
Reviewed by Cursor Bugbot for commit 33cbbd4. Configure here.
Comment on lines
150
to
+160
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 StorageDiff.Data() returns nil when b.chunker.Wait() fails instead of propagating the error, causing the caller in template_build.go to silently skip the upload (via 'if data == nil { return nil }') as though there were no diff. Every other method on StorageDiff (Close, ReadAt, Slice, WriteTo, FileSize) properly returns the error; Data() is the only method that swallows it. Fix by changing the Diff.Data() interface to return ([]byte, error) so errors can propagate. Extended reasoning...What the bug is StorageDiff.Data() at storage_diff.go:152-160 calls b.chunker.Wait() and, if it returns an error (e.g. network failure during chunker initialization, context cancellation, or any error set via SetError), the method returns nil with no error signal: func (b *StorageDiff) Data() []byte { The specific code path that triggers it In template_build.go (Upload function), both the memfile and rootfs upload goroutines call Diff.Data() and then check the result: Since NoDiff.Data() also returns nil (by design), the caller cannot distinguish between no diff to upload and a chunker failure. Both result in silently skipping the upload with nil returned from the goroutine. Why existing code does not prevent it Every other method on StorageDiff propagates errors from chunker.Wait(): Close, ReadAt, Slice, WriteTo, and FileSize all return the error to callers. The Data() method introduced by this PR is the sole exception. The root cause is structural: the Diff.Data() interface is declared as returning only []byte with no error return, making proper error propagation impossible without a signature change. What the impact would be If b.chunker.Wait() returns an error (e.g., upstream storage cannot be fetched during Init, or context is cancelled between Init and Upload), the memfile or rootfs upload is silently skipped. TemplateBuild.Upload() returns nil (success), Snapshot.Upload() returns nil, and the caller has no indication that the template build is incomplete. This creates a stored template entry missing its memfile or rootfs data: silent data loss during template builds. How to fix it Change the Diff interface Data() signature to return ([]byte, error). StorageDiff.Data() can then return (nil, err) on chunker failure and callers in template_build.go can properly check and propagate the error. NoDiff.Data() would return (nil, nil). Alternatively, store the initialization error in StorageDiff and expose it through a separate method. Step-by-step proof
|
||
|
|
||
| func (b *StorageDiff) FileSize() (int64, error) { | ||
| c, err := b.chunker.Wait() | ||
| if err != nil { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ import ( | |
|
|
||
| "golang.org/x/sync/errgroup" | ||
|
|
||
| "github.com/e2b-dev/infra/packages/orchestrator/pkg/sandbox/build" | ||
| "github.com/e2b-dev/infra/packages/shared/pkg/storage" | ||
| headers "github.com/e2b-dev/infra/packages/shared/pkg/storage/header" | ||
| ) | ||
|
|
@@ -58,13 +59,13 @@ func (t *TemplateBuild) uploadMemfileHeader(ctx context.Context, h *headers.Head | |
| return nil | ||
| } | ||
|
|
||
| func (t *TemplateBuild) uploadMemfile(ctx context.Context, memfilePath string) error { | ||
| func (t *TemplateBuild) uploadMemfile(ctx context.Context, data []byte) error { | ||
| object, err := t.persistence.OpenSeekable(ctx, t.paths.Memfile(), storage.MemfileObjectType) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| err = object.StoreFile(ctx, memfilePath) | ||
| err = object.StoreData(ctx, data) | ||
| if err != nil { | ||
| return fmt.Errorf("error when uploading memfile: %w", err) | ||
| } | ||
|
|
@@ -91,13 +92,13 @@ func (t *TemplateBuild) uploadRootfsHeader(ctx context.Context, h *headers.Heade | |
| return nil | ||
| } | ||
|
|
||
| func (t *TemplateBuild) uploadRootfs(ctx context.Context, rootfsPath string) error { | ||
| func (t *TemplateBuild) uploadRootfs(ctx context.Context, data []byte) error { | ||
| object, err := t.persistence.OpenSeekable(ctx, t.paths.Rootfs(), storage.RootFSObjectType) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| err = object.StoreFile(ctx, rootfsPath) | ||
| err = object.StoreData(ctx, data) | ||
| if err != nil { | ||
| return fmt.Errorf("error when uploading rootfs: %w", err) | ||
| } | ||
|
|
@@ -153,7 +154,7 @@ func uploadFileAsBlob(ctx context.Context, b storage.Blob, path string) error { | |
| return nil | ||
| } | ||
|
|
||
| func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapfilePath string, memfilePath *string, rootfsPath *string) error { | ||
| func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapfilePath string, memfileDiff build.Diff, rootfsDiff build.Diff) error { | ||
| eg, ctx := errgroup.WithContext(ctx) | ||
|
|
||
| eg.Go(func() error { | ||
|
|
@@ -173,19 +174,21 @@ func (t *TemplateBuild) Upload(ctx context.Context, metadataPath string, fcSnapf | |
| }) | ||
|
|
||
| eg.Go(func() error { | ||
| if rootfsPath == nil { | ||
| data := rootfsDiff.Data() | ||
| if data == nil { | ||
| return nil | ||
|
Comment on lines
+177
to
179
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Useful? React with 👍 / 👎. |
||
| } | ||
|
|
||
| return t.uploadRootfs(ctx, *rootfsPath) | ||
| return t.uploadRootfs(ctx, data) | ||
| }) | ||
|
|
||
| eg.Go(func() error { | ||
| if memfilePath == nil { | ||
| data := memfileDiff.Data() | ||
| if data == nil { | ||
| return nil | ||
| } | ||
|
|
||
| return t.uploadMemfile(ctx, *memfilePath) | ||
| return t.uploadMemfile(ctx, data) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Errors from
|
||
| }) | ||
|
|
||
| eg.Go(func() error { | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test body one