feat(source/cloud-storage): add Cloud Storage source with list_objects and read_object tools#3081
feat(source/cloud-storage): add Cloud Storage source with list_objects and read_object tools#3081huangjiahua wants to merge 13 commits intogoogleapis:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds Google Cloud Storage integration, introducing a new source and tools for listing and reading objects. The implementation includes configuration, error handling, and tests. Feedback recommends capping listing page sizes at 1000 for consistency, implementing memory safety limits when reading objects, and updating documentation titles to include the 'Tool' suffix.
…s and read_object tools Adds a new project-scoped `cloud-storage` source using ADC, plus two read-only tools: `cloud-storage-list-objects` (with prefix/delimiter/pagination) and `cloud-storage-read-object` (with HTTP-style byte range and base64 payload). Introduces a GCS-aware error classifier in `cloudstoragecommon` that splits failures into Agent errors (missing bucket/object, bad request, unsatisfiable range) and Server errors (auth, IAM denial, quota, 5xx, cancellation) per DEVELOPER.md, replacing the coarse-grained `util.ProcessGcpError`. Ships YAML-parse unit tests, an error-classifier unit test, a range-parser unit test, a live-GCS integration test (12 sub-tests, UUID-suffixed bucket with self-cleanup), docs under `docs/en/integrations/cloud-storage/`, and a `cloud-storage` CI shard. The remaining 12 tools from the approved design doc land in follow-up PRs.
…dObject at 1 MiB - ListObjects: pageSize() now clamps to the GCS API max of 1000 so callers that pass a larger max_results don't pre-allocate oversized buffers. - ReadObject: reject objects/ranges over 1 MiB with the new sentinel cloudstoragecommon.ErrReadSizeLimitExceeded, which the classifier maps to an Agent error so the LLM can retry with a narrower 'range'. - Docs + integration tests updated (two new sub-tests: oversize rejection and oversize-narrowed-by-range success).
… MiB 8 MiB gives agents more headroom for typical text/JSON/log payloads while still guarding against OOM. Doc and the oversize integration seed updated to match.
…ckage DefaultMaxReadBytes doesn't belong in errors.go — the limit is a source-side invariant, not an error-classification concern. The sentinel ErrReadSizeLimitExceeded stays in cloudstoragecommon because the classifier still needs to recognize it.
…geSize bounds Cleanup loop in the integration test was treating any iterator error as iterator.Done; now distinguishes the two and logs non-Done errors so flaky teardowns are debuggable. Also adds an internal unit test for pageSize covering 0, negative, in-range, and over-cap inputs.
MCP tool results only carry text today, so the previous base64-encoded content was unusable by the LLM. Validate object bytes with utf8.Valid and return plain-text content; non-UTF-8 objects surface as an agent-fixable ErrBinaryContent error. TODO notes mark the spots to revisit once MCP supports embedded resources.
91a222a to
4919821
Compare
Yuan325
left a comment
There was a problem hiding this comment.
Hi @huangjiahua Thank you for the contribution! Please let me know if you need any clarifications
|
|
||
| type Source struct { | ||
| Config | ||
| Client *storage.Client |
There was a problem hiding this comment.
| Client *storage.Client | |
| client *storage.Client |
Probably don't need to export this~
There was a problem hiding this comment.
can we move this test function into cloudstorage_test.go instead?
| // results at 1000 per page; we enforce the same cap here so callers don't | ||
| // pre-allocate larger buffers and so the contract matches the tool's | ||
| // 'max_results' documentation. | ||
| func pageSize(maxResults int) int { |
There was a problem hiding this comment.
Should we trigger an AgentError in the tool during parameter extraction if the value exceeds 1,000? This makes the limit explicit to the agent/user, preventing confusion when the returned page count is lower than requested.
There was a problem hiding this comment.
Yes, I did that in internal/tools/cloudstorage/cloudstoragelistobjects/cloudstoragelistobjects.go
| func (s *Source) ListObjects(ctx context.Context, bucket, prefix, delimiter string, maxResults int, pageToken string) (map[string]any, error) { | ||
| it := s.Client.Bucket(bucket).Objects(ctx, &storage.Query{ | ||
| Prefix: prefix, | ||
| Delimiter: delimiter, |
There was a problem hiding this comment.
just confirming, will be okay if these 2 values are ""?
There was a problem hiding this comment.
Can confirm. I've also add integration test with these 2 values being "".
| Prefix: prefix, | ||
| Delimiter: delimiter, | ||
| }) | ||
| pager := iterator.NewPager(it, pageSize(maxResults), pageToken) |
There was a problem hiding this comment.
will this be okay if pageToken is ""
There was a problem hiding this comment.
Yes. Also added an integration test case.
There was a problem hiding this comment.
Let's just move this test to cloudstoragereadobject_test.go
| } | ||
|
|
||
| func initStorageClient(ctx context.Context) (*storage.Client, error) { | ||
| return storage.NewClient(ctx, option.WithUserAgent("genai-toolbox-integration-test")) |
There was a problem hiding this comment.
| return storage.NewClient(ctx, option.WithUserAgent("genai-toolbox-integration-test")) | |
| return storage.NewClient(ctx) |
Can we just init without the user agent option for int test?
| t.Fatalf("toolbox didn't start successfully: %s", err) | ||
| } | ||
|
|
||
| runCloudStorageToolGetTest(t) |
There was a problem hiding this comment.
Probably wouldn't need this function. We can just utilize this existing function
Line 90 in 2375ffc
ref:
mcp-toolbox/tests/looker/looker_integration_test.go
Lines 358 to 366 in 2375ffc
There was a problem hiding this comment.
Do you mean we can remove this test?
| } | ||
| } | ||
|
|
||
| func runCloudStorageListObjectsTest(t *testing.T, bucket string) { |
There was a problem hiding this comment.
For this function and runCloudStorageReadObjectTest(), is it possible to use the golang's table-driven tests? There's probably alot of duplication here in each t.Run().
Reference:
Line 231 in 2375ffc
The storage.Client is an implementation detail; external callers that need it use the StorageClient() accessor, so the field itself doesn't need to be exported.
…e tests into single test file per package Merge TestPageSize into cloudstorage_test.go and TestParseRange into cloudstoragereadobject_test.go. Both test files now use the internal package so they can exercise the unexported pageSize and parseRange helpers directly, removing the need for separate *_internal_test.go files.
…with AgentError Previously, values above the GCS per-page cap of 1000 were silently clamped by the pageSize helper, which could confuse agents when the returned page was smaller than requested. Validate max_results during Invoke and return an AgentError so the limit is explicit. Docs and the parameter description are updated to match; the pageSize clamp remains as defense in depth. A unit test covers the rejection path and an integration test exercises it over HTTP.
…age_token inputs Add two integration sub-tests confirming that empty-string inputs are accepted by the GCS client as expected: ListObjects with empty prefix and delimiter returns an unfiltered listing, and an empty page_token returns the first page rather than erroring. These cases address review questions about whether the values passed through to storage.Query and iterator.NewPager are safe when unset.
… simplify storage client init in integration test Drop the initStorageClient wrapper and the option.WithUserAgent call; the integration test now uses storage.NewClient(ctx) directly, matching the suggestion in review and removing a needless indirection.
… table-drive integration tests and reuse RunToolGetTestByName Replace the bespoke runCloudStorageToolGetTest with two tests.RunToolGetTestByName calls that assert the full manifest for each tool. Convert the list_objects and read_object sub-tests to table-driven form: each case declares a request body plus substring, content, or contentType expectations, driven by a single assertion loop. The inherently two-step pagination test stays as its own t.Run. Behaviour is unchanged; the file is ~220 lines shorter in boilerplate.
… drop tool-get manifest test Remove runCloudStorageToolGetTest entirely. The manifest-shape check it performed was redundant: unit tests already cover ParseFromYaml for each config, and the invoke sub-tests exercise the tool handlers over HTTP. Keeping a full-manifest deep-equal here just duplicates that coverage and has to be updated whenever parameter docs change.
Description
Adds Google Cloud Storage as a first-class source in MCP Toolbox, enabling LLM agents to work with objects across buckets in a GCP project. The source is project-scoped and authenticates via Application Default Credentials, mirroring Firestore/Bigtable.
This first PR ships the source plus two read-only tools from the approved design (14 total):
cloud-storage-list-objects— prefix filter, delimiter-based grouping (returnsprefixes), and pagination viamax_results/page_token. Passes through whatever metadata the GCS client returns (*storage.ObjectAttrs) so we don't have to plumb new fields later.cloud-storage-read-object— reads an object's bytes, textual data only, with optional HTTP-style byte ranges (bytes=0-999,bytes=-500,bytes=500-).GCS-aware error categorization (per DEVELOPER.md) is implemented in a new
cloudstoragecommonhelper that maps GCS sentinels and*googleapi.Errorcodes to Agent errors (missing bucket/object, bad request, unsatisfiable range) vs. Server errors (auth, IAM denial, quota, 5xx, context cancellation). This replaces the coarseutil.ProcessGcpErrorfor the two new tools.Remaining 12 tools from the design doc (
list_buckets,create_bucket,copy/move/delete_object, etc.) will land in follow-up PRs.CI note: the
cloud-storageshard in.ci/integration.cloudbuild.yamlexpectsCLOUD_STORAGE_PROJECT=$PROJECT_IDand requires the test service account to have a Cloud Storage admin role in the test project. Integration test self-manages its own UUID-suffixed bucket with defer-based cleanup.PR Checklist
!if this involve a breaking changeWhat's included
internal/sources/cloudstorage/(+ YAML-parse unit tests)internal/tools/cloudstorage/cloudstoragelistobjects/,.../cloudstoragereadobject/(+ YAML-parse + range-parser unit tests)cloudstoragecommonerror classifier (+ 17-case unit test covering sentinels, HTTP statuses,context.Canceled/DeadlineExceeded, and fallback)tests/cloudstorage/cloud_storage_integration_test.go— 12 sub-tests against a real bucket (self-created, self-cleaned)docs/en/integrations/cloud-storage/(source + both tool pages; passes.ci/lint-docs-{source,tool}-page.sh)cloud-storagein.ci/integration.cloudbuild.yamlcloud.google.com/go/storage v1.62.1Opening as draft for initial review — happy to split the error-classifier refactor into a separate commit if reviewers prefer.