Skip to content

feat: Add Amazon S3 Vectors document store integration#3149

Open
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration
Open

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

Conversation

@dotKokott
Copy link
Copy Markdown
Contributor

@dotKokott dotKokott commented Apr 13, 2026

Related Issues

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

  • S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
  • S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

  • Content stored as non-filterable metadata (AWS-recommended pattern for large text)
  • Cosine distance converted to similarity score (1 - distance) for Haystack convention
  • Blob data uses base64 encoding for round-trip fidelity
  • filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
  • Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

  • top_k capped at 100 (service limit)
  • query_vectors does not return embedding data
  • 40KB total metadata per vector, 2KB filterable
  • Only float32, cosine/euclidean, eventual consistency

How did you test it?

  • 26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
  • 12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
  • hatch run test:all, hatch run fmt, hatch run test:types
  • Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with:

- S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete)
- S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering
- Filter conversion from Haystack format to S3 Vectors filter syntax
- Auto-creation of vector buckets and indexes
- AWS credential support via Secret (or default credential chain)
- 49 unit tests covering store, retriever, filters, and serialization
- README with usage examples and known limitations
…rkflow

- boto3 lower bound set to 1.42.0 (when s3vectors service was added)
- pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name)
- Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern)
- Assert full serialized dict structure in to_dict/from_dict tests
- Use Mock(spec=...) for retriever tests instead of MagicMock+patch
- Verify _embedding_retrieval call args match exactly
- Add test_from_dict_no_filter_policy (backward compat)
- Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling
the mock client). Keep tests that verify our actual logic:
- Serialization roundtrip (full dict structure)
- Score conversion (cosine + euclidean)
- Filter conversion (pure function with real logic)
- Duplicate policy batch checks (SKIP/NONE)
- Document <-> S3 vector conversion
- Input validation

Before: 49 unit tests (many testing mock behavior)
After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only
- write_documents: embedding required, 40KB metadata limit
- _embedding_retrieval: top_k=100 cap, no embeddings in response
- Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic

- Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare
  with haystack.utils.filters.document_matches_filter (same utility used by
  InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and
  date comparison support for free. (-65 lines)
- Deduplicate blob/content reconstruction in _embedding_retrieval() by
  reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines)
- Make filter_documents() warning conditional on filters actually being
  provided (no warning when listing all documents)
@dotKokott dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:39
@dotKokott dotKokott requested a review from a team as a code owner April 13, 2026 13:39
@dotKokott dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as draft April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:40
@dotKokott
Copy link
Copy Markdown
Contributor Author

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

  • Integration tests never actually run in CI — only locally by developers with AWS credentials
  • The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

  1. Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
  2. Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
  3. Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

@anakin87
Copy link
Copy Markdown
Member

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

@dotKokott
Copy link
Copy Markdown
Contributor Author

dotKokott commented Apr 15, 2026

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments.
Will take a better look soon

with:
name: coverage-comment-amazon_s3_vectors
path: python-coverage-comment-action-amazon_s3_vectors.txt

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need a step for AWS authentication here, like the Bedrock one

@@ -0,0 +1,207 @@
# amazon-s3-vectors-haystack
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have a very minimal README (see https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/README.md)

This info is useful but we'll put it in docs


**Service limits:**

- Maximum ``top_k``: 100 results per query
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Maximum ``top_k``: 100 results per query
- Maximum `top_k`: 100 results per query

For consistency, we always use single backticks. Please update this pattern across all docstrings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amazon S3 Vectors (DocStore)

2 participants