feat: Add Amazon S3 Vectors document store integration by dotKokott · Pull Request #3149 · deepset-ai/haystack-core-integrations

dotKokott · 2026-04-13T12:49:32Z

Related Issues

fixes Amazon S3 Vectors (DocStore) #2110

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

Content stored as non-filterable metadata (AWS-recommended pattern for large text)
Cosine distance converted to similarity score (1 - distance) for Haystack convention
Blob data uses base64 encoding for round-trip fidelity
filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

top_k capped at 100 (service limit)
query_vectors does not return embedding data
40KB total metadata per vector, 2KB filterable
Only float32, cosine/euclidean, eventual consistency

How did you test it?

26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
hatch run test:all, hatch run fmt, hatch run test:types
Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations

…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086

- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy

Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests

- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned

…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)

dotKokott · 2026-04-13T14:18:48Z

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

Integration tests never actually run in CI — only locally by developers with AWS credentials
The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

anakin87 · 2026-04-15T07:52:35Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

dotKokott · 2026-04-15T10:57:12Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

anakin87

I left some initial comments.
Will take a better look soon

anakin87 · 2026-04-20T07:35:13Z

+        with:
+          name: coverage-comment-amazon_s3_vectors
+          path: python-coverage-comment-action-amazon_s3_vectors.txt
+


we also need a step for AWS authentication here, like the Bedrock one

haystack-core-integrations/.github/workflows/amazon_bedrock.yml

Line 117 in 35c9bef

- name: AWS authentication

anakin87 · 2026-04-20T07:37:10Z

@@ -0,0 +1,207 @@
+# amazon-s3-vectors-haystack


We want to have a very minimal README (see https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/README.md)

This info is useful but we'll put it in docs

anakin87 · 2026-04-20T07:41:41Z

+
+    **Service limits:**
+
+    - Maximum ``top_k``: 100 results per query


Suggested change

- Maximum ``top_k``: 100 results per query

- Maximum `top_k`: 100 results per query

For consistency, we always use single backticks. Please update this pattern across all docstrings

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026

dotKokott added 7 commits April 13, 2026 15:28

fix: pin haystack-ai>=2.26.1 for FilterPolicy support

b264a95

dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28

dotKokott marked this pull request as ready for review April 13, 2026 13:39

dotKokott requested a review from a team as a code owner April 13, 2026 13:39

dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39

dotKokott marked this pull request as draft April 13, 2026 13:39

dotKokott marked this pull request as ready for review April 13, 2026 13:40

anakin87 requested changes Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Amazon S3 Vectors document store integration#3149

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

dotKokott commented Apr 13, 2026 •

edited

Loading

Uh oh!

dotKokott commented Apr 13, 2026

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 •

edited

Loading

Uh oh!

anakin87 left a comment

Uh oh!

anakin87 Apr 20, 2026

Uh oh!

anakin87 Apr 20, 2026

Uh oh!

anakin87 Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Service limits:

		- Maximum ``top_k``: 100 results per query

	- Maximum ``top_k``: 100 results per query
	- Maximum `top_k`: 100 results per query

Conversation

dotKokott commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

dotKokott commented Apr 13, 2026

CI: Integration tests need AWS credential setup

What needs to happen

Prerequisites (maintainer action required)

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

anakin87 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dotKokott commented Apr 13, 2026 •

edited

Loading

dotKokott commented Apr 15, 2026 •

edited

Loading