Skip to content

feat: replace Vespa visitor-based deletes with query-then-delete-by-ID#1598

Open
felixschmetz wants to merge 1 commit intomainfrom
feat/vespa-delete-by-doc-id
Open

feat: replace Vespa visitor-based deletes with query-then-delete-by-ID#1598
felixschmetz wants to merge 1 commit intomainfrom
feat/vespa-delete-by-doc-id

Conversation

@felixschmetz
Copy link
Member

@felixschmetz felixschmetz commented Mar 12, 2026

Selection-based deletes use Vespa's visitor pattern which scans all buckets across all 5 schemas sequentially — O(buckets * schemas).

New approach: query indexed original_entity_id field to resolve Vespa doc IDs, then issue parallel direct DELETE-by-ID calls — O(1) per doc. Falls back to selection-based delete if the query fails.


Summary by cubic

Replace Vespa visitor-based deletes with a fast query-then-delete-by-ID path for parent-ID scoped deletions. This resolves doc IDs via indexed fields and issues parallel direct DELETEs, with a fallback to selection deletes if the query fails.

  • New Features
    • delete_by_parent_ids now:
      • Queries airweave_system_metadata_original_entity_id + airweave_system_metadata_collection_id to resolve doc IDs, then deletes them by ID in parallel.
      • Aggregates a single DeleteResult across schemas (schema=None), and falls back to selection-based delete on query errors.
    • Added _query_doc_ids_by_parent_ids (YQL with DELETE_QUERY_HITS_LIMIT) and _delete_by_doc_ids (parallel HTTP deletes with DELETE_CONCURRENCY).
    • Increased DELETE_BATCH_SIZE to 200 for fewer queries; added DELETE_QUERY_HITS_LIMIT=10000 and DELETE_CONCURRENCY=20.
    • Made DeleteResult.schema_name optional to support cross-schema aggregation.
    • Added unit tests for fast delete flow and updated existing tests.

Written for commit 99056c9. Summary will update on new commits.

Selection-based deletes use Vespa's visitor pattern which scans all
buckets across all 5 schemas sequentially — O(buckets * schemas).

New approach: query indexed original_entity_id field to resolve Vespa
doc IDs, then issue parallel direct DELETE-by-ID calls — O(1) per doc.
Falls back to selection-based delete if the query fails.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/airweave/platform/destinations/vespa/client.py">

<violation number="1" location="backend/airweave/platform/destinations/vespa/client.py:323">
P2: Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named `escaped_ids`. Replace embedded single quotes to avoid malformed YQL and misleading naming.</violation>

<violation number="2" location="backend/airweave/platform/destinations/vespa/client.py:332">
P1: Silent data truncation when matching documents exceed `DELETE_QUERY_HITS_LIMIT`. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check `response.json['root']['fields']['totalCount']` against `len(hits)` and either paginate or log a warning and fall back to the selection-based delete for completeness.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

)
query_params = {
"yql": yql,
"hits": DELETE_QUERY_HITS_LIMIT,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Silent data truncation when matching documents exceed DELETE_QUERY_HITS_LIMIT. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check response.json['root']['fields']['totalCount'] against len(hits) and either paginate or log a warning and fall back to the selection-based delete for completeness.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/vespa/client.py, line 332:

<comment>Silent data truncation when matching documents exceed `DELETE_QUERY_HITS_LIMIT`. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check `response.json['root']['fields']['totalCount']` against `len(hits)` and either paginate or log a warning and fall back to the selection-based delete for completeness.</comment>

<file context>
@@ -264,39 +267,193 @@ async def delete_by_parent_ids(
+        )
+        query_params = {
+            "yql": yql,
+            "hits": DELETE_QUERY_HITS_LIMIT,
+            "timeout": "30s",
+            "summary": "none",
</file context>
Fix with Cubic

Returns:
List of (schema_name, doc_id) tuples for direct deletion
"""
escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named escaped_ids. Replace embedded single quotes to avoid malformed YQL and misleading naming.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/vespa/client.py, line 323:

<comment>Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named `escaped_ids`. Replace embedded single quotes to avoid malformed YQL and misleading naming.</comment>

<file context>
@@ -264,39 +267,193 @@ async def delete_by_parent_ids(
+        Returns:
+            List of (schema_name, doc_id) tuples for direct deletion
+        """
+        escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids)
+        source_list = ", ".join(ALL_VESPA_SCHEMAS)
+        yql = (
</file context>
Suggested change
escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids)
escaped_ids = ", ".join(f"'{pid.replace(chr(39), chr(39)*2)}'" for pid in parent_ids)
Fix with Cubic

@marc-rutzou
Copy link
Collaborator

I would stick to the name of original_entity_id instead of parent id to not conflict with a parent in terms of breadcrumbs

@marc-rutzou
Copy link
Collaborator

otherwise lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants