feat: replace Vespa visitor-based deletes with query-then-delete-by-ID#1598
feat: replace Vespa visitor-based deletes with query-then-delete-by-ID#1598felixschmetz wants to merge 1 commit intomainfrom
Conversation
Selection-based deletes use Vespa's visitor pattern which scans all buckets across all 5 schemas sequentially — O(buckets * schemas). New approach: query indexed original_entity_id field to resolve Vespa doc IDs, then issue parallel direct DELETE-by-ID calls — O(1) per doc. Falls back to selection-based delete if the query fails.
There was a problem hiding this comment.
2 issues found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/airweave/platform/destinations/vespa/client.py">
<violation number="1" location="backend/airweave/platform/destinations/vespa/client.py:323">
P2: Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named `escaped_ids`. Replace embedded single quotes to avoid malformed YQL and misleading naming.</violation>
<violation number="2" location="backend/airweave/platform/destinations/vespa/client.py:332">
P1: Silent data truncation when matching documents exceed `DELETE_QUERY_HITS_LIMIT`. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check `response.json['root']['fields']['totalCount']` against `len(hits)` and either paginate or log a warning and fall back to the selection-based delete for completeness.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| ) | ||
| query_params = { | ||
| "yql": yql, | ||
| "hits": DELETE_QUERY_HITS_LIMIT, |
There was a problem hiding this comment.
P1: Silent data truncation when matching documents exceed DELETE_QUERY_HITS_LIMIT. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check response.json['root']['fields']['totalCount'] against len(hits) and either paginate or log a warning and fall back to the selection-based delete for completeness.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/vespa/client.py, line 332:
<comment>Silent data truncation when matching documents exceed `DELETE_QUERY_HITS_LIMIT`. If the query returns exactly the limit, remaining documents are orphaned with no warning. Check `response.json['root']['fields']['totalCount']` against `len(hits)` and either paginate or log a warning and fall back to the selection-based delete for completeness.</comment>
<file context>
@@ -264,39 +267,193 @@ async def delete_by_parent_ids(
+ )
+ query_params = {
+ "yql": yql,
+ "hits": DELETE_QUERY_HITS_LIMIT,
+ "timeout": "30s",
+ "summary": "none",
</file context>
| Returns: | ||
| List of (schema_name, doc_id) tuples for direct deletion | ||
| """ | ||
| escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids) |
There was a problem hiding this comment.
P2: Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named escaped_ids. Replace embedded single quotes to avoid malformed YQL and misleading naming.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/vespa/client.py, line 323:
<comment>Parent IDs are interpolated into YQL without escaping single quotes, despite the variable being named `escaped_ids`. Replace embedded single quotes to avoid malformed YQL and misleading naming.</comment>
<file context>
@@ -264,39 +267,193 @@ async def delete_by_parent_ids(
+ Returns:
+ List of (schema_name, doc_id) tuples for direct deletion
+ """
+ escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids)
+ source_list = ", ".join(ALL_VESPA_SCHEMAS)
+ yql = (
</file context>
| escaped_ids = ", ".join(f"'{pid}'" for pid in parent_ids) | |
| escaped_ids = ", ".join(f"'{pid.replace(chr(39), chr(39)*2)}'" for pid in parent_ids) |
|
I would stick to the name of original_entity_id instead of parent id to not conflict with a parent in terms of breadcrumbs |
|
otherwise lgtm! |
Selection-based deletes use Vespa's visitor pattern which scans all buckets across all 5 schemas sequentially — O(buckets * schemas).
New approach: query indexed original_entity_id field to resolve Vespa doc IDs, then issue parallel direct DELETE-by-ID calls — O(1) per doc. Falls back to selection-based delete if the query fails.
Summary by cubic
Replace Vespa visitor-based deletes with a fast query-then-delete-by-ID path for parent-ID scoped deletions. This resolves doc IDs via indexed fields and issues parallel direct DELETEs, with a fallback to selection deletes if the query fails.
delete_by_parent_idsnow:airweave_system_metadata_original_entity_id+airweave_system_metadata_collection_idto resolve doc IDs, then deletes them by ID in parallel.DeleteResultacross schemas (schema=None), and falls back to selection-based delete on query errors._query_doc_ids_by_parent_ids(YQL withDELETE_QUERY_HITS_LIMIT) and_delete_by_doc_ids(parallel HTTP deletes withDELETE_CONCURRENCY).DELETE_BATCH_SIZEto 200 for fewer queries; addedDELETE_QUERY_HITS_LIMIT=10000andDELETE_CONCURRENCY=20.DeleteResult.schema_nameoptional to support cross-schema aggregation.Written for commit 99056c9. Summary will update on new commits.