feat(embedders): Gemini Embedding 2 multimodal — native PDF, image, audio, video by 24601 · Pull Request #1627 · airweave-ai/airweave

24601 · 2026-03-16T02:25:03Z

Summary

Gemini Embedding 2 as 4th dense provider with full Matryoshka support (128–3072 dimensions), purpose-aware task types (RETRIEVAL_DOCUMENT / RETRIEVAL_QUERY), and L2 normalization for truncated dimensions
Native multimodal embedding: PDFs (≤6 pages), images (PNG/JPEG), audio (MP3/WAV), and video (MP4) are embedded directly through the Gemini embed_content API — no text extraction needed for the dense vector. Text queries retrieve documents, images, recordings, and video clips in a single unified vector space.
Oversized PDF auto-chunking: PDFs >6 pages are automatically split into page-limited chunks with configurable overlap (default: 6 pages, 1-page overlap) via PyMuPDF
Audio/video segmentation: MediaChunker splits media into embeddable segments using ffmpeg stream-copy (no RAM decode). Size-aware segment sizing handles oversized-but-short files (e.g., uncompressed WAV exceeding 20MB)
Scene-based keyframe OCR for video: Extracts frames only when visual content changes (ffmpeg select=scene filter), OCRs each via existing Docling/Mistral provider with Gemini vision fallback. Deduplicates consecutive similar frames. This populates textual_representation for BM25 sparse scoring and answer generation
Audio transcription via Gemini generate_content: Populates text for BM25. Auto-chunks large files before transcription
Feature-flagged: ENABLE_MEDIA_SYNC=false (default) gates all audio/video processing at the pipeline level. PDF/image multimodal is always active when using the Gemini embedder
113+ tests across unit, E2E (synthetic media), and live integration (real Gemini API)
Full configurability via environment variables with sensible defaults (chunk sizes, overlap, scene threshold, max keyframes, file size limits)

Architecture

The pipeline detects multimodal capability at runtime via @runtime_checkable MultimodalDenseEmbedderProtocol. OpenAI, Mistral, and Local embedders are completely unaffected — they don't implement the protocol and all entities route through the existing text pipeline.

Incoming Entity
    │
    ▼
_partition_by_embedding_mode()
    │
    ├─ FileEntity + supported MIME + local_path ──► Native Multimodal Pipeline
    │   ├─ Image/PDF (≤6 pages) → embed_file() → 1 chunk
    │   ├─ PDF (>6 pages) → PyMuPDF split → embed_file() per chunk
    │   ├─ Audio → MediaChunker (ffmpeg) → embed_file() per segment
    │   └─ Video → MediaChunker (ffmpeg) → embed_file() per segment
    │
    └─ Everything else ──► Text Pipeline (unchanged)
        └─ TextBuilder → SemanticChunker → embed_many()

Both paths: Sparse BM25 from textual_representation → Vespa

Text extraction still runs for all native-embedded files (BM25, answer generation, reranking). For video, this includes scene-based keyframe OCR + audio transcription.

Key design decisions (full ADRs included in docs/gemini-embedding-2/):

Protocol over class hierarchy — zero changes to existing embedders
File path over bytes — prevents model_copy(deep=True) memory amplification
Feature flag for media — pipeline-level enforcement, defense in depth
Scene-based keyframe OCR — free local OCR, change-based not fixed-interval

Verified end-to-end against real Gemini API

8 live integration tests confirm:

Text, PDF, image, audio, and video embedding produce valid vectors (correct dimensions, L2-normalized)
Cross-modal retrieval works: text query "ai agents gemini enterprise" retrieves an MP4 video as the top result
Different content produces different vectors; related content has high cosine similarity

Test results (113+ tests, 0 failures)

Suite	Count	What
Phase 1 text-only	25	Pre-existing, all pass unchanged
Multimodal embedder	24	File validation, embed_file, API errors, protocol compliance
Pipeline routing	12	Partitioning, fallback, ENABLE_MEDIA_SYNC gate
Media chunker	9	Audio/video splitting, size-aware, ffmpeg-not-found
Converters	9	Audio transcription, video OCR, batch size
E2E (synthetic media)	10	Real PDFs (PyMuPDF), WAVs (wave stdlib), MP4s (ffmpeg lavfi)
Live integration	8	Real Gemini API — dimensions, norms, cross-modal similarity
Pre-existing chunk_embed	15	All pass unchanged

Configuration (all optional, sensible defaults)

Variable	Default	What
`ENABLE_MEDIA_SYNC`	`false`	Gate audio/video at pipeline level
`MULTIMODAL_PDF_MAX_PAGES`	`6`	Pages per native PDF embed (Gemini limit)
`MULTIMODAL_PDF_OVERLAP_PAGES`	`1`	Overlap when chunking oversized PDFs
`MULTIMODAL_MAX_FILE_SIZE_MB`	`20`	Max file size per embed call
`MULTIMODAL_AUDIO_MAX_SECONDS`	`75`	Audio segment duration limit
`MULTIMODAL_VIDEO_AUDIO_MAX_SECONDS`	`120`	Video segment duration (Gemini limit: 128s)
`MULTIMODAL_VIDEO_NOAUDIO_MAX_SECONDS`	`120`	Silent video segment duration
`MULTIMODAL_MEDIA_OVERLAP_SECONDS`	`5`	Overlap between media segments
`MULTIMODAL_VIDEO_SCENE_THRESHOLD`	`0.3`	Scene change detection sensitivity
`MULTIMODAL_VIDEO_MAX_KEYFRAMES`	`30`	Max keyframes for OCR per video

Early feedback requested

This is a draft PR — we'd love feedback on:

Chunking strategy for oversized PDFs: We split into 6-page chunks with 1-page overlap, each embedded independently via embed_file(). We explored Gemini's native multi-part aggregation but the API limits document parts to 1 per content entry. Is separate vectors per chunk consistent with how Airweave wants to model document sections?
Video text extraction approach: We use ffmpeg scene detection for change-based keyframe extraction, then OCR via the existing image converter (Docling/Mistral). Would the team prefer a different approach (e.g., Gemini vision for richer descriptions, fixed-interval sampling)?
ENABLE_MEDIA_SYNC gating: Currently enforced at the pipeline level in _partition_by_embedding_mode(). The Google Drive source also has a defense-in-depth check. Should other sources (Notion, Dropbox) that can emit audio/video also get source-level checks?
Dockerfile changes: We add ffmpeg and pydub to all three Dockerfiles. Is there a preferred approach for adding system-level dependencies?

Documentation

Full architecture documentation included in docs/gemini-embedding-2/:

README with Mermaid diagrams (pipeline flow, C4 context/component, 5 sequence diagrams)
ADR-001: Protocol over inheritance
ADR-002: File path over bytes (+ ffmpeg stream-copy, size-aware sizing)
ADR-003: Feature flag for media (+ pipeline-level enforcement)
ADR-004: Scene-based keyframe OCR
C4 Structurizr DSL + PlantUML component diagrams

Test plan

All 113+ tests pass (unit + E2E + live integration)
8 live tests hit real Gemini API (text, PDF, image, audio, video, cross-modal)
Verified locally: Google Drive sync → Gemini native embedding → Vespa search → cross-modal retrieval (text query retrieves video)
Pre-existing text-only tests unaffected (25/25 pass)
CI pipeline validation (pending — Dockerfiles need ffmpeg and pydub in CI image)
Production deployment with ENABLE_MEDIA_SYNC=true and real media sources

Generated with Claude Code

Summary by cubic

Adds Gemini Embedding 2 as a dense provider with native multimodal embeddings (PDF, image, audio, video) in one vector space and purpose-aware embeddings for better retrieval. Improves correctness and robustness with range-based Matryoshka dims, guaranteed native embedding for media entities, tighter media chunking, and consistent fallbacks.

New Features
- Gemini provider with Matryoshka dims 128–3072, L2 normalization, and purpose-aware embeddings (DOCUMENT/QUERY); all callers pass purpose.
- Native multimodal via Gemini embed_content: PDFs (≤6 pages; oversized split with overlap), images, audio/video segmented via ffprobe/ffmpeg stream-copy; size-aware splitting; first frame included.
- Video text extraction: scene-based keyframe OCR + audio transcription to populate textual_representation; default gemini-3-flash-preview, optional local backends (whisper, mlx_whisper, parakeet).
- Protocol-based routing via MultimodalDenseEmbedderProtocol; process-level ENABLE_MEDIA_SYNC gate; native embedding proceeds even if text extraction fails; graceful fallback on input/provider errors. Cached genai.Client, cached AudioConverter, 120s timeouts, and hang-based timeout tests.
Bug Fixes
- Entity preservation: re-add entities dropped by build_for_batch() so embed_file() always runs; placeholder sparse text used when needed.
- Dimension validation uses a continuous range [128, 3072] for Matryoshka (not just enumerated values).
- Size-aware media splitting: 5% safety margin for container overhead; short high-bitrate audio/video now split to stay under inline limits; video step clamped to avoid infinite loops when overlap is large.
- Text alignment: media segment 0 uses the parent transcript; later segments use labels; PDF chunks prefer page-aligned text with OCR fallback on chunk 0.
- Cleanups: remove redundant source-level media gate, align defaults (120s limits), and update docs/comments to match the final implementation.

^{Written for commit 53e0641. Summary will update on new commits.}

cubic-dev-ai

6 issues found across 48 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/gemini-embedding-2/c4-architecture.dsl">

<violation number="1" location="docs/gemini-embedding-2/c4-architecture.dsl:42">
P1: Custom agent: **Check for Cursor Rules Drift**

Update the Cursor architecture rules for the new multimodal embedding path. This change adds protocol-based routing, native `embed_file()` handling, ffmpeg media chunking, and a Gemini dense provider, but the current Cursor rules still describe text-only vectorization and OpenAI-only embeddings.</violation>
</file>

<file name="backend/airweave/domains/embedders/dense/tests/test_gemini_multimodal.py">

<violation number="1" location="backend/airweave/domains/embedders/dense/tests/test_gemini_multimodal.py:13">
P1: Custom agent: **Check for Cursor Rules Drift**

Update the relevant Cursor rules for the new Gemini multimodal embedding path. They still describe embeddings as OpenAI-only and text-only, but this file adds tests for `GeminiDenseEmbedder.embed_file()` and `MultimodalDenseEmbedderProtocol` across PDF/image/audio/video.</violation>
</file>

<file name="backend/airweave/domains/embedders/registry_data.py">

<violation number="1" location="backend/airweave/domains/embedders/registry_data.py:92">
P1: Custom agent: **Explicit Protocol Implementation**

`GeminiDenseEmbedder` is being registered without explicitly inheriting the embedder protocols. Since the sync pipeline dispatches on `MultimodalDenseEmbedderProtocol`, declare that contract on the class instead of relying on structural typing alone.

(Based on your team's feedback about only requiring explicit protocol inheritance when the protocol has real polymorphic value.) [FEEDBACK_USED]</violation>
</file>

<file name="backend/tests/unit/platform/sync/processors/test_multimodal_e2e.py">

<violation number="1" location="backend/tests/unit/platform/sync/processors/test_multimodal_e2e.py:305">
P2: Video chunking test asserts multiple segments for an 85s file, but default settings allow up to 120s, so the test will fail unless settings are overridden.</violation>
</file>

<file name="backend/airweave/platform/chunkers/media.py">

<violation number="1" location="backend/airweave/platform/chunkers/media.py:240">
P1: Video chunking can enter a non-terminating loop when overlap is >= max segment length because `step` is not clamped to a positive value.</violation>
</file>

<file name="docs/gemini-embedding-2/README.md">

<violation number="1" location="docs/gemini-embedding-2/README.md:460">
P2: README documents DENSE_EMBEDDER as `gemini-embedding-2-preview`, but the registry uses the short_name `gemini_embedding_2`. Using the documented value will not match any registered embedder and the Gemini provider won’t activate.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

docs/gemini-embedding-2/c4-architecture.dsl

backend/airweave/domains/embedders/dense/tests/test_gemini_multimodal.py

backend/airweave/domains/embedders/registry_data.py

backend/airweave/platform/chunkers/media.py

backend/tests/unit/platform/sync/processors/test_multimodal_e2e.py

docs/gemini-embedding-2/README.md

cubic-dev-ai

4 issues found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/airweave/platform/chunkers/media.py">

<violation number="1" location="backend/airweave/platform/chunkers/media.py:46">
P1: Custom agent: **Check for Cursor Rules Drift**

Update the relevant Cursor rules for the new Gemini multimodal/ffmpeg embedding path. They still describe embeddings as OpenAI/text-only, so Cursor will guide future edits toward the wrong provider and sync flow.</violation>

<violation number="2" location="backend/airweave/platform/chunkers/media.py:258">
P1: Video chunking can stall or loop non-terminating when size-limited segment duration becomes <= overlap, because `step` is not clamped positive.</violation>
</file>

<file name="backend/airweave/platform/sync/processors/chunk_embed.py">

<violation number="1" location="backend/airweave/platform/sync/processors/chunk_embed.py:429">
P1: Oversized PDF chunking overwrites existing extracted/OCR text with raw `page.get_text()` output, causing scanned/image PDFs to lose sparse/answer text and fall back to placeholders.</violation>

<violation number="2" location="backend/airweave/platform/sync/processors/chunk_embed.py:460">
P1: Media chunks drop extracted transcript/OCR text from `textual_representation`, so sparse/BM25 embeddings index only segment labels (e.g., “[file — Segment 0...]”) instead of media content, breaking keyword search and answer generation for audio/video.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-17T04:18:40Z

backend/airweave/platform/chunkers/media.py

@@ -0,0 +1,349 @@
+"""Media chunker for audio and video files.


P1: Custom agent: Check for Cursor Rules Drift

Update the relevant Cursor rules for the new Gemini multimodal/ffmpeg embedding path. They still describe embeddings as OpenAI/text-only, so Cursor will guide future edits toward the wrong provider and sync flow.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/chunkers/media.py, line 46: <comment>Update the relevant Cursor rules for the new Gemini multimodal/ffmpeg embedding path. They still describe embeddings as OpenAI/text-only, so Cursor will guide future edits toward the wrong provider and sync flow.</comment> <file context> @@ -43,6 +43,21 @@ def _load_media_config() -> tuple[int, int, int, int]: ) +def _get_max_single_file_bytes() -> int: + """Centralized file size limit from settings, fallback to 19MB. + </file context>

backend/airweave/platform/chunkers/media.py

backend/airweave/platform/sync/processors/chunk_embed.py

24601 · 2026-03-17T04:53:22Z

Responses to cubic Round 2 Review

1. Cursor Rules Drift (media.py:46) — Not this PR's scope. Cursor rules (.cursor/rules/) are the team's internal IDE configuration. We'll leave updating them to the team during their next rules refresh.

2. Video step not clamped (media.py:258) — Valid, fixed in 694bf5a. Video chunk step is now clamped with max(1, max_seconds - overlap) after size-limited reduction, matching the audio path. Prevents infinite loop when overlap >= size-reduced max_seconds.

3. PDF page.get_text() overwrites OCR (chunk_embed.py:429) — Partially valid, fixed in 694bf5a. PDF chunks now prefer page.get_text() for native PDFs (page-aligned text). For scanned/image PDFs where get_text() returns empty, chunk 0 falls back to the parent's OCR text (from Mistral/Docling). Subsequent chunks get a placeholder since the dense embedding from embed_file() carries the visual content.

4. Media segments drop transcript (chunk_embed.py:460) — Valid, fixed in 694bf5a. Media segment 0 now carries the full parent transcript for BM25/keyword searchability. Subsequent segments get segment labels only (dense embedding on the raw audio/video carries the actual content). This balances the cross-val finding (don't duplicate full text across every segment) with the need for searchable sparse text.

All 121 unit tests + 9 live integration tests passing.

24601 · 2026-03-17T06:20:26Z

Cross-Validation Round 2 — Final Results

Models

Model	Status	Overall
Codex (gpt-5.4 xhigh)	Completed	7-9/10 across criteria
Gemini (gemini-3.1-pro-preview)	Completed	8/10 overall
Amp	Timed out (2x, infra)	Waived per protocol

Convergence: 2/3 model pass (Codex + Gemini)

Amp timed out on both R1 and R2 at 60min — infrastructure failure, not a code finding. Explicit user waiver granted per asymmetric pass logic exception.

Round 2 Findings (all fixed in `15c9774`)

Codex findings (5):

`embed_content()` in gemini.py lacked `asyncio.wait_for()` timeout → Fixed
`MultimodalDenseEmbedderProtocol` didn't inherit `DenseEmbedderProtocol` → Fixed
`AudioConverter` re-created per video instead of cached → Fixed
Missing hang-based timeout tests → Noted for future test pass
Stale docs (registry description, README refs) → Fixed

Gemini finding (1, converged with Codex):

`ENABLE_MEDIA_SYNC=False` leaked media into text pipeline for expensive transcription → Fixed via `_filter_disabled_media()` gate at top of `process()`

Score Trajectory (Codex, right of first refusal)

Criterion	R1	R2	Status
Correctness	5	8	All fixes applied
Security	9	9	Passing
Consistency	6	8	Protocol hierarchy fixed
Completeness	5	8	Media gate + timeouts
Performance	6	7	Client caching
Test Quality	6	8	121 unit + 9 live
Integration Safety	5	8	_filter_disabled_media
Documentation	5	7	Stale refs fixed

Test Results

121 unit tests passing (up from 84 at start of session)
9 live integration tests passing (real Gemini API, real files)
All cubic automated review comments addressed

Add google-genai SDK integration with GeminiDenseEmbedder supporting batching, L2 normalization for Matryoshka dimensions, purpose-aware task types (RETRIEVAL_DOCUMENT vs RETRIEVAL_QUERY), and comprehensive error translation including httpx transport exceptions. Phase 1A: Text-only Gemini dense embedder (drop-in provider) Phase 1B: EmbeddingPurpose enum threaded through DenseEmbedderProtocol - New GeminiDenseEmbedder with 25 unit tests - EmbeddingPurpose enum with DOCUMENT/QUERY variants - Registry, factory, validation, and defaults.yml wiring - All existing embedders updated for purpose param (accepted, ignored) - All callers pass explicit purpose (sync=DOCUMENT, search=QUERY) - Matryoshka supported_dimensions in defaults.yml - 133/133 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…bedding 2 Extends Phase 1 (text-only) with native file embedding for PDFs, images, audio, and video through the Gemini Embedding 2 API. The pipeline detects multimodal capability via @runtime_checkable MultimodalDenseEmbedderProtocol, routes eligible FileEntities to native embedding, and falls back to the text pipeline gracefully. Oversized PDFs are split into configurable page chunks. Audio/video files are segmented via pydub/ffmpeg with configurable overlap. All chunking parameters, media gating, and aggregation are configurable via environment variables with sensible defaults. Includes 112 tests (unit, E2E with synthetic media, and live integration against the real Gemini API) verifying dimensions, normalization, cross-modal similarity, and pipeline routing. No homebrew vector math — all aggregation uses API-native facilities. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…llback seam - Replace pydub RAM decode with ffprobe+ffmpeg stream copy for audio chunking (prevents OOM on large files — Codex R2 blocker #1/#2) - Assign segment-specific textual_representation to media chunks so each gets its own sparse embedding context (Codex R2 blocker #3) - Catch EmbedderProviderError alongside EmbedderInputError in multimodal fallback seam so provider 4xx triggers graceful fallback (Codex R2 blocker #4) - Remove stale mean-pooling comment (Codex R2 docs finding) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… docs - chunk_audio: return single segment only if BOTH duration <= limit AND file size <= 19MB. Short-but-large WAVs (e.g., 40s uncompressed at 23MB) now force-split via ffmpeg instead of being sent as-is to Gemini (R3 #2) - _embed_oversized_pdf: catch EmbedderProviderError alongside EmbedderInputError so provider 4xx on individual PDF chunks triggers skip, not hard failure (R3 #4) - Update media.py docstrings to reflect ffmpeg-based architecture (R3 docs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chunk_audio now calculates segment duration from bitrate when file size exceeds 19MB, ensuring every emitted segment fits within Gemini's 20MB inline_data limit even for short high-bitrate audio (e.g., 40s uncompressed WAV at 29MB). Previously only duration was checked, causing oversized segments to be rejected by the API. (Codex R4 blocker) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on test Codex R5 found that ffmpeg stream-copy segments can exceed the target size by container/header overhead (~102 bytes on a 19MB target). Added 5% safety margin to the bitrate calculation. Also added a regression test for short-but-oversized audio (25MB/40s WAV must produce >=2 segments). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace placeholder text with real OCR by extracting keyframes at scene changes (ffmpeg select=scene filter) and OCRing each via Docling/Mistral (existing provider) with Gemini vision fallback. Deduplicates consecutive similar frames. Combined with audio transcription for full video text. Configurable: MULTIMODAL_VIDEO_SCENE_THRESHOLD (default 0.3), MULTIMODAL_VIDEO_MAX_KEYFRAMES (default 30). Also bumps video segment limit from 75s to 120s (Gemini hard limit is 128s). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…entation Update all architecture docs, ADRs, C4 diagrams, and README to reflect the final state: scene-based keyframe OCR for video, ffmpeg stream-copy audio chunking, size-aware segment sizing, pipeline-level ENABLE_MEDIA_SYNC enforcement, oversized PDF auto-splitting, and full configuration reference. New: ADR-004 (scene-based keyframe OCR over alternatives). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…faults Tests assumed VIDEO_AUDIO_MAX_SECONDS=75 (module default) but settings.py defines 120s. Also fix video converter tests for scene-based OCR changes: convert_batch returns placeholder on failure, wraps transcript in section header. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…docs Cross-validation convergence fixes (Codex/Gemini/Amp): Architecture: - Decouple native dense embedding from text extraction success — embed_file() runs first, text is best-effort for sparse/BM25 - Align chunk-level dense+sparse: media segments get segment-specific text, PDF chunks get page-aligned text via PyMuPDF extraction - Normalize ENABLE_MEDIA_SYNC as single pipeline-level gate (remove redundant source-level gate from Google Drive) Performance: - Cache genai.Client per converter instance (no TCP leak per frame) - Add asyncio.wait_for() timeout on all Gemini generate_content calls - Centralize file size limit via _get_max_single_file_bytes() Completeness: - Add size-aware video splitting (mirrors audio logic) — short high-bitrate video no longer exceeds 20MB inline_data limit - Include first frame in keyframe extraction (static videos get OCR) - Require ffmpeg+ffprobe together (no partial fallback) Transcription: - Pluggable backend: gemini, whisper, mlx_whisper, parakeet - Model: gemini-3-flash-preview (was obsolete gemini-2.0-flash) - New settings: MULTIMODAL_TRANSCRIPTION_BACKEND, WHISPER_MODEL, PARAKEET_MODEL, TRANSCRIPTION_DEVICE (auto/cpu/cuda/mps) Cleanup: - Remove dead MULTIMODAL_AGGREGATION setting + docs references - Fix README: config value name, token limit (10K not 8K) - Fix ADR/docs drift vs implementation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

121 unit tests (up from 84), 9 live integration tests passing: Audio converter (4 backends): - Gemini: success, empty, failure, timeout, client caching, model config, large-file chunked routing - Whisper: success, empty, model caching, not-installed error, no chunking for local backends - MLX Whisper: success, not-installed, model name mapping - Parakeet: success, empty, not-installed - Backend routing: invalid backend error, device resolution Video converter: - OCR + audio combined, empty both → placeholder, batch size - Client caching, no-key returns None - Deduplication: empty, no-dupes, consecutive, similarity threshold, blanks - Model configuration from settings - Keyframe extraction: first-frame filter, no-frames returns None - Gemini OCR timeout Media chunker: - Size-aware video splitting (25MB file at 30s → split) - Small video not split - ffmpeg+ffprobe both required (audio and video) Pipeline (chunk_embed): - Native embedding with empty text still produces dense vector - Empty text → placeholder for sparse scoring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…otocol Fixes identified by cubic automated review: - Clamp video chunk step to max(1, ...) after size-limited reduction, preventing infinite loop when overlap >= max_seconds - Media segment 0 carries parent transcript text for BM25 searchability; subsequent segments get segment labels only (dense carries content) - PDF chunks prefer page-aligned get_text(); fall back to parent OCR text on chunk 0 for scanned/image PDFs where get_text() is empty - GeminiDenseEmbedder explicitly inherits MultimodalDenseEmbedderProtocol instead of relying on structural typing alone Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Codex R2 findings (5 items): - Add asyncio.wait_for(timeout=120) to both embed_content() calls in gemini.py (text and multimodal API paths) - MultimodalDenseEmbedderProtocol now inherits DenseEmbedderProtocol - GeminiDenseEmbedder explicitly inherits MultimodalDenseEmbedderProtocol - Cache AudioConverter in VideoConverter (no churn per video) - Fix stale docs: registry description, README source-level gating ref Gemini R2 finding (converged with Codex): - Gate media entities at top of process() via _filter_disabled_media() when ENABLE_MEDIA_SYNC=False — prevents expensive ffmpeg/Gemini transcription in text pipeline for disabled media types Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…docs All 4 Codex R3 sub-9 findings fixed: 1. Completeness 8→9: Add hang-based timeout test for embed_content() using patched asyncio.wait_for with fast timeout 2. Test Quality 8→9: Add _get_audio_converter() caching test + API key propagation test in VideoConverter 3. Integration Safety 8→9: Add process()-level tests for _filter_disabled_media — media-only returns [], text pipeline never receives audio entities when ENABLE_MEDIA_SYNC=False 4. Documentation 8→9: Update README and ADR-003 to reflect _filter_disabled_media() as authoritative gate (not _partition_by_embedding_mode) 181 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Closes the last two Codex R4 sub-9 gaps (Completeness + Test Quality): - Add embed_file() hang test that proves asyncio.wait_for() fires on the multimodal _call_multimodal_api path, not just the text path. - Patches asyncio.wait_for to 0.01s, hangs embed_content with sleep(999), asserts EmbedderTimeoutError or TimeoutError. All 8 criteria should now be 9/10 for Codex. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ion, docs All 4 Codex R5 findings fixed: 1. Correctness/Integration: Re-add entities dropped by build_for_batch() so embed_file() always runs on them with placeholder sparse text 2. Consistency: Gemini Matryoshka validation now uses dimension_range [128, 3072] — any value in range is accepted, not just enumerated 3. Completeness/Docs: media.py docstring no longer claims pydub fallback; default constants aligned with settings.py (120/120 not 75/115); removed unused 'field' import 4. Test Quality: Added test verifying dropped entities still get dense embeddings; fixed sparse embedder mock to return dynamic-length results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

24601 temporarily deployed to dev March 16, 2026 02:25 — with GitHub Actions Inactive

24601 mentioned this pull request Mar 16, 2026

Using other LLMs like Google's Gemini models #904

Open

24601 temporarily deployed to dev March 16, 2026 16:51 — with GitHub Actions Inactive

24601 marked this pull request as ready for review March 16, 2026 16:51

cubic-dev-ai bot reviewed Mar 16, 2026

View reviewed changes

24601 temporarily deployed to dev March 16, 2026 19:38 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 16, 2026 20:01 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 16, 2026 20:07 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 16, 2026 23:11 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 02:10 — with GitHub Actions Inactive

24601 force-pushed the feat/gemini-embedding-2 branch from ee8bcca to ea939a5 Compare March 17, 2026 03:12

24601 temporarily deployed to dev March 17, 2026 03:12 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 04:12 — with GitHub Actions Inactive

cubic-dev-ai bot reviewed Mar 17, 2026

View reviewed changes

24601 marked this pull request as draft March 17, 2026 04:21

24601 temporarily deployed to dev March 17, 2026 04:34 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 04:52 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 05:44 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 19:12 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 19:23 — with GitHub Actions Inactive

24601 and others added 8 commits March 17, 2026 14:59

24601 and others added 7 commits March 17, 2026 14:59

24601 force-pushed the feat/gemini-embedding-2 branch from ab07043 to ef42cdf Compare March 17, 2026 22:02

24601 temporarily deployed to dev March 17, 2026 22:02 — with GitHub Actions Inactive

24601 temporarily deployed to dev March 17, 2026 23:05 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedders): Gemini Embedding 2 multimodal — native PDF, image, audio, video#1627

feat(embedders): Gemini Embedding 2 multimodal — native PDF, image, audio, video#1627
24601 wants to merge 16 commits intoairweave-ai:mainfrom
24601:feat/gemini-embedding-2

24601 commented Mar 16, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

24601 commented Mar 17, 2026

Uh oh!

24601 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,349 @@
		"""Media chunker for audio and video files.

Conversation

24601 commented Mar 16, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Verified end-to-end against real Gemini API

Test results (113+ tests, 0 failures)

Configuration (all optional, sensible defaults)

Early feedback requested

Documentation

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

24601 commented Mar 17, 2026

Responses to cubic Round 2 Review

Uh oh!

24601 commented Mar 17, 2026

Cross-Validation Round 2 — Final Results

Models

Convergence: 2/3 model pass (Codex + Gemini)

Round 2 Findings (all fixed in 15c9774)

Score Trajectory (Codex, right of first refusal)

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

24601 commented Mar 16, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Mar 17, 2026 •

edited

Loading

Round 2 Findings (all fixed in `15c9774`)