fix(rag): 20-item review pass — security, perf, arch, simplification#14
fix(rag): 20-item review pass — security, perf, arch, simplification#14
Conversation
…ctory Splits the 1429-line mcp_bridge.rs into a mcp_bridge/ directory module: - mod.rs: shared types (McpSource, BridgeConfig, SyncReport, BridgeDocument, WriteStatus), McporterExecutor trait, McpBridge<E> struct + sync() dispatch, call_data(), private helper functions, and all tests - executor.rs: ProcessMcporterExecutor implementation (59 lines) - io.rs: write_bridge_document(), relative_output_path(), build_extraction(), and file I/O helpers (129 lines) - linkding.rs: sync_linkding() + normalize_linkding_record() (120 lines) - memos.rs: sync_memos() + normalize_memo_record() (86 lines) - bytestash.rs: sync_bytestash() + normalize_bytestash_record() (105 lines) - paperless.rs: sync_paperless() + fetch_paperless_lookup() + normalize_paperless_record() (127 lines) Zero behavior change. Public API identical. All platform files < 200 lines.
…fice_zip_file
Replace bare entry.read_to_string() in the ODT/PPTX branch with the same
two-pass pattern already used by the DOCX branch: advisory entry.size() checks
remain as a fast-path, but actual reads are now capped by .take(per_entry_cap+1)
and measured against MAX_ODT_PPTX_PER_ENTRY_BYTES (10 MiB) and
MAX_ODT_PPTX_TOTAL_BYTES (50 MiB cumulative).
A crafted zip that declares tiny central-directory sizes but expands to
gigabytes is now rejected before allocating unbounded memory.
Tests: odt_decompression_bomb_rejected_by_measured_guard,
odt_cumulative_decompression_bomb_rejected,
odt_small_legitimate_file_parses_ok,
docx_entry_count_bomb_rejected
…ion in chunk()
The token_estimate diagnostic field was previously computed by calling
tokenizer.encode() on every emitted chunk, doubling tokenizer invocations
on the spawn_blocking hot path (the splitter already runs the tokenizer
internally via ChunkConfig::with_sizer for accurate boundary placement).
ChunkCharIndex in text-splitter 0.25 does not expose a token count field,
so the fix replaces token_estimate(&text, tokenizer) with word_count(&text).
The splitter's boundary accuracy is unchanged; only the diagnostic metadata
field now uses a word-count approximation.
Also removes the now-unused token_estimate() helper to keep clippy clean.
Tests: chunk_token_estimate_uses_word_count_not_tokenizer,
chunk_empty_content_returns_empty
…move direct field access
Pipeline::new still accepts RagConfig by value and wraps it with Arc::new internally — call sites are unchanged. WorkerContext.config: Arc<RagConfig> clones the Arc pointer (O(1)) instead of deep-cloning the config value per worker.
…se-in-depth design
…roots to spawn_workers watch_roots is now computed in runtime::run(), passed directly to spawn_workers() as Arc<Vec<PathBuf>>, and forwarded to WorkerContext::from_pipeline() as a parameter. Eliminates the runtime panic from OnceLock::get().expect() and enforces the invariant at compile time instead.
…-file dispatch overhead
…ocking_parse helper
t0 previously captured canonicalize + open + read_to_end time, then parse_ms was recorded before parse_file() was called — measuring I/O, not parse CPU. Split into io_t0 (file I/O through read_to_end) and parse_t0 (wrapping parse_file only). Add io_ms field to JobStats, total_io_ms to SessionCounters/CounterSnapshot, and emit both in the per-job tracing::info and the heartbeat log so operators can distinguish I/O-bound from parse-bound workloads.
…surface confinement violations Replace RagError::Generic at all five validate_url_scheme call sites with RagError::UrlValidation so callers can pattern-match on URL-specific failures without parsing error strings. Replace the chunker spawn_blocking join-handle panic with RagError::WorkerPanic for the same reason. Surface path confinement rejections to monitoring by calling ctx.counters.record_failure() before the silent Ok(default) return at the TOCTOU guard — confinement rejections are now visible in files_failed heartbeat metrics even though no Err is propagated.
…eliminate 14 mut locals in build_point_payload
…in process_job Add find_cached_branch() helper that walks ancestor directories of the already-canonical file path against the DashMap cache (O(depth) in-memory lookups, no I/O). If any ancestor matches a cached git root, return the stored branch immediately without spawning a blocking thread. On cache miss, canonicalize the detected git root before storing it so future pre-checks via canonical path ancestors always match, regardless of symlinks or non-normalized input paths. Also cache the miss (None) at the immediate parent directory to skip re-stating .git for files outside any git repository. This eliminates spawn_blocking for every file after the first in each git repository, which is the common case for large watched directories.
…iminate N-per-chunk clones
…iminate double-collect
📝 WalkthroughWalkthroughThe PR refactors noxa-rag by simplifying token estimation to word-count approximation, modularizing MCP bridge implementations into platform-specific modules, adding XML entity expansion security checks, improving embedding batching, making the parse module public with refactored payload construction, splitting pipeline timing metrics into I/O and parsing, converting shared config to Arc-based ownership, and introducing new error variants. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Pipeline
participant Executor
participant MCPServer
participant FileIO
participant Store
User->>Pipeline: Configure MCP bridge (linkding/memos/bytestash/paperless)
Pipeline->>Executor: spawn_workers with watch_roots Arc
loop Sync Loop (per platform)
Executor->>MCPServer: Execute action (list/lookup)
MCPServer-->>Executor: JSON response
Executor->>Executor: Normalize record to BridgeDocument
Executor->>FileIO: write_bridge_document
FileIO->>FileIO: Compute relative_output_path
FileIO->>FileIO: Compare serialized bytes<br/>(avoid rewrite)
FileIO->>FileIO: Atomic write (temp + rename)
FileIO-->>Executor: WriteStatus
Executor->>Executor: Update SyncReport counters
end
Executor-->>Pipeline: SyncReport (fetched/written/skipped)
sequenceDiagram
participant Worker
participant Parser as ParseModule
participant Security
participant Chunker
participant Store
Worker->>Parser: parse_file (bytes, url, provenance)
alt Is XML/RSS/HTML
Parser->>Security: contains_xml_entity_expansion_risk(bytes)
Security-->>Parser: bool
alt Risk Detected
Parser-->>Worker: RagError::Parse
else Safe
Parser->>Parser: parse_html_file or parse_xml_file
end
else Binary (PDF/Office)
Parser->>Parser: Decompress with budget caps
alt Exceeds Budget
Parser-->>Worker: RagError::Parse
else Within Budget
Parser->>Parser: Extract text
end
end
Parser->>Chunker: Split text into chunks
Chunker->>Chunker: word_count approximation<br/>(replaces tokenizer)
loop For each chunk
Parser->>Parser: build_point_payload(chunk, FileMetadata)
Parser->>Store: Insert PointPayload
end
Parser-->>Worker: Emit JobStats with io_ms + parse_ms
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR is a broad hardening + performance + refactor pass across noxa-rag’s ingestion pipeline (parsing → chunking → embedding → Qdrant upsert), plus a new MCP bridge module split intended to replace the previous monolithic mcp_bridge.rs.
Changes:
- Adds multiple ingestion safety guards (zip bomb caps, XML entity expansion pre-scan, startup-scan JSON size cap, safer logging) and improves error taxonomy.
- Improves ingestion throughput/latency (startup scan batching, avoids double tokenization, avoids redundant clones, git-branch cache fast path, improved timing metrics).
- Refactors parsing/provenance and pipeline internals (per-file metadata extraction, provenance application, watch roots handling, new MCP bridge module layout).
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/noxa-rag/src/types.rs | Removes IngestionContext and keeps provenance fields in PointPayload. |
| crates/noxa-rag/src/store/qdrant/vector_store.rs | Switches Qdrant upsert to wait=true to avoid post-upsert delete visibility races. |
| crates/noxa-rag/src/pipeline/worker.rs | Passes canonicalized watch_roots into worker context explicitly. |
| crates/noxa-rag/src/pipeline/startup_scan.rs | Batches startup_scan_key computations per spawn_blocking task to reduce overhead. |
| crates/noxa-rag/src/pipeline/scan.rs | Adds symlink-safety docs and a JSON size guard + tests for startup_scan_key. |
| crates/noxa-rag/src/pipeline/runtime.rs | Removes OnceLock watch-roots setup; threads watch_roots into worker spawn. |
| crates/noxa-rag/src/pipeline/process.rs | Separates IO vs parse timing, adds git-branch cache precheck, improves failure accounting, and uses new error variants. |
| crates/noxa-rag/src/pipeline/parse/text.rs | Uses shared word_count, makes HTML parse sync (spawned via helper), and adds XML entity-expansion pre-scan. |
| crates/noxa-rag/src/pipeline/parse/tests.rs | Adds tests ensuring XML/feed entity-expansion payloads are rejected. |
| crates/noxa-rag/src/pipeline/parse/rich.rs | Applies the XML entity-expansion pre-scan to feed parsing; uses shared word_count. |
| crates/noxa-rag/src/pipeline/parse/mod.rs | Adds FormatProvenance::apply, introduces FileMetadata, normalizes spawn-blocking usage, and simplifies payload building. |
| crates/noxa-rag/src/pipeline/parse/binary.rs | Adds measured ODT/PPTX decompression caps and tests; uses shared word_count. |
| crates/noxa-rag/src/pipeline/heartbeat.rs | Emits io_ms in heartbeat snapshots. |
| crates/noxa-rag/src/pipeline.rs | Adds IO timing counters, refactors counters API, wraps config in Arc, removes OnceLock watch-roots. |
| crates/noxa-rag/src/mcp_bridge/paperless.rs | Adds Paperless MCP normalization + sync logic. |
| crates/noxa-rag/src/mcp_bridge/mod.rs | New MCP bridge module root with shared helpers, executor trait, sync orchestration, and tests. |
| crates/noxa-rag/src/mcp_bridge/memos.rs | Adds Memos MCP sync + record normalization. |
| crates/noxa-rag/src/mcp_bridge/linkding.rs | Adds Linkding MCP sync + record normalization (including platform URL building). |
| crates/noxa-rag/src/mcp_bridge/io.rs | Adds bridge document storage format + atomic-ish write helper and path sanitization. |
| crates/noxa-rag/src/mcp_bridge/executor.rs | Adds ProcessMcporterExecutor to invoke mcporter as a subprocess and parse JSON output. |
| crates/noxa-rag/src/mcp_bridge/bytestash.rs | Adds Bytestash MCP sync + record normalization. |
| crates/noxa-rag/src/error.rs | Adds UrlValidation and WorkerPanic error variants. |
| crates/noxa-rag/src/embed/tei.rs | Avoids cloning batch texts and omits retry body logs when an auth token is configured. |
| crates/noxa-rag/src/chunker.rs | Makes word_count reusable, switches token estimate to word-count approximation, and adds tests to prevent re-tokenization regressions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd4243a979
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Remove mcp_bridge.rs (was deleted on disk but not staged in commit 59a195c; caused E0761 duplicate module root alongside mcp_bridge/) - mcp_bridge/io.rs: remove destination before rename so write_bridge_document is idempotent on Windows (rename fails if dst exists) - parse/text.rs: scan raw bytes in contains_xml_entity_expansion_risk instead of from_utf8().unwrap_or("") — non-UTF-8 bytes no longer silently suppress the <!DOCTYPE/<!ENTITY guard - pipeline/scan.rs: replace hardcoded MAX_SCAN_JSON_BYTES with a max_json_bytes parameter threaded from pipeline.config.pipeline.max_file_size_bytes - pipeline/scan.rs: fix startup_scan_key_json_at_size_limit_is_processed to actually exercise the boundary condition (len == limit, not a tiny file) Resolves review thread PRRT_kwDOR_mP6c59t3qL Resolves review thread PRRT_kwDOR_mP6c59t25j Resolves review thread PRRT_kwDOR_mP6c59t25d Resolves review thread PRRT_kwDOR_mP6c59t25O Resolves review thread PRRT_kwDOR_mP6c59t3qN Resolves review thread PRRT_kwDOR_mP6c59t25n
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
crates/noxa-rag/src/embed/tei.rs (2)
266-275:⚠️ Potential issue | 🟡 MinorBody still leaks via the non‑retryable error path when
auth_tokenis set.The retry-log omission added above is undone here: on any non‑success status that isn't 429/503/413 (or after
should_retryreturns false), the response body is read and a 512‑char preview is embedded intoRagError::Embed { message }. That error message is virtually always logged by upstream callers, so the same TEI body bytes the retry path now suppresses will still surface in logs whenever auth is configured. Since the PR’s stated goal is "Omit TEI response body from retry logs when auth_token is configured", this is a consistency/security gap worth closing in the same place.🛡️ Suggested fix
- let body = resp.text().await.unwrap_or_default(); - let preview: String = body.chars().take(512).collect(); - return Err(RagError::Embed { - message: format!("TEI /embed returned HTTP {status_u16}: {preview}"), - status: Some(status_u16), - }); + let message = if self.auth_token.is_some() { + // Body omitted to avoid leaking auth-gated response content into error logs. + format!("TEI /embed returned HTTP {status_u16} (body omitted: auth token configured)") + } else { + let body = resp.text().await.unwrap_or_default(); + let preview: String = body.chars().take(512).collect(); + format!("TEI /embed returned HTTP {status_u16}: {preview}") + }; + return Err(RagError::Embed { + message, + status: Some(status_u16), + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/embed/tei.rs` around lines 266 - 275, The non-retryable error path still reads and includes a 512‑char TEI response body preview in RagError::Embed, leaking sensitive TEI content when an auth token is configured; change the branch that constructs the error so it does NOT read/attach the response body if the request used an auth token (e.g., check the same auth_token flag/variable used earlier), and instead produce a safe generic message (or omit the preview) in RagError::Embed; retain the body preview only when no auth token is present so behavior matches the earlier retry-path omission.
332-352: 🧹 Nitpick | 🔵 TrivialRefactor is correct; the intermediate
Vecis unnecessary.Slicing
&texts[start..end]insideasync movecleanly avoids the previous per-batchVec<String>clone, and the borrow is sound becauseselfandtextsoutlive thebuffered(..).collect()await. One optional nit:futures::stream::iteraccepts anyIntoIterator, so thebatch_ranges: Vec<(usize, usize, usize)>materialization adds an extra allocation for no benefit — you can feed the range iterator directly.♻️ Optional simplification
- let batch_ranges: Vec<(usize, usize, usize)> = (0..total_batches) - .map(|i| { - let start = i * BATCH_SIZE; - let end = (start + BATCH_SIZE).min(texts.len()); - (i, start, end) - }) - .collect(); - - let results: Vec<Vec<Vec<f32>>> = futures::stream::iter(batch_ranges) + let results: Vec<Vec<Vec<f32>>> = futures::stream::iter((0..total_batches).map(|i| { + let start = i * BATCH_SIZE; + let end = (start + BATCH_SIZE).min(texts.len()); + (i, start, end) + })) .map(|(batch_idx, start, end)| async move { self.embed_batch_adaptive(&texts[start..end], batch_idx, total_batches) .await }) .buffered(EMBED_PIPELINE_DEPTH) .collect::<Vec<_>>() .await .into_iter() .collect::<Result<_, _>>()?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/embed/tei.rs` around lines 332 - 352, The intermediate Vec batch_ranges is an unnecessary allocation—remove the batch_ranges binding and pass the iterator directly to futures::stream::iter by using (0..total_batches).map(|i| { let start = i * BATCH_SIZE; let end = (start + BATCH_SIZE).min(texts.len()); (i, start, end) }) so the stream maps over those tuples and calls self.embed_batch_adaptive(&texts[start..end], batch_idx, total_batches).await, keeping the existing buffered(EMBED_PIPELINE_DEPTH).collect().await.into_iter().collect::<Result<_, _>>()? plumbing intact.crates/noxa-rag/src/types.rs (1)
213-215:⚠️ Potential issue | 🟡 MinorFix
cargo fmtfailure: extra blank line.CI reported a
cargo fmt --check --alldiff. The double blank line between the closing}ofSearchMetadataFilterand the#[cfg(test)]module looks like the culprit.📐 Proposed fix
} - #[cfg(test)] mod tests {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/types.rs` around lines 213 - 215, Remove the extra blank line between the end of the SearchMetadataFilter type declaration (the closing "}" for SearchMetadataFilter) and the subsequent #[cfg(test)] module; ensure there is only a single newline separating them so the file matches cargo fmt formatting (i.e., place #[cfg(test)] immediately after one blank line following the SearchMetadataFilter closing brace).crates/noxa-rag/src/pipeline/parse/text.rs (1)
171-199:⚠️ Potential issue | 🟡 MinorFix
cargo fmtfailure onparse_xml_filesignature/body wrapping.CI reports a fmt diff in this region (likely the
Result<ParsedFile, RagError>signature wrapping or theErr(RagError::Parse(...))block). Runcargo fmt --alland commit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/pipeline/parse/text.rs` around lines 171 - 199, The function parse_xml_file has formatting that fails cargo fmt (signature/body wrapping around Result<ParsedFile, RagError> and the Err(RagError::Parse(...)) block); run `cargo fmt --all` to reformat the file, ensure the parse_xml_file signature and its Err return are wrapped/indented per rustfmt rules, then add and commit the formatted changes so CI passes (locate parse_xml_file and the Err(RagError::Parse(...)) return to verify formatting).crates/noxa-rag/src/pipeline/scan.rs (1)
123-184: 🧹 Nitpick | 🔵 TrivialAdd explicit comment at
spawn_startup_scancall site wiringmax_json_bytesto config source.The doc note at lines 107-108 correctly explains the design, but the relationship is only documented in
startup_scan_key's function comment. At the call site inruntime.rs:99-106, the parameter is already correctly wired frompipeline.config.pipeline.max_file_size_bytes, but a defensive comment would make this coupling explicit and prevent accidental drift. Consider adding a comment above the call:// max_json_bytes wired from pipeline.max_file_size_bytes to stay in sync with process_job guardAlternatively, a debug_assert in
spawn_startup_scancould validate at runtime (e.g., assert the parameter is not zero if config is expected to be non-zero).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/pipeline/scan.rs` around lines 123 - 184, At the call site of spawn_startup_scan in runtime.rs (where you pass pipeline.config.pipeline.max_file_size_bytes into startup_scan_key), add an explicit comment stating that max_json_bytes is intentionally wired to pipeline.config.pipeline.max_file_size_bytes to keep it in sync with the process_job guard and startup_scan_key behavior; optionally, add a defensive debug_assert inside spawn_startup_scan (or at its caller) that the passed max_json_bytes is non-zero (or matches expected invariants) to catch accidental config drift at runtime.crates/noxa-rag/src/mcp_bridge/mod.rs (1)
106-117: 🧹 Nitpick | 🔵 TrivialRemove the redundant
pub(self)visibility modifier.
pub(self)provides no additional visibility compared to omitting the modifier entirely and is flagged by theclippy::needless_pub_selflint (enabled by default in Clippy 1.72.0+). Sincecall_datais used exclusively within submodule implementations viaself.call_data(...), the visibility marker is unnecessary.♻️ Proposed fix
- pub(self) async fn call_data( + async fn call_data( &self, source: McpSource, action: &str, params: Value, ) -> Result<Value, RagError> {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-rag/src/mcp_bridge/mod.rs` around lines 106 - 117, The method declaration uses the redundant visibility modifier `pub(self)` on async fn call_data; remove `pub(self)` so the signature becomes async fn call_data(&self, source: McpSource, action: &str, params: Value) -> Result<Value, RagError> and keep the body unchanged; this addresses the clippy::needless_pub_self lint while preserving usage from submodule callers and retaining the call to self.executor.call(...) and extract_mcporter_data(raw).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/noxa-rag/src/chunker.rs`:
- Line 132: Remove the unused import "use std::str::FromStr as _;" from
chunker.rs to silence the CI warning; the code that calls
str::parse::<Tokenizer>() does not require the trait to be imported, so delete
the FromStr import line (referenced symbol: FromStr as _ in chunker.rs) and run
cargo check to confirm the warning is gone.
In `@crates/noxa-rag/src/mcp_bridge/bytestash.rs`:
- Around line 5-10: CI indicates rustfmt import grouping/line-wrapping issues in
this file; run cargo fmt --all to apply rustfmt rules or manually reflow the use
statement that imports BridgeDocument, McpBridge, McpSource, McporterExecutor,
SyncReport, WriteStatus, array_field, join_base_url, join_non_empty,
optional_string, required_base_url, required_string, string_array and the nested
io::{build_extraction, write_bridge_document} so they follow rustfmt's grouping
and line length rules (split into multiple lines or group paths as rustfmt
expects) and re-run cargo fmt to verify the formatting passes.
- Around line 16-38: The sync_bytestash implementation currently makes a single
snippets.list call and loads all results; change it to paginate using
self.config.page_size by repeatedly calling call_data(McpSource::Bytestash,
"snippets.list", params) with pagination params (e.g. page_size and a page token
or offset depending on the Bytestash API) and accumulate records rather than
calling it once; inside the loop use array_field(&data, "snippets") or
data.as_array() to collect page results, update report counts as each record is
processed (normalize_bytestash_record, write_bridge_document, WriteStatus
handling remain the same), and stop when the API indicates there are no more
pages (no next_page_token / empty page), mirroring the pagination pattern used
in linkding.rs/memos.rs/paperless.rs.
In `@crates/noxa-rag/src/mcp_bridge/executor.rs`:
- Around line 48-53: The current error builds RagError::Generic by interpolating
String::from_utf8_lossy(&output.stderr).trim() which may leak secrets; instead,
debug!-log the full stderr (e.g., debug!("mcporter stderr: {:?}",
String::from_utf8_lossy(&output.stderr))) and return a sanitized
RagError::Generic that does not include the raw stderr—use a short safe summary
such as "mcporter call {selector} {action} failed (stderr redacted)" or include
only a non-sensitive prefix/hex-hash of output.stderr; update the code around
the existing Err(...) construction to perform the debug! call and replace the
interpolated stderr with the sanitized message while still referencing selector
and action.
- Around line 36-45: The mcporter invocation using Command::output().await can
hang indefinitely; add a configurable timeout field (e.g., timeout: Duration) to
ProcessMcporterExecutor, wrap the child spawn/await in tokio::time::timeout
using that field, and if the timeout elapses ensure the child process is killed
(send kill/terminate to the Child returned by Command::spawn) and return a
RagError indicating timeout; update the call site that currently uses
Command::new(&self.executable)...output().await to spawn the child, await its
output under timeout, and handle both timeout and child-kill error paths
accordingly.
In `@crates/noxa-rag/src/mcp_bridge/io.rs`:
- Around line 10-16: The filenames are unstable because stable_component_hash
(used by relative_output_path) relies on DefaultHasher; replace that hashing
with a deterministic 64-bit xxh3 implementation (xxhash_rust::xxh3::xxh3_64) so
the same external_id always yields the same 16‑hex filename. Update the
stable_component_hash function to compute xxh3_64(external_id.as_bytes()) and
keep the existing "{:016x}" formatting, and replace any other uses of
DefaultHasher in this file (e.g., other calls referenced around the 104-109
region) to use xxh3_64 to ensure on-disk paths remain stable across Rust
toolchain upgrades.
- Around line 37-43: The existing remove-then-rename sequence (using
temp_output_path, tokio::fs::write, tokio::fs::remove_file, tokio::fs::rename)
is intentional for reliable cross-platform behavior; update the inline comment
above that block to explicitly state why we keep remove_file: note that although
std::fs::rename may use MoveFileExW / FileRenameInfoEx on Windows, atomic
replacement semantics depend on Rust version and Windows 10+ API differences and
will fail if the destination is a directory, so without a guaranteed MSRV/target
Windows version we intentionally remove the destination first to ensure success;
leave the code and return WriteStatus::Written as-is.
In `@crates/noxa-rag/src/mcp_bridge/linkding.rs`:
- Around line 55-60: The check that decides when to stop paging currently does
two lookups of data.get("next") and can be simplified and made more efficient:
replace the condition `data.get("next").is_none() ||
data.get("next").is_some_and(Value::is_null)` with a single lookup like
`data.get("next").map_or(true, Value::is_null)` (or the equivalent
`!data.get("next").is_some_and(|v| !v.is_null())`) so the "next" field is read
once; keep the subsequent `if new_records == 0 { break; }` logic unchanged.
- Around line 8-13: The import block inside the super::{...} in linkding.rs is
not formatted to rustfmt/cargo fmt standards causing CI failures; run cargo fmt
--all (or rustfmt on that file) to canonicalize the grouping and line-wrapping
of the super::{ BridgeDocument, McpBridge, McpSource, McporterExecutor,
SyncReport, WriteStatus, array_field, as_u64_value, join_non_empty,
optional_string, required_string, required_value, string_array,
io::{build_extraction, write_bridge_document} } import so it matches the
project's formatting rules and removes the CI diff.
In `@crates/noxa-rag/src/mcp_bridge/memos.rs`:
- Around line 5-10: The import list in crates/noxa-rag/src/mcp_bridge/memos.rs
is not formatted to rustfmt's grouping/wrapping rules (symbols like
BridgeDocument, McpBridge, McpSource, McporterExecutor, SyncReport, WriteStatus,
array_field, first_line_title, join_base_url, optional_string,
required_base_url, required_string, string_array, and the io imports
build_extraction and write_bridge_document are flagged); run rustfmt by
executing cargo fmt --all (or rustfmt on that file) to reorder and wrap the use
statements to match the project's formatting, then re-commit and push the
updated file so CI passes.
In `@crates/noxa-rag/src/pipeline/parse/binary.rs`:
- Around line 167-183: The current read-and-check block (using remaining,
per_entry_cap, read_cap, xml_buf, .take(read_cap).read_to_string(...),
read_bytes and odt_pptx_measured_total) is correct functionally but needs two
fixes: 1) run cargo fmt --all to satisfy rustfmt so the (&mut
entry).take(read_cap).read_to_string(...) chain is formatted on one line; 2)
improve the RagError::Parse message emitted in the if read_bytes > per_entry_cap
branch to distinguish a per-entry limit hit from a cumulative-cap exhaustion
(i.e., when remaining==0 / per_entry_cap==0) — update the error text inside the
RagError::Parse formatting to mention either "exceeds per-entry decompression
limit" or "cumulative decompression budget exhausted" depending on per_entry_cap
== 0 (or include both pieces of context), then run tests and commit.
In `@crates/noxa-rag/src/pipeline/parse/tests.rs`:
- Around line 142-147: The CI failure is due to rustfmt reflowing short
function-call expressions (e.g., FileMetadata::from_result_and_provenance and
build_point_payload) and the sample_file_metadata helper; fix it by running
`cargo fmt --all` locally and committing the reformatted changes, ensuring the
call sites (FileMetadata::from_result_and_provenance(...) and
build_point_payload(...)) and the sample_file_metadata helper are formatted
according to rustfmt so the function-call wrapping matches CI expectations.
- Around line 416-430: The test parse_feed_rejects_entity_expansion currently
only checks result.is_err(); change it to mirror the XML test by unwrapping the
error (let err = result.unwrap_err().to_string()) and assert the error message
mentions DOCTYPE/ENTITY/entity expansion (e.g., assert!(err.contains("DOCTYPE")
|| err.contains("ENTITY") || err.contains("entity expansion"))), so that the
failure is specifically due to the entity-expansion guard in parse_file rather
than any other parse error.
In `@crates/noxa-rag/src/pipeline/scan.rs`:
- Around line 549-568: The at-limit test is currently asserting is_some() but
only passes because serde_json fails on trailing NULs and the code falls back to
mtime/size; change the test (the test covering the at-limit case that calls
startup_scan_key) to assert the returned Option is Some and that the string
begins with "mtime:" (e.g. unwrap() and starts_with("mtime:")) to explicitly
verify the mtime/size branch; locate the test that calls startup_scan_key at the
size limit and replace the loose is_some() check with this stronger assertion.
- Around line 580-584: Rustfmt is breaking the long chained call to
OpenOptions::new().write(true).open(&file); wrap the chained OpenOptions
invocation (the OpenOptions::new().write(true).open(&file) expression used to
create f) in parentheses or assign it to a temporary variable so the formatter
can break lines cleanly, then call set_len(LIMIT) on the resulting File; ensure
you still use std::fs::OpenOptions::new, .write(true), .open(&file) and
.set_len(LIMIT) (and keep the existing expect("open")/expect("set_len")
behavior).
---
Outside diff comments:
In `@crates/noxa-rag/src/embed/tei.rs`:
- Around line 266-275: The non-retryable error path still reads and includes a
512‑char TEI response body preview in RagError::Embed, leaking sensitive TEI
content when an auth token is configured; change the branch that constructs the
error so it does NOT read/attach the response body if the request used an auth
token (e.g., check the same auth_token flag/variable used earlier), and instead
produce a safe generic message (or omit the preview) in RagError::Embed; retain
the body preview only when no auth token is present so behavior matches the
earlier retry-path omission.
- Around line 332-352: The intermediate Vec batch_ranges is an unnecessary
allocation—remove the batch_ranges binding and pass the iterator directly to
futures::stream::iter by using (0..total_batches).map(|i| { let start = i *
BATCH_SIZE; let end = (start + BATCH_SIZE).min(texts.len()); (i, start, end) })
so the stream maps over those tuples and calls
self.embed_batch_adaptive(&texts[start..end], batch_idx, total_batches).await,
keeping the existing
buffered(EMBED_PIPELINE_DEPTH).collect().await.into_iter().collect::<Result<_,
_>>()? plumbing intact.
In `@crates/noxa-rag/src/mcp_bridge/mod.rs`:
- Around line 106-117: The method declaration uses the redundant visibility
modifier `pub(self)` on async fn call_data; remove `pub(self)` so the signature
becomes async fn call_data(&self, source: McpSource, action: &str, params:
Value) -> Result<Value, RagError> and keep the body unchanged; this addresses
the clippy::needless_pub_self lint while preserving usage from submodule callers
and retaining the call to self.executor.call(...) and
extract_mcporter_data(raw).
In `@crates/noxa-rag/src/pipeline/parse/text.rs`:
- Around line 171-199: The function parse_xml_file has formatting that fails
cargo fmt (signature/body wrapping around Result<ParsedFile, RagError> and the
Err(RagError::Parse(...)) block); run `cargo fmt --all` to reformat the file,
ensure the parse_xml_file signature and its Err return are wrapped/indented per
rustfmt rules, then add and commit the formatted changes so CI passes (locate
parse_xml_file and the Err(RagError::Parse(...)) return to verify formatting).
In `@crates/noxa-rag/src/pipeline/scan.rs`:
- Around line 123-184: At the call site of spawn_startup_scan in runtime.rs
(where you pass pipeline.config.pipeline.max_file_size_bytes into
startup_scan_key), add an explicit comment stating that max_json_bytes is
intentionally wired to pipeline.config.pipeline.max_file_size_bytes to keep it
in sync with the process_job guard and startup_scan_key behavior; optionally,
add a defensive debug_assert inside spawn_startup_scan (or at its caller) that
the passed max_json_bytes is non-zero (or matches expected invariants) to catch
accidental config drift at runtime.
In `@crates/noxa-rag/src/types.rs`:
- Around line 213-215: Remove the extra blank line between the end of the
SearchMetadataFilter type declaration (the closing "}" for SearchMetadataFilter)
and the subsequent #[cfg(test)] module; ensure there is only a single newline
separating them so the file matches cargo fmt formatting (i.e., place
#[cfg(test)] immediately after one blank line following the SearchMetadataFilter
closing brace).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3fd83159-dbcf-4405-adbb-2b03c2b2a28a
📒 Files selected for processing (24)
crates/noxa-rag/src/chunker.rscrates/noxa-rag/src/embed/tei.rscrates/noxa-rag/src/error.rscrates/noxa-rag/src/mcp_bridge/bytestash.rscrates/noxa-rag/src/mcp_bridge/executor.rscrates/noxa-rag/src/mcp_bridge/io.rscrates/noxa-rag/src/mcp_bridge/linkding.rscrates/noxa-rag/src/mcp_bridge/memos.rscrates/noxa-rag/src/mcp_bridge/mod.rscrates/noxa-rag/src/mcp_bridge/paperless.rscrates/noxa-rag/src/pipeline.rscrates/noxa-rag/src/pipeline/heartbeat.rscrates/noxa-rag/src/pipeline/parse/binary.rscrates/noxa-rag/src/pipeline/parse/mod.rscrates/noxa-rag/src/pipeline/parse/rich.rscrates/noxa-rag/src/pipeline/parse/tests.rscrates/noxa-rag/src/pipeline/parse/text.rscrates/noxa-rag/src/pipeline/process.rscrates/noxa-rag/src/pipeline/runtime.rscrates/noxa-rag/src/pipeline/scan.rscrates/noxa-rag/src/pipeline/startup_scan.rscrates/noxa-rag/src/pipeline/worker.rscrates/noxa-rag/src/store/qdrant/vector_store.rscrates/noxa-rag/src/types.rs
- executor.rs: add 30s timeout to mcporter subprocess; truncate stderr to 512 chars to prevent large/sensitive output leaking into error msgs - io.rs: replace DefaultHasher (not stable across Rust versions) with xxh3_64 for deterministic filenames in stable_component_hash - linkding.rs: simplify next-page termination check to map_or(true, is_null) - binary.rs: separate per-entry vs cumulative error messages in ODT/PPTX decompression guard so the rejection reason is explicit - parse/tests.rs: strengthen parse_feed_rejects_entity_expansion to assert the error message mentions DOCTYPE/ENTITY (mirrors XML test) - chunker.rs: remove unused `use std::str::FromStr as _` import (CI warning) - cargo fmt --all: fix import grouping in bytestash.rs, linkding.rs, memos.rs, parse/tests.rs; fix OpenOptions chain wrapping in scan.rs No action taken on: - io.rs:43 rename/remove_file: kept as-is; cross-platform safe regardless - bytestash.rs:38 pagination: no official pagination API confirmed by reviewer - scan.rs:568 boundary test: reviewer acknowledged implementation is correct Resolves review thread PRRT_kwDOR_mP6c59zaLc Resolves review thread PRRT_kwDOR_mP6c59zaLq Resolves review thread PRRT_kwDOR_mP6c59zaLs Resolves review thread PRRT_kwDOR_mP6c59zaL3 Resolves review thread PRRT_kwDOR_mP6c59zaL5 Resolves review thread PRRT_kwDOR_mP6c59zaL8 Resolves review thread PRRT_kwDOR_mP6c59zaMA Resolves review thread PRRT_kwDOR_mP6c59zaMJ Resolves review thread PRRT_kwDOR_mP6c59zaMM Resolves review thread PRRT_kwDOR_mP6c59zaMR Resolves review thread PRRT_kwDOR_mP6c59zaMT Resolves review thread PRRT_kwDOR_mP6c59zaMm Resolves review thread PRRT_kwDOR_mP6c59zaMq Resolves review thread PRRT_kwDOR_mP6c59zaMs Resolves review thread PRRT_kwDOR_mP6c59zaMz
Summary
Comprehensive review pass on the
noxa-ragingestion pipeline addressing security vulnerabilities, performance bottlenecks, architectural issues, and code quality findings from a 4-agent review (security-sentinel, performance-oracle, architecture-strategist, code-simplicity-reviewer).21 commits across 24 files. 92 tests passing (up from 81).
Security
fix(noxa-5gf): ODT/PPTX decompression bomb — replace bareread_to_stringwith.take(cap)measured guard inparse_office_zip_file; matches the existing DOCX two-pass patternfix(noxa-3b8): Add 50 MiB size guard tostartup_scan_keybefore JSON deserialization (previously unbounded on startup)fix(noxa-qkg): Guard XML/OPML/RSS/Atom parsers against<!DOCTYPE/<!ENTITYexpansion attacks (billion laughs)fix(noxa-byr): Omit TEI response body from retry logs whenauth_tokenis configuredPerformance
fix(noxa-gs8): Eliminate double BPE tokenization inchunk()—token_estimatenow usesword_countapproximation; the splitter already runs the tokenizer for boundary placementfix(noxa-bkq): Change Qdrant upsert towait=trueto guarantee segment visibility beforedelete_stale_by_urlfilter runsperf(noxa-c28): Eliminate redundant chunk text clone inTeiProvider::embed— stream index-range tuples instead of materializingVec<String>batchesperf(noxa-rso): Checkgit_branch_cachebefore dispatchingspawn_blocking— ancestor walk avoids blocking task on cache hitsperf(noxa-5tl): Batchstartup_scanspawn_blockingcalls (256 paths/batch vs. one per file)fix(noxa-u90): Splitparse_mstimer intoio_ms+parse_ms— heartbeat now correctly measures parse CPU time separately from file I/OArchitecture
chore(noxa-bd4): Delete deadIngestionContextstruct (zero call sites, duplicated fields fromPointPayload/SearchResult)refactor(noxa-4jt): Split 1429-linemcp_bridge.rsinto per-platform module directory (7 files, each <200 lines)refactor(noxa-26r): WrapRagConfiginArcforWorkerContext— consistent with all other shared fieldsrefactor(noxa-qgq): RemoveOnceLock<watch_roots>fromPipeline— pass canonicalized roots directly tospawn_workers, eliminating runtime panic invariantrefactor(noxa-mqm): Addrecord_parse_failure()toSessionCounters, remove direct atomic field access fromprocess.rsrefactor(noxa-3g7): AddRagError::UrlValidationandWorkerPanicvariants; incrementfiles_failedon path confinement violations (previously silent)refactor(noxa-ngd): Normalizeparse_html_fileto sync fn dispatched viaspawn_blocking_parsehelper (consistent with all other CPU-bound parsers)refactor(noxa-udb): MoveFormatProvenancematch intoapply()method, eliminating 14mutlocals inbuild_point_payloadperf(noxa-346): ExtractFileMetadatastruct built once per document — eliminates N-per-chunk clones of title, author, git_branch, content_hash, etc.Docs
docs(noxa-dkl): Documentis_indexablesymlink TOCTOU window and defense-in-depth designSimplification (follow-up pass)
refactor: Promoteword_counttopub(crate), replace 8 inlinesplit_whitespace().count()calls across parse files; replace boilerplate test helper withmake_text_result; remove redundant WHAT comments; eliminate double-collect instartup_scanTest plan
cargo test -p noxa-rag— 92 tests passcargo clippy --workspace— no new warningsio_msandparse_msfieldsSummary by cubic
Hardens and speeds up the
noxa-ragingestion pipeline: blocks zip/XML bombs, reduces CPU/I/O overhead, and simplifies the worker/parse architecture for stability and clearer metrics. Adds deterministicmcpfilenames and safermcporterexecution (30s timeout, truncated stderr).Bug Fixes
pipeline.max_file_size_bytes) beforestartup_scan_keyJSON parse.Qdrantupserts usewait=trueto avoid races; TEI retries omit bodies whenauth_tokenis set; Windows-safe, idempotentmcpwrites; deterministicmcpfilenames via stable hashing;mcportersubprocess now has a 30s timeout and truncates stderr; remove straymcp_bridge.rsto fix duplicate module.RagError::UrlValidationandWorkerPanic; confinement rejections counted infiles_failed; heartbeat now reportsio_msandparse_ms.Refactors
IngestionContext; splitmcp_bridgeinto focused modules; wrap config inArc; removeOnceLockfor watch roots (pass to workers); move provenance writes intoFormatProvenance::apply().word_count); batch startup scan; check git-branch cache before spawning; stream ranges in TEI embed; extractFileMetadatato cut per-chunk clones.Written for commit b8bab43. Summary will update on new commits. Review in cubic
Summary by CodeRabbit
Release Notes
New Features
Security
Performance