feat: add per-request logs for proxy and MITM traffic#98
Conversation
Capture proxy and transparent-MITM requests into a per-vault request
log, surfaced through a new Logs tab in the dashboard and a
`GET /v1/vaults/{name}/logs` API. The MITM and explicit-proxy paths
share a single `requestlog.Sink`, so both ingress paths feed the same
pipeline with consistent fields (actor, matched service, credential
keys injected, status, latency).
- Store: new `request_logs` table + migration 039, CRUD in sqlite.go,
methods added to the Store interface.
- Pipeline: `internal/requestlog` package with Sink + SQLite backend
and a background retention worker that trims by age and per-vault
row cap.
- Server: `AttachLogSink` wires the sink at startup; the proxy and
MITM handlers fan events into it on every request.
- API: `GET /v1/vaults/{name}/logs` returns paginated logs for a vault
(vault-member auth); skill docs (CLI + HTTP) updated to match.
- UI: Logs tab under each vault with filtering and latency/status
rendering; AllAgentsTab updated to route through the new layout.
- Ops: new env vars + CLI flags for retention knobs documented in
.env.example, docs/reference/cli.mdx, and
docs/self-hosting/environment-variables.mdx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Every entry in the `where` slice is a static predicate string; all user input flows through `args` as ? placeholders. Matches the existing `#nosec G202` pattern at sqlite.go:2684 for the same safe-concat shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| Path: r.Path, | ||
| MatchedService: r.MatchedService, | ||
| CredentialKeys: r.CredentialKeys, | ||
| Status: r.Status, | ||
| LatencyMs: r.LatencyMs, | ||
| ErrorCode: r.ErrorCode, | ||
| ActorType: r.ActorType, | ||
| ActorID: r.ActorID, | ||
| CreatedAt: r.CreatedAt.Format(time.RFC3339), | ||
| } | ||
| } | ||
|
|
||
| // When the caller is tailing (passed after=) and no new rows exist, we | ||
| // echo back their cursor so the next poll stays on the same high-water | ||
| // mark without a separate round-trip to learn "still nothing new". | ||
| if latestID == 0 { | ||
| latestID = opts.After | ||
| } | ||
|
|
||
| // next_cursor is the id to pass as `before` on the next page going back | ||
| // in time. Nil when we don't have a full page (end of history) or when | ||
| // the caller is tailing forward (`after` set). | ||
| var nextCursor any | ||
| if opts.After == 0 && len(rows) == limit { | ||
| nextCursor = rows[len(rows)-1].ID | ||
| } | ||
|
|
||
| jsonOK(w, map[string]any{ | ||
| "logs": items, | ||
| "next_cursor": nextCursor, |
There was a problem hiding this comment.
🔴 The live-tail polling query always uses ORDER BY id DESC LIMIT even when an after= cursor is set, so during any burst inserting more than the page limit (default 50) rows between polls, only the 50 newest are returned and latestID is set from rows[0].ID (the highest). The frontend advances latestIdRef.current to that value, permanently skipping the older burst rows — they remain in the DB but never appear in live tail. Fix by using ORDER BY id ASC LIMIT for after= queries and reading latestID from the last row.
Extended reasoning...
Root cause: In internal/store/sqlite.go, ListRequestLogs always appends ORDER BY id DESC LIMIT ? to the query regardless of whether opts.After is set. For the tailing case the intent is to fetch rows newer than the cursor, but descending order means the LIMIT is applied starting from the newest matching row.
Concrete code path: During a tailing poll the frontend calls GET /v1/vaults/{name}/logs?after=100&limit=50. SQLite executes WHERE id > 100 ORDER BY id DESC LIMIT 50. If 200 new rows arrived (IDs 101-300), the query returns IDs 300, 299, ..., 251. In handle_logs.go lines 92-93, latestID is set from rows[0].ID which is 300. The response carries latest_id: 300.
Why existing code does not prevent this: The frontend guard in LogsView.tsx line 123 (if data.latest_id > latestIdRef.current) prevents rollback but unconditionally accepts a forward jump. After receiving latest_id: 300 it updates latestIdRef.current to 300 and the next poll sends after=300. Records 101-250 are now permanently unreachable from live tail - no catch-up or gap-detection mechanism exists.
Step-by-step proof:
- State: latestIdRef.current = 100; 200 new rows inserted (IDs 101-300) during the 3s poll interval.
- Poll fires: GET ...?after=100&limit=50 -> SQL: WHERE id > 100 ORDER BY id DESC LIMIT 50 -> returns rows 300, 299, ..., 251.
- latestID = rows[0].ID = 300 (handle_logs.go line 92).
- Response: {logs: [300..251], latest_id: 300}.
- Frontend prepends rows 300-251; guard advances cursor to 300.
- Next poll: after=300 - rows 101-250 are permanently skipped from live tail.
Impact: Under any traffic rate exceeding limit/poll_interval (default ~16 req/s), some fraction of live-tailed logs will silently vanish from the UI on every burst. Records remain in the DB and are accessible via load-more (historical paging with before=), but the real-time tailing view is lossy.
Correct fix: When opts.After > 0, use ORDER BY id ASC LIMIT ? so the cursor advances incrementally through the burst. In handle_logs.go, read latestID from rows[len(rows)-1].ID (last/highest in ASC order) instead of rows[0].ID. A burst of 200 rows is then delivered in at most ceil(200/50)=4 consecutive polls with no gaps.
| func (s *Server) handleProxy(w http.ResponseWriter, r *http.Request) { | ||
| start := time.Now() | ||
| event := brokercore.ProxyEvent{ | ||
| Ingress: "explicit", | ||
| Ingress: brokercore.IngressExplicit, | ||
| Method: r.Method, | ||
| Path: r.URL.Path, | ||
| } | ||
| // Captured as the handler resolves the request; Emit hands off the | ||
| // terminal snapshot to both the slog line and the log sink. | ||
| var ( | ||
| logVaultID string | ||
| logActorType string | ||
| logActorID string | ||
| ) | ||
| emit := func(status int, errCode string) { | ||
| event.Emit(s.logger, start, status, errCode) | ||
| if s.logSink != nil { | ||
| s.logSink.Record(r.Context(), requestlog.FromEvent(event, logVaultID, logActorType, logActorID)) | ||
| } | ||
| } | ||
|
|
||
| // 1. Parse target host and path from /proxy/{target_host}/{path...} |
There was a problem hiding this comment.
🔴 In handleProxy, logVaultID is declared as an empty string and only assigned after vault resolution, but emit() is called in five early-exit paths before that assignment, producing Record{VaultID: ""} entries. When SQLiteSink flushes its 250ms batch, InsertRequestLogs wraps all rows in a single transaction; the FK constraint vault_id REFERENCES vaults(id) with PRAGMA foreign_keys=ON rejects the empty-string row, the deferred tx.Rollback() fires, and every valid record that landed in the same batch window is silently dropped. Fix: guard the logSink.Record() call with logVaultID != "".
Extended reasoning...
Root cause — declaration vs. assignment ordering
In handleProxy (internal/server/handle_proxy.go), logVaultID is declared as var logVaultID string (zero value = "") at lines 118-122 of the new code, and is only assigned at line 170 after resolveVaultForSession returns successfully. The emit closure captures logVaultID by reference and calls s.logSink.Record(..., requestlog.FromEvent(event, logVaultID, ...)). Any emit() invocation that fires before line 170 therefore passes logVaultID = "".
Five early-exit paths trigger this
Looking at the added code in handleProxy:
- Line 133 - empty trimmed path - emit(400, bad_request)
- Line 140 - empty targetHost - emit(400, bad_request)
- Line 152 - invalid host - emit(400, bad_request)
- Line 162 - sess == nil - emit(403, forbidden)
- Line 167 - resolveVaultForSession error - emit(0, vault_error)
Each produces a requestlog.Record with VaultID = "" that is enqueued into the SQLiteSink channel.
Why the FK constraint fires
Migration 039 defines: vault_id TEXT NOT NULL REFERENCES vaults(id) ON DELETE CASCADE. The SQLite DSN is opened with _pragma=foreign_keys(on) (see internal/store/sqlite.go). An empty string is not NULL so it passes the NOT NULL check, but it violates the FK constraint because no vault with id = "" exists.
Why the entire batch is lost
InsertRequestLogs (internal/store/sqlite.go) begins a transaction, prepares a single INSERT statement, and loops over all rows. If any stmt.ExecContext returns a FK error, the function returns immediately, triggering the defer tx.Rollback(). The entire batch, including all valid records that happened to be flushed in the same 250ms window alongside the error-path record, is silently discarded with no retry.
Concrete proof-of-concept
Suppose at T=0ms a bad-URL request arrives (emit(400, bad_request) fires with logVaultID = ""). At T=10ms ten legitimate proxy requests from vault abc-123 succeed and their records land in the channel. At T=250ms the ticker fires, commit is called with a batch of 11 records. Row 1 has vault_id = "". SQLite rejects it. tx.Rollback() fires. All 11 rows are gone. The commit error is logged at ERROR level, but the valid records are permanently lost with no indication of which vaults were affected.
Fix
The simplest one-line fix is to guard the Record call inside emit:
if s.logSink != nil && logVaultID != "" {
s.logSink.Record(r.Context(), requestlog.FromEvent(event, logVaultID, logActorType, logActorID))
}
Error-path requests before vault resolution are inherently un-attributable to a vault, so skipping them is semantically correct. Alternatively, InsertRequestLogs could skip rows where vault_id == "" rather than failing the whole batch, which would add defence-in-depth but does not fix the root cause.
| latestIdRef.current = 0; | ||
| setRows([]); | ||
| setNextCursor(null); | ||
| setError(""); | ||
| setLoading(true); | ||
| loadInitial(); | ||
| // eslint-disable-next-line react-hooks/exhaustive-deps | ||
| }, [endpoint, filterQS]); | ||
|
|
||
| // Poll for new rows (tailing). | ||
| useEffect(() => { | ||
| if (pollMs <= 0 || paused) return; | ||
| const id = setInterval(() => { | ||
| pollNew(); | ||
| }, pollMs); | ||
| return () => clearInterval(id); |
There was a problem hiding this comment.
🔴 The pollNew() guard if (latestIdRef.current === 0) return permanently disables live tailing for any vault that has no logs when the Logs tab is opened. Because the server echoes latest_id: 0 for an empty vault and the client sets latestIdRef.current = data.latest_id || 0, the ref never advances past 0, so every subsequent interval tick returns immediately — the green "Live" badge pulses but no polls are ever sent. The same stuck state is reproduced whenever a filter change produces zero results, because the filter useEffect resets the ref to 0 before calling loadInitial(). Fix: replace the numeric guard with a separate boolean initializedRef that is set to true only after loadInitial() has completed.
Extended reasoning...
What the bug is and how it manifests
pollNew() in web/src/components/LogsView.tsx contains the guard if (latestIdRef.current === 0) return. The intent is to block polls before loadInitial() has run. However the ref is also the live-tail cursor: it is set from data.latest_id after each initial load. When a vault is empty, the server returns latest_id: 0, which causes the ref to remain 0 indefinitely. Every 3-second interval tick fires pollNew(), hits the guard, and exits without issuing any network request. The UI shows a pulsing green "Live" badge, misleadingly suggesting that tailing is active.
The specific code path that triggers it
- Component mounts;
latestIdRef.current = 0(initial value,useRef<number>(0)). loadInitial()fetches/v1/vaults/{name}/logs?limit=50with noafter=param, soopts.After = 0on the server.- Server (
handle_logs.go~line 115): vault has no rows solatestIDstays 0; fallbacklatestID = opts.Afteralso gives 0; response:{"logs":[], "latest_id":0}. - Client sets
latestIdRef.current = data.latest_id || 0 = 0. setIntervalfires everypollMs(default 3000 ms) and callspollNew().pollNew()~line 117:if (latestIdRef.current === 0) returnexits immediately every time. No HTTP request is issued.
No recovery path exists: unless the page is reloaded after at least one log row exists, the ref stays 0 permanently.
Why existing code does not prevent it
The guard was added to avoid sending poll requests before loadInitial() completes (which would fire after=0 queries that return the full page on every tick and prepend duplicate rows). The problem is that the guard conflates two distinct states: "initial load not yet done" (latestIdRef transiently 0) vs. "vault is empty" (latestIdRef permanently 0). Using the cursor value as a readiness sentinel is incorrect for any vault whose legitimate cursor stays 0.
The filter-reset reproducer
The filter useEffect explicitly resets latestIdRef.current = 0 before calling loadInitial(). If the new filter returns zero matching rows (e.g. switching to "Errors" on a vault with no errors), the cursor is reset to 0 and the same stuck state is reproduced — live tailing stops until the page is reloaded.
Step-by-step proof
- Open a brand-new vault that has received no proxied requests.
- Navigate to the Logs tab →
loadInitial()returns{logs:[], latest_id:0}→latestIdRef.current = 0. - Wait 3+ seconds and observe the browser network tab — no
/logs?after=0request is ever issued. - Send a proxied request through the vault (e.g. via
curlthrough the proxy) → a row is inserted in the DB. - The UI table remains empty because
pollNew()is still short-circuiting. - Reload the page →
loadInitial()now returns{logs:[{id:1,...}], latest_id:1}→latestIdRef.current = 1→ subsequent polls work correctly.
Impact
The live-tailing feature — a headline capability of this PR — is silently non-functional for the most common scenario: opening the Logs tab on any new or idle vault. Users see the green pulsing "Live" indicator and may incorrectly conclude there is simply no traffic.
How to fix
Introduce a separate initializedRef = useRef(false) boolean. Set it to true inside loadInitial() once the response is processed. Change the pollNew() guard to if (\!initializedRef.current) return. Reset initializedRef.current = false in the filter useEffect alongside the existing resets. This correctly separates "not yet initialized" from "vault is empty with a valid cursor of 0".
Three bugs flagged on #98, all fixed: 1. Live-tail burst loss (sqlite.go, handle_logs.go): ListRequestLogs always used ORDER BY id DESC LIMIT, so a burst of >page_size rows between polls would return the newest N and silently skip the older ones. Tail queries (After > 0) now use ORDER BY id ASC; handler reverses before responding so the API contract stays newest-first. Regression test in sqlite_test.go pins the ordering. 2. FK-violation batch rollback (handle_proxy.go): Early-exit paths (missing host, invalid target, no session, vault_error) called emit() before logVaultID was set, producing Record{VaultID: ""} entries. vault_id has a FK constraint — the empty-string row rejects the whole 250ms SQLiteSink batch, discarding every valid record that rode with it. Guard the Record() call with logVaultID != "" so un-attributable errors stay in slog only. 3. Live-tail stuck on empty vault (LogsView.tsx): pollNew's `if (latestIdRef.current === 0) return` conflated "not yet initialized" with "vault legitimately has cursor 0", so opening the Logs tab on an empty vault permanently disabled polling. Same stuck state reproduced after any filter change that returns zero rows. Replaced with a separate initializedRef boolean, cleared in the filter reset and set in loadInitial on success. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
All three review findings addressed in f242fd1. 1. Live-tail burst loss (handle_logs.go:129 + sqlite.go) 2. FK-violation batch rollback (handle_proxy.go:129) 3. Live-tail stuck on empty vault (LogsView.tsx:85) `go test ./...` and `golangci-lint run ./...` both green. |
Resolved conflict in internal/server/server.go: #97 (merged) dropped the ipAuth wrapper from /v1/service-catalog, /v1/skills/{cli,http}, and /v1/mitm/ca.pem. Kept that change and preserved this branch's new /v1/vaults/{name}/logs route alongside it. Also applied De Morgan's to two QF1001 staticcheck findings in the new TestListRequestLogsTailOrdering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
GET /v1/vaults/{name}/logs.requestlog.Sink, so both ingress paths feed the same pipeline with consistent fields.Changes
Store
request_logstable + two indexes.Storeinterface.Pipeline
Sinkinterface, SQLite-backed implementation, retention worker.AttachLogSinkon the Server wires the sink at startup; the explicit-proxy and MITM handlers fan events into it.API + docs
GET /v1/vaults/{name}/logs(vault-member auth) returns paginated logs.UI
AllAgentsTabandrouter.tsxupdated to integrate the new layout.Test plan
go test ./...green/proxy/*and MITM HTTPS_PROXY flows, confirm rows land in the Logs tab with correct actor / status / latency / credential-keys fieldsGET /v1/vaults/{name}/logsNotes
main. Does not depend on fix: exempt loopback from TierAuth so local vault run doesn't get 429'd #97 (rate-limit loopback fix) — they touch disjoint hunks.make buildwill re-embedweb/distinto the binary.🤖 Generated with Claude Code