Skip to content

feat: add per-request logs for proxy and MITM traffic#98

Merged
dangtony98 merged 4 commits intomainfrom
feat/request-logs
Apr 21, 2026
Merged

feat: add per-request logs for proxy and MITM traffic#98
dangtony98 merged 4 commits intomainfrom
feat/request-logs

Conversation

@dangtony98
Copy link
Copy Markdown
Contributor

Summary

  • New per-vault request log captures every proxy and transparent-MITM request (actor, matched service, credential keys injected, status, latency), surfaced via a Logs tab in the dashboard and GET /v1/vaults/{name}/logs.
  • MITM and explicit-proxy paths share a single requestlog.Sink, so both ingress paths feed the same pipeline with consistent fields.
  • Background retention worker trims by age and per-vault row cap; knobs exposed as env vars and CLI flags.

Changes

Store

Pipeline

  • New internal/requestlog/ package: Sink interface, SQLite-backed implementation, retention worker.
  • AttachLogSink on the Server wires the sink at startup; the explicit-proxy and MITM handlers fan events into it.

API + docs

UI

Test plan

  • go test ./... green
  • Manual: start the broker, exercise both explicit /proxy/* and MITM HTTPS_PROXY flows, confirm rows land in the Logs tab with correct actor / status / latency / credential-keys fields
  • Manual: verify retention — set a low per-vault cap via env, generate > cap rows, confirm trim on next worker tick
  • Manual: verify auth — non-vault-member gets 403 on GET /v1/vaults/{name}/logs

Notes

🤖 Generated with Claude Code

Capture proxy and transparent-MITM requests into a per-vault request
log, surfaced through a new Logs tab in the dashboard and a
`GET /v1/vaults/{name}/logs` API. The MITM and explicit-proxy paths
share a single `requestlog.Sink`, so both ingress paths feed the same
pipeline with consistent fields (actor, matched service, credential
keys injected, status, latency).

- Store: new `request_logs` table + migration 039, CRUD in sqlite.go,
  methods added to the Store interface.
- Pipeline: `internal/requestlog` package with Sink + SQLite backend
  and a background retention worker that trims by age and per-vault
  row cap.
- Server: `AttachLogSink` wires the sink at startup; the proxy and
  MITM handlers fan events into it on every request.
- API: `GET /v1/vaults/{name}/logs` returns paginated logs for a vault
  (vault-member auth); skill docs (CLI + HTTP) updated to match.
- UI: Logs tab under each vault with filtering and latency/status
  rendering; AllAgentsTab updated to route through the new layout.
- Ops: new env vars + CLI flags for retention knobs documented in
  .env.example, docs/reference/cli.mdx, and
  docs/self-hosting/environment-variables.mdx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mintlify
Copy link
Copy Markdown

mintlify Bot commented Apr 21, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
agent-vault 🟢 Ready View Preview Apr 21, 2026, 4:23 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Every entry in the `where` slice is a static predicate string; all
user input flows through `args` as ? placeholders. Matches the
existing `#nosec G202` pattern at sqlite.go:2684 for the same
safe-concat shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +100 to +129
Path: r.Path,
MatchedService: r.MatchedService,
CredentialKeys: r.CredentialKeys,
Status: r.Status,
LatencyMs: r.LatencyMs,
ErrorCode: r.ErrorCode,
ActorType: r.ActorType,
ActorID: r.ActorID,
CreatedAt: r.CreatedAt.Format(time.RFC3339),
}
}

// When the caller is tailing (passed after=) and no new rows exist, we
// echo back their cursor so the next poll stays on the same high-water
// mark without a separate round-trip to learn "still nothing new".
if latestID == 0 {
latestID = opts.After
}

// next_cursor is the id to pass as `before` on the next page going back
// in time. Nil when we don't have a full page (end of history) or when
// the caller is tailing forward (`after` set).
var nextCursor any
if opts.After == 0 && len(rows) == limit {
nextCursor = rows[len(rows)-1].ID
}

jsonOK(w, map[string]any{
"logs": items,
"next_cursor": nextCursor,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The live-tail polling query always uses ORDER BY id DESC LIMIT even when an after= cursor is set, so during any burst inserting more than the page limit (default 50) rows between polls, only the 50 newest are returned and latestID is set from rows[0].ID (the highest). The frontend advances latestIdRef.current to that value, permanently skipping the older burst rows — they remain in the DB but never appear in live tail. Fix by using ORDER BY id ASC LIMIT for after= queries and reading latestID from the last row.

Extended reasoning...

Root cause: In internal/store/sqlite.go, ListRequestLogs always appends ORDER BY id DESC LIMIT ? to the query regardless of whether opts.After is set. For the tailing case the intent is to fetch rows newer than the cursor, but descending order means the LIMIT is applied starting from the newest matching row.

Concrete code path: During a tailing poll the frontend calls GET /v1/vaults/{name}/logs?after=100&limit=50. SQLite executes WHERE id > 100 ORDER BY id DESC LIMIT 50. If 200 new rows arrived (IDs 101-300), the query returns IDs 300, 299, ..., 251. In handle_logs.go lines 92-93, latestID is set from rows[0].ID which is 300. The response carries latest_id: 300.

Why existing code does not prevent this: The frontend guard in LogsView.tsx line 123 (if data.latest_id > latestIdRef.current) prevents rollback but unconditionally accepts a forward jump. After receiving latest_id: 300 it updates latestIdRef.current to 300 and the next poll sends after=300. Records 101-250 are now permanently unreachable from live tail - no catch-up or gap-detection mechanism exists.

Step-by-step proof:

  1. State: latestIdRef.current = 100; 200 new rows inserted (IDs 101-300) during the 3s poll interval.
  2. Poll fires: GET ...?after=100&limit=50 -> SQL: WHERE id > 100 ORDER BY id DESC LIMIT 50 -> returns rows 300, 299, ..., 251.
  3. latestID = rows[0].ID = 300 (handle_logs.go line 92).
  4. Response: {logs: [300..251], latest_id: 300}.
  5. Frontend prepends rows 300-251; guard advances cursor to 300.
  6. Next poll: after=300 - rows 101-250 are permanently skipped from live tail.

Impact: Under any traffic rate exceeding limit/poll_interval (default ~16 req/s), some fraction of live-tailed logs will silently vanish from the UI on every burst. Records remain in the DB and are accessible via load-more (historical paging with before=), but the real-time tailing view is lossy.

Correct fix: When opts.After > 0, use ORDER BY id ASC LIMIT ? so the cursor advances incrementally through the burst. In handle_logs.go, read latestID from rows[len(rows)-1].ID (last/highest in ASC order) instead of rows[0].ID. A burst of 200 rows is then delivered in at most ceil(200/50)=4 consecutive polls with no gaps.

Comment on lines 108 to 129
func (s *Server) handleProxy(w http.ResponseWriter, r *http.Request) {
start := time.Now()
event := brokercore.ProxyEvent{
Ingress: "explicit",
Ingress: brokercore.IngressExplicit,
Method: r.Method,
Path: r.URL.Path,
}
// Captured as the handler resolves the request; Emit hands off the
// terminal snapshot to both the slog line and the log sink.
var (
logVaultID string
logActorType string
logActorID string
)
emit := func(status int, errCode string) {
event.Emit(s.logger, start, status, errCode)
if s.logSink != nil {
s.logSink.Record(r.Context(), requestlog.FromEvent(event, logVaultID, logActorType, logActorID))
}
}

// 1. Parse target host and path from /proxy/{target_host}/{path...}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In handleProxy, logVaultID is declared as an empty string and only assigned after vault resolution, but emit() is called in five early-exit paths before that assignment, producing Record{VaultID: ""} entries. When SQLiteSink flushes its 250ms batch, InsertRequestLogs wraps all rows in a single transaction; the FK constraint vault_id REFERENCES vaults(id) with PRAGMA foreign_keys=ON rejects the empty-string row, the deferred tx.Rollback() fires, and every valid record that landed in the same batch window is silently dropped. Fix: guard the logSink.Record() call with logVaultID != "".

Extended reasoning...

Root cause — declaration vs. assignment ordering

In handleProxy (internal/server/handle_proxy.go), logVaultID is declared as var logVaultID string (zero value = "") at lines 118-122 of the new code, and is only assigned at line 170 after resolveVaultForSession returns successfully. The emit closure captures logVaultID by reference and calls s.logSink.Record(..., requestlog.FromEvent(event, logVaultID, ...)). Any emit() invocation that fires before line 170 therefore passes logVaultID = "".

Five early-exit paths trigger this

Looking at the added code in handleProxy:

  1. Line 133 - empty trimmed path - emit(400, bad_request)
  2. Line 140 - empty targetHost - emit(400, bad_request)
  3. Line 152 - invalid host - emit(400, bad_request)
  4. Line 162 - sess == nil - emit(403, forbidden)
  5. Line 167 - resolveVaultForSession error - emit(0, vault_error)

Each produces a requestlog.Record with VaultID = "" that is enqueued into the SQLiteSink channel.

Why the FK constraint fires

Migration 039 defines: vault_id TEXT NOT NULL REFERENCES vaults(id) ON DELETE CASCADE. The SQLite DSN is opened with _pragma=foreign_keys(on) (see internal/store/sqlite.go). An empty string is not NULL so it passes the NOT NULL check, but it violates the FK constraint because no vault with id = "" exists.

Why the entire batch is lost

InsertRequestLogs (internal/store/sqlite.go) begins a transaction, prepares a single INSERT statement, and loops over all rows. If any stmt.ExecContext returns a FK error, the function returns immediately, triggering the defer tx.Rollback(). The entire batch, including all valid records that happened to be flushed in the same 250ms window alongside the error-path record, is silently discarded with no retry.

Concrete proof-of-concept

Suppose at T=0ms a bad-URL request arrives (emit(400, bad_request) fires with logVaultID = ""). At T=10ms ten legitimate proxy requests from vault abc-123 succeed and their records land in the channel. At T=250ms the ticker fires, commit is called with a batch of 11 records. Row 1 has vault_id = "". SQLite rejects it. tx.Rollback() fires. All 11 rows are gone. The commit error is logged at ERROR level, but the valid records are permanently lost with no indication of which vaults were affected.

Fix

The simplest one-line fix is to guard the Record call inside emit:

if s.logSink != nil && logVaultID != "" {
s.logSink.Record(r.Context(), requestlog.FromEvent(event, logVaultID, logActorType, logActorID))
}

Error-path requests before vault resolution are inherently un-attributable to a vault, so skipping them is semantically correct. Alternatively, InsertRequestLogs could skip rows where vault_id == "" rather than failing the whole batch, which would add defence-in-depth but does not fix the root cause.

Comment on lines +70 to +85
latestIdRef.current = 0;
setRows([]);
setNextCursor(null);
setError("");
setLoading(true);
loadInitial();
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [endpoint, filterQS]);

// Poll for new rows (tailing).
useEffect(() => {
if (pollMs <= 0 || paused) return;
const id = setInterval(() => {
pollNew();
}, pollMs);
return () => clearInterval(id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The pollNew() guard if (latestIdRef.current === 0) return permanently disables live tailing for any vault that has no logs when the Logs tab is opened. Because the server echoes latest_id: 0 for an empty vault and the client sets latestIdRef.current = data.latest_id || 0, the ref never advances past 0, so every subsequent interval tick returns immediately — the green "Live" badge pulses but no polls are ever sent. The same stuck state is reproduced whenever a filter change produces zero results, because the filter useEffect resets the ref to 0 before calling loadInitial(). Fix: replace the numeric guard with a separate boolean initializedRef that is set to true only after loadInitial() has completed.

Extended reasoning...

What the bug is and how it manifests

pollNew() in web/src/components/LogsView.tsx contains the guard if (latestIdRef.current === 0) return. The intent is to block polls before loadInitial() has run. However the ref is also the live-tail cursor: it is set from data.latest_id after each initial load. When a vault is empty, the server returns latest_id: 0, which causes the ref to remain 0 indefinitely. Every 3-second interval tick fires pollNew(), hits the guard, and exits without issuing any network request. The UI shows a pulsing green "Live" badge, misleadingly suggesting that tailing is active.

The specific code path that triggers it

  1. Component mounts; latestIdRef.current = 0 (initial value, useRef<number>(0)).
  2. loadInitial() fetches /v1/vaults/{name}/logs?limit=50 with no after= param, so opts.After = 0 on the server.
  3. Server (handle_logs.go ~line 115): vault has no rows so latestID stays 0; fallback latestID = opts.After also gives 0; response: {"logs":[], "latest_id":0}.
  4. Client sets latestIdRef.current = data.latest_id || 0 = 0.
  5. setInterval fires every pollMs (default 3000 ms) and calls pollNew().
  6. pollNew() ~line 117: if (latestIdRef.current === 0) return exits immediately every time. No HTTP request is issued.

No recovery path exists: unless the page is reloaded after at least one log row exists, the ref stays 0 permanently.

Why existing code does not prevent it

The guard was added to avoid sending poll requests before loadInitial() completes (which would fire after=0 queries that return the full page on every tick and prepend duplicate rows). The problem is that the guard conflates two distinct states: "initial load not yet done" (latestIdRef transiently 0) vs. "vault is empty" (latestIdRef permanently 0). Using the cursor value as a readiness sentinel is incorrect for any vault whose legitimate cursor stays 0.

The filter-reset reproducer

The filter useEffect explicitly resets latestIdRef.current = 0 before calling loadInitial(). If the new filter returns zero matching rows (e.g. switching to "Errors" on a vault with no errors), the cursor is reset to 0 and the same stuck state is reproduced — live tailing stops until the page is reloaded.

Step-by-step proof

  1. Open a brand-new vault that has received no proxied requests.
  2. Navigate to the Logs tab → loadInitial() returns {logs:[], latest_id:0}latestIdRef.current = 0.
  3. Wait 3+ seconds and observe the browser network tab — no /logs?after=0 request is ever issued.
  4. Send a proxied request through the vault (e.g. via curl through the proxy) → a row is inserted in the DB.
  5. The UI table remains empty because pollNew() is still short-circuiting.
  6. Reload the page → loadInitial() now returns {logs:[{id:1,...}], latest_id:1}latestIdRef.current = 1 → subsequent polls work correctly.

Impact

The live-tailing feature — a headline capability of this PR — is silently non-functional for the most common scenario: opening the Logs tab on any new or idle vault. Users see the green pulsing "Live" indicator and may incorrectly conclude there is simply no traffic.

How to fix

Introduce a separate initializedRef = useRef(false) boolean. Set it to true inside loadInitial() once the response is processed. Change the pollNew() guard to if (\!initializedRef.current) return. Reset initializedRef.current = false in the filter useEffect alongside the existing resets. This correctly separates "not yet initialized" from "vault is empty with a valid cursor of 0".

Three bugs flagged on #98, all fixed:

1. Live-tail burst loss (sqlite.go, handle_logs.go):
   ListRequestLogs always used ORDER BY id DESC LIMIT, so a burst of
   >page_size rows between polls would return the newest N and silently
   skip the older ones. Tail queries (After > 0) now use ORDER BY id
   ASC; handler reverses before responding so the API contract stays
   newest-first. Regression test in sqlite_test.go pins the ordering.

2. FK-violation batch rollback (handle_proxy.go):
   Early-exit paths (missing host, invalid target, no session,
   vault_error) called emit() before logVaultID was set, producing
   Record{VaultID: ""} entries. vault_id has a FK constraint — the
   empty-string row rejects the whole 250ms SQLiteSink batch,
   discarding every valid record that rode with it. Guard the
   Record() call with logVaultID != "" so un-attributable errors stay
   in slog only.

3. Live-tail stuck on empty vault (LogsView.tsx):
   pollNew's `if (latestIdRef.current === 0) return` conflated "not
   yet initialized" with "vault legitimately has cursor 0", so opening
   the Logs tab on an empty vault permanently disabled polling.
   Same stuck state reproduced after any filter change that returns
   zero rows. Replaced with a separate initializedRef boolean,
   cleared in the filter reset and set in loadInitial on success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dangtony98
Copy link
Copy Markdown
Contributor Author

All three review findings addressed in f242fd1.

1. Live-tail burst loss (handle_logs.go:129 + sqlite.go)
Tail queries (`After > 0`) now use `ORDER BY id ASC LIMIT`, so a burst >page-size is consumed oldest-first and the cursor advances through all of it over consecutive polls with no gaps. Handler reverses the slice before responding so the newest-first API contract is preserved, and `latestID` is taken from the new highest id. Regression test in sqlite_test.go pins the invariant.

2. FK-violation batch rollback (handle_proxy.go:129)
Guarded `s.logSink.Record()` with `logVaultID != ""` so the five early-exit paths (missing host, invalid host, no session, vault_error) no longer enqueue unattributable rows that the FK constraint would reject and take down the whole 250ms batch. These events still go to slog — they just aren't persisted, which is semantically correct since they occur before vault resolution. The MITM forward path (forward.go:48) was already safe because `scope.VaultID` is resolved at CONNECT time.

3. Live-tail stuck on empty vault (LogsView.tsx:85)
Replaced the numeric `latestIdRef.current === 0` guard with a separate `initializedRef` boolean. It's reset alongside `latestIdRef` in the filter `useEffect` and set to `true` once `loadInitial` completes. Opening the Logs tab on an empty vault now polls correctly and picks up the first request without a reload; filter changes that produce zero results no longer permanently disable live tail.

`go test ./...` and `golangci-lint run ./...` both green.

Resolved conflict in internal/server/server.go: #97 (merged) dropped
the ipAuth wrapper from /v1/service-catalog, /v1/skills/{cli,http},
and /v1/mitm/ca.pem. Kept that change and preserved this branch's
new /v1/vaults/{name}/logs route alongside it.

Also applied De Morgan's to two QF1001 staticcheck findings in the
new TestListRequestLogsTailOrdering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dangtony98 dangtony98 merged commit df9a76d into main Apr 21, 2026
4 checks passed
@dangtony98 dangtony98 deleted the feat/request-logs branch April 21, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant