Wire authserver DCR resolver and add structured logs#5044
Wire authserver DCR resolver and add structured logs#5044tgrunnagle wants to merge 7 commits intomainfrom
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
4e90171 to
4e41c9b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5044 +/- ##
==========================================
+ Coverage 67.50% 67.56% +0.05%
==========================================
Files 601 601
Lines 61055 61262 +207
==========================================
+ Hits 41218 41391 +173
- Misses 16721 16745 +24
- Partials 3116 3126 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9b66570 to
fc135b4
Compare
4e41c9b to
4ee2863
Compare
fc135b4 to
3478e8c
Compare
4ee2863 to
283489f
Compare
3478e8c to
cc43b14
Compare
283489f to
7ebe96e
Compare
cc43b14 to
5ad7473
Compare
7ebe96e to
cbfb884
Compare
cbfb884 to
063e820
Compare
5ad7473 to
80d5cbe
Compare
Closes the 14 review threads from review #4199898531 that are scoped to this PR; #3 and #5 are intentionally deferred to PR #5044 which is the PR that consumes the ClientSecret plumbing and RedirectURI propagation. Behaviour fixes: - #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now preserve the issuer's path component, so multi-tenant upstreams that ship DCR without advertising it (or this auth server when it itself has a tenant prefix in its issuer) keep their tenant prefix in the synthesised URL and the defaulted redirect URI. - #7 resolveDCRCredentials wraps registration in a package-level singleflight.Group keyed by DCRKey so concurrent callers for the same upstream coalesce into a single RFC 7591 registration. The leader's closure rechecks the cache before any network I/O so followers that arrive just after a Put returns observe the fresh entry. - #9 applyResolution writes ClientID only when the run-config copy leaves it empty, treating it symmetrically with the endpoint writes. This is defence-in-depth for callers that bypass the resolver's validateResolveInputs precondition check. - #10 the cfg.RegistrationEndpoint short-circuit branch in resolveDCREndpoints validates URL well-formedness and HTTPS-or- loopback locally, so a non-HTTPS endpoint fails before performRegistration constructs a bearer-token transport for it. - #13 endpointsFromMetadata validates discovered authorization_endpoint and token_endpoint before populating dcrEndpoints, so a self- consistent metadata document with http:// URLs cannot flow through to the auth-code or token-exchange path. Doc / scope: - #6 trust-assumption note added on DCRUpstreamConfig with a link to the follow-up SSRF-hardening issue (#5135) per the "Document Architectural Constraints" rule. - #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what the in-memory store does and does NOT solve (cross-replica sharing, durability, cross-process write coordination). Test hygiene: - #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that added no meaningful coverage. - #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in TestFetchAuthorizationServerMetadataFromURL, matching the sibling pattern in dcr_test.go. - #12 new failingDCRStore test double + two cases asserting the cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache put") that operators see when a future Redis backend errors. - #14 renamed the TestNeedsDCR case from "client_id with dcr (invalid but returns false)" to "client_id wins over dcr_config (defensive AND semantic)" so the case label describes the intentional defensive behaviour rather than reading like a bug report. - #15 removed a //nolint:lll directive that was suppressing a non- existent violation (the line is 108 chars; the limit is 130). - #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out 16 goroutines doing alternating Put/Get against overlapping and disjoint keys, with a fail-fast deadline, so go test -race actually exercises the sync.RWMutex guard. New tests cover each behaviour change: - TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers pins exactly-one-registration under N concurrent callers. - TestSynthesiseRegistrationEndpoint_PreservesIssuerPath / TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path- preservation fix. - TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers the symmetric ClientID write. - TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers the local registration-endpoint guard. - TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers the metadata authz/token endpoint guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 14 review threads from review #4199898531 that are scoped to this PR; #3 and #5 are intentionally deferred to PR #5044 which is the PR that consumes the ClientSecret plumbing and RedirectURI propagation. Behaviour fixes: - #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now preserve the issuer's path component, so multi-tenant upstreams that ship DCR without advertising it (or this auth server when it itself has a tenant prefix in its issuer) keep their tenant prefix in the synthesised URL and the defaulted redirect URI. - #7 resolveDCRCredentials wraps registration in a package-level singleflight.Group keyed by DCRKey so concurrent callers for the same upstream coalesce into a single RFC 7591 registration. The leader's closure rechecks the cache before any network I/O so followers that arrive just after a Put returns observe the fresh entry. - #9 applyResolution writes ClientID only when the run-config copy leaves it empty, treating it symmetrically with the endpoint writes. This is defence-in-depth for callers that bypass the resolver's validateResolveInputs precondition check. - #10 the cfg.RegistrationEndpoint short-circuit branch in resolveDCREndpoints validates URL well-formedness and HTTPS-or- loopback locally, so a non-HTTPS endpoint fails before performRegistration constructs a bearer-token transport for it. - #13 endpointsFromMetadata validates discovered authorization_endpoint and token_endpoint before populating dcrEndpoints, so a self- consistent metadata document with http:// URLs cannot flow through to the auth-code or token-exchange path. Doc / scope: - #6 trust-assumption note added on DCRUpstreamConfig with a link to the follow-up SSRF-hardening issue (#5135) per the "Document Architectural Constraints" rule. - #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what the in-memory store does and does NOT solve (cross-replica sharing, durability, cross-process write coordination). Test hygiene: - #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that added no meaningful coverage. - #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in TestFetchAuthorizationServerMetadataFromURL, matching the sibling pattern in dcr_test.go. - #12 new failingDCRStore test double + two cases asserting the cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache put") that operators see when a future Redis backend errors. - #14 renamed the TestNeedsDCR case from "client_id with dcr (invalid but returns false)" to "client_id wins over dcr_config (defensive AND semantic)" so the case label describes the intentional defensive behaviour rather than reading like a bug report. - #15 removed a //nolint:lll directive that was suppressing a non- existent violation (the line is 108 chars; the limit is 130). - #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out 16 goroutines doing alternating Put/Get against overlapping and disjoint keys, with a fail-fast deadline, so go test -race actually exercises the sync.RWMutex guard. New tests cover each behaviour change: - TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers pins exactly-one-registration under N concurrent callers. - TestSynthesiseRegistrationEndpoint_PreservesIssuerPath / TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path- preservation fix. - TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers the symmetric ClientID write. - TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers the local registration-endpoint guard. - TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers the metadata authz/token endpoint guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tgrunnagle
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: security, concurrency, error-handling, tests, architecture, go-conventions, general-quality, review-friendliness. Codex cross-review skipped (codex CLI not installed).
Reviewer note: Posting as COMMENT because the authenticated reviewer is the PR author and the PR is a draft. The recommendation underneath is REQUEST CHANGES — there are three correctness/security issues (F3, F1, F2 below) that should land before this PR is review-ready.
Consensus Summary (29 retained findings)
| # | Finding | Score | Severity | Action |
|---|---|---|---|---|
| F3 | remoteAuthLogContext misses CIMD precedence; doc claim is wrong |
10/10 | HIGH | Fix |
| F1 | sanitizeErrorForLog does not strip URL userinfo |
9/10 | HIGH | Fix |
| F2 | Upstream /register error body is io.ReadAll-ed unbounded then logged |
8/10 | HIGH | Fix or follow-up |
| F6 | Handler.issuer() will panic on nil config — validation belongs in NewHandler |
9/10 | LOW | Fix |
| F7 | DCR success log promoted Debug→Info contradicts silent-success-at-INFO+ rule |
9/10 | LOW | Fix or amend rule |
| F4 | Singleflight panic recovery double-logs (in-defer Error + boundary Error) | 8/10 | MEDIUM | Fix |
| F8 | Permanent-4xx Warn has no test asserting fire/contents | 8/10 | MEDIUM | Add test |
| F9 | Cache-hit dcr_age_days Debug + stale-threshold Warn unexercised |
8/10 | MEDIUM | Add test |
| F10 | TestBuildUpstreamConfigs_DCR lacks failure-path subtest |
8/10 | MEDIUM | Add test |
| F11 | applyResolution is non-idempotent state transition; rename it |
8/10 | LOW | Rename |
| F12 | applyResolution + applyResolutionToOAuth2Config paired-call invariant unenforced |
8/10 | MEDIUM | Collapse or document |
| F13 | DCRStepError, LogDCRStepError, newDCRStepError exported but no external consumer |
8/10 | MEDIUM | Lowercase |
| F23 | trimURLTrailingPunctuation uses ContainsRune on byte input |
8/10 | LOW | IndexByte instead |
| F19 | Sanitizer doesn't strip URL fragments | 7/10 | MEDIUM | One-line fix |
| F5 | Permanent-4xx Warn risk of log-spam from Token() entry |
7/10 | MEDIUM | Hoist out of classifier |
| F14 | dcr.go is 1,186 lines; sanitizer is independent of DCR |
7/10 | MEDIUM | Split file |
| F15 | New private helpers placed above the // Private helpers divider |
7/10 | MEDIUM | Reorder |
| F16 | PR exceeds 400-LOC / 10-file limit; lacks Large PR Justification |
7/10 | MEDIUM | Add section |
| F18 | Sanitizer regex case-sensitive (HTTPS:// escapes) |
7/10 | LOW | (?i) flag |
| F20 | LogDCRStepError(nil) emits misleading record |
7/10 | LOW | Nil guard |
| F21 | dcrStore lifetime / interface lacks Close |
7/10 | LOW | Document |
| F22 | Process-wide dcrFlight vs per-instance dcrStore undocumented |
7/10 | LOW | Document |
| F24 | Production comments cite .claude/rules/... paths |
7/10 | LOW | Inline principle |
| F25 | Production comment names a specific test | 7/10 | LOW | Drop ref |
| F26 | TestRemoteAuthLogContext doesn't cover CIMD path (ties to F3) |
8/10 | MEDIUM | Add case w/ F3 |
| F27 | TestHandler_issuer doesn't assert nil-config invariant |
7/10 | LOW | Add assertion |
| F28 | Mock-AS request count uses Greater(0) not exact |
7/10 | LOW | Tighten |
| F29 | Panic-recovery test doesn't errors.As(*DCRStepError) |
7/10 | LOW | Augment |
| F30 | Step-wrap tests use substring instead of errors.As |
7/10 | LOW | Augment |
Overall
The PR's structural shape is sound. The wiring of resolveDCRCredentials into EmbeddedAuthServer.NewEmbeddedAuthServer correctly observes the copy-before-mutate rule (deep-copying both the outer UpstreamRunConfig and the inner *OAuth2UpstreamRunConfig before passing to the resolver), and TestBuildUpstreamConfigs_DCR + TestNewEmbeddedAuthServer_DCRBoot pin those guarantees explicitly via assert.Same on the caller's pointer. The DCRStepError step taxonomy + boundary LogDCRStepError is a genuinely good observability pattern for a multi-step resolver. The new sanitizer's trailing-punctuation handling is well-tested with regression cases.
However, three correctness/security items should land before merge:
- F3 (HIGH, score 10/10) —
remoteAuthLogContextclaims to mirror the precedence ofpkg/auth/remote/handler.go::resolveClientCredentials, but the actual precedence in that function is CIMD > DCR > static, while the new helper handles only DCR > static. Workloads using cached CIMD client_ids will see a stale or emptyclient_idin the new DCR remediation Warn — defeating the operator-correlation the field exists for. Either fix the precedence or rewrite the comment. - F1 (HIGH, score 9/10) —
sanitizeErrorForLogonly clearsu.RawQuery. URLs with embedded credentials (https://user:pass@host) have their userinfo preserved; URLs with fragments (e.g., implicit-flow callbacks#access_token=...) keep them too (F19). The function is documented as defense-in-depth, so the gap is precisely what the function exists to prevent. Two-line fix in the same block:u.User = nil; u.Fragment = "". - F2 (HIGH, score 8/10) —
pkg/oauthproto/dcr.go:307reads the upstream's error body viaio.ReadAll(resp.Body)with noLimitReader(the success path correctly caps at 1 MiB), then embeds the body verbatim infmt.Errorf("...status %d: %s", code, body). That error flows intoLogDCRStepError, which only sanitizes URL query strings. A misbehaving or compromised upstream can echo arbitrary content into operator logs unbounded. The bug is technically inoauthproto/dcr.go(outside this PR's diff), but this PR is what surfaces the risk operationally — worth either fixing in this PR or filing a follow-up that blocks merge.
Beyond the HIGH items, the test-coverage cluster (F8, F9, F10, F26, F27–F30) is the next-most-important bucket: the PR's central observability contributions (the 4xx Warn, the staleness Warn, the boundary Error log) are not asserted by any test. A regression that drops or changes those records would not fail CI. Adding a slog-handler-tee fixture and exercising the three branches would close the gap.
The architecture cluster (F11–F13) is polish but worth considering before merge: the API surface widening (exported DCRStepError, etc.) and the paired-helpers contract (applyResolution + applyResolutionToOAuth2Config) are both maintenance hazards if left as comment-enforced invariants.
F14–F16 are review-experience findings; the github-actions bot has already left a CHANGES_REQUESTED for the >1000-line size, and adding a Large PR Justification section addresses that mechanically.
Notes for items that don't have an inline comment
- F2 (unbounded upstream error body) — the affected file
pkg/oauthproto/dcr.gois not in this PR's diff. Fix atpkg/oauthproto/dcr.go:307: replaceio.ReadAll(resp.Body)withio.ReadAll(io.LimitReader(resp.Body, 8*1024))and consider parsing as RFC 7591 JSON instead of%s-interpolating the raw body. - F8/F9/F10/F26–F30 (test gaps) — see the per-finding lines in the consensus table; rather than 7 tiny inline comments, group these into one PR-level test plan: (1) capture slog records via a tee handler in monitored_token_source_test.go, (2) drive the cache-hit + stale-threshold path in dcr_test.go with a pre-populated old
CreatedAt, (3) add afailure-pathsubtest toTestBuildUpstreamConfigs_DCRwith a 500 mock, (4) add a CIMD case toTestRemoteAuthLogContext, (5) augment the existing wrapped-error tests witherrors.As(*DCRStepError)assertions. - F14 (dcr.go file size) — split
sanitizeErrorForLog,trimURLTrailingPunctuation,queryStrippingPattern, andTestSanitizeErrorForLoginto a newpkg/authserver/runner/logsanitize.goandlogsanitize_test.go. Optionally also splitDCRStepError+LogDCRStepError+ thedcrStep*constants intopkg/authserver/runner/dcr_errors.go. - F15 (file org) — move
newDCRStepError,sanitizeErrorForLog,trimURLTrailingPunctuation, andqueryStrippingPatternbelow the// ----- Private helpers -----divider indcr.go(line ~573). - F16 (PR meta) — add a
## Large PR Justificationsection to the PR description: ~620 of the 1133 additions are tests, the remaining ~510 across 7 production files implement one logical change, and the four sub-concerns (resolver wiring, step taxonomy, monitored-token-source remediation log, software_id validation) cannot be split without orphaned-code intermediate states. Alternatively, run/split-prto evaluate carving out the URL sanitizer + tests as a standalone PR. - F22 (
dcrFlightlifetime) — the var declaration is package-level pre-existing code (not in any hunk). Add a doc comment explaining the intentional process-wide lifetime, contrasting with the per-instancedcrStore. - F26 is structurally identical to F3; fixing F3 closes both.
Generated with Claude Code
A misbehaving or malicious authorization server could echo arbitrary content into the upstream's /register error response. handleHTTPResponse read that body via io.ReadAll with no LimitReader, then embedded it verbatim in the returned error — which downstream callers log. Cap the read at 8 KiB (far larger than any conformant RFC 7591 error response) so operator log volume cannot be inflated by a non-conformant upstream. Addresses #5044 review finding F2 (HIGH). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc comment claimed remoteAuthLogContext mirrored the precedence of remote.Handler.resolveClientCredentials, but the implementation skipped the CachedCIMDClientID check entirely. For CIMD-authenticated workloads the new DCR remediation Warn would have reported a stale or empty client_id rather than the CIMD URL actually being sent on token refresh, defeating the operator-correlation the field exists for. Restore the documented precedence (CachedCIMDClientID > CachedClientID > ClientID) and add a TestRemoteAuthLogContext case covering the CIMD-wins path. Addresses #5044 review findings F3 (HIGH) and F26 (MEDIUM, closed by the new test case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.
Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").
Addresses #5044 review finding F5 (MEDIUM).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package: NewHandler did not validate that AuthorizationServerConfig (or its embedded *fosite.Config) was non-nil. issuer() reached into the config at request time, so a misconfigured caller would panic deep inside an HTTP handler instead of failing at startup. Add the nil check to the constructor and simplify issuer() to rely on the constructor invariant. Pin the new invariant with TestNewHandler_ErrorsOnNilConfig. The /oauth/register success log was promoted to Info even though the operation is neither long-running nor exceptional. Demote back to Debug; an audit-log path is the right home for the audit signal if that becomes a requirement. Addresses #5044 review findings F6 and F7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each addresses internal-API hygiene or doc-comment drift in the DCR resolver, no behaviour change for callers. Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now strips userinfo and fragment in addition to the query, since either can carry credentials or tokens (https://user:pass@host, implicit-flow #access_token=...). queryStrippingPattern matches http/https case-insensitively per RFC 3986 §3.1 so HTTPS://... cannot escape sanitisation. trimURLTrailingPunctuation switches to strings.IndexByte to match the ASCII-only terminator set without the rune-decoding overhead. Test cases added for each. Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery branch no longer emits a duplicate slog.Error; the captured stack travels with the wrapped *dcrStepError to the boundary log so a single panic produces a single record. DCRStepError / LogDCRStepError lowercased to dcrStepError / logDCRStepError since no caller lives outside the package. logDCRStepError now no-ops on nil err so the unknown-step branch cannot fire on a missing failure. Resolution helpers (F11, F12, F25): applyResolution renamed to consumeResolution to communicate that it is a one-shot state transition (clearing DCRConfig is unconditional), not an idempotent defaulting step. The applyResolutionToOAuth2Config doc now states the paired-call invariant explicitly without referencing a specific test. Lifecycle docs (F21, F22): the per-instance dcrStore vs. process-wide dcrFlight asymmetry is now stated on both sides, and EmbeddedAuthServer.Close documents the future-Close hook for a backend with handles. Inline rules-file rationale (F24): production comments no longer cite .claude/rules/... by path; the principle is inlined. Addresses #5044 review findings F1, F4, F11, F12, F13, F18, F19, F20, F21, F22, F23, F24, F25. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Body-only review items follow-up: F2 (HIGH) — Addressed in 4d504fe. F8 / F9 / F10 / F26-F30 (test-gap cluster) — Deferred. F26 was closed mechanically by the new CIMD case in Generated with Claude Code |
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
jhrozek
left a comment
There was a problem hiding this comment.
A few notes — none blocking.
Three follow-up items from review: isPermanentTokenEndpointError used to treat every non-5xx RetrieveError as permanent, which meant the DCR remediation Warn fired on transient back-pressure (408 Request Timeout, 429 Too Many Requests). Narrow the classifier to an explicit 400 / 401 / 403 allowlist so the "delete the cached credentials and restart" hint fires only on responses where stale cached client credentials are a plausible cause. Retry behaviour is unaffected — the retry decision is gated entirely by isTransientNetworkError, which already returns false for the whole RetrieveError branch. handleHTTPResponse in pkg/oauthproto/dcr.go now drains the response body after the LimitReader-bounded ReadAll. Without this, an upstream returning more than 8 KiB on /register error left the connection mid- stream when the deferred Close fired, preventing TCP connection reuse. Mirrors what the non-JSON content-type branch a few lines down already does. remoteAuthLogContext moved out of pkg/runner into pkg/auth/remote as *Config.LogContext(), collocating it with resolveClientCredentials so the cached-CIMD > cached-DCR > static precedence has one home. Adding a fourth cached field in future updates both call sites in one place. Tests moved to pkg/auth/remote/config_test.go. Addresses #5044 review #4212656411 comments 3174540081, 3174540082, and 3174540086. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
resolveDCRCredentialswas wired up and unit-tested in the prior sub-issue, butEmbeddedAuthServernever called it, so any upstream configured withDCRConfigstill booted without a client. This PR makes the resolver load-bearing and adds the operator-visible structured logs the Phase 2 spec calls for (cache-hit age, stale-registration warning, per-step errors, and a remediation hint when an upstream rejects a cached client as 4xx).EmbeddedAuthServernow owns an in-memoryDCRCredentialStoreand callsresolveDCRCredentials+consumeResolutionfor every OAuth2 upstream whose config requests DCR.buildPureOAuth2Configdeliberately keeps its pure, network-free signature; the DCR-resolvedClientSecretis layered on afterwards via a newapplyResolutionToOAuth2Confighelper so the two application sites stay side-by-side.RunConfigelement is shallow-copied and its OAuth2 sub-config is deep-copied before mutation, preserving the caller'sRunConfig.Upstreamsslice.resolveDCRCredentialsreturns a newdcrStepErrorwrapping the originating step (metadata discovery, DCR call, store read, store write) instead of logging-and-returning at each branch. The boundary caller (buildUpstreamConfigs) emits oneslog.ErrorvialogDCRStepError, so there is a single error line per failure. A panic inside the singleflight closure is converted to the samedcrStepErrorcarrying the captureddebug.Stack(), so the boundary log surfaces it without an in-defer duplicate.slog.Debugwithdcr_age_daysand additionallyslog.Warnwith remediation text when the cached registration exceeds the 90-daydcrStaleAgeThreshold./oauth/registerhandler success log emits atDebugwithissuer,client_id,software_id,token_endpoint_auth_method, andscopes— silent at INFO+ per the project's logging convention.SoftwareIDis threaded throughValidateDCRRequest, which now also caps it to 256 printable-ASCII characters.MonitoredTokenSourceacceptsupstreamandclient_idat construction (replacing the earlierSetUpstreamContextsetter, which had an unsynchronized writer). On the first transition toUnauthenticateddriven by a permanent token-endpoint 4xx, it emits a singleslog.Warnwith a DCR/CIMD remediation hint (gated bystopOnceso a tightToken()loop cannot spam it, and suppressed when noclient_idcontext is available). The classifier itself (isTransientNetworkError) is side-effect free.NewHandlernow fails loudly whenAuthorizationServerConfigor its embedded*fosite.Configis nil;issuer()no longer carries a runtime nil-guard.http(s)://case-insensitively per RFC 3986 §3.1, and trailing sentence punctuation around a URL is preserved so prose is not mangled.pkg/oauthproto/dcr.gohandleHTTPResponsecaps the upstream/registererror body at 8 KiB viaio.LimitReaderso a non-conformant or malicious upstream cannot inflate operator log volume.Closes #5039
Type of change
Test plan
task test)task lint-fix)New coverage added:
pkg/authserver/runner/embeddedauthserver_test.go— full DCR boot against a mock AS (metadata +/register), cache-hit short-circuit (zero additional HTTP requests on the secondbuildUpstreamConfigs), copy-before-mutate assertion on the caller'sRunConfig.Upstreamsslice, andTestNewEmbeddedAuthServer_DCRBootdriving the real constructor to assertdcrStoreis populated.pkg/authserver/runner/dcr_test.go— cache-hit Debug log withdcr_age_days, stale-age Warn above threshold, onedcrStepError-returning case per failure step, and the panic-recovery test now asserts the wrapped error is a*dcrStepError(dcrStepRegister, …, Stack: <captured>).pkg/authserver/server/handlers/handler_chain_test.go—TestHandler_issuercovers both populated and zero-valueAccessTokenIssuer.TestNewHandler_ErrorsOnNilConfig(inauthorize_test.go) pins the new constructor invariant.pkg/authserver/server/registration/dcr_test.go—software_idlength cap and non-printable-ASCII rejection.pkg/auth/monitored_token_source_test.go— permanent-4xx branch emits the remediation Warn withworkload/upstream/client_idpopulated via the new constructor parameters.pkg/runner/runner_test.go—TestRemoteAuthLogContextcovers the fullcached CIMD > cached DCR > staticprecedence.http.Client-style quoted URL, embeddeduser:pass@hostuserinfo (with and without query), implicit-flow#access_token=…fragments, and uppercaseHTTPS://schemes.Changes
pkg/authserver/runner/embeddedauthserver.godcrStorefield and wire resolver intobuildUpstreamConfigs; introduceapplyResolutionToOAuth2Config; document the per-instance store vs. process-wide singleflight asymmetry.pkg/authserver/runner/dcr.godcrStepError/logDCRStepError; emit cache-hit Debug, stale-age Warn, and per-step error context; carry panic stack on the wrapped error so a single boundary record surfaces it; URL-secret-stripping sanitizer (userinfo + query + fragment, case-insensitive scheme match).pkg/authserver/runner/dcr_store.gopkg/authserver/server/handlers/dcr.goDebug, drop redundantupstreamattribute, threadsoftware_idthrough.pkg/authserver/server/handlers/handler.goNewHandlervalidatesconfigandconfig.Config; remove the runtime nil-guard fromissuer().pkg/authserver/server/registration/dcr.gosoftware_idto 256 printable-ASCII characters; exportMaxSoftwareIDLength.pkg/auth/monitored_token_source.goupstream/client_idas constructor parameters; emit DCR/CIMD remediation Warn frommarkAsUnauthenticated(once per state transition, gated bystopOnce, suppressed whenclientID == ""). The classifier is side-effect free.pkg/runner/runner.goMonitoredTokenSourcefields fromRemoteAuthConfigwithcached CIMD > cached DCR > staticprecedence (mirrorsremote.Handler.resolveClientCredentials).pkg/oauthproto/dcr.go/registererror body at 8 KiB viaio.LimitReaderbefore embedding it in the returned error string.pkg/authserver/runner/embeddedauthserver_test.gopkg/authserver/runner/dcr_test.gopkg/authserver/server/handlers/handler_chain_test.go,authorize_test.goTestHandler_issuerandTestNewHandler_ErrorsOnNilConfig.pkg/authserver/server/handlers/dcr_test.go,pkg/authserver/server/registration/dcr_test.gosoftware_idvalidation.pkg/auth/monitored_token_source_test.go,pkg/runner/runner_test.goLarge PR Justification
The PR exceeds the 400-LOC / 10-file guideline (1342+/104- across 18 files), but the change cannot be split without leaving the codebase in an intermediate non-functional state:
*_test.go) — table-driven coverage for the resolver wiring, the new error taxonomy, the staleness + remediation logs, and the sanitizer.dcrStepErrortaxonomy + boundary logging, monitored-token-source remediation log, andsoftware_idvalidation. Splitting the resolver wiring without the matching error taxonomy would land a regression in observability; splitting the error taxonomy without the wiring would orphan the new types.Does this introduce a user-facing change?
Operators gain two new diagnostic signals on DCR-enabled upstreams:
Warnwhen a cached DCR registration is older than 90 days, hinting that the operator may want to rotate.Warnon the first transition toUnauthenticateddriven by a permanent token-endpoint 4xx, hinting that the workload's cached DCR or CIMD credentials may be stale and that deleting them and restarting will trigger re-registration. The Warn fires at most once per state transition and is suppressed entirely for workloads with noclient_idcontext.The
MonitoredTokenSourceconstructor signature has changed to requireupstreamandclient_id, but it is an internal API with no public surface, and the priorSetUpstreamContextsetter is removed.Successful DCR registrations log at
Debug, notInfo, in keeping with the project's silent-success-at-INFO+ convention.Special notes for reviewers
buildPureOAuth2Configis load-bearing: it takes nocontext.Contextand performs no network I/O. DCR resolution is deliberately sequenced around it, not inside it, so the pure config builder stays trivially testable. Please flag any future change that threadsctxinto it.dcrStepErrormoves the "one error line per failure" invariant from the resolver to the boundary caller. If a future caller forgets to invokelogDCRStepError, the error will still propagate normally but the structured log will be missing. Worth keeping in mind if we add more resolver call sites.consumeResolution(formerlyapplyResolution) is a one-shot state transition, not an idempotent default-fill: a second call after the first is a no-op only because the first clearedDCRConfig. The companionapplyResolutionToOAuth2Configwrites the inline-onlyClientSecretonto the built*upstream.OAuth2Config; any future output path fromOAuth2UpstreamRunConfigtoupstream.OAuth2Configmust call both helpers to get a fully-resolved DCR client.dcrFlightsingleflight is intentionally process-wide whileEmbeddedAuthServer.dcrStoreis per-instance — see the doc comments on each. The asymmetry is load-bearing for concurrentEmbeddedAuthServerinstances in the same process.Generated with Claude Code