Skip to content

spec: add §9 session-level aggregation extension (Closes #3)#4

Open
nanookclaw wants to merge 1 commit intowillau95:mainfrom
nanookclaw:feat/session-level-aggregation
Open

spec: add §9 session-level aggregation extension (Closes #3)#4
nanookclaw wants to merge 1 commit intowillau95:mainfrom
nanookclaw:feat/session-level-aggregation

Conversation

@nanookclaw
Copy link
Copy Markdown

Summary

Implements the §9 Session-Level Aggregation spec section proposed in #3.

What's Added

§9 Session-Level Aggregation (ECP-SPEC.md)

A new optional session_summary record type that aggregates per-action behavioral flags into a session-level view:

{
  "ecp": "1.0",
  "id": "sess_<hex32>",
  "type": "session_summary",
  "agent": "did:ecp:...",
  "session_start": 1741766400000,
  "session_end": 1741769000000,
  "record_count": 47,
  "flag_totals": { "retried": 3, "incomplete": 1, "hedged": 12, "error": 0, ... },
  "delivery_score": 0.894,
  "calibration_flag_rate": 0.255,
  "prev_session": "sess_..."
}

Two derived metrics:

  • delivery_score = 1 - (retried + incomplete + error) / record_count — fraction of actions completed without failure
  • calibration_flag_rate = hedged / record_count — hedging frequency (explicitly marked as neutral/context-dependent)

Cross-session drift via prev_session chaining:
Walk the chain to compute slope of delivery_score across sessions. O(n) where n = session count, versus O(total records) for raw rescanning. Makes gradual behavioral regression visible even when no single session crosses a flag threshold.

Design Decisions

  • Strictly additive: zero changes to existing record format. Readers MUST NOT reject batches containing session_summary records.
  • All fields derivable from §3 flags: no new data sources, no LLM-as-Judge, no self-reporting — consistent with ECP's core design principle.
  • Minimal formula: delivery_score uses only the three clearly-negative flags (retried, incomplete, error). hedging is kept separate as a neutral signal to avoid encoding assumptions about what hedging rate is "good".
  • Server semantics: §9.4 specifies that servers MUST NOT use summary records to override individual record flags — summaries are derived, not authoritative.

What This Enables

An operator can answer "is this agent's delivery rate declining over 30 days?" by walking a 30-node chain of session summaries rather than rescanning potentially thousands of raw records.

Closes #3

Add optional session_summary record type enabling:
- Per-session delivery_score: 1 - (retried + incomplete + error) / record_count
- calibration_flag_rate: hedged / record_count (neutral context-dependent signal)
- prev_session chaining for O(n) cross-session drift detection

All derived metrics are computable from existing §3 behavioral flags with no
new data sources required. session_summary is strictly additive — existing
record format and validation rules are unchanged.

Motivation: per-action flags can't surface gradual behavioral regression across
hundreds of sessions. A linked chain of session summaries makes slope-based
drift detection possible without rescanning raw records.

Implementation notes in §9.4 cover server storage semantics and writer/reader
obligations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spec: add session-level behavioral aggregation — deriving trend scores from ECP flag chains

1 participant