StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812TAOXUY wants to merge 54 commits intoenvoyproxy:mainfrom
Conversation
ggreenway
left a comment
There was a problem hiding this comment.
I don't think your fix is quite right.
I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.
if (scope->evictable_) {
MetricBag metrics(scope->scope_id_);
CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
return [&unused_metrics](std::pair<StatName, T> kv) {
const auto& [name, metric] = kv;
if (metric->used()) {
metric->markUnused();
return false;
} else {
unused_metrics.try_emplace(name, metric);
return true;
}
};
};
The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .
I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.
/wait
When evicting unused stats from the central cache, we need to ensure that gauges actively referenced by components like AccessLogState are not evicted. The use_count() > 1 check prevents this, but a previous bug in evictUnused where the lambda parameter std::pair<StatName, T> kv was captured by value caused artificial inflation of the use_count due to the deep copy. This broke eviction entirely across the codebase. This commit fixes evictUnused by taking const auto& kv by reference, avoiding the deep copy and correctly applying the use_count() > 1 safeguard. Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State struct so its active references prevent premature eviction by evictUnused. The erroneous std::min safeguard during gauge subtractions is also removed as AccessLogState gauges will no longer be unfairly cleared. Signed-off-by: Xuyang Tao <taoxuy@google.com>
Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov |
|
/retest |
|
Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components. |
CMIIW, if gauge is evictable, it cannot be Imagine when a gauge is incremented and then evicted before decremented, there is another |
The central store is the That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the
In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted. |
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
I like this solution; it's much cleaner and clearer.
@kyessenov can you also review this, especially the change to eviction logic?
Co-authored-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Xuyang Tao <taoxuy@google.com>
ggreenway
left a comment
There was a problem hiding this comment.
format check (spelling) is failing, but other than that this LGTM.
@kyessenov do you want to take another look?
|
/retest |
1 similar comment
|
/retest |
|
Generally LGTM, mostly style nits. |
jmarantz
left a comment
There was a problem hiding this comment.
flushing comments. Still looking.
jmarantz
left a comment
There was a problem hiding this comment.
mostly superficial stuff remaining, except for hash validation.
| // When the key actually needs to be safely persisted into the map, `makeOwned()` | ||
| // is explicitly called to allocate and copy the tags into `owned_tags_`. | ||
| struct GaugeKey { | ||
| Stats::StatName stat_name_; |
There was a problem hiding this comment.
should makeOwned ensure that stat_name has local backup also? stat_name is like string_view.
There was a problem hiding this comment.
I scanned this a bit more and I am not understanding why the backing-store management for stat_name differs from the backing-store management for tags.
I think the stat-name comes from some pre-existing gauge but I am having a hard time keeping this all in my head. Wherever you get the the name from should have the tags; there should be no need to copy the tags if you are copying the name.
There was a problem hiding this comment.
Because stat_name_ always comes from the persistent static configuration of the StatsAccessLog (which is kept alive by std::shared_ptr in the AccessLogState filter state), whereas the tags are dynamic and computed per request.
Since the static name outlives the request, it is safe without local storage. The tags, however, are dynamic and need to be persisted if they are pushed to the background/delayed logging map.
There was a problem hiding this comment.
I can see you copying the vector of tag name/value pairs, but I do not see how the backing store of the tag names & values is managed. Copying the vector won't copy the backing store of the tag names & values.
I think that ultimately if you are creating a new gauge, the gauge should own the tag names/values that are held in the GaugeKey.
Actually I think maybe the best way to manage the backing store is to to have an OptRef<Gauge> in the GaugeKey.
There was a problem hiding this comment.
One example, if 2 such stats accesslogger defined the same gauge. The time sequence is like
- one accesslog add 1 on the gauge
- the other accesslog set the gauge to 0
- eviction happen
- the first accesslog now does a subtract 1
There was a problem hiding this comment.
Sorry, let me ask this a different way.
If you have a loop like
for (int i = 0; i < a billion; ++i) {
allocate gauge and store in map
evict gauge
}
what happens to the map?
There was a problem hiding this comment.
Shared the benchmarking offline
There was a problem hiding this comment.
There was a problem hiding this comment.
The benchmark test added for AccessLogState
| // When the key actually needs to be safely persisted into the map, `makeOwned()` | ||
| // is explicitly called to allocate and copy the tags into `owned_tags_`. | ||
| struct GaugeKey { | ||
| Stats::StatName stat_name_; |
There was a problem hiding this comment.
I can see you copying the vector of tag name/value pairs, but I do not see how the backing store of the tag names & values is managed. Copying the vector won't copy the backing store of the tag names & values.
I think that ultimately if you are creating a new gauge, the gauge should own the tag names/values that are held in the GaugeKey.
Actually I think maybe the best way to manage the backing store is to to have an OptRef<Gauge> in the GaugeKey.
| // When the key actually needs to be safely persisted into the map, `makeOwned()` | ||
| // is explicitly called to allocate and copy the tags into `owned_tags_`. | ||
| struct GaugeKey { | ||
| Stats::StatName stat_name_; |
There was a problem hiding this comment.
Good point, and the simplicity of the solution has appeal. But when eviciting should we remove the map entry? Would that make sense?
If we hang onto stale entries doesn't that defeat the purpose of minimizing memory we are holding?
Speaking of which, can you use the memory-test framework to test memory savings with this?
Also that is an answer to the questIon: "what do we do with the memory in 'owned' mode", and there's still the question of using StatNameJoiner for constructing the key for a hash lookup.
|
How did you do that benchmark? It is weird that's an image on aws.
Do you have code for the experiment?
…On Mon, Mar 23, 2026 at 2:05 PM Xuyang Tao ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In source/extensions/access_loggers/stats/stats.h
<#43812 (comment)>:
> @@ -14,8 +15,57 @@ namespace Extensions {
namespace AccessLoggers {
namespace StatsAccessLog {
-class StatsAccessLog : public AccessLoggers::Common::ImplBase {
+// GaugeKey serves as a lock-free map key composed of exactly the configuration
+// properties that define a fully resolved gauge metric.
+//
+// It preserves the raw components (base name + tags) allowing us to safely
+// re-create the gauge from the scope if it gets evicted while the request is in-flight.
+//
+// To avoid heap-allocating a new std::vector on every map lookup (which happens
+// on every single gauge increment/decrement), this key acts as a lightweight
+// zero-allocation "view" using `borrowed_tags_` during map lookups.
+// When the key actually needs to be safely persisted into the map, `makeOwned()`
+// is explicitly called to allocate and copy the tags into `owned_tags_`.
Did some benchmark and looks the existing version of using GaugeKey is
more performant.
image.png (view on web)
<https://github.com/user-attachments/assets/f5b536b0-172a-4053-a111-a0ddca5f3e3f>
—
Reply to this email directly, view it on GitHub
<#43812?email_source=notifications&email_token=AAO2IPNN5SAKNJVNVPVVDS34SF4HHA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTGOJZGM3DGMRQGY42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#discussion_r2976700228>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPOSZF5TTQYLIWHNRFL4SF4HHAVCNFSM6AAAAACWJDCOHOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTSOJTGYZTEMBWHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Signed-off-by: Xuyang Tao <taoxuy@google.com>
…update stats speed benchmark Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.
The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.
Fix: we will keep the gauge in the scope cache if it is non-zero
Risk Level: Low
Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.
Docs: NA
Release: NA
Platform Specific Features: no
Benchmark result on single AddSubtract