Skip to content

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812

Open
TAOXUY wants to merge 54 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor
Open

StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
TAOXUY wants to merge 54 commits intoenvoyproxy:mainfrom
TAOXUY:fixStatDestructor

Conversation

@TAOXUY
Copy link
Contributor

@TAOXUY TAOXUY commented Mar 6, 2026

Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.

The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.

Fix: we will keep the gauge in the scope cache if it is non-zero

Risk Level: Low

Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.

Docs: NA

Release: NA

Platform Specific Features: no

Benchmark result on single AddSubtract

image

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY TAOXUY changed the title StatsAccessLogger: StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions. Mar 6, 2026
@ggreenway ggreenway self-assigned this Mar 6, 2026
Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

TAOXUY added 2 commits March 8, 2026 18:09
When evicting unused stats from the central cache, we need to ensure that
gauges actively referenced by components like AccessLogState are not evicted.
The use_count() > 1 check prevents this, but a previous bug in evictUnused
where the lambda parameter std::pair<StatName, T> kv was captured by value
caused artificial inflation of the use_count due to the deep copy. This broke
eviction entirely across the codebase.

This commit fixes evictUnused by taking const auto& kv by reference, avoiding
the deep copy and correctly applying the use_count() > 1 safeguard.

Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State
struct so its active references prevent premature eviction by evictUnused. The
erroneous std::min safeguard during gauge subtractions is also removed as
AccessLogState gauges will no longer be unfairly cleared.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 8, 2026

I don't think your fix is quite right.

I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.

      if (scope->evictable_) {
        MetricBag metrics(scope->scope_id_);
        CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
        auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
          return [&unused_metrics](std::pair<StatName, T> kv) {
            const auto& [name, metric] = kv;
            if (metric->used()) {
              metric->markUnused();
              return false;
            } else {
              unused_metrics.try_emplace(name, metric);
              return true;
            }
          };
        };

The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .

I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.

/wait

Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov

TAOXUY added 4 commits March 8, 2026 21:12
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 9, 2026

/retest

@ggreenway
Copy link
Member

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 11, 2026

Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components.

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

Imagine when a gauge is incremented and then evicted before decremented, there is another
there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

@kyessenov

@ggreenway
Copy link
Member

CMIIW, if gauge is evictable, it cannot be dec/inc. We need the central_cache in scope to hold the gauge for concurrent access.

The central store is the Store, and all scopes reference the same store. Anytime you get a metric from the scope, if the scope does not already have it, it looks in the store, so it is not possible for two scopes to have different metrics with the same name/tags.

That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the Store.

Imagine when a gauge is incremented and then evicted before decremented, there is another there is another accesslog accessing the same gauge using the same name and doing inc/dec, the value would be corrupted.

In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 11, 2026

/retest

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution; it's much cleaner and clearer.

@kyessenov can you also review this, especially the change to eviction logic?

TAOXUY added 3 commits March 11, 2026 22:22
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
TAOXUY added 2 commits March 19, 2026 23:52
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
ggreenway
ggreenway previously approved these changes Mar 20, 2026
Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format check (spelling) is failing, but other than that this LGTM.

@kyessenov do you want to take another look?

Signed-off-by: Xuyang Tao <taoxuy@google.com>
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 20, 2026

/retest

1 similar comment
@TAOXUY
Copy link
Contributor Author

TAOXUY commented Mar 20, 2026

/retest

@kyessenov
Copy link
Contributor

Generally LGTM, mostly style nits.

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments. Still looking.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly superficial stuff remaining, except for hash validation.

TAOXUY added 3 commits March 20, 2026 21:01
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
// When the key actually needs to be safely persisted into the map, `makeOwned()`
// is explicitly called to allocate and copy the tags into `owned_tags_`.
struct GaugeKey {
Stats::StatName stat_name_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should makeOwned ensure that stat_name has local backup also? stat_name is like string_view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scanned this a bit more and I am not understanding why the backing-store management for stat_name differs from the backing-store management for tags.

I think the stat-name comes from some pre-existing gauge but I am having a hard time keeping this all in my head. Wherever you get the the name from should have the tags; there should be no need to copy the tags if you are copying the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because stat_name_ always comes from the persistent static configuration of the StatsAccessLog (which is kept alive by std::shared_ptr in the AccessLogState filter state), whereas the tags are dynamic and computed per request.

Since the static name outlives the request, it is safe without local storage. The tags, however, are dynamic and need to be persisted if they are pushed to the background/delayed logging map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you copying the vector of tag name/value pairs, but I do not see how the backing store of the tag names & values is managed. Copying the vector won't copy the backing store of the tag names & values.

I think that ultimately if you are creating a new gauge, the gauge should own the tag names/values that are held in the GaugeKey.

Actually I think maybe the best way to manage the backing store is to to have an OptRef<Gauge> in the GaugeKey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backing store is in the map value(InflightGauge.tags_storage_).
image

the best way to manage the backing store is to to have an OptRef

I don't follow. Can you explain a little?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example, if 2 such stats accesslogger defined the same gauge. The time sequence is like

  • one accesslog add 1 on the gauge
  • the other accesslog set the gauge to 0
  • eviction happen
  • the first accesslog now does a subtract 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me ask this a different way.

If you have a loop like

for (int i = 0; i < a billion; ++i) {
  allocate gauge and store in map
  evict gauge
}

what happens to the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared the benchmarking offline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark test added for AccessLogState

TAOXUY added 2 commits March 21, 2026 04:01
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
// When the key actually needs to be safely persisted into the map, `makeOwned()`
// is explicitly called to allocate and copy the tags into `owned_tags_`.
struct GaugeKey {
Stats::StatName stat_name_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you copying the vector of tag name/value pairs, but I do not see how the backing store of the tag names & values is managed. Copying the vector won't copy the backing store of the tag names & values.

I think that ultimately if you are creating a new gauge, the gauge should own the tag names/values that are held in the GaugeKey.

Actually I think maybe the best way to manage the backing store is to to have an OptRef<Gauge> in the GaugeKey.

TAOXUY added 2 commits March 21, 2026 15:19
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
// When the key actually needs to be safely persisted into the map, `makeOwned()`
// is explicitly called to allocate and copy the tags into `owned_tags_`.
struct GaugeKey {
Stats::StatName stat_name_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and the simplicity of the solution has appeal. But when eviciting should we remove the map entry? Would that make sense?

If we hang onto stale entries doesn't that defeat the purpose of minimizing memory we are holding?

Speaking of which, can you use the memory-test framework to test memory savings with this?

Also that is an answer to the questIon: "what do we do with the memory in 'owned' mode", and there's still the question of using StatNameJoiner for constructing the key for a hash lookup.

@jmarantz
Copy link
Contributor

jmarantz commented Mar 23, 2026 via email

TAOXUY added 4 commits March 23, 2026 22:13
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
…update stats speed benchmark

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants