Skip to content

Remove useless grpc connections#233

Open
xzhangxian1008 wants to merge 14 commits intotikv:masterfrom
xzhangxian1008:fix-tf-9806
Open

Remove useless grpc connections#233
xzhangxian1008 wants to merge 14 commits intotikv:masterfrom
xzhangxian1008:fix-tf-9806

Conversation

@xzhangxian1008
Copy link
Copy Markdown
Contributor

@xzhangxian1008 xzhangxian1008 commented Apr 20, 2026

Root cause: illegal connections are not cleaned up when grpc call is failed
Fix: We remove useless connections when grpc call fails with unavailable or cancelled codes
Related issue: pingcap/tiflash#9806

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling and propagation of gRPC errors, ensuring failed RPCs mark unhealthy endpoints for removal.
    • More reliable cleanup of unavailable connections during background scans.
  • New Features

    • Background health monitoring and automatic removal of bad connections.
    • RPC client now runs as a background task and is stopped earlier during cluster shutdown for more predictable lifecycle behavior.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign niedhui for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the contribution This PR is from a community contributor. label Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

RegionClient now propagates gRPC Status into failure handling and calls a helper to mark connections invalid; RpcClient gains lifecycle and background scanning to detect and remove unhealthy connections, plus APIs to mark/remove invalid connections and wake the scan loop.

Changes

Cohort / File(s) Summary
RegionClient Status Propagation
include/pingcap/kv/RegionClient.h, src/kv/RegionClient.cc
onSendFail signature extended to accept const ::grpc::Status& status; streaming StreamReader stores client and addr; callers pass gRPC Status on RPC failures.
RpcClient Health Management & Control Loop
include/pingcap/kv/Rpc.h, src/kv/Rpc.cc
Added run(), stop(), scanConns(), markConnInvalid(), removeInvalidConns(), removeConn() overloads; new members for scan interval, timeout, stop flag, CV and invalid_conns container; helpers for channel/conn validation and dropConnIfNeeded(...) to mark UNAVAILABLE endpoints.
Cluster Background Integration & Shutdown Order
include/pingcap/kv/Cluster.h, src/kv/Cluster.cc
Thread-pool sizes increased (defaults adjusted); rpc_client->run() enqueued in background tasks; destructor now stops rpc_client earlier in shutdown sequence.

Sequence Diagram(s)

sequenceDiagram
    participant Client as RegionClient
    participant Rpc as RpcClient
    participant Channel as GrpcChannel
    participant Scan as ScanLoop

    Client->>Channel: send RPC (unary or stream)
    Channel-->>Client: returns non-OK ::grpc::Status
    Client->>Rpc: dropConnIfNeeded(client, addr, status)
    alt status.code == UNAVAILABLE
        Rpc->>Rpc: markConnInvalid(addr)
        Rpc-->>Scan: notify scan_cv
    end
    Client->>Client: onSendFail(..., status)

    Note right of Scan: background run()/scanConns()
    Scan->>Rpc: scanConns()
    Rpc->>Channel: check channel state (READY or not)
    alt not READY or timed out
        Rpc->>Rpc: enqueue addr to invalid_conns
    end
    Scan->>Rpc: removeInvalidConns()
    Rpc->>Rpc: erase invalid addrs from pool
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • windtalker
  • overvenus

Poem

🐰 Hopping through connections, sniffing the air,

I nudge the broken, mark ones not fair.
Background scouts check each channel and thread,
Clean pools leap forward while bad links are shed.
A rabbit cheers: healthy RPC ahead!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Remove useless grpc connections' accurately reflects the main objective: cleaning up illegal/invalid gRPC connections when calls fail, which is the core change across the implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 20, 2026
@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 20, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/kv/Rpc.cc (1)

43-47: Implementation is safe; concern is policy, not mechanics.

The erase is correctly serialized under mutex, and in-flight callers keep their ConnArrayPtr alive via shared_ptr, so there is no use-after-free risk here. The substantive concern (removing the entire 5-channel pool on a single failed call) is raised on include/pingcap/kv/Rpc.h.

One small note: consider logging (debug/trace) the erase, including addr and whether the key was actually present. Without any observability, repeated pool teardown during an outage will be hard to diagnose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/kv/Rpc.cc` around lines 43 - 47, RpcClient::removeConn is safe but lacks
observability; modify it to capture the result of conns.erase(addr) while
holding the mutex and emit a debug/trace log that includes the addr and whether
an entry was actually removed (e.g., size_t n = conns.erase(addr); log "removed"
vs "not found" with addr and n). Keep the lock (mutex) scope unchanged and use
the existing project logging/tracing facility when emitting the message.
include/pingcap/kv/RegionClient.h (1)

137-138: Raw RpcClient * inside StreamReader — document the lifetime contract.

RpcClient is owned by Cluster via std::unique_ptr<RpcClient> (see RpcClientPtr in Rpc.h). The new raw pointer is captured from cluster->rpc_client.get() and later dereferenced in finish(). If a StreamReader ever outlives its Cluster (e.g., returned up the stack while the owning Cluster is being torn down during shutdown), this becomes a use-after-free.

Two small hardening options:

Option A — add a brief invariant comment
     private:
         friend struct RegionClient;
-        RpcClient * client = nullptr;
-        std::string addr;
+        // Non-owning. Points to Cluster::rpc_client; StreamReader must not
+        // outlive the owning Cluster.
+        RpcClient * client = nullptr;
+        std::string addr;
Option B — null out after use to make double-finish() safe
         ::grpc::Status finish()
         {
             if (no_resp)
                 return ::grpc::Status::OK;
             auto status = reader->Finish();
-            if (client != nullptr && shouldRemoveConnOnStatus(status))
-                client->removeConn(addr);
+            if (client != nullptr && shouldRemoveConnOnStatus(status))
+                client->removeConn(addr);
+            client = nullptr; // finish() is not expected to be called twice
             return status;
         }

Also note that StreamReader has defaulted move operations; after a move, the source retains a copy of client/addr but a null reader. Calling finish() on a moved-from instance would crash on reader->Finish() (pre-existing), but also now stores a stale raw pointer — worth being aware of.

Also applies to: 174-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/RegionClient.h` around lines 137 - 138, The raw RpcClient*
stored in StreamReader (members client and addr) can become a dangling pointer
if the owning Cluster's unique_ptr< RpcClient > is destroyed; add a short
invariant comment near the members explaining that client is a non-owning
observer borrowed from Cluster::rpc_client.get() and must not outlive Cluster,
and harden finish() and move semantics: in StreamReader::finish() guard on
reader and client (if nullptr return), call reader->Finish(), then immediately
reset client = nullptr and reader = nullptr (and clear addr) to avoid
double-finish/use-after-free; also update the defaulted move
constructor/operator= to explicitly transfer client/addr/reader and set the
source object's client and reader to nullptr after move to avoid leaving stale
raw pointers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/pingcap/kv/Rpc.h`:
- Around line 30-40: shouldRemoveConnOnStatus currently treats CANCELLED as a
pool-removal trigger and leads RpcClient::removeConn(addr) to erase the entire
ConnArrayPtr; update it to exclude ::grpc::StatusCode::CANCELLED (only treat
UNAVAILABLE and any other confirmed transport-level codes) and then change the
removal strategy in RpcClient: instead of immediately calling removeConn(addr)
for a single-failing RPC, either mark the specific KvConnClient channel as
degraded (per-channel flag inside ConnArrayPtr) or deduplicate removals by
capturing and comparing the ConnArrayPtr pointer (so only one goroutine/thread
recreates the pool) and/or apply a short backoff/cooldown per address; adjust
getConnArray/ConnArrayPtr usage so failed RPCs act on a single channel state
rather than removing the whole ConnArrayPtr.

---

Nitpick comments:
In `@include/pingcap/kv/RegionClient.h`:
- Around line 137-138: The raw RpcClient* stored in StreamReader (members client
and addr) can become a dangling pointer if the owning Cluster's unique_ptr<
RpcClient > is destroyed; add a short invariant comment near the members
explaining that client is a non-owning observer borrowed from
Cluster::rpc_client.get() and must not outlive Cluster, and harden finish() and
move semantics: in StreamReader::finish() guard on reader and client (if nullptr
return), call reader->Finish(), then immediately reset client = nullptr and
reader = nullptr (and clear addr) to avoid double-finish/use-after-free; also
update the defaulted move constructor/operator= to explicitly transfer
client/addr/reader and set the source object's client and reader to nullptr
after move to avoid leaving stale raw pointers.

In `@src/kv/Rpc.cc`:
- Around line 43-47: RpcClient::removeConn is safe but lacks observability;
modify it to capture the result of conns.erase(addr) while holding the mutex and
emit a debug/trace log that includes the addr and whether an entry was actually
removed (e.g., size_t n = conns.erase(addr); log "removed" vs "not found" with
addr and n). Keep the lock (mutex) scope unchanged and use the existing project
logging/tracing facility when emitting the message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4b7c8bce-34d5-49cf-94df-a93d3ae33f07

📥 Commits

Reviewing files that changed from the base of the PR and between 88ca06e and 1ca0512.

📒 Files selected for processing (4)
  • include/pingcap/kv/RegionClient.h
  • include/pingcap/kv/Rpc.h
  • src/common/MPPProber.cc
  • src/kv/Rpc.cc

Comment thread include/pingcap/kv/Rpc.h Outdated
@xzhangxian1008
Copy link
Copy Markdown
Contributor Author

/cc @gengliqi @windtalker

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 21, 2026

@xzhangxian1008: GitHub didn't allow me to request PR reviews from the following users: windtalker.

Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @gengliqi @windtalker

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot requested a review from gengliqi April 21, 2026 06:26
@ti-chi-bot ti-chi-bot Bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 21, 2026
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
@ti-chi-bot ti-chi-bot Bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 21, 2026
@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 21, 2026
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 21, 2026
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
@ti-chi-bot ti-chi-bot Bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 21, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
include/pingcap/kv/Rpc.h (2)

37-47: ⚠️ Potential issue | 🟠 Major

Unify the status policy before dropping shared pools.

shouldRemoveConnOnStatus() drops on CANCELLED, but the free dropConnIfNeeded() only drops on UNAVAILABLE, so behavior differs by call path. Also verify CANCELLED is really a transport-failure signal here; client-side cancellation can evict healthy shared pools.

Official gRPC C++ StatusCode documentation: what are the client-side semantics of CANCELLED vs UNAVAILABLE, and should CANCELLED be treated as a broken channel requiring connection-pool removal?
🛠️ Proposed consistency fix if only transport failures should evict pools
 inline bool shouldRemoveConnOnStatus(const ::grpc::Status & status)
 {
     switch (status.error_code())
     {
     case ::grpc::StatusCode::UNAVAILABLE:
-    case ::grpc::StatusCode::CANCELLED:
         return true;
     default:
         return false;
     }
 }
@@
 inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
 {
-    if (status.error_code() == grpc::StatusCode::UNAVAILABLE)
+    if (shouldRemoveConnOnStatus(status))
         client->removeConn(addr);
 }

Also applies to: 97-100, 152-155

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/Rpc.h` around lines 37 - 47, The logic for evicting
connections is inconsistent: shouldRemoveConnOnStatus(const ::grpc::Status&)
treats CANCELLED as a removal-worthy status while dropConnIfNeeded() only treats
UNAVAILABLE; unify these by making both functions use the same policy (either
remove on UNAVAILABLE only or include CANCELLED in both) and update all other
occurrences noted (lines around 97-100 and 152-155) to the same rule; locate and
change the behavior in the functions shouldRemoveConnOnStatus and
dropConnIfNeeded (and other duplicated checks) so they consult a single helper
or enum-based policy to decide eviction, and if you opt to treat CANCELLED as
non-transport-failure remove it from shouldRemoveConnOnStatus (and do not add it
to dropConnIfNeeded), otherwise add CANCELLED to dropConnIfNeeded as well.

134-155: ⚠️ Potential issue | 🟠 Major

Use expected-pointer removal for failed RPC cleanup.

RpcCall::call() knows which ConnArrayPtr issued the RPC, but dropConnIfNeeded() erases by address only. A late failure from an old pool can delete a newer pool already recreated for the same address.

🔒 Proposed fix using the existing conditional removal API
     template <typename... Args>
     auto call(Args &&... args)
     {
-        ConnArrayPtr conn_array = client->getConnArray(addr);
-        auto conn_client = conn_array->get();
+        conn_array = client->getConnArray(addr);
+        auto conn_client = conn_array->get();
         return T::call(conn_client, std::forward<Args>(args)...);
     }
@@
     void dropConnIfNeeded(const ::grpc::Status & status)
     {
         if (shouldRemoveConnOnStatus(status))
-            client->removeConn(addr);
+            client->removeConn(addr, conn_array);
     }
@@
 private:
     const RpcClientPtr & client;
     const std::string & addr;
+    ConnArrayPtr conn_array;
     Logger * log;

Also applies to: 90-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/Rpc.h` around lines 134 - 155, RpcCall::call obtains the
ConnArrayPtr via client->getConnArray(addr) but dropConnIfNeeded currently calls
client->removeConn(addr) which can wrongly evict a newer pool; change the flow
to pass or store the ConnArrayPtr returned by getConnArray into the RpcCall
instance (or as an argument to dropConnIfNeeded) and use the client's
conditional removal API (e.g., client->removeConnIfMatch(addr, conn_array) or
equivalent) instead of removeConn(addr) so the connection is only removed when
the pointer matches the failing pool; update both call() and
dropConnIfNeeded(const ::grpc::Status&) to use ConnArrayPtr for expected-pointer
removal.
include/pingcap/kv/RegionClient.h (1)

125-132: ⚠️ Potential issue | 🟠 Major

Guard stream cleanup with the pool that created the stream.

finish() can run long after this stream was opened; client->removeConn(addr) may erase a freshly recreated pool for the same address. Carry the ConnArrayPtr used to create the stream and call removeConn(addr, expected), matching RpcClient::scanConns().

🔒 Proposed shape for expected-pool cleanup
 class StreamReader
 {
 public:
@@
             auto status = reader->Finish();
             if (client != nullptr && shouldRemoveConnOnStatus(status))
-                client->removeConn(addr);
+                client->removeConn(addr, conn_array);
             return status;
@@
         RpcClient * client = nullptr;
         std::string addr;
+        ConnArrayPtr conn_array;

Then initialize stream_reader->conn_array from the RpcCall/connection pool used to create reader.

Also applies to: 173-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/RegionClient.h` around lines 125 - 132, The finish() path
currently calls client->removeConn(addr) unguarded, risking removal of a newly
created pool; capture and store the ConnArrayPtr from the RpcCall/connection
pool when the stream/reader is created (e.g. set stream_reader->conn_array) and
in finish() call client->removeConn(addr, expected_conn_array) or the overload
that accepts the expected ConnArrayPtr (matching RpcClient::scanConns()), using
that saved conn_array as the expected argument; update both finish() occurrences
(lines around 125–132 and 173–175) to use the guarded removeConn(addr, expected)
with the stored ConnArrayPtr.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/kv/Rpc.cc`:
- Around line 16-20: The current code in Rpc.cc uses an if/return true/return
false pattern around conn_client->channel->WaitForConnected(deadline); replace
that with a direct return of the boolean expression to satisfy
warnings-as-errors: return conn_client->channel->WaitForConnected(deadline); —
locate the block that computes deadline and calls WaitForConnected on
conn_client->channel and change it to return the call result directly.

---

Duplicate comments:
In `@include/pingcap/kv/RegionClient.h`:
- Around line 125-132: The finish() path currently calls
client->removeConn(addr) unguarded, risking removal of a newly created pool;
capture and store the ConnArrayPtr from the RpcCall/connection pool when the
stream/reader is created (e.g. set stream_reader->conn_array) and in finish()
call client->removeConn(addr, expected_conn_array) or the overload that accepts
the expected ConnArrayPtr (matching RpcClient::scanConns()), using that saved
conn_array as the expected argument; update both finish() occurrences (lines
around 125–132 and 173–175) to use the guarded removeConn(addr, expected) with
the stored ConnArrayPtr.

In `@include/pingcap/kv/Rpc.h`:
- Around line 37-47: The logic for evicting connections is inconsistent:
shouldRemoveConnOnStatus(const ::grpc::Status&) treats CANCELLED as a
removal-worthy status while dropConnIfNeeded() only treats UNAVAILABLE; unify
these by making both functions use the same policy (either remove on UNAVAILABLE
only or include CANCELLED in both) and update all other occurrences noted (lines
around 97-100 and 152-155) to the same rule; locate and change the behavior in
the functions shouldRemoveConnOnStatus and dropConnIfNeeded (and other
duplicated checks) so they consult a single helper or enum-based policy to
decide eviction, and if you opt to treat CANCELLED as non-transport-failure
remove it from shouldRemoveConnOnStatus (and do not add it to dropConnIfNeeded),
otherwise add CANCELLED to dropConnIfNeeded as well.
- Around line 134-155: RpcCall::call obtains the ConnArrayPtr via
client->getConnArray(addr) but dropConnIfNeeded currently calls
client->removeConn(addr) which can wrongly evict a newer pool; change the flow
to pass or store the ConnArrayPtr returned by getConnArray into the RpcCall
instance (or as an argument to dropConnIfNeeded) and use the client's
conditional removal API (e.g., client->removeConnIfMatch(addr, conn_array) or
equivalent) instead of removeConn(addr) so the connection is only removed when
the pointer matches the failing pool; update both call() and
dropConnIfNeeded(const ::grpc::Status&) to use ConnArrayPtr for expected-pointer
removal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a9b210f-e269-435a-adf7-29679f671726

📥 Commits

Reviewing files that changed from the base of the PR and between f22dc03 and 5c28d30.

📒 Files selected for processing (7)
  • include/pingcap/kv/Cluster.h
  • include/pingcap/kv/RegionClient.h
  • include/pingcap/kv/Rpc.h
  • src/common/MPPProber.cc
  • src/kv/Cluster.cc
  • src/kv/RegionClient.cc
  • src/kv/Rpc.cc

Comment thread src/kv/Rpc.cc Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/pingcap/kv/Rpc.h`:
- Around line 102-106: The helper dropConnIfNeeded currently only treats
grpc::StatusCode::UNAVAILABLE as removable while shouldRemoveConnOnStatus also
considers CANCELLED; update dropConnIfNeeded to mark the connection invalid for
both UNAVAILABLE and CANCELLED (i.e., check for status.error_code() ==
UNAVAILABLE || status.error_code() == CANCELLED) so that callers like onSendFail
and stream setup failures behave consistently with StreamReader::finish and
shouldRemoveConnOnStatus, still calling client->markConnInvalid(addr) when
either code is seen.

In `@src/kv/Rpc.cc`:
- Around line 95-134: scanConns currently snapshots (addr, ConnArrayPtr) but
only pushes addr into invalid_conns, which can cause removeInvalidConns to erase
a newly recreated pool; change the invalid_conns container to hold pairs of
(std::string addr, ConnArrayPtr expected) and in scanConns push the captured
ConnArrayPtr, update markConnInvalid to push a pair with the current conns[addr]
(or nullptr if called externally), and modify removeInvalidConns to check that
conns[addr] exists and equals the expected ConnArrayPtr before erasing (skip if
different), leaving semantics of logging and clearing the invalid_conns intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3bcffc46-9a01-410e-829f-66cd255221f1

📥 Commits

Reviewing files that changed from the base of the PR and between 5c28d30 and 23e7942.

📒 Files selected for processing (3)
  • include/pingcap/kv/RegionClient.h
  • include/pingcap/kv/Rpc.h
  • src/kv/Rpc.cc

Comment thread include/pingcap/kv/Rpc.h
Comment on lines +102 to +106
inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
{
if (status.error_code() == grpc::StatusCode::UNAVAILABLE)
client->markConnInvalid(addr);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align dropConnIfNeeded() with the shared removal predicate.

shouldRemoveConnOnStatus() treats both UNAVAILABLE and CANCELLED as removable, but dropConnIfNeeded() only handles UNAVAILABLE. Since onSendFail() uses this helper, unary failures and stream setup failures with CANCELLED won’t invalidate the connection, while StreamReader::finish() will.

Proposed fix
 inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
 {
-    if (status.error_code() == grpc::StatusCode::UNAVAILABLE)
+    if (shouldRemoveConnOnStatus(status))
         client->markConnInvalid(addr);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
{
if (status.error_code() == grpc::StatusCode::UNAVAILABLE)
client->markConnInvalid(addr);
}
inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
{
if (shouldRemoveConnOnStatus(status))
client->markConnInvalid(addr);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/Rpc.h` around lines 102 - 106, The helper dropConnIfNeeded
currently only treats grpc::StatusCode::UNAVAILABLE as removable while
shouldRemoveConnOnStatus also considers CANCELLED; update dropConnIfNeeded to
mark the connection invalid for both UNAVAILABLE and CANCELLED (i.e., check for
status.error_code() == UNAVAILABLE || status.error_code() == CANCELLED) so that
callers like onSendFail and stream setup failures behave consistently with
StreamReader::finish and shouldRemoveConnOnStatus, still calling
client->markConnInvalid(addr) when either code is seen.

Comment thread src/kv/Rpc.cc
Comment on lines +95 to +134
void RpcClient::scanConns()
{
std::vector<std::pair<std::string, ConnArrayPtr>> conn_snapshot;
{
std::lock_guard<std::mutex> lock(mutex);
conn_snapshot.reserve(conns.size());
for (const auto & [addr, conn_array] : conns)
conn_snapshot.emplace_back(addr, conn_array);
}

for (const auto & [addr, conn_array] : conn_snapshot)
{
if (!isConnArrayValid(conn_array, detect_rpc_timeout))
{
std::lock_guard<std::mutex> lock(mutex);
invalid_conns.push_back(addr);
}
}
}

void RpcClient::markConnInvalid(const std::string & addr)
{
std::lock_guard<std::mutex> lock(mutex);
invalid_conns.push_back(addr);
scan_cv.notify_all();
}

void RpcClient::removeInvalidConns()
{
std::lock_guard<std::mutex> lock(mutex);
if (invalid_conns.empty())
return;

for (const auto & addr : invalid_conns)
{
log->information("delete unavailable addr: " + addr);
conns.erase(addr);
}

invalid_conns.clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t erase newly recreated connection pools from stale scan results.

scanConns() validates a snapshot, then queues only addr. If another thread recreates conns[addr] before removeInvalidConns(), the later address-only erase can delete the fresh healthy pool. Use the captured ConnArrayPtr as the expected value when removing scan-detected failures.

One possible fix using the existing conditional remover
 void RpcClient::scanConns()
 {
     std::vector<std::pair<std::string, ConnArrayPtr>> conn_snapshot;
@@
     for (const auto & [addr, conn_array] : conn_snapshot)
     {
         if (!isConnArrayValid(conn_array, detect_rpc_timeout))
         {
-            std::lock_guard<std::mutex> lock(mutex);
-            invalid_conns.push_back(addr);
+            removeConn(addr, conn_array);
         }
     }
 }

If the scan path still needs batched logging/removal, store (addr, expected_conn_array) in invalid_conns and compare the pointer before erasing.

🧰 Tools
🪛 Clang (14.0.6)

[error] 95-95: method 'scanConns' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 122-122: method 'removeInvalidConns' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/kv/Rpc.cc` around lines 95 - 134, scanConns currently snapshots (addr,
ConnArrayPtr) but only pushes addr into invalid_conns, which can cause
removeInvalidConns to erase a newly recreated pool; change the invalid_conns
container to hold pairs of (std::string addr, ConnArrayPtr expected) and in
scanConns push the captured ConnArrayPtr, update markConnInvalid to push a pair
with the current conns[addr] (or nullptr if called externally), and modify
removeInvalidConns to check that conns[addr] exists and equals the expected
ConnArrayPtr before erasing (skip if different), leaving semantics of logging
and clearing the invalid_conns intact.

Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/pingcap/kv/Rpc.h (1)

53-53: Duplicate entries can accumulate in invalid_conns.

markConnInvalid() unconditionally push_backs the addr (see src/kv/Rpc.cc:117-122). Concurrent RPC failures on the same address (a very common scenario — a single broken TiKV/TiFlash node will produce many near-simultaneous UNAVAILABLE failures) will push the same addr many times before the scan loop wakes and drains it. Consequences: the log->information("delete unavailable addr: ...") line in removeInvalidConns() emits N lines per real removal, and the vector grows proportionally to the failure burst. Consider using std::unordered_set<std::string> for invalid_conns, or deduping on insert.

Proposed change
-    std::vector<std::string> invalid_conns;
+    std::unordered_set<std::string> invalid_conns;

and update markConnInvalid / removeInvalidConns / run() accordingly (the empty check and iteration remain straightforward).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/pingcap/kv/Rpc.h` at line 53, The vector invalid_conns can accumulate
duplicates; change its type to std::unordered_set<std::string> (replace
std::vector<std::string> invalid_conns) and update markConnInvalid(),
removeInvalidConns(), and run() to use set semantics: insert into invalid_conns
(instead of push_back) in markConnInvalid(), iterate and erase entries in
removeInvalidConns() (adjust log to still report each unique removal), and
update any empty checks/iterations in run() to work with the unordered_set API
so duplicate addresses are not stored or logged multiple times.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/pingcap/kv/Rpc.h`:
- Line 53: The vector invalid_conns can accumulate duplicates; change its type
to std::unordered_set<std::string> (replace std::vector<std::string>
invalid_conns) and update markConnInvalid(), removeInvalidConns(), and run() to
use set semantics: insert into invalid_conns (instead of push_back) in
markConnInvalid(), iterate and erase entries in removeInvalidConns() (adjust log
to still report each unique removal), and update any empty checks/iterations in
run() to work with the unordered_set API so duplicate addresses are not stored
or logged multiple times.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ee3f21a-fc4b-4b36-a25f-3fd7857825a1

📥 Commits

Reviewing files that changed from the base of the PR and between 23e7942 and 885eb4c.

📒 Files selected for processing (2)
  • include/pingcap/kv/RegionClient.h
  • include/pingcap/kv/Rpc.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • include/pingcap/kv/RegionClient.h

Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 22, 2026

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: no Indicates the PR's author has not signed dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant