Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRegionClient now propagates gRPC Status into failure handling and calls a helper to mark connections invalid; RpcClient gains lifecycle and background scanning to detect and remove unhealthy connections, plus APIs to mark/remove invalid connections and wake the scan loop. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as RegionClient
participant Rpc as RpcClient
participant Channel as GrpcChannel
participant Scan as ScanLoop
Client->>Channel: send RPC (unary or stream)
Channel-->>Client: returns non-OK ::grpc::Status
Client->>Rpc: dropConnIfNeeded(client, addr, status)
alt status.code == UNAVAILABLE
Rpc->>Rpc: markConnInvalid(addr)
Rpc-->>Scan: notify scan_cv
end
Client->>Client: onSendFail(..., status)
Note right of Scan: background run()/scanConns()
Scan->>Rpc: scanConns()
Rpc->>Channel: check channel state (READY or not)
alt not READY or timed out
Rpc->>Rpc: enqueue addr to invalid_conns
end
Scan->>Rpc: removeInvalidConns()
Rpc->>Rpc: erase invalid addrs from pool
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
f62f97c to
0a4af5c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/kv/Rpc.cc (1)
43-47: Implementation is safe; concern is policy, not mechanics.The erase is correctly serialized under
mutex, and in-flight callers keep theirConnArrayPtralive viashared_ptr, so there is no use-after-free risk here. The substantive concern (removing the entire 5-channel pool on a single failed call) is raised oninclude/pingcap/kv/Rpc.h.One small note: consider logging (debug/trace) the erase, including addr and whether the key was actually present. Without any observability, repeated pool teardown during an outage will be hard to diagnose.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/kv/Rpc.cc` around lines 43 - 47, RpcClient::removeConn is safe but lacks observability; modify it to capture the result of conns.erase(addr) while holding the mutex and emit a debug/trace log that includes the addr and whether an entry was actually removed (e.g., size_t n = conns.erase(addr); log "removed" vs "not found" with addr and n). Keep the lock (mutex) scope unchanged and use the existing project logging/tracing facility when emitting the message.include/pingcap/kv/RegionClient.h (1)
137-138: RawRpcClient *insideStreamReader— document the lifetime contract.
RpcClientis owned byClusterviastd::unique_ptr<RpcClient>(seeRpcClientPtrinRpc.h). The new raw pointer is captured fromcluster->rpc_client.get()and later dereferenced infinish(). If aStreamReaderever outlives itsCluster(e.g., returned up the stack while the owningClusteris being torn down during shutdown), this becomes a use-after-free.Two small hardening options:
Option A — add a brief invariant comment
private: friend struct RegionClient; - RpcClient * client = nullptr; - std::string addr; + // Non-owning. Points to Cluster::rpc_client; StreamReader must not + // outlive the owning Cluster. + RpcClient * client = nullptr; + std::string addr;Option B — null out after use to make double-finish() safe
::grpc::Status finish() { if (no_resp) return ::grpc::Status::OK; auto status = reader->Finish(); - if (client != nullptr && shouldRemoveConnOnStatus(status)) - client->removeConn(addr); + if (client != nullptr && shouldRemoveConnOnStatus(status)) + client->removeConn(addr); + client = nullptr; // finish() is not expected to be called twice return status; }Also note that
StreamReaderhas defaulted move operations; after a move, the source retains a copy ofclient/addrbut a nullreader. Callingfinish()on a moved-from instance would crash onreader->Finish()(pre-existing), but also now stores a stale raw pointer — worth being aware of.Also applies to: 174-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/pingcap/kv/RegionClient.h` around lines 137 - 138, The raw RpcClient* stored in StreamReader (members client and addr) can become a dangling pointer if the owning Cluster's unique_ptr< RpcClient > is destroyed; add a short invariant comment near the members explaining that client is a non-owning observer borrowed from Cluster::rpc_client.get() and must not outlive Cluster, and harden finish() and move semantics: in StreamReader::finish() guard on reader and client (if nullptr return), call reader->Finish(), then immediately reset client = nullptr and reader = nullptr (and clear addr) to avoid double-finish/use-after-free; also update the defaulted move constructor/operator= to explicitly transfer client/addr/reader and set the source object's client and reader to nullptr after move to avoid leaving stale raw pointers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/pingcap/kv/Rpc.h`:
- Around line 30-40: shouldRemoveConnOnStatus currently treats CANCELLED as a
pool-removal trigger and leads RpcClient::removeConn(addr) to erase the entire
ConnArrayPtr; update it to exclude ::grpc::StatusCode::CANCELLED (only treat
UNAVAILABLE and any other confirmed transport-level codes) and then change the
removal strategy in RpcClient: instead of immediately calling removeConn(addr)
for a single-failing RPC, either mark the specific KvConnClient channel as
degraded (per-channel flag inside ConnArrayPtr) or deduplicate removals by
capturing and comparing the ConnArrayPtr pointer (so only one goroutine/thread
recreates the pool) and/or apply a short backoff/cooldown per address; adjust
getConnArray/ConnArrayPtr usage so failed RPCs act on a single channel state
rather than removing the whole ConnArrayPtr.
---
Nitpick comments:
In `@include/pingcap/kv/RegionClient.h`:
- Around line 137-138: The raw RpcClient* stored in StreamReader (members client
and addr) can become a dangling pointer if the owning Cluster's unique_ptr<
RpcClient > is destroyed; add a short invariant comment near the members
explaining that client is a non-owning observer borrowed from
Cluster::rpc_client.get() and must not outlive Cluster, and harden finish() and
move semantics: in StreamReader::finish() guard on reader and client (if nullptr
return), call reader->Finish(), then immediately reset client = nullptr and
reader = nullptr (and clear addr) to avoid double-finish/use-after-free; also
update the defaulted move constructor/operator= to explicitly transfer
client/addr/reader and set the source object's client and reader to nullptr
after move to avoid leaving stale raw pointers.
In `@src/kv/Rpc.cc`:
- Around line 43-47: RpcClient::removeConn is safe but lacks observability;
modify it to capture the result of conns.erase(addr) while holding the mutex and
emit a debug/trace log that includes the addr and whether an entry was actually
removed (e.g., size_t n = conns.erase(addr); log "removed" vs "not found" with
addr and n). Keep the lock (mutex) scope unchanged and use the existing project
logging/tracing facility when emitting the message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4b7c8bce-34d5-49cf-94df-a93d3ae33f07
📒 Files selected for processing (4)
include/pingcap/kv/RegionClient.hinclude/pingcap/kv/Rpc.hsrc/common/MPPProber.ccsrc/kv/Rpc.cc
|
/cc @gengliqi @windtalker |
|
@xzhangxian1008: GitHub didn't allow me to request PR reviews from the following users: windtalker. Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
116d59a to
13d5ec8
Compare
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
include/pingcap/kv/Rpc.h (2)
37-47:⚠️ Potential issue | 🟠 MajorUnify the status policy before dropping shared pools.
shouldRemoveConnOnStatus()drops onCANCELLED, but the freedropConnIfNeeded()only drops onUNAVAILABLE, so behavior differs by call path. Also verifyCANCELLEDis really a transport-failure signal here; client-side cancellation can evict healthy shared pools.Official gRPC C++ StatusCode documentation: what are the client-side semantics of CANCELLED vs UNAVAILABLE, and should CANCELLED be treated as a broken channel requiring connection-pool removal?🛠️ Proposed consistency fix if only transport failures should evict pools
inline bool shouldRemoveConnOnStatus(const ::grpc::Status & status) { switch (status.error_code()) { case ::grpc::StatusCode::UNAVAILABLE: - case ::grpc::StatusCode::CANCELLED: return true; default: return false; } } @@ inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status) { - if (status.error_code() == grpc::StatusCode::UNAVAILABLE) + if (shouldRemoveConnOnStatus(status)) client->removeConn(addr); }Also applies to: 97-100, 152-155
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/pingcap/kv/Rpc.h` around lines 37 - 47, The logic for evicting connections is inconsistent: shouldRemoveConnOnStatus(const ::grpc::Status&) treats CANCELLED as a removal-worthy status while dropConnIfNeeded() only treats UNAVAILABLE; unify these by making both functions use the same policy (either remove on UNAVAILABLE only or include CANCELLED in both) and update all other occurrences noted (lines around 97-100 and 152-155) to the same rule; locate and change the behavior in the functions shouldRemoveConnOnStatus and dropConnIfNeeded (and other duplicated checks) so they consult a single helper or enum-based policy to decide eviction, and if you opt to treat CANCELLED as non-transport-failure remove it from shouldRemoveConnOnStatus (and do not add it to dropConnIfNeeded), otherwise add CANCELLED to dropConnIfNeeded as well.
134-155:⚠️ Potential issue | 🟠 MajorUse expected-pointer removal for failed RPC cleanup.
RpcCall::call()knows whichConnArrayPtrissued the RPC, butdropConnIfNeeded()erases by address only. A late failure from an old pool can delete a newer pool already recreated for the same address.🔒 Proposed fix using the existing conditional removal API
template <typename... Args> auto call(Args &&... args) { - ConnArrayPtr conn_array = client->getConnArray(addr); - auto conn_client = conn_array->get(); + conn_array = client->getConnArray(addr); + auto conn_client = conn_array->get(); return T::call(conn_client, std::forward<Args>(args)...); } @@ void dropConnIfNeeded(const ::grpc::Status & status) { if (shouldRemoveConnOnStatus(status)) - client->removeConn(addr); + client->removeConn(addr, conn_array); } @@ private: const RpcClientPtr & client; const std::string & addr; + ConnArrayPtr conn_array; Logger * log;Also applies to: 90-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/pingcap/kv/Rpc.h` around lines 134 - 155, RpcCall::call obtains the ConnArrayPtr via client->getConnArray(addr) but dropConnIfNeeded currently calls client->removeConn(addr) which can wrongly evict a newer pool; change the flow to pass or store the ConnArrayPtr returned by getConnArray into the RpcCall instance (or as an argument to dropConnIfNeeded) and use the client's conditional removal API (e.g., client->removeConnIfMatch(addr, conn_array) or equivalent) instead of removeConn(addr) so the connection is only removed when the pointer matches the failing pool; update both call() and dropConnIfNeeded(const ::grpc::Status&) to use ConnArrayPtr for expected-pointer removal.include/pingcap/kv/RegionClient.h (1)
125-132:⚠️ Potential issue | 🟠 MajorGuard stream cleanup with the pool that created the stream.
finish()can run long after this stream was opened;client->removeConn(addr)may erase a freshly recreated pool for the same address. Carry theConnArrayPtrused to create the stream and callremoveConn(addr, expected), matchingRpcClient::scanConns().🔒 Proposed shape for expected-pool cleanup
class StreamReader { public: @@ auto status = reader->Finish(); if (client != nullptr && shouldRemoveConnOnStatus(status)) - client->removeConn(addr); + client->removeConn(addr, conn_array); return status; @@ RpcClient * client = nullptr; std::string addr; + ConnArrayPtr conn_array;Then initialize
stream_reader->conn_arrayfrom theRpcCall/connection pool used to createreader.Also applies to: 173-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/pingcap/kv/RegionClient.h` around lines 125 - 132, The finish() path currently calls client->removeConn(addr) unguarded, risking removal of a newly created pool; capture and store the ConnArrayPtr from the RpcCall/connection pool when the stream/reader is created (e.g. set stream_reader->conn_array) and in finish() call client->removeConn(addr, expected_conn_array) or the overload that accepts the expected ConnArrayPtr (matching RpcClient::scanConns()), using that saved conn_array as the expected argument; update both finish() occurrences (lines around 125–132 and 173–175) to use the guarded removeConn(addr, expected) with the stored ConnArrayPtr.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/kv/Rpc.cc`:
- Around line 16-20: The current code in Rpc.cc uses an if/return true/return
false pattern around conn_client->channel->WaitForConnected(deadline); replace
that with a direct return of the boolean expression to satisfy
warnings-as-errors: return conn_client->channel->WaitForConnected(deadline); —
locate the block that computes deadline and calls WaitForConnected on
conn_client->channel and change it to return the call result directly.
---
Duplicate comments:
In `@include/pingcap/kv/RegionClient.h`:
- Around line 125-132: The finish() path currently calls
client->removeConn(addr) unguarded, risking removal of a newly created pool;
capture and store the ConnArrayPtr from the RpcCall/connection pool when the
stream/reader is created (e.g. set stream_reader->conn_array) and in finish()
call client->removeConn(addr, expected_conn_array) or the overload that accepts
the expected ConnArrayPtr (matching RpcClient::scanConns()), using that saved
conn_array as the expected argument; update both finish() occurrences (lines
around 125–132 and 173–175) to use the guarded removeConn(addr, expected) with
the stored ConnArrayPtr.
In `@include/pingcap/kv/Rpc.h`:
- Around line 37-47: The logic for evicting connections is inconsistent:
shouldRemoveConnOnStatus(const ::grpc::Status&) treats CANCELLED as a
removal-worthy status while dropConnIfNeeded() only treats UNAVAILABLE; unify
these by making both functions use the same policy (either remove on UNAVAILABLE
only or include CANCELLED in both) and update all other occurrences noted (lines
around 97-100 and 152-155) to the same rule; locate and change the behavior in
the functions shouldRemoveConnOnStatus and dropConnIfNeeded (and other
duplicated checks) so they consult a single helper or enum-based policy to
decide eviction, and if you opt to treat CANCELLED as non-transport-failure
remove it from shouldRemoveConnOnStatus (and do not add it to dropConnIfNeeded),
otherwise add CANCELLED to dropConnIfNeeded as well.
- Around line 134-155: RpcCall::call obtains the ConnArrayPtr via
client->getConnArray(addr) but dropConnIfNeeded currently calls
client->removeConn(addr) which can wrongly evict a newer pool; change the flow
to pass or store the ConnArrayPtr returned by getConnArray into the RpcCall
instance (or as an argument to dropConnIfNeeded) and use the client's
conditional removal API (e.g., client->removeConnIfMatch(addr, conn_array) or
equivalent) instead of removeConn(addr) so the connection is only removed when
the pointer matches the failing pool; update both call() and
dropConnIfNeeded(const ::grpc::Status&) to use ConnArrayPtr for expected-pointer
removal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9a9b210f-e269-435a-adf7-29679f671726
📒 Files selected for processing (7)
include/pingcap/kv/Cluster.hinclude/pingcap/kv/RegionClient.hinclude/pingcap/kv/Rpc.hsrc/common/MPPProber.ccsrc/kv/Cluster.ccsrc/kv/RegionClient.ccsrc/kv/Rpc.cc
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/pingcap/kv/Rpc.h`:
- Around line 102-106: The helper dropConnIfNeeded currently only treats
grpc::StatusCode::UNAVAILABLE as removable while shouldRemoveConnOnStatus also
considers CANCELLED; update dropConnIfNeeded to mark the connection invalid for
both UNAVAILABLE and CANCELLED (i.e., check for status.error_code() ==
UNAVAILABLE || status.error_code() == CANCELLED) so that callers like onSendFail
and stream setup failures behave consistently with StreamReader::finish and
shouldRemoveConnOnStatus, still calling client->markConnInvalid(addr) when
either code is seen.
In `@src/kv/Rpc.cc`:
- Around line 95-134: scanConns currently snapshots (addr, ConnArrayPtr) but
only pushes addr into invalid_conns, which can cause removeInvalidConns to erase
a newly recreated pool; change the invalid_conns container to hold pairs of
(std::string addr, ConnArrayPtr expected) and in scanConns push the captured
ConnArrayPtr, update markConnInvalid to push a pair with the current conns[addr]
(or nullptr if called externally), and modify removeInvalidConns to check that
conns[addr] exists and equals the expected ConnArrayPtr before erasing (skip if
different), leaving semantics of logging and clearing the invalid_conns intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3bcffc46-9a01-410e-829f-66cd255221f1
📒 Files selected for processing (3)
include/pingcap/kv/RegionClient.hinclude/pingcap/kv/Rpc.hsrc/kv/Rpc.cc
| inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status) | ||
| { | ||
| if (status.error_code() == grpc::StatusCode::UNAVAILABLE) | ||
| client->markConnInvalid(addr); | ||
| } |
There was a problem hiding this comment.
Align dropConnIfNeeded() with the shared removal predicate.
shouldRemoveConnOnStatus() treats both UNAVAILABLE and CANCELLED as removable, but dropConnIfNeeded() only handles UNAVAILABLE. Since onSendFail() uses this helper, unary failures and stream setup failures with CANCELLED won’t invalidate the connection, while StreamReader::finish() will.
Proposed fix
inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status)
{
- if (status.error_code() == grpc::StatusCode::UNAVAILABLE)
+ if (shouldRemoveConnOnStatus(status))
client->markConnInvalid(addr);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status) | |
| { | |
| if (status.error_code() == grpc::StatusCode::UNAVAILABLE) | |
| client->markConnInvalid(addr); | |
| } | |
| inline void dropConnIfNeeded(const RpcClientPtr & client, const std::string & addr, const ::grpc::Status & status) | |
| { | |
| if (shouldRemoveConnOnStatus(status)) | |
| client->markConnInvalid(addr); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/pingcap/kv/Rpc.h` around lines 102 - 106, The helper dropConnIfNeeded
currently only treats grpc::StatusCode::UNAVAILABLE as removable while
shouldRemoveConnOnStatus also considers CANCELLED; update dropConnIfNeeded to
mark the connection invalid for both UNAVAILABLE and CANCELLED (i.e., check for
status.error_code() == UNAVAILABLE || status.error_code() == CANCELLED) so that
callers like onSendFail and stream setup failures behave consistently with
StreamReader::finish and shouldRemoveConnOnStatus, still calling
client->markConnInvalid(addr) when either code is seen.
| void RpcClient::scanConns() | ||
| { | ||
| std::vector<std::pair<std::string, ConnArrayPtr>> conn_snapshot; | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex); | ||
| conn_snapshot.reserve(conns.size()); | ||
| for (const auto & [addr, conn_array] : conns) | ||
| conn_snapshot.emplace_back(addr, conn_array); | ||
| } | ||
|
|
||
| for (const auto & [addr, conn_array] : conn_snapshot) | ||
| { | ||
| if (!isConnArrayValid(conn_array, detect_rpc_timeout)) | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex); | ||
| invalid_conns.push_back(addr); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| void RpcClient::markConnInvalid(const std::string & addr) | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex); | ||
| invalid_conns.push_back(addr); | ||
| scan_cv.notify_all(); | ||
| } | ||
|
|
||
| void RpcClient::removeInvalidConns() | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex); | ||
| if (invalid_conns.empty()) | ||
| return; | ||
|
|
||
| for (const auto & addr : invalid_conns) | ||
| { | ||
| log->information("delete unavailable addr: " + addr); | ||
| conns.erase(addr); | ||
| } | ||
|
|
||
| invalid_conns.clear(); |
There was a problem hiding this comment.
Don’t erase newly recreated connection pools from stale scan results.
scanConns() validates a snapshot, then queues only addr. If another thread recreates conns[addr] before removeInvalidConns(), the later address-only erase can delete the fresh healthy pool. Use the captured ConnArrayPtr as the expected value when removing scan-detected failures.
One possible fix using the existing conditional remover
void RpcClient::scanConns()
{
std::vector<std::pair<std::string, ConnArrayPtr>> conn_snapshot;
@@
for (const auto & [addr, conn_array] : conn_snapshot)
{
if (!isConnArrayValid(conn_array, detect_rpc_timeout))
{
- std::lock_guard<std::mutex> lock(mutex);
- invalid_conns.push_back(addr);
+ removeConn(addr, conn_array);
}
}
}If the scan path still needs batched logging/removal, store (addr, expected_conn_array) in invalid_conns and compare the pointer before erasing.
🧰 Tools
🪛 Clang (14.0.6)
[error] 95-95: method 'scanConns' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 122-122: method 'removeInvalidConns' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/kv/Rpc.cc` around lines 95 - 134, scanConns currently snapshots (addr,
ConnArrayPtr) but only pushes addr into invalid_conns, which can cause
removeInvalidConns to erase a newly recreated pool; change the invalid_conns
container to hold pairs of (std::string addr, ConnArrayPtr expected) and in
scanConns push the captured ConnArrayPtr, update markConnInvalid to push a pair
with the current conns[addr] (or nullptr if called externally), and modify
removeInvalidConns to check that conns[addr] exists and equals the expected
ConnArrayPtr before erasing (skip if different), leaving semantics of logging
and clearing the invalid_conns intact.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
include/pingcap/kv/Rpc.h (1)
53-53: Duplicate entries can accumulate ininvalid_conns.
markConnInvalid()unconditionallypush_backs the addr (seesrc/kv/Rpc.cc:117-122). Concurrent RPC failures on the same address (a very common scenario — a single broken TiKV/TiFlash node will produce many near-simultaneousUNAVAILABLEfailures) will push the same addr many times before the scan loop wakes and drains it. Consequences: thelog->information("delete unavailable addr: ...")line inremoveInvalidConns()emits N lines per real removal, and the vector grows proportionally to the failure burst. Consider usingstd::unordered_set<std::string>forinvalid_conns, or deduping on insert.Proposed change
- std::vector<std::string> invalid_conns; + std::unordered_set<std::string> invalid_conns;and update
markConnInvalid/removeInvalidConns/run()accordingly (the empty check and iteration remain straightforward).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/pingcap/kv/Rpc.h` at line 53, The vector invalid_conns can accumulate duplicates; change its type to std::unordered_set<std::string> (replace std::vector<std::string> invalid_conns) and update markConnInvalid(), removeInvalidConns(), and run() to use set semantics: insert into invalid_conns (instead of push_back) in markConnInvalid(), iterate and erase entries in removeInvalidConns() (adjust log to still report each unique removal), and update any empty checks/iterations in run() to work with the unordered_set API so duplicate addresses are not stored or logged multiple times.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@include/pingcap/kv/Rpc.h`:
- Line 53: The vector invalid_conns can accumulate duplicates; change its type
to std::unordered_set<std::string> (replace std::vector<std::string>
invalid_conns) and update markConnInvalid(), removeInvalidConns(), and run() to
use set semantics: insert into invalid_conns (instead of push_back) in
markConnInvalid(), iterate and erase entries in removeInvalidConns() (adjust log
to still report each unique removal), and update any empty checks/iterations in
run() to work with the unordered_set API so duplicate addresses are not stored
or logged multiple times.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ee3f21a-fc4b-4b36-a25f-3fd7857825a1
📒 Files selected for processing (2)
include/pingcap/kv/RegionClient.hinclude/pingcap/kv/Rpc.h
🚧 Files skipped from review as they are similar to previous changes (1)
- include/pingcap/kv/RegionClient.h
Signed-off-by: xzhangxian1008 <xzhangxian@foxmail.com>
|
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff: DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Root cause: illegal connections are not cleaned up when grpc call is failed
Fix: We remove useless connections when grpc call fails with unavailable or cancelled codes
Related issue: pingcap/tiflash#9806
Summary by CodeRabbit
Bug Fixes
New Features