fix(ui): long-poll room list sync after initial request#6361
fix(ui): long-poll room list sync after initial request#6361TigerInYourDream wants to merge 3 commits intomatrix-org:mainfrom
Conversation
Reproduction context
Observed symptomAfter the initial sync completed and the client was idle, the homeserver kept receiving a continuous stream of requests like: with the same |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6361 +/- ##
==========================================
+ Coverage 88.93% 89.78% +0.84%
==========================================
Files 357 357
Lines 99195 98827 -368
Branches 99195 98827 -368
==========================================
+ Hits 88221 88730 +509
+ Misses 6991 6614 -377
+ Partials 3983 3483 -500 ☔ View full report in Codecov by Sentry. |
|
Thank you for your contribution. The state machine is as follows. The initial state is
The timeout is set to 0 for the requests made in the The only place I see where we can save one request with no timeout is After a proof-read, I think I understand your use case: an account with e.g. 800 rooms but with very few activity: we have to paginate over all the rooms but the server has nothing to return, so it's a sequence of quick request and empty response. Is it correct? The problem is that the I want to remove this |
Per review on matrix-org#6361: long-polling in `SettingUp` / `Recovering` extends the time spent outside `Running` and falsely triggers `SyncIndicator` (which derives `Show` from how long the state machine stays in non-`Running` states). Roll those states back to `PollTimeout::Some(0)` and limit the policy change to collapsing the `Running` branch so it always long-polls regardless of `is_fully_loaded()`. The idle-loop symptom from Robrix2 + local Palpo (MSC4186) still goes away because the loop sits in `Running` + `!is_fully_loaded` — the client repeatedly issues `pos=<n>&timeout=0` while the server has nothing to deliver for the next Growing batch. Tests: - Integration `test_sync_all_states`: `SettingUp -> Running` request goes back to `timeout=0`; `Running -> Running` requests stay at `timeout=30000`. - Unit `test_long_poll_once_running` (renamed): drive 3 sync cycles, assert the `SettingUp` request still carries `timeout=0` AND the first `Running` request carries `timeout=30000`, defending the `SyncIndicator` invariant.
|
Thanks for the careful walkthrough — it really helps me see where my change overreaches. Let me re-anchor the reproduction with concrete details and then propose a narrower fix that respects your Where the report came fromI hit this while testing Robrix2 against a local Palpo homeserver (an MSC4186-capable implementation). The trigger is not specifically "many rooms" — it is simply "the client has reached idle after the initial sync". After that, server logs show a non-stop stream of: with the same Why it's a two-sided bug, and why the SDK side still mattersThe loop is the product of two independent issues:
Crucially: when I applied just the SDK-side fix and left the Palpo bug in place, the tight loop already stopped in our local Palpo deployment (verified in Project-Robius-China/robrix2#101). So this PR is a complete fix for the user-visible symptom on the client side, independent of the server work. Where my original patch overreachedYou're right that the loop is not specifically a Pushing Narrower change I just pushed (
|
|
One more piece of context I should have led with — the loop is not "a few The reason So the takeaway I'd flag for any future |
Problem
After the initial room-list sync,
RoomListServicecontinues sendingtimeout=0requests becauseSettingUp,Recovering, andRunning(before fully loaded) all forced immediate responses. This creates a tight polling loop when idle — the server returns empty responses instantly, and the client re-sends the sameposright away.Fix
Only force
timeout=0forState::Init. All post-init states usePollTimeout::Default, letting the server long-poll when idle. If the server has pending changes, it can still respond immediately regardless of the timeout value.Test
Added a regression test verifying the second sync request carries
timeout=30000instead oftimeout=0.