Skip to content

fix(ui): long-poll room list sync after initial request#6361

Open
TigerInYourDream wants to merge 3 commits intomatrix-org:mainfrom
TigerInYourDream:fix/room-list-long-poll-after-initial-sync
Open

fix(ui): long-poll room list sync after initial request#6361
TigerInYourDream wants to merge 3 commits intomatrix-org:mainfrom
TigerInYourDream:fix/room-list-long-poll-after-initial-sync

Conversation

@TigerInYourDream
Copy link
Copy Markdown

@TigerInYourDream TigerInYourDream commented Mar 27, 2026

Problem

After the initial room-list sync, RoomListService continues sending timeout=0 requests because SettingUp, Recovering, and Running (before fully loaded) all forced immediate responses. This creates a tight polling loop when idle — the server returns empty responses instantly, and the client re-sends the same pos right away.

Fix

Only force timeout=0 for State::Init. All post-init states use PollTimeout::Default, letting the server long-poll when idle. If the server has pending changes, it can still respond immediately regardless of the timeout value.

Test

Added a regression test verifying the second sync request carries timeout=30000 instead of timeout=0.

  • Public API changes documented in changelogs (optional)

@TigerInYourDream TigerInYourDream requested a review from a team as a code owner March 27, 2026 08:44
@TigerInYourDream TigerInYourDream requested review from stefanceriu and removed request for a team March 27, 2026 08:44
@TigerInYourDream
Copy link
Copy Markdown
Author

TigerInYourDream commented Mar 27, 2026

Reproduction context

  • Local homeserver implementing MSC4186
  • Robrix as the client, with matrix-sdk-ui driving the room-list sync

Observed symptom

After the initial sync completed and the client was idle, the homeserver kept receiving a continuous stream of requests like:

/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=539&timeout=0

with the same pos repeating. The loop is driven by the room-list timeout policy: all post-init states were forcing timeout=0, so the server had no choice but to respond immediately, and the client would re-send right away.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 27, 2026

Merging this PR will not alter performance

✅ 50 untouched benchmarks


Comparing TigerInYourDream:fix/room-list-long-poll-after-initial-sync (a5d7247) with main (2c2e0f1)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 93.18182% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.78%. Comparing base (d64c990) to head (a5d7247).
⚠️ Report is 950 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/matrix-sdk-ui/src/room_list_service/mod.rs 93.18% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6361      +/-   ##
==========================================
+ Coverage   88.93%   89.78%   +0.84%     
==========================================
  Files         357      357              
  Lines       99195    98827     -368     
  Branches    99195    98827     -368     
==========================================
+ Hits        88221    88730     +509     
+ Misses       6991     6614     -377     
+ Partials     3983     3483     -500     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hywan
Copy link
Copy Markdown
Member

Hywan commented Apr 22, 2026

Thank you for your contribution.

The state machine is as follows. The initial state is Init. Then, the transitions are (not exhaustive, just the ones we are interested by):

  • Init -> SettingUp, a first request is made, with a SlidingSyncMode::Selective
  • SettingUp -> Running, a second request is made, with a SlidingSyncMode::Growing with the initial batch size of 100
  • Running -> Running, a third request is made, with the same growing mode, and the batch increases to 200
  • etc.

The timeout is set to 0 for the requests made in the SettingUp state (and Error, Recovering etc.). For the requests made in the Running state, the timeout is set to 0 as long as the list is not fully loaded, i.e. when not all rooms have been “paginated”. Put it in different words, it enters long-polling as soon as all rooms have been paginated/synced at least once.

The only place I see where we can save one request with no timeout is SettingUp -> Running. Indeed, for the server to return nothing here, we need a number of rooms smaller than the range used in Selective, with no updates. Maybe I'm misunderstand your use case. Can you give me a concrete example with the number of rooms, number of updates etc. please?

After a proof-read, I think I understand your use case: an account with e.g. 800 rooms but with very few activity: we have to paginate over all the rooms but the server has nothing to return, so it's a sequence of quick request and empty response. Is it correct?

The problem is that the SyncIndicator depends on the state machine. If we enter long-polling in SettingUp immediately, the system will believe there is a lag and will display the SyncIndicator.

I want to remove this SyncIndicator API entirely, and replace it with a NetworkStatus-ish API. Instead of displaying a “sync loader”, I prefer to provide a “wait on network loader”, which seems more accurate.

Per review on matrix-org#6361: long-polling in
`SettingUp` / `Recovering` extends the time spent outside `Running` and
falsely triggers `SyncIndicator` (which derives `Show` from how long the
state machine stays in non-`Running` states). Roll those states back to
`PollTimeout::Some(0)` and limit the policy change to collapsing the
`Running` branch so it always long-polls regardless of
`is_fully_loaded()`.

The idle-loop symptom from Robrix2 + local Palpo (MSC4186) still goes
away because the loop sits in `Running` + `!is_fully_loaded` — the
client repeatedly issues `pos=<n>&timeout=0` while the server has
nothing to deliver for the next Growing batch.

Tests:
- Integration `test_sync_all_states`: `SettingUp -> Running` request
  goes back to `timeout=0`; `Running -> Running` requests stay at
  `timeout=30000`.
- Unit `test_long_poll_once_running` (renamed): drive 3 sync cycles,
  assert the `SettingUp` request still carries `timeout=0` AND the
  first `Running` request carries `timeout=30000`, defending the
  `SyncIndicator` invariant.
@TigerInYourDream
Copy link
Copy Markdown
Author

TigerInYourDream commented Apr 23, 2026

Thanks for the careful walkthrough — it really helps me see where my change overreaches. Let me re-anchor the reproduction with concrete details and then propose a narrower fix that respects your SyncIndicator concern.

Where the report came from

I hit this while testing Robrix2 against a local Palpo homeserver (an MSC4186-capable implementation). The trigger is not specifically "many rooms" — it is simply "the client has reached idle after the initial sync". After that, server logs show a non-stop stream of:

/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=539&timeout=0

with the same pos repeating back-to-back. Full investigation tracked in Project-Robius-China/robrix2#30.

Why it's a two-sided bug, and why the SDK side still matters

The loop is the product of two independent issues:

  1. SDK side (this PR): requires_timeout was forcing timeout=0 for every post-init state. The client was actively opting out of long-polling.
  2. Server side (palpo-im/palpo#72): Palpo's MSC4186 route was treating static metadata (one-time-key counts, empty account_data / typing containers, unchanged list counts) as "non-empty updates", so even when the client did long-poll, the server short-circuited and answered immediately.

Crucially: when I applied just the SDK-side fix and left the Palpo bug in place, the tight loop already stopped in our local Palpo deployment (verified in Project-Robius-China/robrix2#101). So this PR is a complete fix for the user-visible symptom on the client side, independent of the server work.

Where my original patch overreached

You're right that the loop is not specifically a SettingUp -> Running hop. It sits in Running while request_generator.is_fully_loaded() stays false — Growing keeps wanting the next batch, the server keeps having nothing for that batch, and the client keeps spinning. That's the branch I really need to fix.

Pushing SettingUp / Recovering to PollTimeout::Default was overreach on my part: as you pointed out, SyncIndicator derives its Show decision from how long we stay outside Running (State::SettingUp | State::Error { .. }), so long-polling there would falsely trigger the loading hint. That's a real UX regression and not worth taking.

Narrower change I just pushed (a5d7247a6)

The policy change is now limited to the Running branch:

match observable_state.get() {
    State::Init
    | State::SettingUp
    | State::Recovering
    | State::Error { .. }
    | State::Terminated { .. } => PollTimeout::Some(0),

    // Once we're `Running`, let the server long-poll regardless of
    // whether the list is fully loaded. If the next Growing batch is
    // ready (or any update is queued) the server can still answer
    // immediately; otherwise it holds the connection instead of forcing
    // the client to spin on empty responses.
    State::Running => PollTimeout::Default,
}

Behavior change reduces to a single rule: collapse the Running branch so it always long-polls, instead of conditioning on is_fully_loaded(). By the time we are in Running, SyncIndicator is already in the Hide group, so this is invisible to that API.

Tests updated accordingly:

  • test_sync_all_states (integration): the SettingUp -> Running request goes back to asserting timeout=0; the three Running -> Running requests assert timeout=30000 (which previously only held after is_fully_loaded() flipped true).
  • Renamed the new unit test to test_long_poll_once_running, drove three sync.next().await cycles, and now assert both that the SettingUp request still carries timeout=0 and that the first Running request carries timeout=30000. That defends the SyncIndicator invariant in CI.

Happy to defer the broader SyncIndicator -> NetworkStatus rework to your planned redesign — let me know if this scope works.

@TigerInYourDream
Copy link
Copy Markdown
Author

One more piece of context I should have led with — the loop is not "a few timeout=0 requests during the Growing transition", it's hundreds per minute. The original server log showed pos values up to pos=539 within a single idle session (every response carries a fresh, monotonically-incremented pos, so 539 means ~539 back-to-back requests). That's the concrete data point if you're weighing whether the Running + !is_fully_loaded window is really worth optimizing — at the rates we observed, it's a sustained state, not a brief transition.

The reason is_fully_loaded() does not flip to true and end the loop on its own is that on Palpo, every response carries device_one_time_keys_count, empty extensions.{typing,account_data} containers, and a lists.<id>.count field that can microscopically fluctuate. Each of those is spec-legal — Palpo isn't violating MSC4186, its tests pass, its implementation is internally consistent. But the combination repeatedly nudges the list state, so is_fully_loaded() never settles, and we stay in the Running + !is_fully_loaded arm indefinitely. (palpo-im/palpo#72 retroactively tightens what counts as "empty" on the server side, but that's Palpo voluntarily narrowing a permissive spec reading, not fixing a violation.)

So the takeaway I'd flag for any future NetworkStatus-style rework: MSC4186's "long-poll until there's a change" leaves enough room for divergent-but-compliant server implementations that the SDK can't rely on the server to gate empty responses correctly. This PR makes the client express its long-poll intent unconditionally once we're Running; the failure mode against a permissive server then degrades to "server returns immediately but client does not re-spin", instead of the current tight loop. Surfacing this in case it informs the broader API design you're already considering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants