perf(send): Parallelize per-device encryption for large group messages#900
perf(send): Parallelize per-device encryption for large group messages#900jlucaso1 wants to merge 3 commits intotulir:mainfrom
Conversation
|
Cool, take a look @tulir |
How many CPUs / routines did you spin for it to half the time? 2 or more? 🤔 |
12 (configured based on my cpu) |
| if jid == ownJID || jid == ownLID { | ||
|
|
||
| // Heuristic: below this size, sequential loop is cheaper than goroutine scheduling. | ||
| const parallelThreshold = 8 |
There was a problem hiding this comment.
🤔 How was this number set? So, if I had 9 devices, you'd start a new routine for however many EncryptConsistency is (lets say I have 4 cpus).. Have you consider if that's faster than just letting the same 1 routine process just that one more device??
| } | ||
|
|
||
| if len(allDevices) < parallelThreshold || concurrency == 1 { | ||
| // Fall back to original sequential implementation for small batches |
There was a problem hiding this comment.
Is there a way to minimize the amount of duplicated code here?
Maybe abstract this into a function that can be routined or ran sequentially if threading is not possible??
| return participantNodes, includeIdentity | ||
| } | ||
|
|
||
| func (cli *Client) retryEncryptMissing( |
There was a problem hiding this comment.
Is it reasonable to abstract this logic outside of the function? 🤔
So after running 12 routines for 117 members, you only got a "100%" improvement. Doesn't sound particularly as impressive as it could be. EDIT: wanted to add that libsignal's complex logic/math is probably what's limiting us here. Either we can get to use even more threads (maybe by temporarily setting |
|
what happened to the cache method? we were using some kinda cache for group encryption right? goroutine seems like a very hacky solution here |
|
I extensively optimized this by batching operations and parallelizing encryption across available CPU cores for ~50k+ contacts.
However the main constraint remains the sequential encryption system: Even with parallelization, we still wait for builder initialization across all devices. Beta testing revealed critical problems with concurrent users attempting to post status to thousands of contacts.
Instead of implementing complex global concurrency (extensive refactoring required), I used default behavior with added participants section which bypassed the fetching of status contacts (using @devlikepro #800 pr) and moved status recipient control to application level From my experience, I figured that parralelization is effective for individual users and smaller lists, systems with multiple concurrent clients need refactoring to prevent DB overload during parallel encryption operations. However I could be at the end of my programming knowledge but I'm sure this insight might prove useful to you guys refacoring |
|
Made some improvements here with scoped cache. An already cached group now respond a ping with about 200ms (feels instantly). |
|
I just want to ask, is this PR still being continued :) |
A simple benchmark with a group with 117 members (both are fresh sessions and are the first message sent to a group):