Skip to content

perf(send): Parallelize per-device encryption for large group messages#900

Open
jlucaso1 wants to merge 3 commits intotulir:mainfrom
jlucaso1:optimize-group-send-goroutine
Open

perf(send): Parallelize per-device encryption for large group messages#900
jlucaso1 wants to merge 3 commits intotulir:mainfrom
jlucaso1:optimize-group-send-goroutine

Conversation

@jlucaso1
Copy link
Copy Markdown

A simple benchmark with a group with 117 members (both are fresh sessions and are the first message sent to a group):

* **Without Goroutines (`log-go-first.txt`):**
    * [cite_start]Initial message received: `20:46:02.486` [cite: 3]
    * [cite_start]Encrypted group reply sent: `20:46:06.922` [cite: 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78]
    * **Total Time Elapsed: 4.436 seconds**
    * [cite_start]A warning in the log confirms a long processing time, stating: `Node handling took 5.317818646s`[cite: 79].

* **With Goroutines (`log-go-first-with-goroutine.txt`):**
    * [cite_start]Initial message received: `21:10:33.447` [cite: 86]
    * [cite_start]Encrypted group reply sent: `21:10:36.138` [cite: 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159]
    * **Total Time Elapsed: 2.691 seconds**

@w3nder
Copy link
Copy Markdown

w3nder commented Aug 14, 2025

Cool, take a look @tulir

@purpshell
Copy link
Copy Markdown
Contributor

A simple benchmark with a group with 117 members (both are fresh sessions and are the first message sent to a group):

* **Without Goroutines (`log-go-first.txt`):**
    * [cite_start]Initial message received: `20:46:02.486` [cite: 3]
    * [cite_start]Encrypted group reply sent: `20:46:06.922` [cite: 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78]
    * **Total Time Elapsed: 4.436 seconds**
    * [cite_start]A warning in the log confirms a long processing time, stating: `Node handling took 5.317818646s`[cite: 79].

* **With Goroutines (`log-go-first-with-goroutine.txt`):**
    * [cite_start]Initial message received: `21:10:33.447` [cite: 86]
    * [cite_start]Encrypted group reply sent: `21:10:36.138` [cite: 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159]
    * **Total Time Elapsed: 2.691 seconds**

How many CPUs / routines did you spin for it to half the time? 2 or more? 🤔

@jlucaso1
Copy link
Copy Markdown
Author

How many CPUs / routines did you spin for it to half the time? 2 or more? 🤔

12 (configured based on my cpu)

Copy link
Copy Markdown
Contributor

@purpshell purpshell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some food for thought

Comment thread send.go
if jid == ownJID || jid == ownLID {

// Heuristic: below this size, sequential loop is cheaper than goroutine scheduling.
const parallelThreshold = 8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 How was this number set? So, if I had 9 devices, you'd start a new routine for however many EncryptConsistency is (lets say I have 4 cpus).. Have you consider if that's faster than just letting the same 1 routine process just that one more device??

Comment thread send.go Outdated
Comment thread send.go
}

if len(allDevices) < parallelThreshold || concurrency == 1 {
// Fall back to original sequential implementation for small batches
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to minimize the amount of duplicated code here?
Maybe abstract this into a function that can be routined or ran sequentially if threading is not possible??

Comment thread send.go
return participantNodes, includeIdentity
}

func (cli *Client) retryEncryptMissing(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to abstract this logic outside of the function? 🤔

@purpshell
Copy link
Copy Markdown
Contributor

purpshell commented Aug 15, 2025

How many CPUs / routines did you spin for it to half the time? 2 or more? 🤔

12 (configured based on my cpu)

So after running 12 routines for 117 members, you only got a "100%" improvement. Doesn't sound particularly as impressive as it could be.
Probably because of the way routines work (Go schedules them itself and puts multiple routines on one thread). Wonder if there's a way to make it even more efficient.

EDIT: wanted to add that libsignal's complex logic/math is probably what's limiting us here. Either we can get to use even more threads (maybe by temporarily setting runtime.GOMAXPROCS(number you're using here)), or there's a diminishing returns situation here.

@Manjit2003
Copy link
Copy Markdown

what happened to the cache method? we were using some kinda cache for group encryption right? goroutine seems like a very hacky solution here

@Prodigy90
Copy link
Copy Markdown

I extensively optimized this by batching operations and parallelizing encryption across available CPU cores for ~50k+ contacts.

  • After warming the cache, performance improved significantly

  • Achieved ~10s to send status updates to 10k contacts

  • CPU utilization spiked to 100% during operations across all available cpu cores

However the main constraint remains the sequential encryption system:


builder := session.NewBuilderFromSignal(cli.Store, to.SignalAddress(), pbSerializer)

Even with parallelization, we still wait for builder initialization across all devices.

Beta testing revealed critical problems with concurrent users attempting to post status to thousands of contacts.

  1. Cache Invalidation: 10-50k+ contacts fetched randomly makes the LRU cache ineffective

  2. DB Bottleneck: Thousands of parallel DB requests severely degrade query performance

  3. Resource Contention: Maxing CPU cores isn't viable for multi-user scenarios since the entire app grinds to a halt.

Instead of implementing complex global concurrency (extensive refactoring required), I used default behavior with added participants section which bypassed the fetching of status contacts (using @devlikepro #800 pr) and moved status recipient control to application level

From my experience, I figured that parralelization is effective for individual users and smaller lists, systems with multiple concurrent clients need refactoring to prevent DB overload during parallel encryption operations.

However I could be at the end of my programming knowledge but I'm sure this insight might prove useful to you guys refacoring

@jlucaso1
Copy link
Copy Markdown
Author

Made some improvements here with scoped cache. An already cached group now respond a ping with about 200ms (feels instantly).
I will make a properly benchmark and improve the current spaghetti code soon.

@suhwr
Copy link
Copy Markdown
Contributor

suhwr commented Oct 4, 2025

I just want to ask, is this PR still being continued :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants