Decentralized learning stalls during training

https://github.com/epfml/disco/blob/c2c21118abc85730092e7814a943a50d777ab87f/discojs/src/client/decentralized/decentralized_client.ts#L153-L156

Decentralized learning sometimes stalls after the first few rounds.
This appears to be caused by differences in how quickly nodes establish connections with their peers in each round.

Currently, once a node finishes establishing connections via `establishPeerConnections()`, it immediately proceeds to `exchangeWeightUpdates()` and starts sending weight updates to other peers. However, if some peers are slower in completing `establishPeerConnections()`, they may not yet be ready to receive these updates.

As a result, slower nodes can miss incoming weight updates and eventually timeout while waiting for them, causing the entire training process to stall.

	// Create peer-to-peer connections with all peers for the round
	await this.establishPeerConnections()
	// Exchange weight updates with peers and return aggregated weights
	return await this.exchangeWeightUpdates(weights)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decentralized learning stalls during training #1109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decentralized learning stalls during training #1109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions