Skip to content

Decentralized learning stalls during training #1109

@ahzero7d1

Description

@ahzero7d1

// Create peer-to-peer connections with all peers for the round
await this.establishPeerConnections()
// Exchange weight updates with peers and return aggregated weights
return await this.exchangeWeightUpdates(weights)

Decentralized learning sometimes stalls after the first few rounds.
This appears to be caused by differences in how quickly nodes establish connections with their peers in each round.

Currently, once a node finishes establishing connections via establishPeerConnections(), it immediately proceeds to exchangeWeightUpdates() and starts sending weight updates to other peers. However, if some peers are slower in completing establishPeerConnections(), they may not yet be ready to receive these updates.

As a result, slower nodes can miss incoming weight updates and eventually timeout while waiting for them, causing the entire training process to stall.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions