Skip to content

IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling#753

Open
chhwang wants to merge 17 commits intomainfrom
chhwang/fix-ib-no-atomic
Open

IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling#753
chhwang wants to merge 17 commits intomainfrom
chhwang/fix-ib-no-atomic

Conversation

@chhwang
Copy link
Contributor

@chhwang chhwang commented Feb 24, 2026

Fix potential memory inconsistency in IB host-no-atomic mode, and reduce latency overhead by introducing GDRCopy.

  • IB no-atomic: 8-byte RDMA write-with-imm carries full 64-bit token to remote signal GPU buffer, which is read by the remote host before updating the inbound token for strict data-flag ordering.
  • GDRCopy: recv thread reads token via BAR1 (CUDA) or uncached GPU memory (ROCm); no more cudaMemcpyAsync/CUDA stream

@chhwang chhwang requested a review from a team February 25, 2026 04:42
@chhwang
Copy link
Contributor Author

chhwang commented Feb 26, 2026

/azp run mscclpp-ut

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Comment on lines +223 to +227
#if defined(DEBUG_CUFLUSH) && defined(MSCCLPP_USE_CUDA)
// cuFlush path: read from imm_data then flush NIC->GPU write pipeline for visibility.
newValueHost = static_cast<uint64_t>(qp->getRecvWcImmData(i));
MSCCLPP_CUTHROW(cuFlushGPUDirectRDMAWrites(CU_FLUSH_GPU_DIRECT_RDMA_WRITES_TARGET_CURRENT_CTX,
CU_FLUSH_GPU_DIRECT_RDMA_WRITES_TO_OWNER));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep this code here?

// Direct host-side write to GPU memory via GDRCopy BAR1 mapping
remoteUpdateDstAddrMap_->copyTo(&newValueHost, sizeof(uint64_t));
} else {
*dstPtr = newValueHost;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this valid for CUDA? Maybe we can throw error if the dstAddrMap is invalid for cuda env

#endif

// Read dstGpuAddr from the local stored address (set by setRemoteUpdateDstAddr)
uint64_t dstGpuAddr = remoteUpdateDstAddr_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused about this var. If we use host2hostSemaphore, this addr is host addr?

@chhwang chhwang changed the title Improved IB host-no-atomic mode IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling Mar 4, 2026
@chhwang chhwang changed the title IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants