IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling by chhwang · Pull Request #753 · microsoft/mscclpp

chhwang · 2026-02-24T02:16:59Z

Fix potential memory inconsistency in IB host-no-atomic mode, and reduce latency overhead by introducing GDRCopy.

IB no-atomic: 8-byte RDMA write-with-imm carries full 64-bit token to remote signal GPU buffer, which is read by the remote host before updating the inbound token for strict data-flag ordering.
GDRCopy: recv thread reads token via BAR1 (CUDA) or uncached GPU memory (ROCm); no more cudaMemcpyAsync/CUDA stream

chhwang · 2026-02-26T20:42:06Z

/azp run mscclpp-ut

azure-pipelines · 2026-02-26T20:42:21Z

Azure Pipelines successfully started running 1 pipeline(s).

Binyang2014 · 2026-02-26T22:59:03Z

src/core/connection.cc

+#if defined(DEBUG_CUFLUSH) && defined(MSCCLPP_USE_CUDA)
+      // cuFlush path: read from imm_data then flush NIC->GPU write pipeline for visibility.
+      newValueHost = static_cast<uint64_t>(qp->getRecvWcImmData(i));
+      MSCCLPP_CUTHROW(cuFlushGPUDirectRDMAWrites(CU_FLUSH_GPU_DIRECT_RDMA_WRITES_TARGET_CURRENT_CTX,
+                                                 CU_FLUSH_GPU_DIRECT_RDMA_WRITES_TO_OWNER));


Do we need to keep this code here?

Binyang2014 · 2026-02-26T23:56:43Z

src/core/connection.cc

+          // Direct host-side write to GPU memory via GDRCopy BAR1 mapping
+          remoteUpdateDstAddrMap_->copyTo(&newValueHost, sizeof(uint64_t));
+        } else {
+          *dstPtr = newValueHost;


Is this valid for CUDA? Maybe we can throw error if the dstAddrMap is invalid for cuda env

Binyang2014 · 2026-02-27T01:18:48Z

src/core/connection.cc

+#endif

      // Read dstGpuAddr from the local stored address (set by setRemoteUpdateDstAddr)
      uint64_t dstGpuAddr = remoteUpdateDstAddr_;


A bit confused about this var. If we use host2hostSemaphore, this addr is host addr?

chhwang added 11 commits February 21, 2026 00:02

WIP; need amd fix

febdbf9

rocm fix wip

54e46ba

rocm fixes

98b023a

gdrcopy install in container

22e5efb

updates

25f31b4

Merge branch 'main' into chhwang/fix-ib-no-atomic

75dfdd9

updates

ac4d713

a few updates

ac022c3

License

72407af

License

8effd97

License, lint

fd7358d

chhwang requested a review from a team February 25, 2026 04:42

chhwang added 3 commits February 25, 2026 19:59

optimized recv loop

67d1706

updates

060982d

Merge branch 'main' into chhwang/fix-ib-no-atomic

6b2f819

Binyang2014 reviewed Feb 27, 2026

View reviewed changes

data direct

3b56b08

chhwang changed the title ~~Improved IB host-no-atomic mode~~ IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling Mar 4, 2026

chhwang changed the title ~~IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling~~ IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling Mar 4, 2026

chhwang added 2 commits March 5, 2026 22:59

updates

448ceb6

Updates

7ce841b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling#753

IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling#753
chhwang wants to merge 17 commits intomainfrom
chhwang/fix-ib-no-atomic

chhwang commented Feb 24, 2026 •

edited

Loading

Uh oh!

chhwang commented Feb 26, 2026

Uh oh!

azure-pipelines bot commented Feb 26, 2026

Uh oh!

Binyang2014 Feb 26, 2026

Uh oh!

Binyang2014 Feb 26, 2026

Uh oh!

Binyang2014 Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chhwang commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chhwang commented Feb 26, 2026

Uh oh!

azure-pipelines bot commented Feb 26, 2026

Uh oh!

Binyang2014 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chhwang commented Feb 24, 2026 •

edited

Loading