Skip to content

[AIMIGRAPHX-885] Clean up pad/slice Copies to use external stream and cleanup memory alloc/dealloc#233

Merged
TedThemistokleous merged 7 commits intorocm7.2_internal_testingfrom
update_copies
Apr 17, 2026
Merged

[AIMIGRAPHX-885] Clean up pad/slice Copies to use external stream and cleanup memory alloc/dealloc#233
TedThemistokleous merged 7 commits intorocm7.2_internal_testingfrom
update_copies

Conversation

@TedThemistokleous
Copy link
Copy Markdown
Collaborator

@TedThemistokleous TedThemistokleous commented Apr 10, 2026

Description

  1. Cleanup copy overhead added from repeated hipAlloc in migraphx_run_program
  2. Set stream to always valid (default/external) before using it for copy tensors
  3. redice the amount of hipMemCpyAsync() when allocating inputs
  4. Ensure CopyTensor and CopyTensorAsync use the correct stream and don't block
  5. Cleanup stale pad/slice buffers on GPU (seeing remnants on rocm-smi during testing)
  6. Add pinned memory pool used for Python API so we have a true CopyTensorAsync

Changes should be backwards compatible and when no external stream is used we default to the default hipstream

Motivation and Context

Cleans up and reduces memory overhead by ensuring we're using the correct stream to perform copies on and reduce overall tension through either C or python APIs.

Seeing this with a few a customer workloads where the copies are the larger bottleneck even though the main model runs efficiently. The copies add significant overhead such that any change overshadows perf improvements on MIGraphX under a certain threshold.

Used in tandem with - ROCm/AMDMIGraphX#4775 and saw a large boost in perf which reduced copy overhead

…we just pass stream instead of constantly calling ORT to get the compute stream
…emCpyAsync calls per pad

This should help with HBM bandwidth utilization.
make asyn calls truly async by working through a pinned memory buffer to ensure we can perform async calls and allow caller to return immediantly instead of block.
should help with the python API and now CopyTensor accepts hipstream for proper sync instead of always syncronizing on the default stream
@TedThemistokleous TedThemistokleous changed the title Clean up pad/slice Copies to use external stream and cleanup memory alloc/dealloc [AIMIGRAPHX-885] Clean up pad/slice Copies to use external stream and cleanup memory alloc/dealloc Apr 10, 2026
@TedThemistokleous TedThemistokleous merged commit 755ff27 into rocm7.2_internal_testing Apr 17, 2026
5 of 7 checks passed
@TedThemistokleous TedThemistokleous deleted the update_copies branch April 17, 2026 14:47
@TedThemistokleous TedThemistokleous restored the update_copies branch April 20, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants