[AIMIGRAPHX-885][AIMIGRAPGX-987] Use External Stream Contexts by TedThemistokleous · Pull Request #4775 · ROCm/AMDMIGraphX

TedThemistokleous · 2026-04-10T16:12:05Z

Motivation

Customer workload seeing some stalls during inference. This allows us to use the customer hipSteam passed to context via run_async so that we don't need to internally sync and manage a thread within MIGraphX. This allows the synchronization to be handled externally.

As an added benefit if not external thread is used we should fall back to the old fork_join run on the GPU where we internally create a stream to sync events onto.

Technical Details

Adds additional conditions to the wait_for , finish_on calls in context.cpp such that we avoid new stream creation for async runs while also simplifying much of the code.

Test cases have been added for this to ensure we don't break existing functionality.
Additional code added to ensure we set external libraries like BLAS and MIOPEN to use the default stream on clear

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

codecov · 2026-04-10T18:11:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4775   +/-   ##
========================================
  Coverage    92.49%   92.49%           
========================================
  Files          583      583           
  Lines        29562    29562           
========================================
  Hits         27343    27343           
  Misses        2219     2219

see 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Enables MIGraphX GPU async execution to run directly on a caller-provided HIP stream (external stream contexts) to reduce internal synchronization/stalls, and adds GPU tests to validate external-stream behavior and fallback behavior.

Changes:

Add external-stream support in gpu::context/hip_device::stream (override stream used by the context during async eval).
Adjust async synchronization logic (wait_for / finish_on) to avoid creating/using an extra internal stream when an external stream is provided.
Add a comprehensive new GPU test suite covering external stream override, async eval behavior, and fallback paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/targets/gpu/include/migraphx/gpu/context.hpp`	Adds external stream override plumbing and modifies async sync behavior to use caller stream.
`test/gpu/external_stream.cpp`	Adds new GPU tests for external stream override, async eval correctness, and state cleanup expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T15:05:00Z

+        if(not get_stream().has_external_stream())
+        {
+            get_stream().record(finish_event.get());
+            auto status = hipStreamWaitEvent(queue.get<hipStream_t>(), finish_event.get(), 0);
+            if(status != hipSuccess)
+                MIGRAPHX_THROW("Failed to wait on event: " + hip_error(status));
+        }


finish_on() skips all work when an external stream is active, but it also never restores the stream state. Since program::eval() calls wait_for()/finish_on() around async execution, this means an async eval will leave the GPU context permanently bound to the external stream (affecting later sync evals and finish()). finish_on() (or a dedicated scope guard) should clear the external stream and restore library handles back to the internal/default stream after the async run completes.

Suggested change

if(not get_stream().has_external_stream())

{

get_stream().record(finish_event.get());

auto status = hipStreamWaitEvent(queue.get<hipStream_t>(), finish_event.get(), 0);

if(status != hipSuccess)

MIGRAPHX_THROW("Failed to wait on event: " + hip_error(status));

}

if(get_stream().has_external_stream())

{

get_stream().set_external_stream(nullptr);

return;

}

get_stream().record(finish_event.get());

auto status = hipStreamWaitEvent(queue.get<hipStream_t>(), finish_event.get(), 0);

if(status != hipSuccess)

MIGRAPHX_THROW("Failed to wait on event: " + hip_error(status));

Copilot · 2026-04-15T15:05:01Z

+        auto *ext = queue.get<hipStream_t>();
+        if(ext == nullptr)
+        {
+            auto status = hipEventRecord(begin_event.get(), ext);
+            if(status != hipSuccess)
+                MIGRAPHX_THROW("Failed to record: " + hip_error(status));
+            get_stream().wait(begin_event.get());
+        }
+        else
+        {


wait_for() calls queue.get<hipStream_t>() unconditionally. If the caller passes a null stream via execution_environment{nullptr, true} (which is a reasonable way to request the fallback/event path), the any_ptr was constructed with a typed nullptr and any_ptr::get() will hit assert(not ti or ptr != nullptr) in debug builds. Consider checking queue.unsafe_get() == nullptr first and treating that as the null-stream fallback, only calling get<hipStream_t>() when the pointer is non-null.

Suggested change

auto *ext = queue.get<hipStream_t>();

if(ext == nullptr)

{

auto status = hipEventRecord(begin_event.get(), ext);

if(status != hipSuccess)

MIGRAPHX_THROW("Failed to record: " + hip_error(status));

get_stream().wait(begin_event.get());

}

else

{

if(queue.unsafe_get() == nullptr)

{

auto status = hipEventRecord(begin_event.get(), nullptr);

if(status != hipSuccess)

MIGRAPHX_THROW("Failed to record: " + hip_error(status));

get_stream().wait(begin_event.get());

}

else

{

auto* ext = queue.get<hipStream_t>();

bdevorem

couple questions, thanks Ted

bdevorem · 2026-04-15T15:11:43Z

+    auto host_result = migraphx::gpu::from_gpu(gout);
+    verify_data(host_result, out_shape, 12.0f);
+}
+


the PR description says external libs reset to default stream on clear/finish, but the tests seem to mostly assert get_queue() or has_external_stream() and numerical results. I think none would fail if MIOpen or rocBLAS were left bound to the customer stream?

bdevorem

I think a rebase/merge will solve the CI problems. lgtm otherwise

pfultz2

I dont think we should do this in the wait_for and finish_on functions as it could change the semantics of the function. Instead we should add a use_queue method to the context interface and use that directly.

TedThemistokleous · 2026-04-16T21:09:42Z

I think a rebase/merge will solve the CI problems. lgtm otherwise

Sure rebased this off develop

TedThemistokleous · 2026-04-17T03:57:34Z

I dont think we should do this in the wait_for and finish_on functions as it could change the semantics of the function. Instead we should add a use_queue method to the context interface and use that directly

Okay let me add this. This is similar and just do the create/set in the use_queue or use_external() thread?

I dont think we should do this in the wait_for and finish_on functions as it could change the semantics of the function. Instead we should add a use_queue method to the context interface and use that directly.

So is the idea then run_async() -> bind the stream if its not the null/default stream? otherwise we just create an internal stream for regular run()?

causten · 2026-04-29T21:02:39Z

Test	Batch	New Rate (bcc94e)	Old Rate (cdbdbc)	Diff	Status
resnet50v1_fp16	1	nan	992.67	nan	❌

Regressions detected 🔴

causten · 2026-04-29T21:02:40Z

Test	Status	Result
bert-mrpc-onnx	✅	PASSED: MIGraphX meets tolerance
bert-mrpc-tf	✅	PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-gru	✅	PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-lstm	✅	PASSED: MIGraphX meets tolerance
dlrm-criteoterabyte	✅	PASSED: MIGraphX meets tolerance
agentmodel	✅	PASSED: MIGraphX meets tolerance
unet	✅	PASSED: MIGraphX meets tolerance
resnet50v1	✅	PASSED: MIGraphX meets tolerance
bert_base_cased_fp16	✅	PASSED: MIGraphX meets tolerance
bert_large_uncased_fp16	🔴	FAILED: MIGraphX is not within tolerance - check verbose output
bert_large	✅	PASSED: MIGraphX meets tolerance
yolov5s	✅	PASSED: MIGraphX meets tolerance
tinyllama	✅	PASSED: MIGraphX meets tolerance
vicuna-fastchat	✅	PASSED: MIGraphX meets tolerance
whisper-tiny-encoder	✅	PASSED: MIGraphX meets tolerance
whisper-tiny-decoder	✅	PASSED: MIGraphX meets tolerance
distilgpt2_fp16	✅	PASSED: MIGraphX meets tolerance
llama2_7b	✅	PASSED: MIGraphX meets tolerance
qwen1.5-7b	✅	PASSED: MIGraphX meets tolerance
phi3-3.8b	✅	PASSED: MIGraphX meets tolerance
llama3-8b	✅	PASSED: MIGraphX meets tolerance
whisper-large-encoder	❌	ERROR - check error output traceback Traceback (most recent call last): File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 360, in main() File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 211, in main model = migraphx.parse_onnx(model_name, default_dim_value=batch) RuntimeError: /data/src/include/migraphx/op/convolution.hpp:103: normalize_compute_shape: CONVOLUTION: mismatched channel numbers
whisper-large-decoder	✅	PASSED: MIGraphX meets tolerance
mistral-7b	✅	PASSED: MIGraphX meets tolerance
FLUX.1-schnell	✅	PASSED: MIGraphX meets tolerance

…, finish_on

TedThemistokleous · 2026-04-29T22:37:26Z

Added changes based on Paul's comments so that we odn't modify wait_for, finish_on and just use a use_queue and set_queue_context.

TedThemistokleous self-assigned this Apr 10, 2026

TedThemistokleous added the Perf Improve label Apr 10, 2026

TedThemistokleous requested a review from causten as a code owner April 10, 2026 16:12

TedThemistokleous mentioned this pull request Apr 10, 2026

[AIMIGRAPHX-885] Clean up pad/slice Copies to use external stream and cleanup memory alloc/dealloc ROCm/onnxruntime#233

Merged

TedThemistokleous requested a review from kahmed10 April 10, 2026 16:14

causten requested a review from bdevorem April 10, 2026 21:27

bdevorem requested a review from Copilot April 15, 2026 14:56

Copilot started reviewing on behalf of bdevorem April 15, 2026 14:58 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

bdevorem reviewed Apr 15, 2026

View reviewed changes

TedThemistokleous added 5 commits April 16, 2026 10:22

add external stream support to context

0d1eda8

[AIGenerated] Add tests for external streams as well as fallback modes

136cd24

Fix format

15de4ee

Cleanup

a0c057e

Update context to not rebind on the same stream

563c78b

TedThemistokleous force-pushed the use_external_contexts branch from d2cfb6b to 563c78b Compare April 16, 2026 15:22

TedThemistokleous requested a review from bdevorem April 16, 2026 20:04

bdevorem approved these changes Apr 16, 2026

View reviewed changes

pfultz2 requested changes Apr 16, 2026

View reviewed changes

Comment thread src/targets/gpu/include/migraphx/gpu/context.hpp Outdated

TedThemistokleous added 3 commits April 16, 2026 22:55

Remove noop clearStream call

25ea246

remove clear_stream from tests

054cc15

Update context and tests

a164acb

TedThemistokleous force-pushed the use_external_contexts branch from 7b92f18 to a164acb Compare April 17, 2026 03:55

fix format

54ad81b

TedThemistokleous mentioned this pull request Apr 17, 2026

Rocm7.2.3 changes #4796

Merged

TedThemistokleous changed the title ~~[AIMIGRAPHX-885] Use External Stream Contexts~~ [AIMIGRAPHX-885][AIMIGRAPGX-987] Use External Stream Contexts Apr 24, 2026

TedThemistokleous requested a review from CharlieL7 April 29, 2026 16:01

TedThemistokleous requested a review from shivadbhavsar April 29, 2026 16:01

causten added the ok-to-test label Apr 29, 2026

Change interface to use set_queue for async calls instead of wait_for…

bcc94e0

…, finish_on

TedThemistokleous requested a review from pfultz2 April 29, 2026 22:36

Conversation

TedThemistokleous commented Apr 10, 2026

Motivation

Technical Details

Changelog Category

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdevorem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bdevorem Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

bdevorem left a comment

Choose a reason for hiding this comment

Uh oh!

pfultz2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TedThemistokleous commented Apr 16, 2026

Uh oh!

TedThemistokleous commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

causten commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

causten commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TedThemistokleous commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov Bot commented Apr 10, 2026 •

edited

Loading

TedThemistokleous commented Apr 17, 2026 •

edited

Loading

causten commented Apr 29, 2026 •

edited

Loading

causten commented Apr 29, 2026 •

edited

Loading