Skip to content

accel/amdxdna: add TDR implementation#1191

Open
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-tdr
Open

accel/amdxdna: add TDR implementation#1191
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-tdr

Conversation

@wendyliang25
Copy link
Contributor

This patch set port TDR implementation to the new drivers/accel/amdxdna driver.

Compared to the legacy TDR implementation, this implementation does:

  • detection: it compares submission and completion atomic counters in the device wide TDR, in TDR timer callback, it checks if submission and completion counters, if they are different, it is considered potentially context times out. When a job is scheduled to run, the submissions of device wide TDR will be increased by one, when the job is notified to complete, the completions counter will be decreased by one. In the legacy driver, it doesn't use the counters, it only checks if it is if there is any job completes, and in TDR timer callback, it scanned through device wide run queue. In the new implementation, it doesn't need to scan through the device wide contexts. This will reduce TDR detection overhead (mutex lock and time on scanning). However, this new method may need update if in future if submitting jobs mechanism is changed. And thus this part is request for comments, as I can also port the legacy detection mechanism in the new driver.

  • make recovery procedure functions common, so that in future next hardware generation can call the same function for recovery.

    • detection will be hardware specific.

Tests:

$ ./Release/bins/bin/shim_test.elf -k
1       test(s) skipped: 69
69      test(s) executed
ALL 69 executed test(s) PASSED!

$ ./Release/bins/bin/shim_test.elf 26
====== 26: io test with instruction code invalid address access started =====
loaded ./Release/bins/bin/../local_shim_test_data/npu4/bad/bad_txn.xclbin
loaded ./Release/bins/bin/../local_shim_test_data/npu4/partial_elf/design.xclbin
====== 26: io test with instruction code invalid address access passed  =====

0       test(s) skipped:
1       test(s) executed
ALL 1 executed test(s) PASSED!
$

# dmesg:
[429126.251801] amdxdna 0000:c5:00.1: [drm] aie2_tdr_timer_func: Device isn't making progress...
[429126.251811] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_hwctx_dump: TDR: ctx hwctx.395389.15 (fw_ctx 15) stuck, sub 1, comp 0, pending 1
[429126.252218] amdxdna 0000:c5:00.1: [drm] aie2_tdr_work_func: Device isn't making progress... Recovering...
[429126.252235] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout: Firmware timeout state capture:
[429126.252238] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Version: 2.1
[429126.252240] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Report size: 0x230
[429126.252242] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Context ID: 15
[429126.252244] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       DPU PC: 0xffffffff
[429126.252246] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       TXN OP ID: 0xffffffff
[429126.252248] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Context PC: 0x28b15689
[429126.252250] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error type: 0xc
[429126.252252] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error exception type: 0x9
[429126.252254] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error exception PC: 0x28b15689
[429126.252256] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error app module: 0x1
[429126.252258] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error task ID: 15
[429126.252260] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Timed out sub command ID: 0
[429126.253119] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_config_cu: Lookup GEM object failed
[429126.253123] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_hwctx_restart: Config cu failed, ret -22
[429126.253126] amdxdna 0000:c5:00.1: [drm] *ERROR* amdxdna_tdr_restart_all: TDR restart failed for ctx hwctx.395389.15, ret -22

$ sudo cat /sys/module/amdxdna/parameters/tdr_dump_only
N
$ sudo cat /sys/module/amdxdna/parameters/tdr_timeout_sec
2

Copilot AI review requested due to automatic review settings March 18, 2026 17:40
@xdavidz
Copy link
Contributor

xdavidz commented Mar 18, 2026

Is this aie2 only or this should be common between aie2 and aie4? If it can be common, should we change the name from aie2_tdr.c to aie_tdr.c? @houlz0507

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports Timeout Detection and Recovery (TDR) into the new drivers/accel/amdxdna driver by adding a device-wide timer/workqueue based detector and common recovery helpers that stop/restart all hardware contexts.

Changes:

  • Add AIE2 TDR implementation (timer + workqueue) and wire it into device init/fini.
  • Introduce common TDR helpers to dump/stop/restart all hwctx instances via new amdxdna_dev_ops callbacks.
  • Add kernel feature-probe for timer_container_of() and provide a compatibility macro fallback.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
drivers/accel/tools/configure_kernel.sh Probes for timer_container_of() and emits compat macro fallback.
drivers/accel/amdxdna/amdxdna_pci_drv.h Extends dev ops with hwctx stop/restart/dump callbacks used by TDR.
drivers/accel/amdxdna/amdxdna_ctx.h Exposes new common TDR helper APIs.
drivers/accel/amdxdna/amdxdna_ctx.c Implements common dump/stop/restart-all helpers over clients/hwctx.
drivers/accel/amdxdna/aie2_tdr.c New AIE2 TDR detector/recovery logic (timer + work item + module params).
drivers/accel/amdxdna/aie2_pci.h Adds TDR state to device handle and adds per-hwctx cached health report pointer.
drivers/accel/amdxdna/aie2_pci.c Starts/stops TDR as part of AIE2 device init/fini; wires ops callbacks.
drivers/accel/amdxdna/aie2_ctx.c Implements hwctx stop/restart/dump ops; integrates job submitted/completed accounting and timeout report plumbing.
drivers/accel/amdxdna/Kbuild Builds the new aie2_tdr.o object.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wendyliang25
Copy link
Contributor Author

wendyliang25 commented Mar 21, 2026

Is this aie2 only or this should be common between aie2 and aie4? If it can be common, should we change the name from aie2_tdr.c to aie_tdr.c? @houlz0507

Hi @xdavidz tdr detection is is aie2 specific, aie4 TDR error is reported from firmware.the restart procedure maybe common and thus, have the stop and restart functions implemented in amdxdna_ctx.c. can also move them to amdxdna_tdr.c. The TDR processing for AIE2 is using DRM scheduler timeout callback, at the moment only AIE2 using drm scheduler.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports a Timeout Detection and Recovery (TDR) path into the drivers/accel/amdxdna driver by integrating a new AIE2 timeout handler with shared “stop/restart all contexts” recovery helpers and related driver plumbing.

Changes:

  • Adds AIE2 TDR timed-out job handler plus module parameters (timeout / dump-only) and per-context cached health reporting.
  • Introduces common device-wide recovery helpers (amdxdna_tdr_stop_all() / amdxdna_tdr_restart_all()) and wires new hwctx stop/restart ops into the device ops table.
  • Updates build and kernel-compat tooling (adds aie2_tdr.o, adds a timer_container_of compatibility probe).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
drivers/accel/tools/configure_kernel.sh Adds kernel feature probe and fallback macro for timer_container_of.
drivers/accel/amdxdna/amdxdna_pci_drv.h Extends dev ops with hwctx stop/restart; adds client iteration macro.
drivers/accel/amdxdna/amdxdna_ctx.h Declares new device-wide TDR stop/restart helpers.
drivers/accel/amdxdna/amdxdna_ctx.c Implements device-wide “stop/restart all hwctx” recovery helpers.
drivers/accel/amdxdna/aie2_tdr.c Adds new AIE2 TDR implementation (timedout_job hook, detection, health collection).
drivers/accel/amdxdna/aie2_pci.h Adds TDR structures/state to device handle; adds hwctx stop/restart ops prototypes and TDR API.
drivers/accel/amdxdna/aie2_pci.c Wires AIE2 hwctx stop/restart ops into aie2_ops.
drivers/accel/amdxdna/aie2_ctx.c Integrates TDR signaling, timeout handler swap-in, timeout parameter usage, and health propagation on abort paths.
drivers/accel/amdxdna/Kbuild Builds the new aie2_tdr.o.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +137 to +159
guard(mutex)(&xdna->dev_lock);
if (!time_after(jiffies, ndev->tdr.last_jiffies + TDR_TIMEOUT_JIFF))
return DRM_GPU_SCHED_STAT_NO_HANG;

if (!aie2_tdr_detect(ndev)) {
ndev->tdr.last_jiffies = jiffies;
return DRM_GPU_SCHED_STAT_NO_HANG;
}

trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq);
job->job_timeout = true;
XDNA_WARN(xdna, "Device isn't making progress... Recovering...");

aie2_tdr_collect_health(ndev);
job->priv = hwctx->priv->tdr_health;
hwctx->priv->tdr_health = NULL;

amdxdna_tdr_stop_all(xdna, job);
amdxdna_tdr_restart_all(xdna);

ndev->tdr.last_jiffies = jiffies;
return DRM_GPU_SCHED_STAT_RESET;
}
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility: this function now unconditionally returns DRM_GPU_SCHED_STAT_NO_HANG / DRM_GPU_SCHED_STAT_RESET, but the repo already has kernel-version probes for the *_RESET enum name change. On kernels lacking these enum values, this will fail to build. Consider restoring the #ifdef HAVE_drm_gpu_sched_stat_reset fallback (and similarly guarding NO_HANG if needed) or adding a compat define in the generated config header.

Copilot uses AI. Check for mistakes.
…mentation

This commit introduces a implementation of TDR for the AMD XDNA platform.
Key changes:
- Add core logic in drivers/accel/amdxdna/aie2_tdr.c to support device timeout
  detection and recovery flows.
- Hook TDR into device and context management in aie2_ctx.c, amdxdna_ctx.c,
  and aie2_pci.c.
- Update related headers (aie2_pci.h, amdxdna_ctx.h, amdxdna_pci_drv.h)
  to provide the necessary data structures and interfaces for TDR.
- Extend Kbuild to build the new TDR source.

TDR enables enhanced device reliability by detecting hung contexts and
recovering from such events, minimizing system downtime and improving
user experience.

Signed-off-by: Wendy Liang <wendy.liang@amd.com>
…er_of macro.

timer_container_of() macro is defined in kernel v6.16.
It is used to get the container of a timer from a timer_list.
For older kernels, we use from_timer() macro instead.

Signed-off-by: Wendy Liang <wendy.liang@amd.com>
Copilot AI review requested due to automatic review settings March 22, 2026 06:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +189 to +190
/* TDR cached health report, consumed by first aborted job's response handler */
struct app_health_report *tdr_health;
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tdr_health is cached in hwctx_priv and can remain set in dump-only mode or if no aborted job consumes it. aie2_hwctx_fini() currently frees hwctx->priv without freeing priv->tdr_health, so this can leak the allocation across hwctx teardown. Please ensure tdr_health is freed/cleared in the hwctx teardown path (e.g., in aie2_hwctx_fini() before freeing hwctx->priv).

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
kfree(hwctx->priv->tdr_health);
hwctx->priv->tdr_health = NULL;

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aie2_tdr_collect_health() frees hwctx->priv->tdr_health with a plain kfree() and then later overwrites the pointer. Since other paths concurrently access it via xchg() (e.g., response handlers and aie2_hwctx_stop_op()), this open-coded free/store can race and lead to use-after-free or leaks. Prefer freeing via kfree(xchg(&priv->tdr_health, NULL)) and use atomic exchange/assign patterns consistently when replacing the cached report.

Copilot uses AI. Check for mistakes.
if (sub == comp)
continue;

XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld",
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message uses %lld for sub/comp which are u64. This can trigger format warnings and prints incorrectly if the high bit is set. Use %llu (or cast to unsigned long long explicitly) for these counters.

Suggested change
XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld",
XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %llu, comp %llu",

Copilot uses AI. Check for mistakes.
Comment on lines +621 to 630
u64 timeout_ms = MAX_SCHEDULE_TIMEOUT;

if (tdr_timeout_ms > 0)
timeout_ms = tdr_timeout_ms;
const struct drm_sched_init_args args = {
.ops = &sched_ops,
.num_rqs = DRM_SCHED_PRIORITY_COUNT,
.credit_limit = HWCTX_MAX_CMDS,
.timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT),
.timeout = msecs_to_jiffies(timeout_ms),
.name = "amdxdna_js",
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tdr_timeout_ms is documented as “0 or negative = disable”, but when HAVE_6_15_drm_sched_init is enabled the code sets timeout_ms = MAX_SCHEDULE_TIMEOUT and then passes it through msecs_to_jiffies(). MAX_SCHEDULE_TIMEOUT is already a jiffies-based value, so converting it as milliseconds can overflow/clamp and changes the intended “disabled” behavior. Consider computing a timeout_jiffies directly (e.g., MAX_SCHEDULE_TIMEOUT when disabled, otherwise msecs_to_jiffies(tdr_timeout_ms)), and make sure the non-HAVE_6_15_drm_sched_init path also respects disabling via tdr_timeout_ms (it currently still uses HWCTX_MAX_TIMEOUT).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants