accel/amdxdna: add TDR implementation by wendyliang25 · Pull Request #1191 · amd/xdna-driver

wendyliang25 · 2026-03-18T17:40:50Z

This patch set port TDR implementation to the new drivers/accel/amdxdna driver.

Compared to the legacy TDR implementation, this implementation does:

detection: it compares submission and completion atomic counters in the device wide TDR, in TDR timer callback, it checks if submission and completion counters, if they are different, it is considered potentially context times out. When a job is scheduled to run, the submissions of device wide TDR will be increased by one, when the job is notified to complete, the completions counter will be decreased by one. In the legacy driver, it doesn't use the counters, it only checks if it is if there is any job completes, and in TDR timer callback, it scanned through device wide run queue. In the new implementation, it doesn't need to scan through the device wide contexts. This will reduce TDR detection overhead (mutex lock and time on scanning). However, this new method may need update if in future if submitting jobs mechanism is changed. And thus this part is request for comments, as I can also port the legacy detection mechanism in the new driver.
make recovery procedure functions common, so that in future next hardware generation can call the same function for recovery.
- detection will be hardware specific.

Tests:

$ ./Release/bins/bin/shim_test.elf -k
1       test(s) skipped: 69
69      test(s) executed
ALL 69 executed test(s) PASSED!

$ ./Release/bins/bin/shim_test.elf 26
====== 26: io test with instruction code invalid address access started =====
loaded ./Release/bins/bin/../local_shim_test_data/npu4/bad/bad_txn.xclbin
loaded ./Release/bins/bin/../local_shim_test_data/npu4/partial_elf/design.xclbin
====== 26: io test with instruction code invalid address access passed  =====

0       test(s) skipped:
1       test(s) executed
ALL 1 executed test(s) PASSED!
$

# dmesg:
[429126.251801] amdxdna 0000:c5:00.1: [drm] aie2_tdr_timer_func: Device isn't making progress...
[429126.251811] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_hwctx_dump: TDR: ctx hwctx.395389.15 (fw_ctx 15) stuck, sub 1, comp 0, pending 1
[429126.252218] amdxdna 0000:c5:00.1: [drm] aie2_tdr_work_func: Device isn't making progress... Recovering...
[429126.252235] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout: Firmware timeout state capture:
[429126.252238] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Version: 2.1
[429126.252240] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Report size: 0x230
[429126.252242] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Context ID: 15
[429126.252244] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       DPU PC: 0xffffffff
[429126.252246] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       TXN OP ID: 0xffffffff
[429126.252248] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Context PC: 0x28b15689
[429126.252250] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error type: 0xc
[429126.252252] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error exception type: 0x9
[429126.252254] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error exception PC: 0x28b15689
[429126.252256] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error app module: 0x1
[429126.252258] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Fatal error task ID: 15
[429126.252260] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_set_cmd_timeout:       Timed out sub command ID: 0
[429126.253119] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_config_cu: Lookup GEM object failed
[429126.253123] amdxdna 0000:c5:00.1: [drm] *ERROR* aie2_hwctx_restart: Config cu failed, ret -22
[429126.253126] amdxdna 0000:c5:00.1: [drm] *ERROR* amdxdna_tdr_restart_all: TDR restart failed for ctx hwctx.395389.15, ret -22

$ sudo cat /sys/module/amdxdna/parameters/tdr_dump_only
N
$ sudo cat /sys/module/amdxdna/parameters/tdr_timeout_sec
2

xdavidz · 2026-03-18T17:49:20Z

Is this aie2 only or this should be common between aie2 and aie4? If it can be common, should we change the name from aie2_tdr.c to aie_tdr.c? @houlz0507

Copilot

Pull request overview

Ports Timeout Detection and Recovery (TDR) into the new drivers/accel/amdxdna driver by adding a device-wide timer/workqueue based detector and common recovery helpers that stop/restart all hardware contexts.

Changes:

Add AIE2 TDR implementation (timer + workqueue) and wire it into device init/fini.
Introduce common TDR helpers to dump/stop/restart all hwctx instances via new amdxdna_dev_ops callbacks.
Add kernel feature-probe for timer_container_of() and provide a compatibility macro fallback.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
drivers/accel/tools/configure_kernel.sh	Probes for `timer_container_of()` and emits compat macro fallback.
drivers/accel/amdxdna/amdxdna_pci_drv.h	Extends dev ops with hwctx stop/restart/dump callbacks used by TDR.
drivers/accel/amdxdna/amdxdna_ctx.h	Exposes new common TDR helper APIs.
drivers/accel/amdxdna/amdxdna_ctx.c	Implements common dump/stop/restart-all helpers over clients/hwctx.
drivers/accel/amdxdna/aie2_tdr.c	New AIE2 TDR detector/recovery logic (timer + work item + module params).
drivers/accel/amdxdna/aie2_pci.h	Adds TDR state to device handle and adds per-hwctx cached health report pointer.
drivers/accel/amdxdna/aie2_pci.c	Starts/stops TDR as part of AIE2 device init/fini; wires ops callbacks.
drivers/accel/amdxdna/aie2_ctx.c	Implements hwctx stop/restart/dump ops; integrates job submitted/completed accounting and timeout report plumbing.
drivers/accel/amdxdna/Kbuild	Builds the new `aie2_tdr.o` object.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

drivers/accel/amdxdna/aie2_ctx.c

drivers/accel/amdxdna/aie2_tdr.c

drivers/accel/amdxdna/aie2_pci.h

wendyliang25 · 2026-03-21T01:06:40Z

Is this aie2 only or this should be common between aie2 and aie4? If it can be common, should we change the name from aie2_tdr.c to aie_tdr.c? @houlz0507

Hi @xdavidz tdr detection is is aie2 specific, aie4 TDR error is reported from firmware.the restart procedure maybe common and thus, have the stop and restart functions implemented in amdxdna_ctx.c. can also move them to amdxdna_tdr.c. The TDR processing for AIE2 is using DRM scheduler timeout callback, at the moment only AIE2 using drm scheduler.

Copilot

Pull request overview

Ports a Timeout Detection and Recovery (TDR) path into the drivers/accel/amdxdna driver by integrating a new AIE2 timeout handler with shared “stop/restart all contexts” recovery helpers and related driver plumbing.

Changes:

Adds AIE2 TDR timed-out job handler plus module parameters (timeout / dump-only) and per-context cached health reporting.
Introduces common device-wide recovery helpers (amdxdna_tdr_stop_all() / amdxdna_tdr_restart_all()) and wires new hwctx stop/restart ops into the device ops table.
Updates build and kernel-compat tooling (adds aie2_tdr.o, adds a timer_container_of compatibility probe).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
drivers/accel/tools/configure_kernel.sh	Adds kernel feature probe and fallback macro for `timer_container_of`.
drivers/accel/amdxdna/amdxdna_pci_drv.h	Extends dev ops with hwctx stop/restart; adds client iteration macro.
drivers/accel/amdxdna/amdxdna_ctx.h	Declares new device-wide TDR stop/restart helpers.
drivers/accel/amdxdna/amdxdna_ctx.c	Implements device-wide “stop/restart all hwctx” recovery helpers.
drivers/accel/amdxdna/aie2_tdr.c	Adds new AIE2 TDR implementation (timedout_job hook, detection, health collection).
drivers/accel/amdxdna/aie2_pci.h	Adds TDR structures/state to device handle; adds hwctx stop/restart ops prototypes and TDR API.
drivers/accel/amdxdna/aie2_pci.c	Wires AIE2 hwctx stop/restart ops into `aie2_ops`.
drivers/accel/amdxdna/aie2_ctx.c	Integrates TDR signaling, timeout handler swap-in, timeout parameter usage, and health propagation on abort paths.
drivers/accel/amdxdna/Kbuild	Builds the new `aie2_tdr.o`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

drivers/accel/amdxdna/aie2_ctx.c

Copilot · 2026-03-21T01:16:55Z

drivers/accel/amdxdna/aie2_tdr.c

+	guard(mutex)(&xdna->dev_lock);
+	if (!time_after(jiffies, ndev->tdr.last_jiffies + TDR_TIMEOUT_JIFF))
+		return DRM_GPU_SCHED_STAT_NO_HANG;
+
+	if (!aie2_tdr_detect(ndev)) {
+		ndev->tdr.last_jiffies = jiffies;
+		return DRM_GPU_SCHED_STAT_NO_HANG;
+	}
+
+	trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq);
+	job->job_timeout = true;
+	XDNA_WARN(xdna, "Device isn't making progress... Recovering...");
+
+	aie2_tdr_collect_health(ndev);
+	job->priv = hwctx->priv->tdr_health;
+	hwctx->priv->tdr_health = NULL;
+
+	amdxdna_tdr_stop_all(xdna, job);
+	amdxdna_tdr_restart_all(xdna);
+
+	ndev->tdr.last_jiffies = jiffies;
+	return DRM_GPU_SCHED_STAT_RESET;
+}


Compatibility: this function now unconditionally returns DRM_GPU_SCHED_STAT_NO_HANG / DRM_GPU_SCHED_STAT_RESET, but the repo already has kernel-version probes for the *_RESET enum name change. On kernels lacking these enum values, this will fail to build. Consider restoring the #ifdef HAVE_drm_gpu_sched_stat_reset fallback (and similarly guarding NO_HANG if needed) or adding a compat define in the generated config header.

drivers/accel/amdxdna/aie2_tdr.c

drivers/accel/amdxdna/aie2_ctx.c

drivers/accel/amdxdna/aie2_tdr.c

…mentation This commit introduces a implementation of TDR for the AMD XDNA platform. Key changes: - Add core logic in drivers/accel/amdxdna/aie2_tdr.c to support device timeout detection and recovery flows. - Hook TDR into device and context management in aie2_ctx.c, amdxdna_ctx.c, and aie2_pci.c. - Update related headers (aie2_pci.h, amdxdna_ctx.h, amdxdna_pci_drv.h) to provide the necessary data structures and interfaces for TDR. - Extend Kbuild to build the new TDR source. TDR enables enhanced device reliability by detecting hung contexts and recovering from such events, minimizing system downtime and improving user experience. Signed-off-by: Wendy Liang <wendy.liang@amd.com>

…er_of macro. timer_container_of() macro is defined in kernel v6.16. It is used to get the container of a timer from a timer_list. For older kernels, we use from_timer() macro instead. Signed-off-by: Wendy Liang <wendy.liang@amd.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T06:50:00Z

drivers/accel/amdxdna/aie2_pci.h

+	/* TDR cached health report, consumed by first aborted job's response handler */
+	struct app_health_report	*tdr_health;


tdr_health is cached in hwctx_priv and can remain set in dump-only mode or if no aborted job consumes it. aie2_hwctx_fini() currently frees hwctx->priv without freeing priv->tdr_health, so this can leak the allocation across hwctx teardown. Please ensure tdr_health is freed/cleared in the hwctx teardown path (e.g., in aie2_hwctx_fini() before freeing hwctx->priv).

Copilot · 2026-03-22T06:50:00Z

drivers/accel/amdxdna/aie2_tdr.c

+			kfree(hwctx->priv->tdr_health);
+			hwctx->priv->tdr_health = NULL;
+


aie2_tdr_collect_health() frees hwctx->priv->tdr_health with a plain kfree() and then later overwrites the pointer. Since other paths concurrently access it via xchg() (e.g., response handlers and aie2_hwctx_stop_op()), this open-coded free/store can race and lead to use-after-free or leaks. Prefer freeing via kfree(xchg(&priv->tdr_health, NULL)) and use atomic exchange/assign patterns consistently when replacing the cached report.

Copilot · 2026-03-22T06:50:00Z

drivers/accel/amdxdna/aie2_tdr.c

+			if (sub == comp)
+				continue;
+
+			XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld",


The log message uses %lld for sub/comp which are u64. This can trigger format warnings and prints incorrectly if the high bit is set. Use %llu (or cast to unsigned long long explicitly) for these counters.

Suggested change

XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld",

XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %llu, comp %llu",

Copilot · 2026-03-22T06:50:00Z

drivers/accel/amdxdna/aie2_ctx.c

+	u64 timeout_ms = MAX_SCHEDULE_TIMEOUT;
+
+	if (tdr_timeout_ms > 0)
+		timeout_ms = tdr_timeout_ms;
 	const struct drm_sched_init_args args = {
 		.ops = &sched_ops,
 		.num_rqs = DRM_SCHED_PRIORITY_COUNT,
 		.credit_limit = HWCTX_MAX_CMDS,
-		.timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT),
+		.timeout = msecs_to_jiffies(timeout_ms),
 		.name = "amdxdna_js",


tdr_timeout_ms is documented as “0 or negative = disable”, but when HAVE_6_15_drm_sched_init is enabled the code sets timeout_ms = MAX_SCHEDULE_TIMEOUT and then passes it through msecs_to_jiffies(). MAX_SCHEDULE_TIMEOUT is already a jiffies-based value, so converting it as milliseconds can overflow/clamp and changes the intended “disabled” behavior. Consider computing a timeout_jiffies directly (e.g., MAX_SCHEDULE_TIMEOUT when disabled, otherwise msecs_to_jiffies(tdr_timeout_ms)), and make sure the non-HAVE_6_15_drm_sched_init path also respects disabling via tdr_timeout_ms (it currently still uses HWCTX_MAX_TIMEOUT).

wendyliang25 requested review from houlz0507, maxzhen and xdavidz as code owners March 18, 2026 17:40

Copilot AI review requested due to automatic review settings March 18, 2026 17:40

Copilot started reviewing on behalf of wendyliang25 March 18, 2026 17:43 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

wendyliang25 force-pushed the accel-tdr branch from 70df908 to 13af4e8 Compare March 21, 2026 01:00

wendyliang25 requested a review from Copilot March 21, 2026 01:11

Copilot started reviewing on behalf of wendyliang25 March 21, 2026 01:13 View session

wendyliang25 added the do-not-merge label Mar 21, 2026

Copilot AI reviewed Mar 21, 2026

View reviewed changes

wendyliang25 force-pushed the accel-tdr branch from 13af4e8 to ae3d911 Compare March 21, 2026 06:15

wendyliang25 added 2 commits March 21, 2026 23:44

wendyliang25 force-pushed the accel-tdr branch from ae3d911 to 5334939 Compare March 22, 2026 06:45

Copilot AI review requested due to automatic review settings March 22, 2026 06:45

Copilot started reviewing on behalf of wendyliang25 March 22, 2026 06:46 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

		/* TDR cached health report, consumed by first aborted job's response handler */
		struct app_health_report *tdr_health;

		kfree(hwctx->priv->tdr_health);
		hwctx->priv->tdr_health = NULL;

	XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld",
	XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %llu, comp %llu",

Conversation

wendyliang25 commented Mar 18, 2026

Uh oh!

xdavidz commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wendyliang25 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wendyliang25 commented Mar 21, 2026 •

edited

Loading