accel/amdxdna: add TDR implementation#1191
Conversation
|
Is this aie2 only or this should be common between aie2 and aie4? If it can be common, should we change the name from aie2_tdr.c to aie_tdr.c? @houlz0507 |
There was a problem hiding this comment.
Pull request overview
Ports Timeout Detection and Recovery (TDR) into the new drivers/accel/amdxdna driver by adding a device-wide timer/workqueue based detector and common recovery helpers that stop/restart all hardware contexts.
Changes:
- Add AIE2 TDR implementation (timer + workqueue) and wire it into device init/fini.
- Introduce common TDR helpers to dump/stop/restart all hwctx instances via new
amdxdna_dev_opscallbacks. - Add kernel feature-probe for
timer_container_of()and provide a compatibility macro fallback.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| drivers/accel/tools/configure_kernel.sh | Probes for timer_container_of() and emits compat macro fallback. |
| drivers/accel/amdxdna/amdxdna_pci_drv.h | Extends dev ops with hwctx stop/restart/dump callbacks used by TDR. |
| drivers/accel/amdxdna/amdxdna_ctx.h | Exposes new common TDR helper APIs. |
| drivers/accel/amdxdna/amdxdna_ctx.c | Implements common dump/stop/restart-all helpers over clients/hwctx. |
| drivers/accel/amdxdna/aie2_tdr.c | New AIE2 TDR detector/recovery logic (timer + work item + module params). |
| drivers/accel/amdxdna/aie2_pci.h | Adds TDR state to device handle and adds per-hwctx cached health report pointer. |
| drivers/accel/amdxdna/aie2_pci.c | Starts/stops TDR as part of AIE2 device init/fini; wires ops callbacks. |
| drivers/accel/amdxdna/aie2_ctx.c | Implements hwctx stop/restart/dump ops; integrates job submitted/completed accounting and timeout report plumbing. |
| drivers/accel/amdxdna/Kbuild | Builds the new aie2_tdr.o object. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Hi @xdavidz tdr detection is is aie2 specific, aie4 TDR error is reported from firmware.the restart procedure maybe common and thus, have the stop and restart functions implemented in |
There was a problem hiding this comment.
Pull request overview
Ports a Timeout Detection and Recovery (TDR) path into the drivers/accel/amdxdna driver by integrating a new AIE2 timeout handler with shared “stop/restart all contexts” recovery helpers and related driver plumbing.
Changes:
- Adds AIE2 TDR timed-out job handler plus module parameters (timeout / dump-only) and per-context cached health reporting.
- Introduces common device-wide recovery helpers (
amdxdna_tdr_stop_all()/amdxdna_tdr_restart_all()) and wires new hwctx stop/restart ops into the device ops table. - Updates build and kernel-compat tooling (adds
aie2_tdr.o, adds atimer_container_ofcompatibility probe).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| drivers/accel/tools/configure_kernel.sh | Adds kernel feature probe and fallback macro for timer_container_of. |
| drivers/accel/amdxdna/amdxdna_pci_drv.h | Extends dev ops with hwctx stop/restart; adds client iteration macro. |
| drivers/accel/amdxdna/amdxdna_ctx.h | Declares new device-wide TDR stop/restart helpers. |
| drivers/accel/amdxdna/amdxdna_ctx.c | Implements device-wide “stop/restart all hwctx” recovery helpers. |
| drivers/accel/amdxdna/aie2_tdr.c | Adds new AIE2 TDR implementation (timedout_job hook, detection, health collection). |
| drivers/accel/amdxdna/aie2_pci.h | Adds TDR structures/state to device handle; adds hwctx stop/restart ops prototypes and TDR API. |
| drivers/accel/amdxdna/aie2_pci.c | Wires AIE2 hwctx stop/restart ops into aie2_ops. |
| drivers/accel/amdxdna/aie2_ctx.c | Integrates TDR signaling, timeout handler swap-in, timeout parameter usage, and health propagation on abort paths. |
| drivers/accel/amdxdna/Kbuild | Builds the new aie2_tdr.o. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| guard(mutex)(&xdna->dev_lock); | ||
| if (!time_after(jiffies, ndev->tdr.last_jiffies + TDR_TIMEOUT_JIFF)) | ||
| return DRM_GPU_SCHED_STAT_NO_HANG; | ||
|
|
||
| if (!aie2_tdr_detect(ndev)) { | ||
| ndev->tdr.last_jiffies = jiffies; | ||
| return DRM_GPU_SCHED_STAT_NO_HANG; | ||
| } | ||
|
|
||
| trace_xdna_job(sched_job, hwctx->name, "job timedout", job->seq); | ||
| job->job_timeout = true; | ||
| XDNA_WARN(xdna, "Device isn't making progress... Recovering..."); | ||
|
|
||
| aie2_tdr_collect_health(ndev); | ||
| job->priv = hwctx->priv->tdr_health; | ||
| hwctx->priv->tdr_health = NULL; | ||
|
|
||
| amdxdna_tdr_stop_all(xdna, job); | ||
| amdxdna_tdr_restart_all(xdna); | ||
|
|
||
| ndev->tdr.last_jiffies = jiffies; | ||
| return DRM_GPU_SCHED_STAT_RESET; | ||
| } |
There was a problem hiding this comment.
Compatibility: this function now unconditionally returns DRM_GPU_SCHED_STAT_NO_HANG / DRM_GPU_SCHED_STAT_RESET, but the repo already has kernel-version probes for the *_RESET enum name change. On kernels lacking these enum values, this will fail to build. Consider restoring the #ifdef HAVE_drm_gpu_sched_stat_reset fallback (and similarly guarding NO_HANG if needed) or adding a compat define in the generated config header.
…mentation This commit introduces a implementation of TDR for the AMD XDNA platform. Key changes: - Add core logic in drivers/accel/amdxdna/aie2_tdr.c to support device timeout detection and recovery flows. - Hook TDR into device and context management in aie2_ctx.c, amdxdna_ctx.c, and aie2_pci.c. - Update related headers (aie2_pci.h, amdxdna_ctx.h, amdxdna_pci_drv.h) to provide the necessary data structures and interfaces for TDR. - Extend Kbuild to build the new TDR source. TDR enables enhanced device reliability by detecting hung contexts and recovering from such events, minimizing system downtime and improving user experience. Signed-off-by: Wendy Liang <wendy.liang@amd.com>
…er_of macro. timer_container_of() macro is defined in kernel v6.16. It is used to get the container of a timer from a timer_list. For older kernels, we use from_timer() macro instead. Signed-off-by: Wendy Liang <wendy.liang@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /* TDR cached health report, consumed by first aborted job's response handler */ | ||
| struct app_health_report *tdr_health; |
There was a problem hiding this comment.
tdr_health is cached in hwctx_priv and can remain set in dump-only mode or if no aborted job consumes it. aie2_hwctx_fini() currently frees hwctx->priv without freeing priv->tdr_health, so this can leak the allocation across hwctx teardown. Please ensure tdr_health is freed/cleared in the hwctx teardown path (e.g., in aie2_hwctx_fini() before freeing hwctx->priv).
| kfree(hwctx->priv->tdr_health); | ||
| hwctx->priv->tdr_health = NULL; | ||
|
|
There was a problem hiding this comment.
aie2_tdr_collect_health() frees hwctx->priv->tdr_health with a plain kfree() and then later overwrites the pointer. Since other paths concurrently access it via xchg() (e.g., response handlers and aie2_hwctx_stop_op()), this open-coded free/store can race and lead to use-after-free or leaks. Prefer freeing via kfree(xchg(&priv->tdr_health, NULL)) and use atomic exchange/assign patterns consistently when replacing the cached report.
| if (sub == comp) | ||
| continue; | ||
|
|
||
| XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld", |
There was a problem hiding this comment.
The log message uses %lld for sub/comp which are u64. This can trigger format warnings and prints incorrectly if the high bit is set. Use %llu (or cast to unsigned long long explicitly) for these counters.
| XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %lld, comp %lld", | |
| XDNA_ERR(xdna, "TDR: ctx %s (fw_ctx %d) stuck, sub %llu, comp %llu", |
| u64 timeout_ms = MAX_SCHEDULE_TIMEOUT; | ||
|
|
||
| if (tdr_timeout_ms > 0) | ||
| timeout_ms = tdr_timeout_ms; | ||
| const struct drm_sched_init_args args = { | ||
| .ops = &sched_ops, | ||
| .num_rqs = DRM_SCHED_PRIORITY_COUNT, | ||
| .credit_limit = HWCTX_MAX_CMDS, | ||
| .timeout = msecs_to_jiffies(HWCTX_MAX_TIMEOUT), | ||
| .timeout = msecs_to_jiffies(timeout_ms), | ||
| .name = "amdxdna_js", |
There was a problem hiding this comment.
tdr_timeout_ms is documented as “0 or negative = disable”, but when HAVE_6_15_drm_sched_init is enabled the code sets timeout_ms = MAX_SCHEDULE_TIMEOUT and then passes it through msecs_to_jiffies(). MAX_SCHEDULE_TIMEOUT is already a jiffies-based value, so converting it as milliseconds can overflow/clamp and changes the intended “disabled” behavior. Consider computing a timeout_jiffies directly (e.g., MAX_SCHEDULE_TIMEOUT when disabled, otherwise msecs_to_jiffies(tdr_timeout_ms)), and make sure the non-HAVE_6_15_drm_sched_init path also respects disabling via tdr_timeout_ms (it currently still uses HWCTX_MAX_TIMEOUT).
This patch set port TDR implementation to the new
drivers/accel/amdxdnadriver.Compared to the legacy TDR implementation, this implementation does:
detection: it compares submission and completion atomic counters in the device wide TDR, in TDR timer callback, it checks if submission and completion counters, if they are different, it is considered potentially context times out. When a job is scheduled to run, the submissions of device wide TDR will be increased by one, when the job is notified to complete, the completions counter will be decreased by one. In the legacy driver, it doesn't use the counters, it only checks if it is if there is any job completes, and in TDR timer callback, it scanned through device wide run queue. In the new implementation, it doesn't need to scan through the device wide contexts. This will reduce TDR detection overhead (mutex lock and time on scanning). However, this new method may need update if in future if submitting jobs mechanism is changed. And thus this part is request for comments, as I can also port the legacy detection mechanism in the new driver.
make recovery procedure functions common, so that in future next hardware generation can call the same function for recovery.
Tests: