remove stale xfail for batched decode paged-parity test#158
remove stale xfail for batched decode paged-parity test#158renqHIT wants to merge 1 commit intovllm-project:mainfrom
Conversation
The xfail on `test_batched_decode_matches` was added for issue vllm-project#119 (B=2 batched GEMM producing different floats than B=1). The test now passes consistently on main after recent paged kernel fixes (vllm-project#146, vllm-project#151). This follows PR vllm-project#149 which removed the same stale xfail for the greedy single-request test. Signed-off-by: Qiang <qren@integralads.com>
|
Hi @LxYuan0420 @WindChimeRan 👋 I'm new here — excited to have found this project! I'm running a Mac Studio with M3 Ultra (256GB) and very interested in efficient local inference across different model sizes. Still getting familiar with the codebase. This follows PR #149 — the remaining Thanks! |
LxYuan0420
left a comment
There was a problem hiding this comment.
Thanks!
I tested it and on an M2 Max (mlx 0.30.6) test_batched_decode_matches still fails deterministically with token divergence, so I don’t think we can drop the xfail repo-wide yet. Could @WindChimeRan confirm whether batched parity is expected to be fixed across M1/M2 as well? If not, we should keep the xfail (or make it chip conditional) for now.
Thanks for testing on M2 Max @LxYuan0420. I only have M3 Ultra so couldn't catch the cross-chip difference. Makes sense to keep the xfail for now. Happy to help with a chip-conditional approach if needed after @WindChimeRan weighs in. My env: M3 Ultra 256GB, MLX 0.31.0, mlx_lm 0.29.1 — passes 3/3 runs. |
|
@LxYuan0420 @renqHIT Thanks for testing. Let's keep the test as is for now. Will have mulitiple kernel PRs soon, and they may flip tests on and off very frequently. When we have a stable varlen kernel, we can open a new issue to track on the numerical stability systematically. |
…167) `test_metal_kernel_paged.py` re-implements vllm-metal internals (cache setup, prefill/decode orchestration, context management) to compare two paths. This scaffolding introduces additional complexity, making failures hard to attribute. Delete it and add its prompts to `test_paged_deterministic.py`, which does the same comparison end-to-end through the real vLLM stack against golden tokens. Related: #158 #149 #119 --------- Signed-off-by: ran <hzz5361@psu.edu>
Summary
xfailmarker ontest_batched_decode_matchesintest_metal_kernel_paged.pyxfailfor issue Metal paged-attention parity mismatch vs standard path #119 (B=2 batched GEMM float divergence), but now passes consistently onmainafter recent paged kernel fixes ([varlen Kernel] Paged varlen flash attention for Metal [1/n] #146, [Continuous Batching] Packed prefill with cu_seq_lens for multiple requests #151)xfailfortest_greedy_output_matchesTest plan
test_batched_decode_matchespasses locally on M3 Ultra (was previouslyxpassed)