Skip to content

remove stale xfail for batched decode paged-parity test#158

Closed
renqHIT wants to merge 1 commit intovllm-project:mainfrom
renqHIT:chore/remove-batched-decode-xfail
Closed

remove stale xfail for batched decode paged-parity test#158
renqHIT wants to merge 1 commit intovllm-project:mainfrom
renqHIT:chore/remove-batched-decode-xfail

Conversation

@renqHIT
Copy link

@renqHIT renqHIT commented Mar 13, 2026

Summary

Test plan

  • test_batched_decode_matches passes locally on M3 Ultra (was previously xpassed)
  • Full test suite passes: 283 passed, 0 xpassed, 4 skipped

The xfail on `test_batched_decode_matches` was added for issue vllm-project#119
(B=2 batched GEMM producing different floats than B=1). The test now
passes consistently on main after recent paged kernel fixes (vllm-project#146, vllm-project#151).
This follows PR vllm-project#149 which removed the same stale xfail for the
greedy single-request test.

Signed-off-by: Qiang <qren@integralads.com>
@renqHIT
Copy link
Author

renqHIT commented Mar 13, 2026

Hi @LxYuan0420 @WindChimeRan 👋

I'm new here — excited to have found this project! I'm running a Mac Studio with M3 Ultra (256GB) and very interested in efficient local inference across different model sizes. Still getting familiar with the codebase.

This follows PR #149 — the remaining xfail for test_batched_decode_matches (issue #119) now consistently passes on main (xpassed in full test run). Likely fixed by the kernel changes in #146 or #151. Could you confirm this is safe to remove?

Thanks!

Copy link
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I tested it and on an M2 Max (mlx 0.30.6) test_batched_decode_matches still fails deterministically with token divergence, so I don’t think we can drop the xfail repo-wide yet. Could @WindChimeRan confirm whether batched parity is expected to be fixed across M1/M2 as well? If not, we should keep the xfail (or make it chip conditional) for now.

@renqHIT
Copy link
Author

renqHIT commented Mar 13, 2026

Thanks!

I tested it and on an M2 Max (mlx 0.30.6) test_batched_decode_matches still fails deterministically with token divergence, so I don’t think we can drop the xfail repo-wide yet. Could @WindChimeRan confirm whether batched parity is expected to be fixed across M1/M2 as well? If not, we should keep the xfail (or make it chip conditional) for now.

Thanks for testing on M2 Max @LxYuan0420. I only have M3 Ultra so couldn't catch the cross-chip difference.

Makes sense to keep the xfail for now. Happy to help with a chip-conditional approach if needed after @WindChimeRan weighs in.

My env: M3 Ultra 256GB, MLX 0.31.0, mlx_lm 0.29.1 — passes 3/3 runs.

@WindChimeRan
Copy link
Collaborator

@LxYuan0420 @renqHIT Thanks for testing. Let's keep the test as is for now. Will have mulitiple kernel PRs soon, and they may flip tests on and off very frequently.

When we have a stable varlen kernel, we can open a new issue to track on the numerical stability systematically.

LxYuan0420 pushed a commit that referenced this pull request Mar 17, 2026
…167)

`test_metal_kernel_paged.py` re-implements vllm-metal internals (cache
setup, prefill/decode orchestration, context management) to compare two
paths. This scaffolding introduces additional complexity, making
failures hard to attribute.

Delete it and add its prompts to `test_paged_deterministic.py`, which
does the same comparison end-to-end through the real vLLM stack against
golden tokens.

Related: 
#158 
#149
#119

---------

Signed-off-by: ran <hzz5361@psu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants