Add explicit position_ids to GPT-Neo attention layers by vxa8502 · Pull Request #44687 · huggingface/transformers

vxa8502 · 2026-03-13T23:28:55Z

Fixes partial #32937

Adds explicit position_ids threading through GPT-Neo's attention layers to enable flash attention's packed sequence optimization.

Context

GPT-Neo uses learned absolute position embeddings (wpe) applied at the model level, unlike RoPE models (Llama, Falcon) that apply rotations inside attention. The position_ids parameter passed to attention layers serves two purposes:

API consistency with other CausalLM models
Packed sequence support — enables _flash_attention_forward() to detect packed sequences via _is_packed_sequence() (batch_size=1 edge case)

Changes

Add position_ids parameter to:

GPTNeoSelfAttention.forward()
GPTNeoFlashAttention2.forward()
GPTNeoAttention.forward()
GPTNeoBlock.forward()

Pass position_ids to _flash_attention_forward() call.

Tests

All 105 model tests pass
All 28 generation tests pass, including test_generate_with_and_without_position_ids

saivedant169

Thanks for working on this — same issue I tackled for RoFormer/Bloom/MPT in #44705-7.

A couple of observations from my experience with the other models:

GPT-Neo uses learned absolute embeddings, not RoPE — so position_ids needs to actually be consumed somewhere in the position embedding layer for this to have functional impact. In the current diff, position_ids is threaded through to the attention function via ALL_ATTENTION_FUNCTIONS, but does the embedding layer (GPTNeoModel.wpe) use it? If not, the parameter is accepted but silently ignored, which is fine for API consistency (same as the Bloom/MPT approach) but worth noting in the PR description.
The **kwargs additions (lines 294, 344, 493) look like a separate concern from position_ids. Was this needed to pass through some other arguments, or was it introduced to forward position_ids specifically? If it's unrelated, splitting it out would make the diff cleaner for reviewers.
Test results — did you run test_for_generate_causal_lm? When I added position_ids to RoFormer, the 2D shape from GenerationMixin caused a shape mismatch that needed handling. Worth confirming GPT-Neo's generation path works.

vxa8502 · 2026-03-16T22:53:13Z

@saivedant169 Thanks for the thorough review. Addressing each point:

Position embedding usage: Correct. GPT-Neo uses learned embeddings via wpe(position_ids) at the model level, not RoPE in attention. The attention-level position_ids only enables flash attention's packed sequence optimization (batch_size=1 edge case). For standard usage, this is API consistency. Will clarify in PR description.
**kwargs additions: Removed. Will push the cleaner diff.
Generation shape: All 28 generation tests pass, including test_generate_with_and_without_position_ids.

Pushing updated changes shortly.

github-actions · 2026-03-16T23:01:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_neo

saivedant169 reviewed Mar 16, 2026

View reviewed changes

Add explicit position_ids to GPT-Neo attention layers

afe0e22

Rocketknight1 closed this Mar 18, 2026

Rocketknight1 added the Code agent slop label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit position_ids to GPT-Neo attention layers#44687

Add explicit position_ids to GPT-Neo attention layers#44687
vxa8502 wants to merge 1 commit intohuggingface:mainfrom
vxa8502:fix/gpt-neo-position-ids-32937

vxa8502 commented Mar 13, 2026 •

edited

Loading

Uh oh!

saivedant169 left a comment

Uh oh!

vxa8502 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vxa8502 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Tests

Uh oh!

saivedant169 left a comment

Choose a reason for hiding this comment

Uh oh!

vxa8502 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vxa8502 commented Mar 13, 2026 •

edited

Loading