Skip to content

Add explicit position_ids to GPT-Neo attention layers#44687

Closed
vxa8502 wants to merge 1 commit intohuggingface:mainfrom
vxa8502:fix/gpt-neo-position-ids-32937
Closed

Add explicit position_ids to GPT-Neo attention layers#44687
vxa8502 wants to merge 1 commit intohuggingface:mainfrom
vxa8502:fix/gpt-neo-position-ids-32937

Conversation

@vxa8502
Copy link

@vxa8502 vxa8502 commented Mar 13, 2026

Fixes partial #32937

Adds explicit position_ids threading through GPT-Neo's attention layers to enable flash attention's packed sequence optimization.

Context

GPT-Neo uses learned absolute position embeddings (wpe) applied at the model level, unlike RoPE models (Llama, Falcon) that apply rotations inside attention. The position_ids parameter passed to attention layers serves two purposes:

  1. API consistency with other CausalLM models
  2. Packed sequence support — enables _flash_attention_forward() to detect packed sequences via _is_packed_sequence() (batch_size=1 edge case)

Changes

Add position_ids parameter to:

  • GPTNeoSelfAttention.forward()
  • GPTNeoFlashAttention2.forward()
  • GPTNeoAttention.forward()
  • GPTNeoBlock.forward()

Pass position_ids to _flash_attention_forward() call.

Tests

  • All 105 model tests pass
  • All 28 generation tests pass, including test_generate_with_and_without_position_ids

Copy link

@saivedant169 saivedant169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this — same issue I tackled for RoFormer/Bloom/MPT in #44705-7.

A couple of observations from my experience with the other models:

  1. GPT-Neo uses learned absolute embeddings, not RoPE — so position_ids needs to actually be consumed somewhere in the position embedding layer for this to have functional impact. In the current diff, position_ids is threaded through to the attention function via ALL_ATTENTION_FUNCTIONS, but does the embedding layer (GPTNeoModel.wpe) use it? If not, the parameter is accepted but silently ignored, which is fine for API consistency (same as the Bloom/MPT approach) but worth noting in the PR description.

  2. The **kwargs additions (lines 294, 344, 493) look like a separate concern from position_ids. Was this needed to pass through some other arguments, or was it introduced to forward position_ids specifically? If it's unrelated, splitting it out would make the diff cleaner for reviewers.

  3. Test results — did you run test_for_generate_causal_lm? When I added position_ids to RoFormer, the 2D shape from GenerationMixin caused a shape mismatch that needed handling. Worth confirming GPT-Neo's generation path works.

@vxa8502
Copy link
Author

vxa8502 commented Mar 16, 2026

@saivedant169 Thanks for the thorough review. Addressing each point:

  1. Position embedding usage: Correct. GPT-Neo uses learned embeddings via wpe(position_ids) at the model level, not RoPE in attention. The attention-level position_ids only enables flash attention's packed sequence optimization (batch_size=1 edge case). For standard usage, this is API consistency. Will clarify in PR description.

  2. **kwargs additions: Removed. Will push the cleaner diff.

  3. Generation shape: All 28 generation tests pass, including test_generate_with_and_without_position_ids.

Pushing updated changes shortly.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_neo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants