Extend DataCollatorForCompletionOnlyLM to support correct tool result masking by shanghongsim · Pull Request #2369 · oumi-ai/oumi

shanghongsim · 2026-04-14T20:04:05Z

Description

Context

Currently, the completions only collator does not consistently mask tool responses (Role.TOOL) correctly. For some chat templates (like Qwen), it accidentally gets it correct, while masking is consistently wrongly applied for llama chat templates. This PR enables the masking to be correctly applied for all chat templates, supporting all four roles (system, user, assistant and tool).

Without instruction_template, tool results are masked correctly. But with instruction_template, things get weird and some tool results are unmasked. Omitting instruction_template causes the collator to find the last response_template and masks everything before it (only the final assistant turn trains). With instruction_template, the collator searches for instruction_template and response_template pairs and masks everything in between.

Case 1: without instruction_template

RESPONSE_TEMPLATE = "<|im_start|>assistant\n"

old_no_inst = DataCollatorForCompletionOnlyLM(
    response_template=RESPONSE_TEMPLATE,
    instruction_template=None,
    tokenizer=tokenizer,
)
b = old_no_inst.torch_call([token_ids])
labels_A = b["labels"][0].tolist()
summarise("Case A", labels_A, N)
show_masking(b["input_ids"][0].tolist(), labels_A, tokenizer)

Without inst template, collator masks everything before the last resp template. So loss only sees

<|im_start|>assistant
The weather in Paris is sunny and 18°C.<|im_end|>

Case 2: with instruction_template

With instruction_template, it masks everything between a inst and resp template. Since there isn't an inst template before the tool result, it does not get masked properly.

INSTRUCTION_TEMPLATE = "<|im_start|>user\n"

old_with_inst = DataCollatorForCompletionOnlyLM(
    response_template=RESPONSE_TEMPLATE,
    instruction_template=INSTRUCTION_TEMPLATE,
    tokenizer=tokenizer,
)
b2 = old_with_inst.torch_call([token_ids])
labels_B = b2["labels"][0].tolist()
summarise("Case B", labels_B, N)
show_masking(b2["input_ids"][0].tolist(), labels_B, tokenizer)

In case 2, the loss sees tool result, which is incorrect.

...
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call><|im_end|>
...
<|im_start|>assistant
The weather in Paris is sunny and 18°C.<|im_end|>

To confirm my hypothesis of no inst template before the tool result being the cause, I experimented with adding a user turn before the tool call and masking works correctly in that case.

ToolAwareCompletionsCollator

With the new ToolAwareCompletionsCollator, tool results are masked properly

Without instruction_template

With instruction_template

Change 1: Extend `DataCollatorForCompletionOnlyLM` with span-based masking

Extend DataCollatorForCompletionOnlyLM with span-based masking (detecting assistant turn boundaries via response + end-of-turn tokens) beyond instruction based masking (matching instruction/response string pairs) supported currently.

Change 2: `train_target` enum for explicit control of which part of response to train on

assistant_turn: Masks everything, then unmasks each assistant response bounded by response_template .. end_of_turn_template (inclusive of EOT).
final_assistant_turn: Masks all tokens before the last response_template occurrence.

When train_target is set, no matter what templates have been provided, train_target has final authority on which combination is used. When train_target is not set, it is inferred from the templates provided. Eventually, we want users to specify train_target instead of relying on combinations of templates to control behavior. However, since we have many configs in production that do not use train_target and set templates explicitly, this shall be the interim solution to ensure old configs do not break. Existing configs using collator_kwargs with instruction_template + response_template continue to work via the legacy path (with a deprecation warning). New configs should use train_target explicitly. Users can set train_target and override the templates manually by setting collator_kwargs.

Before — users must provide model-specific token strings:

collator_name: "text_completions_only_with_padding"
collator_kwargs:
  response_template: "<|im_start|>assistant\n"
  end_of_turn_template: "<|im_end|>"

After — users express intent, templates auto-resolve from tokenizer:

collator_name: "text_completions_only_with_padding"
train_target: "assistant_turn"

Change 3: Auto detection of collator templates

Instead of fragile vocab and collator template matching (requires us to maintain collator templates for all popular models), shift to a more robust approach: apply the chat template to a known test conversation, then finds the assistant boundary strings in the rendered output.

Migration and deprecation considerations:

Configs that only specify collator_name -> removed support for this with the removal of the default llama template fallback. This should be ok as no configs in production specify collator_name without collator_kwargs. Only less than 10 configs in OSS fall in this edgecase and they are being updated in this PR.
Enterprise SFT configs -> they should still work with old legacy behavior. Will update them once OSS version is updated in API.

Open issues for next time

Movetrain_target into collator_kwargs -> I considered putting train_target inside collator_kwargs instead of making it a separate field, but collator_kwargs is an opaque dict. Users wouldn't know train_target exists unless they read the docs. A top-level field makes it more visible. This will eventually be solved when we shift to CollatorParam.
CollatorParams: This would mean changing every YAML from:

  collator_name: "text_completions_only_with_padding"
  collator_kwargs:
    train_target: "all_assistant_turns"

to

  collator:
    name: "text_completions_only_with_padding"
    train_target: "all_assistant_turns"

This is much cleaner but its a substantial refactor. In order to manage the scope of this PR, this shall be deferred to a future PR

Vision specific concerns -> future PR to neaten the vision concerns that are now scattered throughout the LM specific things
Per-split collator configuration -> build_collator_from_config builds a single collator from the training split's settings, which is reused for all splits (train, validation, test). Validation/test split-specific collator settings (e.g. a different train_target) are silently ignored. This is pre-existing behavior not introduced by this PR, but worth noting as a known limitation. Fixing it would require refactoring the training loop to build per-split collators.

Related issues

N/A

Before submitting

This PR only changes documentation.
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

Merge tool-aware masking directly into the existing TRL-derived collator. Adds masking_method parameter with three strategies: assistant_turn, assistant_turn_no_tools, and final_assistant_turn. - Add _find_pattern, _span_contains, _apply_span_masking to DataCollatorForCompletionOnlyLM - Rename instruction_prefix/response_prefix to instruction_template/response_template - Add end_of_turn_template, tool_call_start_template, masking_method params - Update builder to pass new params and support span-based masking - Add comprehensive tests for all masking strategies

gitar-bot · 2026-04-14T20:04:09Z

Gitar is working

_Gitar

The wrapper never uses list[int] directly — it passes through to DataCollatorForCompletionOnlyLM. All callers pass strings. Update tests to pass string templates instead of token ID lists.

Raise ValueError at init time when masking_method is assistant_turn or assistant_turn_no_tools but end_of_turn_template is None. Previously this silently passed init and crashed with AssertionError on the first batch during training.

…class level Replace 4 repeated isinstance/encode blocks with a single helper method. Move _KNOWN_MASKING_METHODS from __init__ local to class-level constant.

Replace reference to mask_tool_calls=True with the correct usage: masking_method='assistant_turn_no_tools'.

Emit DeprecationWarning when the collator infers _legacy_instruction_response masking from the presence of instruction_template. Guides users toward masking_method.

When masking_method is not explicit and both end_of_turn_template and tool_call_start_template are present, infer assistant_turn_no_tools instead of assistant_turn.

Replace nested if/elif chain with a classmethod that validates or infers masking_method from template presence. Keeps __init__ focused on tokenization and validation.

Drop the ASSISTANT_TURN_NO_TOOLS enum member and the tool_call_start_template from builder templates, matching the core collator simplification in PR1.

MaskingMethod was confusing — it sounded like the assistant turns were being masked. TrainTarget with ALL_ASSISTANT_TURNS / FINAL_ASSISTANT_TURN makes the intent clear: select what to train on.

Instead of checking for marker tokens in the tokenizer vocabulary and returning hardcoded template strings, render the tokenizer's own chat template with sentinel content and extract response_template and end_of_turn_template from the output. This works for any model with a chat template without requiring per-family hardcoded entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove mutual exclusivity check so users can set train_target for auto-resolution and override individual templates via collator_kwargs.

…ates

Move train_target inference from the collator into build_collator_from_config so the builder is the single decision point. The collator now receives a resolved train_target and only validates structural invariants. - Builder handles new config (train_target set, auto-detect templates) and old config (infer train_target from collator_kwargs) - Collator: remove _resolve_train_target, make train_target required, derive _VALID_TRAIN_TARGETS from TrainTarget enum - Add builder tests for old-recipe inference paths

…sing

…_assistant_turns

Sign in to view

+    if not response_template.strip() or not end_of_turn_template.strip():
+        raise ValueError(_FALLBACK_MSG)


sentry · 2026-04-17T18:40:47Z

            debug=debug,
+            ignore_index=(
+                label_ignore_index if label_ignore_index is not None else -100
+            ),
            **kwargs,
        )
    raise ValueError(f"Unknown data collator name: '{collator_name}'")


Bug: The TextCompletionsCollatorWithPadding constructor can receive ignore_index both as an explicit argument and within **kwargs, causing a TypeError if a user customizes it.
_{Severity: HIGH}

Suggested Fix

Before calling the TextCompletionsCollatorWithPadding constructor, pop ignore_index from the kwargs dictionary. Use the popped value to determine the final ignore_index value to be passed, preventing the argument duplication.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/oumi/builders/collators.py#L243-L249 Potential issue: When building the `text_completions_only_with_padding` collator in `build_data_collator`, the `ignore_index` parameter is passed explicitly while also being potentially present in the `**kwargs` passed to the `TextCompletionsCollatorWithPadding` constructor. If a user specifies `ignore_index` in their `collator_kwargs` configuration, this results in a `TypeError` because the `__init__` method receives multiple values for the same keyword argument, causing a runtime crash during training setup.

jgreer013 · 2026-04-17T18:30:55Z

+    Raises:
+        ValueError: If templates cannot be extracted.
+    """
+    msgs = [


This is really clever, but we should maybe include a system instruction here?

Like modify it to extract system instruction? We are current no longer relying on system instruction for masking

No I mean ensure that when we extract the different template portions, we do so using an example conversation that also includes a system instruction

Or rather, perhaps do it twice, once on a conversation with a system instruction, and once on a conversation without.

Why? Because I'm a bit concerned of the scenario where this logic doesn't use system instructions when extracting, but the user does in their data, and somehow the resulting templates we extract using this methodology don't wind up working when the data includes the presence of a system instruction for some reason.

jgreer013 · 2026-04-17T18:42:28Z

+    # Locate boundaries around the second turn pair
+    # to avoid system-prompt effects on the first turn.
+    try:
+        a1 = rendered.index(_SENTINEL_ASST)


nit: first_asst_start, and add _start to second turns

jgreer013 · 2026-04-17T18:45:40Z

+    except ValueError:
+        raise ValueError(_FALLBACK_MSG)
+
+    # End-of-turn: common token-ID prefix of the two strings that


extract this to its own method

jgreer013 · 2026-04-17T18:48:17Z

+    assert isinstance(_eot_decoded, str)
+    end_of_turn_template = _eot_decoded
+
+    # Response template: strip the EOT prefix to get just the assistant header.


Same here, extract to its own method

jgreer013 · 2026-04-17T18:50:47Z

+    if not response_template.strip() or not end_of_turn_template.strip():
+        raise ValueError(_FALLBACK_MSG)
+
+    # Qwen3 and similar reasoning models inject <think>...</think> into


This description scares me, I wonder if we could have a better workaround or a louder error

jgreer013 · 2026-04-17T18:55:54Z

+from oumi.core.configs.params.data_params import TrainTarget
+

 class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):


It's not clear to me what this collator does differently than the other one.

Previously, it would instruction_template and response_template pairs then mask everything in between (see L324). However, there isn't an instruction_template before the tool result, so tool result content does not get masked properly. Some models (like Qwen) accidentally mask it correctly because their chat template stores tool result under role.user. But for most models (like Llama etc), tool result is stored under a role.tool with a separate tag so masking is incorrectly applied for the reasons above. The new approach basically masks everything then only unmask portions between the response_template and end_of_turn_template. This is more robust to the different roles and specifics of how tool results are formatted and stored by different chat templates.

lefft

Can we be sure to check whether template detection and collation happens properly when user or assistant turns begin with (one or more) leading \ns? We had issues with this breaking tokenization and leading to records being skipped in enterprise, would like to understand whether these changes will help, harm, or not impact that issue.

lefft · 2026-04-17T19:06:49Z

+)
+
+
+def _resolve_collator_templates(


Has this been tested with gpt-oss models? They use a non-standard system for marking boundaries within responses, so wondering if boundary detection will work as it does for Qwen/Llama style templates.

lefft · 2026-04-17T19:17:24Z

+                collator_kwargs["train_target"] = "all_assistant_turns"
+            elif has_inst:
+                warnings.warn(
+                    "Instruction-based masking is deprecated.\n"


Is there a plan/intention to eliminate support for instruction-based masking altogether, or will we just keep it in a "deprecated" state?

Want to make sure we retain compatibility with existing enterprise configs (or verify that moving to this style doesn't change training dynamics before adapting enterprise configs).

shanghongsim mentioned this pull request Apr 14, 2026

Add span-based masking to DataCollatorForCompletionOnlyLM #2368

Closed

4 tasks

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py

Narrow TextCompletionsCollatorWithPadding template params to str

1c2f842

The wrapper never uses list[int] directly — it passes through to DataCollatorForCompletionOnlyLM. All callers pass strings. Update tests to pass string templates instead of token ID lists.

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 25c06fd to 9facd20 Compare April 14, 2026 20:18

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 9facd20 to 0ae3142 Compare April 14, 2026 20:25

Extract _tokenize_template helper and move _KNOWN_MASKING_METHODS to …

2948700

…class level Replace 4 repeated isinstance/encode blocks with a single helper method. Move _KNOWN_MASKING_METHODS from __init__ local to class-level constant.

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 0ae3142 to 3f4dba3 Compare April 14, 2026 20:40

shanghongsim changed the title ~~Add MaskingMethod enum for explicit SFT masking control~~ Add MaskingMethod enum for simpler masking control Apr 14, 2026

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

shanghongsim requested review from jgreer013, lefft and oelachqar April 14, 2026 21:53

Fix docstring: mask_tool_calls is not a valid collator_kwargs param

ec9d609

Replace reference to mask_tool_calls=True with the correct usage: masking_method='assistant_turn_no_tools'.

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from f957244 to 2b35a20 Compare April 14, 2026 21:55

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

Add deprecation warning for legacy instruction-based masking

e9ee991

Emit DeprecationWarning when the collator infers _legacy_instruction_response masking from the presence of instruction_template. Guides users toward masking_method.

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 2b35a20 to 676c5b5 Compare April 14, 2026 22:31

sentry bot reviewed Apr 14, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

Infer assistant_turn_no_tools when tool_call_start_template is provided

ab6e081

When masking_method is not explicit and both end_of_turn_template and tool_call_start_template are present, infer assistant_turn_no_tools instead of assistant_turn.

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch 2 times, most recently from c7a1308 to 9a80c42 Compare April 14, 2026 22:46

Extract _resolve_masking_method to simplify __init__ logic

4c081d7

Replace nested if/elif chain with a classmethod that validates or infers masking_method from template presence. Keeps __init__ focused on tokenization and validation.

shanghongsim force-pushed the shanghong/collator-span-masking branch from 8309130 to 4c081d7 Compare April 14, 2026 22:50

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 9a80c42 to 778351f Compare April 14, 2026 22:50

shanghongsim mentioned this pull request Apr 14, 2026

Add tool aware collator to mask tool response correctly #2356

Closed

4 tasks

shanghongsim force-pushed the shanghong/collator-span-masking branch from 86c00e7 to df0d473 Compare April 16, 2026 22:04

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from e9ae1ef to 9913f12 Compare April 16, 2026 22:05

shanghongsim and others added 11 commits April 16, 2026 22:06

Extract _resolve_train_target classmethod from __init__

414dd4c

Remove ASSISTANT_TURN_NO_TOOLS from MaskingMethod enum and builder

33a5da8

Drop the ASSISTANT_TURN_NO_TOOLS enum member and the tool_call_start_template from builder templates, matching the core collator simplification in PR1.

Rename MaskingMethod to TrainTarget with clearer value names

b4583e4

MaskingMethod was confusing — it sounded like the assistant turns were being masked. TrainTarget with ALL_ASSISTANT_TURNS / FINAL_ASSISTANT_TURN makes the intent clear: select what to train on.

docs: clarify _resolve_collator_templates docstring

780606c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allow train_target and collator_kwargs to be used together

b14ab66

Remove mutual exclusivity check so users can set train_target for auto-resolution and override individual templates via collator_kwargs.

Rename end_of_turn to end_of_turn_template in _resolve_collator_templ…

e56f872

…ates

Move train_target collator-name validation into config __post_init__

33c768e

Reject train_target when collator_name is not set or wrong

0136af9

Remove redundant tokenizer None check in train_target block

928c274

Clean up _resolve_train_target and TrainTarget docstring

fbc7087

shanghongsim force-pushed the shanghong/masking-method-enum-v2 branch from 9913f12 to fbc7087 Compare April 16, 2026 22:08

shanghongsim changed the title ~~Add MaskingMethod enum for simpler masking control~~ Add train_target-based masking to DataCollatorForCompletionOnlyLM Apr 17, 2026

shanghongsim changed the base branch from shanghong/collator-span-masking to main April 17, 2026 00:16

Remove redundant tests for span-based masking and legacy collator paths

a06055f

shanghongsim changed the title ~~Add train_target-based masking to DataCollatorForCompletionOnlyLM~~ Extend DataCollatorForCompletionOnlyLM to support correct tool result masking Apr 17, 2026

Add type annotations to fix pyright errors on tokenizer.decode calls

0cdb2bc

sentry bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py

Comment thread src/oumi/builders/collators.py Outdated

shanghongsim added 3 commits April 17, 2026 17:36

Fix pyright: use isinstance assert for tokenizer.decode return type

1c1dadf

Fix YAML configs: use enum name ALL_ASSISTANT_TURNS for OmegaConf par…

e30dc99

…sing

Validate end_of_turn_template early when auto-detection fails for all…

1a8e1d2

…_assistant_turns

sentry bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/oumi/builders/collators.py Outdated

Comment on lines +114 to +115

if not response_template.strip() or not end_of_turn_template.strip():

raise ValueError(_FALLBACK_MSG)

This comment was marked as outdated.

Sign in to view

Differentiate error messages in _resolve_collator_templates

48b65a1

sentry bot reviewed Apr 17, 2026

View reviewed changes

jgreer013 approved these changes Apr 17, 2026

View reviewed changes

lefft approved these changes Apr 17, 2026

View reviewed changes

oelachqar approved these changes Apr 17, 2026

View reviewed changes

		if not response_template.strip() or not end_of_turn_template.strip():
		raise ValueError(_FALLBACK_MSG)

		from oumi.core.configs.params.data_params import TrainTarget


		class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):

		)


		def _resolve_collator_templates(

Conversation

shanghongsim commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

ToolAwareCompletionsCollator

Change 1: Extend DataCollatorForCompletionOnlyLM with span-based masking

Change 2: train_target enum for explicit control of which part of response to train on

Change 3: Auto detection of collator templates

Migration and deprecation considerations:

Open issues for next time

Related issues

Before submitting

Reviewers

Uh oh!

gitar-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

sentry bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lefft left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shanghongsim commented Apr 14, 2026 •

edited

Loading

Change 1: Extend `DataCollatorForCompletionOnlyLM` with span-based masking

Change 2: `train_target` enum for explicit control of which part of response to train on

gitar-bot bot commented Apr 14, 2026 •

edited

Loading