Add MBPP evaluator by natke · Pull Request #2439 · microsoft/Olive

natke · 2026-04-29T23:42:12Z

Summary

Adds MBPP evaluator support for the ortgenai backend, enabling generative LM eval tasks (e.g., MBPP, HumanEval) via lm-eval. Key changes include:

Add generate_until support to LMEvalORTGenAIEvaluator for token-by-token generation with stop-sequence handling
Wire confirm_run_unsafe_code opt-in flag from CLI through to lm_eval.simple_evaluate for MBPP safety compliance
Improve stop-sequence trimming to cut at the earliest match across all provided stop strings
Improve generate_until with full-sequence decode for correct tokenizer whitespace/punctuation handling
Harden gen_kwargs parsing: max tokens, temperature, and do_sample coercion for string/int inputs
Add evaluator tests for unsafe-code flag propagation, multi-stop ordering, and token parsing edge cases

Example Olive config

The evaluator can be used standalone against a pre-existing ONNX model — no optimization passes are required. The passes field defaults to empty, and evaluate_input_model defaults to true, so Olive evaluates the input model directly:

{
    "input_model": {
        "type": "OnnxModel",
        "model_path": "models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4"
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "accelerators": [{ "device": "cpu", "execution_providers": ["CPUExecutionProvider"] }]
        }
    },
    "evaluators": {
        "lm_evaluator": {
            "type": "LMEvaluator",
            "tasks": ["mbpp"],
            "model_class": "ortgenai",
            "batch_size": 1,
            "max_length": 2048,
            "limit": 10,
            "confirm_run_unsafe_code": true
        }
    },
    "target": "local_system",
    "evaluator": "lm_evaluator"
}

Run with:

olive run -c config.json

Or via the CLI directly (no config file needed):

olive benchmark -m models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4 \
    --tasks mbpp \
    --backend ortgenai \
    --confirm_run_unsafe_code

Note: MBPP executes generated Python code to check correctness. The confirm_run_unsafe_code flag (or --confirm_run_unsafe_code on the CLI) is required to acknowledge this.

Validation

pytest -q test/evaluator/test_olive_evaluator.py -k "generate_until or confirm_run_unsafe_code or lmeval"
pytest -q test/cli/test_cli.py -k "benchmark_command_onnxmodel_with_ortgenai_backend or benchmark_command_hfmodel"

- Implement generate_until in LMEvalORTGenAIEvaluator for code generation tasks - Fix EOS token handling for models with multiple EOS token IDs (e.g. Qwen) - Add confirm_run_unsafe_code parameter to CLI and LMEvaluator for code eval tasks - Update base ORT class error message to direct users to ortgenai backend - Add unit tests for generate_until and confirm_run_unsafe_code This enables running MBPP, HumanEval, and other code generation benchmarks through Olive's ortgenai evaluation backend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Trim generation at earliest matching stop sequence across all provided stops\n- Improve generate_until decoding efficiency by decoding incrementally\n- Harden max token parsing for malformed generation kwargs\n- Add evaluator tests for stop ordering and token parsing edge cases\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR improves Olive’s lm-eval integration for the ORT GenAI backend by adding a working generate_until implementation with more robust stopping behavior, while also exposing lm-eval’s confirm_run_unsafe_code flag through Olive’s evaluator and CLI.

Changes:

Implement generate_until for ortgenai with EOS and multi-stop trimming, and improve the unsupported-backend error for plain ort.
Plumb confirm_run_unsafe_code through LMEvaluator and olive benchmark, with unit test coverage.
Add evaluator tests covering multi-stop ordering, EOS handling, incremental decode behavior, and max-token parsing edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`olive/evaluator/lmeval_ort.py`	Adds ORT-GenAI `generate_until`, supports multiple EOS IDs, and improves `ort` backend error messaging.
`olive/evaluator/olive_evaluator.py`	Adds `confirm_run_unsafe_code` to `LMEvaluator` and forwards it to `lm_eval.simple_evaluate`.
`olive/cli/benchmark.py`	Adds `--confirm_run_unsafe_code` CLI flag and writes it into the generated run config when enabled.
`test/evaluator/test_olive_evaluator.py`	Adds unit tests for `confirm_run_unsafe_code` forwarding and ORT-GenAI `generate_until` behaviors.

- Remove dead generated_ids list - Replace string concatenation with list accumulation to avoid quadratic growth - Rename _eos_token_ids to eos_token_ids (public attribute) to fix protected-access lint warnings in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

- Guard against empty eos_token_id list with a clear ValueError - Early-return empty completion when prompt >= max_length or max_gen_toks == 0 to avoid passing invalid max_length to the ORT GenAI generator - Fix generated_text not being set when loop exits via EOS break - Use tail-buffer for stop-sequence checking instead of full join per token - Update test for max_gen_toks=0 to assert early-return behaviour Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

- Replace chunk-count tail window with character-based rolling tail string, fixing missed stop sequences when stop strings span more tokens than characters - Track generated_len with a running counter to avoid O(n^2) join for tail_offset - Coerce temperature from str/None safely with float() + fallback to 0.0 - Add parametrized tests for temperature coercion edge cases (string, None, zero, float) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

test/evaluator/test_olive_evaluator.py:1

The tests cover the max_gen_toks == 0 early-return path, but not the prompt_len >= self.max_length branch (which also triggers cache-hook behavior). Add a unit test where tokenizer.encode(...).tolist() returns a prompt length >= evaluator.max_length, and assert it returns \"\" and writes the empty partial to the cache hook.

# -------------------------------------------------------------------------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ith older lm-eval - Remove batch_size from set_search_options (not a valid kwarg) - Use try/except TypeError for confirm_run_unsafe_code compat - Remove unused import inspect Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

…mple coercion - olive_evaluator.py: use inspect.signature to check if lm-eval supports confirm_run_unsafe_code before passing it (avoids brittle try/except TypeError) - lmeval_ort.py: accumulate generated_token_ids and decode full sequence once at end so tokenizer whitespace/punctuation normalisation is applied correctly; keep per-token incremental decode only for stop-sequence tail detection - lmeval_ort.py: coerce do_sample to bool, handling string 'false'/'0'/'no' and int 0/1 so they are not unintentionally truthy - tests: set __signature__ on simple_evaluate mocks; add older-lm-eval compat test; update decode-assertion for full-seq decode; add do_sample coercion test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

…ort_genai stub - olive_evaluator.py: wrap inspect.signature in try/except (TypeError, ValueError) so an unintrospectable wrapper never crashes evaluation - benchmark.py: write confirm_run_unsafe_code as explicit bool (not 'or None') so False is always written rather than silently omitted as null - lmeval_ort.py: normalise 'until' tuple/set/generator to list via list() so stop sequences in tuple form are not silently dropped - test/evaluator/conftest.py: inject onnxruntime_genai stub so generate_until tests run without the real package installed - test: remove onnxruntime_genai from skipif; add tuple-until regression test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…qdm, caching - lmeval_ort.py: pass do_sample=True to set_search_options when sampling is enabled — without it ORT GenAI ignores temperature and runs greedy regardless - benchmark.py: use default=None for --confirm_run_unsafe_code so False is not written to config when flag is omitted (prevents overriding config-file truth) - lmeval_ort.py: wrap request loop in tqdm respecting disable_tqdm parameter - olive_evaluator.py: extract _simple_evaluate_supports_unsafe_code() with lru_cache so inspect.signature is not called on every evaluate() invocation - tests: assert do_sample=True in search_options when sampling; add multi-request isolation test and cache_hook.add_partial assertion Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

natke and others added 2 commits April 29, 2026 16:04

Copilot AI review requested due to automatic review settings April 29, 2026 23:42

Copilot started reviewing on behalf of natke April 29, 2026 23:43 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread olive/evaluator/lmeval_ort.py Outdated

Comment thread olive/evaluator/lmeval_ort.py Outdated

github-advanced-security AI found potential problems Apr 29, 2026

View reviewed changes

natke changed the title ~~Fix ORT GenAI stop handling and generation robustness~~ Add MBPP evaluator Apr 30, 2026

natke marked this pull request as draft April 30, 2026 01:23

natke requested a review from Copilot April 30, 2026 17:37