Skip to content

Add MBPP evaluator#2439

Open
natke wants to merge 11 commits intomainfrom
feature/add-mbpp-evaluator
Open

Add MBPP evaluator#2439
natke wants to merge 11 commits intomainfrom
feature/add-mbpp-evaluator

Conversation

@natke
Copy link
Copy Markdown
Contributor

@natke natke commented Apr 29, 2026

Summary

Adds MBPP evaluator support for the ortgenai backend, enabling generative LM eval tasks (e.g., MBPP, HumanEval) via lm-eval. Key changes include:

  • Add generate_until support to LMEvalORTGenAIEvaluator for token-by-token generation with stop-sequence handling
  • Wire confirm_run_unsafe_code opt-in flag from CLI through to lm_eval.simple_evaluate for MBPP safety compliance
  • Improve stop-sequence trimming to cut at the earliest match across all provided stop strings
  • Improve generate_until with full-sequence decode for correct tokenizer whitespace/punctuation handling
  • Harden gen_kwargs parsing: max tokens, temperature, and do_sample coercion for string/int inputs
  • Add evaluator tests for unsafe-code flag propagation, multi-stop ordering, and token parsing edge cases

Example Olive config

The evaluator can be used standalone against a pre-existing ONNX model — no optimization passes are required. The passes field defaults to empty, and evaluate_input_model defaults to true, so Olive evaluates the input model directly:

{
    "input_model": {
        "type": "OnnxModel",
        "model_path": "models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4"
    },
    "systems": {
        "local_system": {
            "type": "LocalSystem",
            "accelerators": [{ "device": "cpu", "execution_providers": ["CPUExecutionProvider"] }]
        }
    },
    "evaluators": {
        "lm_evaluator": {
            "type": "LMEvaluator",
            "tasks": ["mbpp"],
            "model_class": "ortgenai",
            "batch_size": 1,
            "max_length": 2048,
            "limit": 10,
            "confirm_run_unsafe_code": true
        }
    },
    "target": "local_system",
    "evaluator": "lm_evaluator"
}

Run with:

olive run -c config.json

Or via the CLI directly (no config file needed):

olive benchmark -m models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4 \
    --tasks mbpp \
    --backend ortgenai \
    --confirm_run_unsafe_code

Note: MBPP executes generated Python code to check correctness. The confirm_run_unsafe_code flag (or --confirm_run_unsafe_code on the CLI) is required to acknowledge this.

Validation

  • pytest -q test/evaluator/test_olive_evaluator.py -k "generate_until or confirm_run_unsafe_code or lmeval"
  • pytest -q test/cli/test_cli.py -k "benchmark_command_onnxmodel_with_ortgenai_backend or benchmark_command_hfmodel"

natke and others added 2 commits April 29, 2026 16:04
- Implement generate_until in LMEvalORTGenAIEvaluator for code generation tasks
- Fix EOS token handling for models with multiple EOS token IDs (e.g. Qwen)
- Add confirm_run_unsafe_code parameter to CLI and LMEvaluator for code eval tasks
- Update base ORT class error message to direct users to ortgenai backend
- Add unit tests for generate_until and confirm_run_unsafe_code

This enables running MBPP, HumanEval, and other code generation benchmarks
through Olive's ortgenai evaluation backend.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Trim generation at earliest matching stop sequence across all provided stops\n- Improve generate_until decoding efficiency by decoding incrementally\n- Harden max token parsing for malformed generation kwargs\n- Add evaluator tests for stop ordering and token parsing edge cases\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 29, 2026 23:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Olive’s lm-eval integration for the ORT GenAI backend by adding a working generate_until implementation with more robust stopping behavior, while also exposing lm-eval’s confirm_run_unsafe_code flag through Olive’s evaluator and CLI.

Changes:

  • Implement generate_until for ortgenai with EOS and multi-stop trimming, and improve the unsupported-backend error for plain ort.
  • Plumb confirm_run_unsafe_code through LMEvaluator and olive benchmark, with unit test coverage.
  • Add evaluator tests covering multi-stop ordering, EOS handling, incremental decode behavior, and max-token parsing edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
olive/evaluator/lmeval_ort.py Adds ORT-GenAI generate_until, supports multiple EOS IDs, and improves ort backend error messaging.
olive/evaluator/olive_evaluator.py Adds confirm_run_unsafe_code to LMEvaluator and forwards it to lm_eval.simple_evaluate.
olive/cli/benchmark.py Adds --confirm_run_unsafe_code CLI flag and writes it into the generated run config when enabled.
test/evaluator/test_olive_evaluator.py Adds unit tests for confirm_run_unsafe_code forwarding and ORT-GenAI generate_until behaviors.

Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
Comment thread test/evaluator/test_olive_evaluator.py Fixed
@natke natke changed the title Fix ORT GenAI stop handling and generation robustness Add MBPP evaluator Apr 30, 2026
@natke natke marked this pull request as draft April 30, 2026 01:23
- Remove dead generated_ids list
- Replace string concatenation with list accumulation to avoid quadratic growth
- Rename _eos_token_ids to eos_token_ids (public attribute) to fix protected-access lint warnings in tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke requested a review from Copilot April 30, 2026 17:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread olive/evaluator/lmeval_ort.py
Comment thread olive/evaluator/lmeval_ort.py
Comment thread olive/evaluator/lmeval_ort.py Outdated
- Guard against empty eos_token_id list with a clear ValueError
- Early-return empty completion when prompt >= max_length or max_gen_toks == 0
  to avoid passing invalid max_length to the ORT GenAI generator
- Fix generated_text not being set when loop exits via EOS break
- Use tail-buffer for stop-sequence checking instead of full join per token
- Update test for max_gen_toks=0 to assert early-return behaviour

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke requested a review from Copilot April 30, 2026 17:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread test/evaluator/test_olive_evaluator.py
- Replace chunk-count tail window with character-based rolling tail string,
  fixing missed stop sequences when stop strings span more tokens than characters
- Track generated_len with a running counter to avoid O(n^2) join for tail_offset
- Coerce temperature from str/None safely with float() + fallback to 0.0
- Add parametrized tests for temperature coercion edge cases (string, None, zero, float)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

test/evaluator/test_olive_evaluator.py:1

  • The tests cover the max_gen_toks == 0 early-return path, but not the prompt_len >= self.max_length branch (which also triggers cache-hook behavior). Add a unit test where tokenizer.encode(...).tolist() returns a prompt length >= evaluator.max_length, and assert it returns \"\" and writes the empty partial to the cache hook.
# -------------------------------------------------------------------------

Comment thread olive/evaluator/lmeval_ort.py
Comment thread olive/evaluator/olive_evaluator.py Outdated
natke and others added 3 commits April 30, 2026 11:59
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ith older lm-eval

- Remove batch_size from set_search_options (not a valid kwarg)
- Use try/except TypeError for confirm_run_unsafe_code compat
- Remove unused import inspect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke requested a review from Copilot April 30, 2026 19:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment thread olive/evaluator/olive_evaluator.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
…mple coercion

- olive_evaluator.py: use inspect.signature to check if lm-eval supports
  confirm_run_unsafe_code before passing it (avoids brittle try/except TypeError)
- lmeval_ort.py: accumulate generated_token_ids and decode full sequence once
  at end so tokenizer whitespace/punctuation normalisation is applied correctly;
  keep per-token incremental decode only for stop-sequence tail detection
- lmeval_ort.py: coerce do_sample to bool, handling string 'false'/'0'/'no'
  and int 0/1 so they are not unintentionally truthy
- tests: set __signature__ on simple_evaluate mocks; add older-lm-eval compat
  test; update decode-assertion for full-seq decode; add do_sample coercion test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread olive/evaluator/olive_evaluator.py Outdated
Comment thread olive/cli/benchmark.py Outdated
Comment thread olive/evaluator/lmeval_ort.py Outdated
Comment thread test/evaluator/test_olive_evaluator.py
…ort_genai stub

- olive_evaluator.py: wrap inspect.signature in try/except (TypeError, ValueError)
  so an unintrospectable wrapper never crashes evaluation
- benchmark.py: write confirm_run_unsafe_code as explicit bool (not 'or None')
  so False is always written rather than silently omitted as null
- lmeval_ort.py: normalise 'until' tuple/set/generator to list via list()
  so stop sequences in tuple form are not silently dropped
- test/evaluator/conftest.py: inject onnxruntime_genai stub so generate_until
  tests run without the real package installed
- test: remove onnxruntime_genai from skipif; add tuple-until regression test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…qdm, caching

- lmeval_ort.py: pass do_sample=True to set_search_options when sampling is
  enabled — without it ORT GenAI ignores temperature and runs greedy regardless
- benchmark.py: use default=None for --confirm_run_unsafe_code so False is not
  written to config when flag is omitted (prevents overriding config-file truth)
- lmeval_ort.py: wrap request loop in tqdm respecting disable_tqdm parameter
- olive_evaluator.py: extract _simple_evaluate_supports_unsafe_code() with
  lru_cache so inspect.signature is not called on every evaluate() invocation
- tests: assert do_sample=True in search_options when sampling; add multi-request
  isolation test and cache_hook.add_partial assertion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@natke natke marked this pull request as ready for review April 30, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants