Conversation
- Implement generate_until in LMEvalORTGenAIEvaluator for code generation tasks - Fix EOS token handling for models with multiple EOS token IDs (e.g. Qwen) - Add confirm_run_unsafe_code parameter to CLI and LMEvaluator for code eval tasks - Update base ORT class error message to direct users to ortgenai backend - Add unit tests for generate_until and confirm_run_unsafe_code This enables running MBPP, HumanEval, and other code generation benchmarks through Olive's ortgenai evaluation backend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Trim generation at earliest matching stop sequence across all provided stops\n- Improve generate_until decoding efficiency by decoding incrementally\n- Harden max token parsing for malformed generation kwargs\n- Add evaluator tests for stop ordering and token parsing edge cases\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves Olive’s lm-eval integration for the ORT GenAI backend by adding a working generate_until implementation with more robust stopping behavior, while also exposing lm-eval’s confirm_run_unsafe_code flag through Olive’s evaluator and CLI.
Changes:
- Implement
generate_untilforortgenaiwith EOS and multi-stop trimming, and improve the unsupported-backend error for plainort. - Plumb
confirm_run_unsafe_codethroughLMEvaluatorandolive benchmark, with unit test coverage. - Add evaluator tests covering multi-stop ordering, EOS handling, incremental decode behavior, and max-token parsing edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
olive/evaluator/lmeval_ort.py |
Adds ORT-GenAI generate_until, supports multiple EOS IDs, and improves ort backend error messaging. |
olive/evaluator/olive_evaluator.py |
Adds confirm_run_unsafe_code to LMEvaluator and forwards it to lm_eval.simple_evaluate. |
olive/cli/benchmark.py |
Adds --confirm_run_unsafe_code CLI flag and writes it into the generated run config when enabled. |
test/evaluator/test_olive_evaluator.py |
Adds unit tests for confirm_run_unsafe_code forwarding and ORT-GenAI generate_until behaviors. |
- Remove dead generated_ids list - Replace string concatenation with list accumulation to avoid quadratic growth - Rename _eos_token_ids to eos_token_ids (public attribute) to fix protected-access lint warnings in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Guard against empty eos_token_id list with a clear ValueError - Early-return empty completion when prompt >= max_length or max_gen_toks == 0 to avoid passing invalid max_length to the ORT GenAI generator - Fix generated_text not being set when loop exits via EOS break - Use tail-buffer for stop-sequence checking instead of full join per token - Update test for max_gen_toks=0 to assert early-return behaviour Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace chunk-count tail window with character-based rolling tail string, fixing missed stop sequences when stop strings span more tokens than characters - Track generated_len with a running counter to avoid O(n^2) join for tail_offset - Coerce temperature from str/None safely with float() + fallback to 0.0 - Add parametrized tests for temperature coercion edge cases (string, None, zero, float) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
test/evaluator/test_olive_evaluator.py:1
- The tests cover the
max_gen_toks == 0early-return path, but not theprompt_len >= self.max_lengthbranch (which also triggers cache-hook behavior). Add a unit test wheretokenizer.encode(...).tolist()returns a prompt length >=evaluator.max_length, and assert it returns\"\"and writes the empty partial to the cache hook.
# -------------------------------------------------------------------------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ith older lm-eval - Remove batch_size from set_search_options (not a valid kwarg) - Use try/except TypeError for confirm_run_unsafe_code compat - Remove unused import inspect Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mple coercion - olive_evaluator.py: use inspect.signature to check if lm-eval supports confirm_run_unsafe_code before passing it (avoids brittle try/except TypeError) - lmeval_ort.py: accumulate generated_token_ids and decode full sequence once at end so tokenizer whitespace/punctuation normalisation is applied correctly; keep per-token incremental decode only for stop-sequence tail detection - lmeval_ort.py: coerce do_sample to bool, handling string 'false'/'0'/'no' and int 0/1 so they are not unintentionally truthy - tests: set __signature__ on simple_evaluate mocks; add older-lm-eval compat test; update decode-assertion for full-seq decode; add do_sample coercion test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ort_genai stub - olive_evaluator.py: wrap inspect.signature in try/except (TypeError, ValueError) so an unintrospectable wrapper never crashes evaluation - benchmark.py: write confirm_run_unsafe_code as explicit bool (not 'or None') so False is always written rather than silently omitted as null - lmeval_ort.py: normalise 'until' tuple/set/generator to list via list() so stop sequences in tuple form are not silently dropped - test/evaluator/conftest.py: inject onnxruntime_genai stub so generate_until tests run without the real package installed - test: remove onnxruntime_genai from skipif; add tuple-until regression test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…qdm, caching - lmeval_ort.py: pass do_sample=True to set_search_options when sampling is enabled — without it ORT GenAI ignores temperature and runs greedy regardless - benchmark.py: use default=None for --confirm_run_unsafe_code so False is not written to config when flag is omitted (prevents overriding config-file truth) - lmeval_ort.py: wrap request loop in tqdm respecting disable_tqdm parameter - olive_evaluator.py: extract _simple_evaluate_supports_unsafe_code() with lru_cache so inspect.signature is not called on every evaluate() invocation - tests: assert do_sample=True in search_options when sampling; add multi-request isolation test and cache_hook.add_partial assertion Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds MBPP evaluator support for the
ortgenaibackend, enabling generative LM eval tasks (e.g., MBPP, HumanEval) vialm-eval. Key changes include:generate_untilsupport toLMEvalORTGenAIEvaluatorfor token-by-token generation with stop-sequence handlingconfirm_run_unsafe_codeopt-in flag from CLI through tolm_eval.simple_evaluatefor MBPP safety compliancegenerate_untilwith full-sequence decode for correct tokenizer whitespace/punctuation handlinggen_kwargsparsing: max tokens, temperature, anddo_samplecoercion for string/int inputsExample Olive config
The evaluator can be used standalone against a pre-existing ONNX model — no optimization passes are required. The
passesfield defaults to empty, andevaluate_input_modeldefaults totrue, so Olive evaluates the input model directly:{ "input_model": { "type": "OnnxModel", "model_path": "models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4" }, "systems": { "local_system": { "type": "LocalSystem", "accelerators": [{ "device": "cpu", "execution_providers": ["CPUExecutionProvider"] }] } }, "evaluators": { "lm_evaluator": { "type": "LMEvaluator", "tasks": ["mbpp"], "model_class": "ortgenai", "batch_size": 1, "max_length": 2048, "limit": 10, "confirm_run_unsafe_code": true } }, "target": "local_system", "evaluator": "lm_evaluator" }Run with:
Or via the CLI directly (no config file needed):
olive benchmark -m models/Qwen2.5-Coder-0.5B-Instruct-cpu-int4 \ --tasks mbpp \ --backend ortgenai \ --confirm_run_unsafe_codeValidation