Skip to content

feat(ethexe/processor): instrument code lazily inside processing#5396

Open
playX18 wants to merge 6 commits intomasterfrom
ap/process-instrument-lazy
Open

feat(ethexe/processor): instrument code lazily inside processing#5396
playX18 wants to merge 6 commits intomasterfrom
ap/process-instrument-lazy

Conversation

@playX18
Copy link
Copy Markdown
Member

@playX18 playX18 commented Apr 30, 2026

Resolves #4681


Note

Medium Risk
Adds on-the-fly WASM instrumentation and DB writes inside the hot queue-processing path, which could affect determinism/performance and introduces new failure mode when original code is absent.

Overview
Enables lazy code instrumentation during program execution: when processing a program queue (including overlay mode), if the current-runtime instrumented_code/code_metadata is missing in the DB, the processor now instruments the stored original_code via a runtime instance and persists the results.

This threads InstanceCreator into instrumented_code_and_metadata, exposes it on CommonRunContext, adds a new MissingOriginalCodeForProgram error, and includes a regression test ensuring processing populates missing instrumentation for valid code.

Reviewed by Cursor Bugbot for commit 5207bb5. Bugbot is set up for automated code reviews on this repo. Configure here.

@semanticdiff-com
Copy link
Copy Markdown

semanticdiff-com Bot commented Apr 30, 2026

Review changes with  SemanticDiff

Changed Files
File Status
  ethexe/processor/src/handling/overlaid.rs  75% smaller
  ethexe/processor/src/handling/run/chunk_execution_spawn.rs  65% smaller
  ethexe/processor/src/handling/run/mod.rs  47% smaller
  ethexe/processor/src/lib.rs  0% smaller
  ethexe/processor/src/tests.rs  0% smaller

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a lazy instrumentation mechanism for program code within the processor. By deferring the instrumentation process until the code is actually required for execution, the system improves efficiency and robustness. The changes involve updating the run context to handle instrumentation dynamically and ensuring that missing instrumented code is generated on-the-fly from the original source.

Highlights

  • Lazy Instrumentation: Implemented lazy code instrumentation during the processing phase, ensuring that code is instrumented only when needed if it is missing from the database.
  • Error Handling: Added a new 'MissingOriginalCodeForProgram' error variant to handle cases where the original code is unavailable during the instrumentation process.
  • Testing: Added a new test case to verify that valid code is correctly instrumented when the current runtime instrumentation is missing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements on-the-fly instrumentation for Gear programs within the ethexe processor. When instrumented code or metadata is missing from the database, the processor now fetches the original code, performs instrumentation using a Wasmtime instance, and persists the results back to the database. Feedback focuses on performance optimizations: specifically, noting that saving results to an overlaid database (common in RPC calls) may lead to redundant heavy instrumentation, and suggesting the reuse of Wasmtime instances across a processing pass to reduce instantiation overhead.

Comment on lines +448 to +453
db.set_instrumented_code(
ethexe_runtime_common::VERSION,
code_id,
instrumented_code.clone(),
);
db.set_code_metadata(code_id, code_metadata.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When db is an overlaid database (e.g., during RPC execute_for_reply), these set calls only update the in-memory overlay. This means every RPC call for the same uninstrumented program will re-perform instrumentation. Since instrumentation is a computationally heavy operation, this might be a performance bottleneck for RPC. Ideally, deterministic instrumentation results should be cached in the base database, but the current Database abstraction makes this difficult without bypassing the overlay. Given this is a fallback path, it's acceptable, but worth noting for future optimization.

References
  1. Code instrumentation is recognized as a computationally heavy 'main operation' in this repository, and its performance impact should be considered during implementation.

.original_code(code_id)
.ok_or(ProcessorError::MissingOriginalCodeForProgram(code_id))?;

let mut instance = instance_creator.instantiate()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instantiating a new Wasmtime instance for every uninstrumented code blob can be expensive. Given that code instrumentation is a heavy operation, this adds significant overhead. In spawn_chunk_execution, if a chunk contains multiple programs with uninstrumented code, this will happen multiple times. Although InstanceCreator::instantiate is optimized via InstancePre, you might consider instantiating the runtime once per chunk processing pass and reusing it for all required instrumentations within that pass to reduce overhead.

References
  1. Code instrumentation is recognized as a computationally heavy 'main operation' in this repository, and its performance impact should be considered during implementation.

@playX18
Copy link
Copy Markdown
Member Author

playX18 commented Apr 30, 2026

@cursor review

@playX18
Copy link
Copy Markdown
Member Author

playX18 commented Apr 30, 2026

@codex review

@playX18
Copy link
Copy Markdown
Member Author

playX18 commented Apr 30, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @playX18's task in 1m 49s —— View job


Code Review

  • Reading changed files
  • Analyzing code
  • Posting review

Summary: This PR adds a lazy fallback in instrumented_code_and_metadata so that if the instrumented code / metadata are absent from the DB (e.g. after a runtime version bump), the processor instruments the stored original code on-the-fly and persists the result. The change is well-targeted, the logic is correct, and the instance_creator field visibility bump is minimal. A few points worth discussing:


medium: misleading error when instrument() returns None

File: ethexe/processor/src/handling/run/mod.rs lines 444–446

let (instrumented_code, code_metadata) = instance
    .instrument(&original_code)?
    .ok_or(ProcessorError::MissingInstrumentedCodeForProgram(code_id))?;

instrument() returns Ok(None) when the WASM can't be instrumented (invalid code). Mapping that to MissingInstrumentedCodeForProgram conflates "nothing in the DB" with "instrumentation produced no output." A dedicated variant like InstrumentationFailed(CodeId) would make debugging easier — especially since MissingInstrumentedCodeForProgram was previously only reachable if the DB lookup returned None, not if live instrumentation silently failed.

That said, this path is guarded upstream: handle_router_event rejects programs whose code_id is not marked code_valid, so in practice Ok(None) here would be a programmer error rather than a runtime condition.

Fix this →


low: RPC re-instrumentation on every call

File: ethexe/processor/src/handling/run/mod.rs lines 448–453 (also overlaid.rs:188)

In OverlaidRunContext, db is an overlaid (in-memory) database. set_instrumented_code / set_code_metadata writes go to the overlay only — they're discarded after the RPC call. So every execute_for_reply call for a program with missing instrumented code pays the full instrumentation cost. Since InstanceCreator::instantiate is InstancePre-backed it's not unbounded, but it's still non-trivial.

This is acceptable for an uncommon fallback path (missing instrumented code after a runtime upgrade should be temporary), but worth a comment so future readers don't wonder why writes appear to be no-ops for RPC callers.


low: partial-state non-issue but worth noting

If set_instrumented_code succeeds and set_code_metadata subsequently panics, the DB holds instrumented code but no metadata. The early-exit check on line 433–437 requires both to be present, so the next call retries instrumentation and overwrites — this is idempotent and safe. No action needed, just confirming this was considered.


low: test only covers the happy path

File: ethexe/processor/src/tests.rs lines 319–377

The new test verifies that after process_queues, both instrumented_code and code_metadata are populated. Consider also testing:

  • MissingOriginalCodeForProgram: drop the set_original_code call — ensures the new error variant is reachable and correct.
  • The overlaid path: confirm that execute_for_reply doesn't return an error when instrumented code is absent (even if it doesn't persist).

nit: CodesStorageRW import placement

CodesStorageRW is only used in the fallback branch of instrumented_code_and_metadata. It's a minor point, but the trait import at the top of the file is now always in scope even when the fallback never fires — fine as-is, just noting it.


Overall this is a clean fix. The logic is sound given that handle_router_event already guards against invalid code_ids being registered. Main open question is whether the instrument() → None case deserves a distinct error variant for observability.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 5207bb5. Configure here.

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @playX18's task in 2m 34s —— View job


Code Review

  • Reading changed files
  • Analyzing code
  • Posting review

Summary: This PR extracts instrumented_code_and_metadata into a standalone function and adds a lazy fallback: when instrumented code / metadata is absent from the DB (e.g., after a runtime version bump), the processor instruments the original code on-the-fly and persists the result. The scope is narrow and the logic is sound for the happy path. A few issues worth addressing before merge:


medium: violates the documented "no KV writes from processor" invariant

File: ethexe/processor/src/lib.rs lines 151–153

The crate-level doc currently reads:

// - Processor is designed to write only in CAS, it must NEVER modify
//   key-value storage from Database.

Both set_instrumented_code and set_code_metadata are KV (not CAS) writes, and they're now called from within instrumented_code_and_metadata which is invoked directly from the queue-processing hot path. The safety argument (instrumentation is deterministic, writes are idempotent, all validators arrive at the same bytes) is reasonable, but the invariant should be updated to document the exception — otherwise the next reader will either assume this is a bug or cargo-cult the old constraint.

Fix this →


medium: instrument() → None aliased to the wrong error variant

File: ethexe/processor/src/handling/run/mod.rs lines 444–446

let (instrumented_code, code_metadata) = instance
    .instrument(&original_code)?
    .ok_or(ProcessorError::MissingInstrumentedCodeForProgram(code_id))?;

instrument() returns Ok(None) when the WASM is structurally invalid. Mapping that to MissingInstrumentedCodeForProgram conflates two distinct situations: "the cache is empty" (DB lookup returned None) versus "live instrumentation produced no output." A dedicated variant like InstrumentationFailed(CodeId) would make the distinction clear in logs and error propagation.

In practice this path is guarded upstream — handle_router_event only registers programs whose code_id is marked code_valid, so Ok(None) here would signal a serious internal inconsistency rather than a routine condition. All the more reason the error name should reflect that.

Fix this →


low: overlaid DB writes are ephemeral — re-instruments on every RPC call

File: ethexe/processor/src/handling/overlaid.rs line 188

OverlaidRunContext holds an overlaid (in-memory, copy-on-write) database. The set_instrumented_code / set_code_metadata writes inside instrumented_code_and_metadata go to the overlay and are discarded when the overlay is dropped. This means every execute_for_reply call for a program whose instrumentation is missing will pay the full instance_creator.instantiate() + instrument() cost. This is an edge case (missing instrumented code is expected only briefly after a version bump), but a short comment noting the limitation would help future readers.


low: parallel chunk execution may redundantly instrument the same code_id

File: ethexe/processor/src/handling/run/mod.rs (spawn_chunk_execution path)

Multiple programs in a chunk may share the same code_id. If all of them miss instrumented code in the DB (e.g. immediately after a runtime version bump), each parallel worker will independently call instantiate() + instrument() for the same bytes and race to write to the DB. The writes are idempotent (identical content → identical output) and the DB layer handles concurrent writes safely, so correctness is not at risk. But it could cause noticeable latency in a chunk where every program's code is missing — worth a comment.


low: new test only covers the happy path

File: ethexe/processor/src/tests.rs lines 321–373

The test verifies that after process_queues with missing instrumented code, both instrumented_code and code_metadata are populated — good. Two additional cases would complete coverage of the new code:

  1. MissingOriginalCodeForProgram path: skip the db.set_original_code(&code) call and assert that process_queues returns Err(ProcessorError::MissingOriginalCodeForProgram(_)).
  2. Overlaid path: call execute_for_reply with instrumented code absent from the overlay and verify it succeeds (exercises the overlay write-then-discard behavior without panic).

Overall this is a clean, well-targeted fix. The main asks are: update the broken invariant comment in the crate doc, and rename the error produced when instrument() returns None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ethexe: instrument uninstrumented codes inside processing

2 participants