Implement shell lexer/tokenizer for /bin/sh by dburkart · Pull Request #885 · dburkart/vibix

dburkart · 2026-05-05T16:17:20Z

Closes #877

Summary

Add base/sh/ crate as the foundation for the vibix POSIX shell (/bin/sh)
Implement a lexer following POSIX.1-2024 §2.3 (Token Recognition) with support for all shell operators, quoting modes (single, double, backslash), comments, and line continuation
Add build_userspace_sh() to xtask for vibix-target compilation (not yet wired into cargo xtask build due to a pre-existing std compile error on the vibix target — same issue that affects std_hello)
49 host-side unit tests covering all token types, quoting edge cases, error recovery, and realistic command lines

Test plan

cargo test --manifest-path base/sh/Cargo.toml — 49 tests pass, no warnings
cargo xtask build — kernel build passes, no regressions
cargo xtask smoke — all 42 markers present
cargo xtask test — pre-existing blocking_sync flake only (rwlock_concurrent_readers), no new failures

Add the `base/sh/` crate as the foundation for the vibix POSIX shell. The lexer implements token recognition per POSIX.1-2024 §2.3, converting raw input into a stream of typed tokens that a future parser can consume. Supported tokens: - Words (unquoted, single-quoted, double-quoted, backslash-escaped) - Operators: |, ||, &, &&, ;, (, ), <, >, >>, <<, >&, <&, <> - Newline (syntactically significant in shell grammar) - EOF Quoting follows POSIX rules: single quotes preserve everything literally, double quotes recognize $, `, \, and " as special, backslash escapes the next character (or acts as line continuation before newline). Comments starting with # are recognized at word boundaries. The crate is a standard Rust binary using `std` via the in-repo fork, following the `base/` convention. A `build_userspace_sh()` function is added to xtask (not yet wired into `cargo xtask build` due to a pre-existing std compile error on the vibix target). 49 host-side unit tests cover all token types, quoting edge cases, error recovery, and realistic command lines. Closes #877 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-05T16:17:33Z

📝 Walkthrough

Walkthrough

A new base/sh crate introduces a POSIX shell lexer that tokenizes input into words, operators, and newlines. The Lexer struct handles single/double quoting, backslash escaping, comment skipping, and operator recognition. A main.rs demonstrates tokenization, and xtask gains a build helper to compile the binary for the vibix target.

Changes

Shell Lexer Implementation

Layer / File(s)	Summary
Token & Error Types `base/sh/src/lexer.rs`	`Token` enum defines operators (`Pipe`, `And`, `Or`, `Semi`, `Less`, `Great`, `DGreat`, `DLess`, `GreatAnd`, `LessAnd`, `LessGreat`, `LParen`, `RParen`), literals (`Word`), and control (`Newline`, `Eof`). `LexError` enum captures three error cases: `UnterminatedSingleQuote`, `UnterminatedDoubleQuote`, `TrailingBackslash`.
Lexer Implementation `base/sh/src/lexer.rs`	`Lexer<'a>` struct implements `next_token()` and `tokenize_all()`, dispatching to operator vs. word lexing, handling comments, newlines, single/double quoting (with POSIX backslash rules), line continuation, and blank skipping.
Binary Entry Point `base/sh/src/main.rs`	Enables `restricted_std` feature, imports `Lexer` and `Token`, tokenizes a hard-coded test command (`"echo hello world
Crate Manifest `base/sh/Cargo.toml`	Declares standalone `sh` crate (`v0.1.0`, edition 2021) with a single binary target at `src/main.rs` and a `[workspace]` section for independent compilation.
Build Integration `xtask/src/main.rs`	Adds unused `build_userspace_sh()` helper that out-of-tree builds the vibix `/bin/sh` binary via `-Z build-std`, verifies the ELF artifact, and strips debug symbols.
Documentation & Tests `base/README.md`, `base/sh/src/lexer.rs`	README documents the `sh` crate as POSIX shell at `/bin/sh` using `std`. Lexer includes comprehensive unit tests covering operators, quoting, escaping, comments, newlines, line continuations, and error recovery.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Implement /bin/sh: POSIX-compatible shell in Rust #874: Tracking overall /bin/sh implementation; this PR advances the lexer foundation required by the parent initiative.
Implement shell parser and AST for /bin/sh #875: Likely involves a recursive-descent parser that will consume the Token enum and Lexer output implemented in this PR.

Possibly related PRs

dburkart/vibix#684: Also adds userspace shell binary and xtask build helpers to compile for vibix target.
dburkart/vibix#873: Modifies xtask and base workspace layout; shares the same code area as the build helper addition.
dburkart/vibix#339: Extracts similar userspace binary build-and-strip helper pattern in xtask.

Poem

A rabbit hops through text so fine,
Lexing tokens, line by line,
Single quotes and doubles too,
Backslash escapes, operators new,
/bin/sh hops into view! 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title "Implement shell lexer/tokenizer for /bin/sh" clearly and concisely summarizes the main change: adding a lexer/tokenizer implementation for the vibix shell.
Description check	✅ Passed	The description is highly related to the changeset, providing detailed context about the lexer implementation, POSIX.1-2024 conformance, test coverage, and known limitations.
Linked Issues check	✅ Passed	The PR fully addresses issue `#877` requirements: creates base/sh crate, implements Token enum with all specified variants, implements Lexer with next_token() handling all quoting modes, includes 49 unit tests, and adds xtask build integration.
Out of Scope Changes check	✅ Passed	All changes are within scope: base/sh crate implementation, lexer/tokenizer functionality per `#877`, unit tests, documentation update, and xtask build helper align with stated objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch m877-sh-lexer

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

base/sh/src/main.rs (1)
7-23: ⚡ Quick win

Replace the demo loop with an explicit stub before this gets wired into xtask.

Right now sh ignores stdin/argv and always prints tokens for the built-in sample. A clear not yet implemented error is safer than silently shipping demo behavior once the build/staging path is enabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@base/sh/src/main.rs` around lines 7 - 23, Replace the demo token-printing
loop in main with an explicit "not yet implemented" stub: remove the hard-coded
input/Lexer::new(...) demo and instead have main return an error (or print a
clear "not yet implemented" message and exit non-zero) when stdin/argv handling
is not implemented; reference main, Lexer::new, lex.next_token, and Token::Eof
to locate and remove the demo loop and ensure the stub prevents silently running
the demo when xtask wiring is enabled.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@base/sh/src/lexer.rs`:
- Around line 139-150: The code currently uses recursion in next_token()
(calling self.next_token() after skip_comment()) and in lex_word() when a
leading backslash-newline produces no bytes, which can overflow the stack for
many comments/continuations; change both to iterative control flow: in
next_token() wrap the match in a loop and replace recursive re-entry (after
skip_comment()) with continue to re-evaluate peek/skip_blanks, and in lex_word()
when encountering a backslash-newline that yields no output, loop back to
continue lexing instead of calling lex_word() recursively; update any return
sites to break/return from the loop with the appropriate Token or LexError,
preserving existing helper calls like skip_comment(), skip_blanks(), and the
lex_word() logic but with iterative control flow.
- Around line 125-126: The lexer is iterating bytes and converting each byte to
a char (e.g., `c as char`), which corrupts multi-byte UTF‑8 (e.g., "café");
change the implementation to operate on character boundaries: either switch the
lexer input to a &str and iterate with char_indices()/chars() preserving byte
offsets, or keep the byte buffer but accumulate slices of bytes for each token
and call String::from_utf8 on the completed token; update the
functions/variables referenced in the diff (the lexer input field `input: &'a
[u8]`, `pos`, any places that cast `c as char`, and the token-accumulation code
in the regions noted: ~125, 265-287, 302-307, 321-338, 347-358) so tokens are
produced as valid UTF‑8 Strings without per-byte char casts.

In `@xtask/src/main.rs`:
- Around line 796-843: The build_userspace_sh() helper is never exercised
(marked with #[allow(dead_code)]), so /bin/sh is not built or staged; either
wire it into the existing build/packaging dispatch that runs other userland
builders (call build_userspace_sh() from the same place that builds other base
programs or from the xtask build pipeline entrypoint) so the function is
executed and its binary gets staged, or explicitly remove it from the shipped
set (and delete the allow_dead_code) by gating it behind a feature/flag or
keeping it as a non-shipped helper until the std-fork compiler error is
resolved; locate the function by name build_userspace_sh and update the
caller/packager that orchestrates other userland builds so the binary path
produced by build_userspace_sh() is included in the staging/publish step.

---

Nitpick comments:
In `@base/sh/src/main.rs`:
- Around line 7-23: Replace the demo token-printing loop in main with an
explicit "not yet implemented" stub: remove the hard-coded input/Lexer::new(...)
demo and instead have main return an error (or print a clear "not yet
implemented" message and exit non-zero) when stdin/argv handling is not
implemented; reference main, Lexer::new, lex.next_token, and Token::Eof to
locate and remove the demo loop and ensure the stub prevents silently running
the demo when xtask wiring is enabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a1d218f1-5e5b-454d-8389-0bc3f400f0bf

📥 Commits

Reviewing files that changed from the base of the PR and between 55ffcab and c55ed5c.

📒 Files selected for processing (5)

base/README.md
base/sh/Cargo.toml
base/sh/src/lexer.rs
base/sh/src/main.rs
xtask/src/main.rs

coderabbitai · 2026-05-05T16:22:02Z

+    input: &'a [u8],
+    pos: usize,


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Byte-wise lexing corrupts valid UTF-8 words.

Every c as char path here turns one UTF-8 byte into a scalar, so inputs like café tokenize as cafÃ©. Since this API takes &str and returns String, token contents should preserve valid UTF-8. Please switch the lexer to char-boundary iteration or accumulate byte slices and rebuild with String::from_utf8 at the end.

Also applies to: 265-287, 302-307, 321-338, 347-358

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@base/sh/src/lexer.rs` around lines 125 - 126, The lexer is iterating bytes and converting each byte to a char (e.g., `c as char`), which corrupts multi-byte UTF‑8 (e.g., "café"); change the implementation to operate on character boundaries: either switch the lexer input to a &str and iterate with char_indices()/chars() preserving byte offsets, or keep the byte buffer but accumulate slices of bytes for each token and call String::from_utf8 on the completed token; update the functions/variables referenced in the diff (the lexer input field `input: &'a [u8]`, `pos`, any places that cast `c as char`, and the token-accumulation code in the regions noted: ~125, 265-287, 302-307, 321-338, 347-358) so tokens are produced as valid UTF‑8 Strings without per-byte char casts.

coderabbitai · 2026-05-05T16:22:02Z

+    pub fn next_token(&mut self) -> Result<Token, LexError> {
+        self.skip_blanks();
+
+        match self.peek() {
+            None => Ok(Token::Eof),
+            Some(b'\n') => {
+                self.advance();
+                Ok(Token::Newline)
+            }
+            Some(b'#') => {
+                self.skip_comment();
+                self.next_token()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recursive re-entry makes stack use proportional to input shape.

next_token() recurses after comments, and lex_word() recurses again when a leading backslash-newline contributes no bytes. A long run of continuations/comments can overflow the stack before EOF. This is safer as iterative control flow.

Also applies to: 294-295

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@base/sh/src/lexer.rs` around lines 139 - 150, The code currently uses recursion in next_token() (calling self.next_token() after skip_comment()) and in lex_word() when a leading backslash-newline produces no bytes, which can overflow the stack for many comments/continuations; change both to iterative control flow: in next_token() wrap the match in a loop and replace recursive re-entry (after skip_comment()) with continue to re-evaluate peek/skip_blanks, and in lex_word() when encountering a backslash-newline that yields no output, loop back to continue lexing instead of calling lex_word() recursively; update any return sites to break/return from the loop with the appropriate Token or LexError, preserving existing helper calls like skip_comment(), skip_blanks(), and the lex_word() logic but with iterative control flow.

coderabbitai · 2026-05-05T16:22:02Z

+/// Build the `/bin/sh` binary — the vibix POSIX shell.
+///
+/// Uses the same out-of-tree `-Z build-std` approach as `std_hello`.
+/// The crate lives in `base/sh/` (base system program, not a test).
+///
+/// Not yet wired into `cargo xtask build` because the in-repo std fork
+/// has a pre-existing compile error on the vibix target (E0034 in
+/// `sys/thread/vibix.rs`). The function is ready to be called once that
+/// is resolved.
+#[allow(dead_code)]
+fn build_userspace_sh() -> R<PathBuf> {
+    let ws = workspace_root();
+    let target_spec = ws.join(VIBIX_USERSPACE_TARGET);
+    let manifest = ws.join("base/sh/Cargo.toml");
+    let library_root = ws.join("library");
+
+    let target_dir = ws.join("target");
+    let mut cmd = Command::new("cargo");
+    cmd.current_dir(&ws)
+        .env("__CARGO_TESTS_ONLY_SRC_ROOT", &library_root)
+        .args(["build", "--manifest-path"])
+        .arg(&manifest)
+        .arg("--target-dir")
+        .arg(&target_dir)
+        .args([
+            "-Z",
+            "build-std=std,core,alloc,panic_abort",
+            "-Z",
+            "build-std-features=compiler-builtins-mem",
+            "-Z",
+            "unstable-options",
+            "-Z",
+            "json-target-spec",
+            "--target",
+        ])
+        .arg(&target_spec);
+    check(cmd.status()?)?;
+
+    let bin = target_dir
+        .join("x86_64-unknown-vibix")
+        .join("debug")
+        .join("sh");
+    if !bin.exists() {
+        return Err(format!("sh binary missing at {} after build", bin.display()).into());
+    }
+    strip_debug(&bin)?;
+    Ok(bin)
+}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

build_userspace_sh() is still unreachable, so /bin/sh never gets built or staged.

#[allow(dead_code)] is accurate here: nothing in this file dispatches to this helper, and no ISO/rootfs path publishes the resulting binary. That means the PR still doesn’t meet the /bin/sh build-integration objective yet. If the std-fork compiler error keeps the full wiring out of scope, I’d avoid treating sh as a current shipped base program until this path is exercised.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@xtask/src/main.rs` around lines 796 - 843, The build_userspace_sh() helper is never exercised (marked with #[allow(dead_code)]), so /bin/sh is not built or staged; either wire it into the existing build/packaging dispatch that runs other userland builders (call build_userspace_sh() from the same place that builds other base programs or from the xtask build pipeline entrypoint) so the function is executed and its binary gets staged, or explicitly remove it from the shipped set (and delete the allow_dead_code) by gating it behind a feature/flag or keeping it as a non-shipped helper until the std-fork compiler error is resolved; locate the function by name build_userspace_sh and update the caller/packager that orchestrates other userland builds so the binary path produced by build_userspace_sh() is included in the staging/publish step.

dburkart · 2026-05-05T16:53:08Z

CodeRabbit findings — deferred as out-of-scope nits

All three CodeRabbit inline findings are classified as out-of-scope nits per our PR review playbook. They suggest worthwhile improvements but would materially expand this PR beyond its intended goal (initial shell lexer/tokenizer). Summary and rationale for deferral:

Byte-wise lexing corrupts valid UTF-8 words — POSIX shell processing is ASCII-primary. Full UTF-8 awareness is a natural follow-up once the lexer is proven correct for the ASCII subset that /bin/sh needs day-to-day.
Recursive re-entry makes stack use proportional to input shape — In practice, recursion depth is bounded by line length in a shell context. An iterative rewrite is a good hardening task but not required for correctness of the initial implementation.
build_userspace_sh() is still unreachable — This is intentional. Issue Wire /bin/sh into init and add shell integration test #883 (the capstone issue) wires it into the build system; keeping it unreachable here avoids coupling unrelated build-graph changes to the lexer PR.

All three are candidates for follow-up work tracked under #883 and subsequent shell issues.

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

dburkart merged commit 159d210 into main May 5, 2026
29 of 31 checks passed

dburkart deleted the m877-sh-lexer branch May 5, 2026 16:53

coderabbitai Bot mentioned this pull request May 5, 2026

Implement shell parser and AST for /bin/sh #887

Merged

4 tasks

dburkart mentioned this pull request May 6, 2026

Interactive /bin/sh on framebuffer console #895

Open

coderabbitai Bot mentioned this pull request May 7, 2026

Fix interactive /bin/sh boot: four root causes #920

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement shell lexer/tokenizer for /bin/sh#885

Implement shell lexer/tokenizer for /bin/sh#885
dburkart merged 1 commit intomainfrom
m877-sh-lexer

dburkart commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

dburkart commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dburkart commented May 5, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

dburkart commented May 5, 2026

CodeRabbit findings — deferred as out-of-scope nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 5, 2026 •

edited

Loading