Conversation
Add the `base/sh/` crate as the foundation for the vibix POSIX shell. The lexer implements token recognition per POSIX.1-2024 §2.3, converting raw input into a stream of typed tokens that a future parser can consume. Supported tokens: - Words (unquoted, single-quoted, double-quoted, backslash-escaped) - Operators: |, ||, &, &&, ;, (, ), <, >, >>, <<, >&, <&, <> - Newline (syntactically significant in shell grammar) - EOF Quoting follows POSIX rules: single quotes preserve everything literally, double quotes recognize $, `, \, and " as special, backslash escapes the next character (or acts as line continuation before newline). Comments starting with # are recognized at word boundaries. The crate is a standard Rust binary using `std` via the in-repo fork, following the `base/` convention. A `build_userspace_sh()` function is added to xtask (not yet wired into `cargo xtask build` due to a pre-existing std compile error on the vibix target). 49 host-side unit tests cover all token types, quoting edge cases, error recovery, and realistic command lines. Closes #877 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughA new ChangesShell Lexer Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
base/sh/src/main.rs (1)
7-23: ⚡ Quick winReplace the demo loop with an explicit stub before this gets wired into xtask.
Right now
shignores stdin/argv and always prints tokens for the built-in sample. A clearnot yet implementederror is safer than silently shipping demo behavior once the build/staging path is enabled.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@base/sh/src/main.rs` around lines 7 - 23, Replace the demo token-printing loop in main with an explicit "not yet implemented" stub: remove the hard-coded input/Lexer::new(...) demo and instead have main return an error (or print a clear "not yet implemented" message and exit non-zero) when stdin/argv handling is not implemented; reference main, Lexer::new, lex.next_token, and Token::Eof to locate and remove the demo loop and ensure the stub prevents silently running the demo when xtask wiring is enabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@base/sh/src/lexer.rs`:
- Around line 139-150: The code currently uses recursion in next_token()
(calling self.next_token() after skip_comment()) and in lex_word() when a
leading backslash-newline produces no bytes, which can overflow the stack for
many comments/continuations; change both to iterative control flow: in
next_token() wrap the match in a loop and replace recursive re-entry (after
skip_comment()) with continue to re-evaluate peek/skip_blanks, and in lex_word()
when encountering a backslash-newline that yields no output, loop back to
continue lexing instead of calling lex_word() recursively; update any return
sites to break/return from the loop with the appropriate Token or LexError,
preserving existing helper calls like skip_comment(), skip_blanks(), and the
lex_word() logic but with iterative control flow.
- Around line 125-126: The lexer is iterating bytes and converting each byte to
a char (e.g., `c as char`), which corrupts multi-byte UTF‑8 (e.g., "café");
change the implementation to operate on character boundaries: either switch the
lexer input to a &str and iterate with char_indices()/chars() preserving byte
offsets, or keep the byte buffer but accumulate slices of bytes for each token
and call String::from_utf8 on the completed token; update the
functions/variables referenced in the diff (the lexer input field `input: &'a
[u8]`, `pos`, any places that cast `c as char`, and the token-accumulation code
in the regions noted: ~125, 265-287, 302-307, 321-338, 347-358) so tokens are
produced as valid UTF‑8 Strings without per-byte char casts.
In `@xtask/src/main.rs`:
- Around line 796-843: The build_userspace_sh() helper is never exercised
(marked with #[allow(dead_code)]), so /bin/sh is not built or staged; either
wire it into the existing build/packaging dispatch that runs other userland
builders (call build_userspace_sh() from the same place that builds other base
programs or from the xtask build pipeline entrypoint) so the function is
executed and its binary gets staged, or explicitly remove it from the shipped
set (and delete the allow_dead_code) by gating it behind a feature/flag or
keeping it as a non-shipped helper until the std-fork compiler error is
resolved; locate the function by name build_userspace_sh and update the
caller/packager that orchestrates other userland builds so the binary path
produced by build_userspace_sh() is included in the staging/publish step.
---
Nitpick comments:
In `@base/sh/src/main.rs`:
- Around line 7-23: Replace the demo token-printing loop in main with an
explicit "not yet implemented" stub: remove the hard-coded input/Lexer::new(...)
demo and instead have main return an error (or print a clear "not yet
implemented" message and exit non-zero) when stdin/argv handling is not
implemented; reference main, Lexer::new, lex.next_token, and Token::Eof to
locate and remove the demo loop and ensure the stub prevents silently running
the demo when xtask wiring is enabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1d218f1-5e5b-454d-8389-0bc3f400f0bf
📒 Files selected for processing (5)
base/README.mdbase/sh/Cargo.tomlbase/sh/src/lexer.rsbase/sh/src/main.rsxtask/src/main.rs
| input: &'a [u8], | ||
| pos: usize, |
There was a problem hiding this comment.
Byte-wise lexing corrupts valid UTF-8 words.
Every c as char path here turns one UTF-8 byte into a scalar, so inputs like café tokenize as café. Since this API takes &str and returns String, token contents should preserve valid UTF-8. Please switch the lexer to char-boundary iteration or accumulate byte slices and rebuild with String::from_utf8 at the end.
Also applies to: 265-287, 302-307, 321-338, 347-358
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@base/sh/src/lexer.rs` around lines 125 - 126, The lexer is iterating bytes
and converting each byte to a char (e.g., `c as char`), which corrupts
multi-byte UTF‑8 (e.g., "café"); change the implementation to operate on
character boundaries: either switch the lexer input to a &str and iterate with
char_indices()/chars() preserving byte offsets, or keep the byte buffer but
accumulate slices of bytes for each token and call String::from_utf8 on the
completed token; update the functions/variables referenced in the diff (the
lexer input field `input: &'a [u8]`, `pos`, any places that cast `c as char`,
and the token-accumulation code in the regions noted: ~125, 265-287, 302-307,
321-338, 347-358) so tokens are produced as valid UTF‑8 Strings without per-byte
char casts.
| pub fn next_token(&mut self) -> Result<Token, LexError> { | ||
| self.skip_blanks(); | ||
|
|
||
| match self.peek() { | ||
| None => Ok(Token::Eof), | ||
| Some(b'\n') => { | ||
| self.advance(); | ||
| Ok(Token::Newline) | ||
| } | ||
| Some(b'#') => { | ||
| self.skip_comment(); | ||
| self.next_token() |
There was a problem hiding this comment.
Recursive re-entry makes stack use proportional to input shape.
next_token() recurses after comments, and lex_word() recurses again when a leading backslash-newline contributes no bytes. A long run of continuations/comments can overflow the stack before EOF. This is safer as iterative control flow.
Also applies to: 294-295
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@base/sh/src/lexer.rs` around lines 139 - 150, The code currently uses
recursion in next_token() (calling self.next_token() after skip_comment()) and
in lex_word() when a leading backslash-newline produces no bytes, which can
overflow the stack for many comments/continuations; change both to iterative
control flow: in next_token() wrap the match in a loop and replace recursive
re-entry (after skip_comment()) with continue to re-evaluate peek/skip_blanks,
and in lex_word() when encountering a backslash-newline that yields no output,
loop back to continue lexing instead of calling lex_word() recursively; update
any return sites to break/return from the loop with the appropriate Token or
LexError, preserving existing helper calls like skip_comment(), skip_blanks(),
and the lex_word() logic but with iterative control flow.
| /// Build the `/bin/sh` binary — the vibix POSIX shell. | ||
| /// | ||
| /// Uses the same out-of-tree `-Z build-std` approach as `std_hello`. | ||
| /// The crate lives in `base/sh/` (base system program, not a test). | ||
| /// | ||
| /// Not yet wired into `cargo xtask build` because the in-repo std fork | ||
| /// has a pre-existing compile error on the vibix target (E0034 in | ||
| /// `sys/thread/vibix.rs`). The function is ready to be called once that | ||
| /// is resolved. | ||
| #[allow(dead_code)] | ||
| fn build_userspace_sh() -> R<PathBuf> { | ||
| let ws = workspace_root(); | ||
| let target_spec = ws.join(VIBIX_USERSPACE_TARGET); | ||
| let manifest = ws.join("base/sh/Cargo.toml"); | ||
| let library_root = ws.join("library"); | ||
|
|
||
| let target_dir = ws.join("target"); | ||
| let mut cmd = Command::new("cargo"); | ||
| cmd.current_dir(&ws) | ||
| .env("__CARGO_TESTS_ONLY_SRC_ROOT", &library_root) | ||
| .args(["build", "--manifest-path"]) | ||
| .arg(&manifest) | ||
| .arg("--target-dir") | ||
| .arg(&target_dir) | ||
| .args([ | ||
| "-Z", | ||
| "build-std=std,core,alloc,panic_abort", | ||
| "-Z", | ||
| "build-std-features=compiler-builtins-mem", | ||
| "-Z", | ||
| "unstable-options", | ||
| "-Z", | ||
| "json-target-spec", | ||
| "--target", | ||
| ]) | ||
| .arg(&target_spec); | ||
| check(cmd.status()?)?; | ||
|
|
||
| let bin = target_dir | ||
| .join("x86_64-unknown-vibix") | ||
| .join("debug") | ||
| .join("sh"); | ||
| if !bin.exists() { | ||
| return Err(format!("sh binary missing at {} after build", bin.display()).into()); | ||
| } | ||
| strip_debug(&bin)?; | ||
| Ok(bin) | ||
| } |
There was a problem hiding this comment.
build_userspace_sh() is still unreachable, so /bin/sh never gets built or staged.
#[allow(dead_code)] is accurate here: nothing in this file dispatches to this helper, and no ISO/rootfs path publishes the resulting binary. That means the PR still doesn’t meet the /bin/sh build-integration objective yet. If the std-fork compiler error keeps the full wiring out of scope, I’d avoid treating sh as a current shipped base program until this path is exercised.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@xtask/src/main.rs` around lines 796 - 843, The build_userspace_sh() helper is
never exercised (marked with #[allow(dead_code)]), so /bin/sh is not built or
staged; either wire it into the existing build/packaging dispatch that runs
other userland builders (call build_userspace_sh() from the same place that
builds other base programs or from the xtask build pipeline entrypoint) so the
function is executed and its binary gets staged, or explicitly remove it from
the shipped set (and delete the allow_dead_code) by gating it behind a
feature/flag or keeping it as a non-shipped helper until the std-fork compiler
error is resolved; locate the function by name build_userspace_sh and update the
caller/packager that orchestrates other userland builds so the binary path
produced by build_userspace_sh() is included in the staging/publish step.
CodeRabbit findings — deferred as out-of-scope nitsAll three CodeRabbit inline findings are classified as out-of-scope nits per our PR review playbook. They suggest worthwhile improvements but would materially expand this PR beyond its intended goal (initial shell lexer/tokenizer). Summary and rationale for deferral:
All three are candidates for follow-up work tracked under #883 and subsequent shell issues. |
Closes #877
Summary
base/sh/crate as the foundation for the vibix POSIX shell (/bin/sh)build_userspace_sh()to xtask for vibix-target compilation (not yet wired intocargo xtask builddue to a pre-existing std compile error on the vibix target — same issue that affectsstd_hello)Test plan
cargo test --manifest-path base/sh/Cargo.toml— 49 tests pass, no warningscargo xtask build— kernel build passes, no regressionscargo xtask smoke— all 42 markers presentcargo xtask test— pre-existingblocking_syncflake only (rwlock_concurrent_readers), no new failures