block · lifeizhou-ap · Mar 18, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/.agents/skills/create-e2e-test/SKILL.md b/.agents/skills/create-e2e-test/SKILL.md
@@ -0,0 +1,190 @@
+---
+name: create-e2e-test
+description: Create replayable e2e tests for the Goose desktop app. Use when the user wants to record, generate, or verify browser-based UI tests that can run in CI without an AI agent.
+---
+
+# Create E2E Test
+
+You are an AI agent that creates replayable e2e test scenarios for the Goose desktop app using agent-browser CLI.
+
+## Goal
+
+Given a test scenario in natural language, you will:
+
+1. Explore the app using agent-browser
+2. Record a set of deterministic CLI commands as a batch file that can be replayed without an AI agent
+
+**Do NOT read source code to understand the UI.** Do not read `.tsx`, `.ts`, or `.css` files to find elements. Use `snapshot` to discover what is on the page — that is your only method. The one exception: read source code only when you need to add a `data-testid` attribute.
+
+## App Lifecycle
+
+Every time you need a clean app state — whether starting for the first time, retrying during exploration, or verifying a recording — follow these steps:
+
+1. Use the `e2e-app` skill to stop any running instance and start a new one. Note the **test session name** (e.g., `260320-170823`) and **CDP port**.
+2. Connect agent-browser to the CDP port using the test session name:
+   ```bash
+   pnpm exec agent-browser --session <test-session-name> connect <port>
+   ```
+
+### Agent-browser Session Isolation
+
+agent-browser uses `--session` to isolate browser contexts. This prevents multiple agents or tests from interfering with each other.
+
+- **Agent (exploration + replay)**: always use the current app's test session name (e.g., `--session 260320-170823`). Pass it to **every** agent-browser command and to the replay script via `--browser-session`.
+- **In batch JSON**: do **not** include test session names — the replay script handles this.
+- **CI**: no `--session` flag needed — the replay script defaults to the recording filename (e.g., `settings-dark-mode.batch.json` → `settings-dark-mode`).
+
+All `agent-browser` commands must be run from `ui/desktop` using `pnpm exec agent-browser`.
+
+## Workflow
+
+### Phase 1: Explore and Record
+
+1. Start the app using the App Lifecycle steps above.
+
+2. Walk through the test scenario step by step. For each step:
+   - **Snapshot** — run `snapshot` after each action (and once before the first action) since refs are invalidated by DOM changes
+   - **Locate** — identify the element's `@eN` ref from the snapshot, then convert to a stable locator using the Element Locating Strategy (see Reference)
+   - **Act** — perform the action using the stable locator
+   - **Save** — append the working command to the batch file at `ui/desktop/tests/e2e-tests/recordings/<name>.batch.json`
+
+   If you need a clean app state at any point, restart using the App Lifecycle steps, then replay the saved batch file to catch up before continuing.
+
+   Rules:
+   - Use `wait --load networkidle` before snapshotting slow pages
+   - Check `agent-browser errors` if something seems wrong
+   - Never use `@eN` refs in the recording — convert to stable locators immediately
+
+   Example (assuming start app test session name is `260320-170823`):
+   ```bash
+   # Snapshot
+   agent-browser --session 260320-170823 snapshot
+   # Output:
+   #   - textbox "Chat input" [ref=e2]
+   #   - button "Send" [ref=e3]
+
+   # Locate — get test-id for @e2
+   agent-browser --session 260320-170823 get attr @e2 data-testid
+   # Output: chat-input
+
+   # Act — count is 1, so find testid works
+   agent-browser --session 260320-170823 find testid "chat-input" fill "hello"
+
+   # Snapshot again
+   agent-browser --session 260320-170823 snapshot
+
+   # Locate — get test-id for @e3
+   agent-browser --session 260320-170823 get attr @e3 data-testid
+   # Output: send-button
+   agent-browser --session 260320-170823 get count "[data-testid='send-button']"
+   # Output: 2 — duplicate! scope to active session
+
+   # Act — count > 1, so narrow the selector to target a unique match
+   agent-browser --session 260320-170823 click "[data-active-session='true'] [data-testid='send-button']"
+   ```
+
+3. Review the test scenario step by step and confirm you have a recorded command for each one. If any steps are missing, go back to step 2.
+
+   Example batch file (`ui/desktop/tests/e2e-tests/recordings/<name>.batch.json`):
+
+   ```json
+   [
+     ["wait", "[data-testid='chat-input']"],
+     ["fill", "[data-active-session='true'] [data-testid='chat-input']", "hello"],
+     ["wait", "[data-active-session='true'] [data-testid='send-button']"],
+    ["click", "[data-active-session='true'] [data-testid='send-button']"],
+     ["wait", "--text", "Response"]
+   ]
+   ```
+
+   Do **not** include in the batch file: `snapshot`, `get`, `diff`, `console`, `errors`, `open`, `connect`
+
+   **Never** use `wait <ms>` (e.g., `wait 3000`) in the batch file. Always wait for a specific condition:
+   - `wait "[data-testid='element']"` — wait for an element to appear
+   - `wait --text "some text"` — wait for text to appear
+   - `wait --load networkidle` — wait for page to finish loading
+   - `wait --url "**/path"` — wait for navigation
+
+### Phase 2: Verify the Recording
+
+1. Add `wait` commands before actions on dynamic elements. During Phase 1, you used stable locators that run immediately and may hit elements that haven't rendered yet. Add a `wait` before any action that targets a dynamic element:
+
+   Before:
+   ```bash
+   find testid "chat-response" click    # fails — element not yet on page
+   ```
+
+   After:
+   ```bash
+   wait "[data-testid='chat-response']"
+   find testid "chat-response" click
+   ```
+
+2. Restart the app using the App Lifecycle steps.
+
+3. Replay the recording:
+   ```bash
+   bash ui/desktop/tests/e2e-tests/scripts/replay.sh recordings/<name>.batch.json --connect <port> --browser-session <test-session-name>
+   ```
+   Always pass the current app test session name. Exit code 0 = pass, non-zero = fail.
+
+4. If replay fails, restart the app, explore the failing step using the Phase 1 cycle (snapshot → locate → convert → act) to find the fix, update the recording, and go back to step 2.
+
+### Phase 3: Write the Scenario
+
+After the recording is verified, write (or update) a scenario file at `ui/desktop/tests/e2e-tests/scenarios/<name>.md` (same base name as the recording, e.g., `settings-dark-mode.batch.json` → `settings-dark-mode.md`). This is a human-readable description of what the test does — the intent, not the implementation.
+
+- Describe each step in terms of **user actions and expected outcomes**, not selectors or test IDs
+- Keep it concise — one line per step. The file should only contain a title and numbered steps, nothing else
+- The scenario serves as the source of truth for re-recording if the test breaks
+
+Example (`scenarios/settings-dark-mode.md`):
+```markdown
+# Settings: Dark Mode Toggle
+
+1. Open Settings
+2. Navigate to the App tab
+3. Verify the app is in light mode
+4. Switch to dark mode and verify it applies
+5. Switch back to light mode and verify it applies
+```
+
+## Reference
+
+### Element Locating Strategy
+
+**Always** verify uniqueness with `get count` before using any locator. If count > 1, narrow the selector or fall back to the next strategy.
+
+For each element, find a stable locator using this priority:
+
+1. **Semantic locator (preferred)**: use the role and name directly from the snapshot (e.g., `button "Send"` → `find role button --name "Send" click`). Never use a bare role without `--name`.
+   - Count is 1 → use `find role <role> --name "<name>" <action>`
+   - Count > 1 → fall back to step 2
+
+2. **Test ID**: `get attr @eN data-testid` → if exists, use `find testid "<id>" <action>`.
+   - If count > 1 and the element is inside a chat session, scope to `[data-active-session='true'] [data-testid='<id>']`
+   - If still count > 1, use `find first "[data-testid='<id>']" <action>` or `find nth <index> "[data-testid='<id>']" <action>` (0-based index)
+
+3. **Add a data-testid (last resort)**: if neither above works, add a `data-testid` to the source code.
+   - Names must be globally unique and unambiguous. Include the parent component or location, the element type, and its purpose (e.g., `bottom-menu-alert-dot` not `alert-dot`, `session-card` not `card`)
+   - Only add the `data-testid` attribute — do not change any other source code
+   - Note the code change so it can be committed alongside the test
+
+**Never** use `@eN` refs in recorded commands — they are session-specific.
+
+### Assertions
+
+Use `wait` and `is` commands as assertions in the recording:
+
+- `wait --text "Success"` — assert text appears (with timeout)
+- `is visible ".error-message"` — assert element is visible
+- `wait --url "**/dashboard"` — assert navigation happened
+
+### Tips
+
+- Run `pnpm exec agent-browser --help` or `pnpm exec agent-browser <command> --help` to learn unfamiliar commands
+- Start with `wait --load networkidle` after `open` to ensure the page is ready
+- Use `wait --text` over `wait <ms>` — it's more resilient to timing variations
+- Keep recordings short — one user journey per file
+- Name files descriptively: `login-with-email.batch.json`, `send-chat-message.batch.json`
+- The "Chat" nav button toggles the chat list and start new chat. It is expanded by default on a fresh app
diff --git a/.agents/skills/e2e-app/SKILL.md b/.agents/skills/e2e-app/SKILL.md
@@ -0,0 +1,52 @@
+---
+name: e2e-app
+description: Start and stop the Goose Electron app ONLY for e2e testing. Use when you need to launch, manage, or tear down the desktop app for end-to-end tests.
+---
+
+# E2E App Management
+
+Scripts are in `ui/desktop/tests/e2e-tests/scripts/`.
+
+## Starting the App
+
+The start script blocks (runs Electron in foreground), so use `screen` to background it.
+The script self-activates hermit for `pnpm`/`node`, but needs `ANTHROPIC_API_KEY` in the environment.
+
+```bash
+TEST_SESSION_NAME=$(date +"%y%m%d-%H%M%S")
+SCREEN_NAME="e2e-$(date +%s)"
+screen -dmS $SCREEN_NAME bash -c "source ~/.zshrc 2>/dev/null; bash ui/desktop/tests/e2e-tests/scripts/e2e-start.sh $TEST_SESSION_NAME"
+```
+
+Then wait for the port file and verify the app is listening:
+
+```bash
+# Wait for port file and app to be ready (up to 30s)
+for i in $(seq 1 30); do
+  if [[ -f "/tmp/goose-e2e/sessions/$TEST_SESSION_NAME/.port" ]]; then
+    CDP_PORT=$(cat /tmp/goose-e2e/sessions/$TEST_SESSION_NAME/.port)
+    if lsof -i :"$CDP_PORT" &>/dev/null; then
+      echo "App ready — Test session name: $TEST_SESSION_NAME, CDP port: $CDP_PORT"
+      break
+    fi
+  fi
+  sleep 1
+done
+```
+
+If the app doesn't start, check the screen log:
+```bash
+screen -ls                    # verify screen session exists
+screen -r $SCREEN_NAME        # attach to see errors (Ctrl-A D to detach)
+```
+
+Common startup failures:
+- `ANTHROPIC_API_KEY must be set` — key not in environment; ensure `~/.zshrc` exports it
+- `pnpm: not found` — hermit activation failed; the script does this automatically now
+- Screen session dies immediately — check `screen -ls`; if no session, run the script directly to see errors
+
+## Stopping the App
+
+```bash
+bash ui/desktop/tests/e2e-tests/scripts/e2e-stop.sh <session-name>
+```
diff --git a/.github/workflows/pr-smoke-test.yml b/.github/workflows/pr-smoke-test.yml
@@ -55,15 +55,15 @@ jobs:
         with:
           ref: ${{ github.event.inputs.branch || github.ref }}
 
-      - uses: actions-rust-lang/setup-rust-toolchain@v1
+      - uses: actions-rust-lang/setup-rust-toolchain@150fca883cd4034361b621bd4e6a9d34e5143606  # v1
 
       - name: Install Dependencies
         run: |
           sudo apt update -y
           sudo apt install -y libdbus-1-dev gnome-keyring libxcb1-dev
 
       - name: Cache Rust dependencies
-        uses: Swatinem/rust-cache@v2
+        uses: Swatinem/rust-cache@42dc69e1aa15d09112580998cf2ef0119e2e91ae  # v2
 
       - name: Build Binary for Smoke Tests
         run: |
@@ -83,6 +83,39 @@ jobs:
           path: target/debug/goosed
           retention-days: 1
 
+  e2e-desktop-tests:
+    name: E2E Desktop Tests
+    runs-on: macos-latest
+    needs: changes
+    if: needs.changes.outputs.code == 'true' || github.event_name == 'workflow_dispatch'
+    steps:
+      - name: Checkout Code
+        uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8  # v6.0.1
+        with:
+          ref: ${{ github.event.inputs.branch || github.ref }}
+
+      - uses: actions-rust-lang/setup-rust-toolchain@150fca883cd4034361b621bd4e6a9d34e5143606  # v1
+
+      - name: Cache Rust dependencies
+        uses: Swatinem/rust-cache@42dc69e1aa15d09112580998cf2ef0119e2e91ae  # v2
+
+      - name: Install GNU timeout (if missing)
+        run: command -v timeout || brew install coreutils
+
+      - name: Run E2E Tests
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          GOOSE_DISABLE_KEYRING: 1
+        run: source bin/activate-hermit && just e2e
+
+      - name: Upload E2E Test Results
+        if: always()
+        uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f  # v6.0.0
+        with:
+          name: e2e-test-results
+          path: ui/desktop/tests/e2e-tests/results/
+          retention-days: 7
+
   smoke-tests:
     name: Smoke Tests
     runs-on: ubuntu-latest

diff --git a/Justfile b/Justfile
@@ -77,6 +77,15 @@ release-intel:
     cargo build --release --target x86_64-apple-darwin
     @just copy-binary-intel
 
+copy-goosed BUILD_MODE="release":
+    @if [ -f ./target/{{BUILD_MODE}}/goosed ]; then \
+        echo "Copying goosed binary from target/{{BUILD_MODE}}..."; \
+        cp -p ./target/{{BUILD_MODE}}/goosed ./ui/desktop/src/bin/; \
+    else \
+        echo "goosed binary not found in target/{{BUILD_MODE}}"; \
+        exit 1; \
+    fi
+
 copy-binary BUILD_MODE="release":
     @if [ -f ./target/{{BUILD_MODE}}/goosed ]; then \
         echo "Copying goosed binary from target/{{BUILD_MODE}}..."; \
@@ -464,3 +473,14 @@ build-test-tools:
 record-mcp-tests: build-test-tools
   GOOSE_RECORD_MCP=1 cargo test --package goose --test mcp_integration_test
   git add crates/goose/tests/mcp_replays/
+
+e2e:
+    @echo "Building goosed..."
+    cargo build --bin goosed
+    @just copy-goosed debug
+    @echo "Installing dependencies..."
+    cd ui && pnpm install --frozen-lockfile
+    @echo "Generating API types..."
+    cd ui/desktop && pnpm run generate-api
+    @echo "Running E2E tests..."
+    bash ui/desktop/tests/e2e-tests/scripts/e2e-run-all.sh
diff --git a/ui/desktop/.gitignore b/ui/desktop/.gitignore
@@ -11,3 +11,6 @@ src/bin/goose-npm/
 src/bin/temporal.db
 # Signing credentials
 .env.signing
+
+tests/e2e-tests/results/
+tests/e2e-tests/results-rerun
diff --git a/ui/desktop/package.json b/ui/desktop/package.json
@@ -132,6 +132,7 @@
     "@vitejs/plugin-react": "^5.1.4",
     "@vitest/coverage-v8": "^4.0.18",
     "@vitest/ui": "^4.0.18",
+    "agent-browser": "^0.20.14",
     "autoprefixer": "^10.4.24",
     "electron": "41.0.0",
     "electron-devtools-installer": "^4.0.0",

diff --git a/ui/desktop/src/components/ChatInput.tsx b/ui/desktop/src/components/ChatInput.tsx
@@ -1367,6 +1367,7 @@ export default function ChatInput({
                       size="sm"
                       shape="round"
                       variant="outline"
+                      data-testid="send-button"
                       disabled={isSubmitButtonDisabled}
                       className={`rounded-full px-10 py-2 flex items-center gap-2 ${
                         isSubmitButtonDisabled
@@ -1593,6 +1594,7 @@ export default function ChatInput({
                       variant="ghost"
                       size="sm"
                       className="flex items-center justify-center text-text-primary/70 hover:text-text-primary text-xs cursor-pointer"
+                      data-testid="recipe-action-button"
                     >
                       <ChefHat size={16} />
                     </Button>

diff --git a/ui/desktop/src/components/ChatSessionsContainer.tsx b/ui/desktop/src/components/ChatSessionsContainer.tsx
@@ -46,6 +46,7 @@ export default function ChatSessionsContainer({
             key={session.sessionId}
             className={`absolute inset-0 ${isVisible ? 'block' : 'hidden'}`}
             data-session-id={session.sessionId}
+            data-active-session={isVisible}
           >
             <BaseChat
               setChat={setChat}

diff --git a/ui/desktop/src/components/Layout/CondensedRenderer.tsx b/ui/desktop/src/components/Layout/CondensedRenderer.tsx
@@ -177,6 +177,7 @@ export const CondensedRenderer: React.FC<NavigationRendererProps> = ({
                                 'flex items-center justify-center'
                               )}
                               title="New Chat"
+                              data-testid="nav-new-chat-btn"
                             >
                               <Plus className="w-4 h-4" />
                             </motion.button>

diff --git a/ui/desktop/src/components/Layout/navigation/ChatSessionsDropdown.tsx b/ui/desktop/src/components/Layout/navigation/ChatSessionsDropdown.tsx
@@ -47,7 +47,7 @@ export const ChatSessionsDropdown: React.FC<ChatSessionsDropdownProps> = ({
         className="flex items-center gap-2 px-3 py-2 text-sm rounded-lg cursor-pointer"
       >
         <Plus className="w-4 h-4 flex-shrink-0" />
-        <span>New Chat</span>
+        <span data-testid="nav-start-new-chat">New Chat</span>
       </DropdownMenuItem>
 
       {sessions.length > 0 && <DropdownMenuSeparator className="my-1" />}
@@ -96,7 +96,7 @@ export const ChatSessionsDropdown: React.FC<ChatSessionsDropdownProps> = ({
             className="flex items-center gap-2 px-3 py-2 text-sm rounded-lg cursor-pointer text-text-secondary"
           >
             <History className="w-4 h-4 flex-shrink-0" />
-            <span>Show All</span>
+            <span data-testid="nav-show-all-sessions">Show All</span>
           </DropdownMenuItem>
         </>
       )}