Skip to content

Add agent-driven e2e testing framework for desktop#8032

Open
lifeizhou-ap wants to merge 38 commits intomainfrom
lifei/agent-browser
Open

Add agent-driven e2e testing framework for desktop#8032
lifeizhou-ap wants to merge 38 commits intomainfrom
lifei/agent-browser

Conversation

@lifeizhou-ap
Copy link
Collaborator

@lifeizhou-ap lifeizhou-ap commented Mar 20, 2026

Summary

Created a new e2e testing framework for the Goose desktop app using agent-browser CLI, enabling test creation via natural language and deterministic replay in CI.

Features

  • create-e2e-test and e2e-app skills for creating e2e tests via natural language
  • debug_e2e_failures recipe for local/CI e2e failure diagnosis
  • Scripts to start, stop, and manage isolated Goose desktop instances
  • Parallel test execution with e2e-run-all.sh with logs, videos and failed test screenshot for debugging
  • e2e test recordings covering chat, extensions, recipes, and settings. The tests are json version of agent-browser CLI
  • e2e tests run automatically in CI via pr-smoke-test.yml and generated artifacts with logs, videos and failed test screenshot for debugging

Changes

  • ui/desktop/src/main.ts: E2E mode support — userData isolation, fixed window size, dialog auto-confirm, GOOSE_WORKING_DIR override
  • ui/desktop/tests/e2e-tests/scripts/: e2e-start.sh, e2e-stop.sh, e2e-run-all.sh, replay.sh
  • ui/desktop/tests/e2e-tests/recordings/: 9 batch test files
  • ui/desktop/tests/e2e-tests/scenarios/: Human-readable test scenarios
  • .agents/skills/create-e2e-test/, .agents/skills/e2e-app/: Skills for test authoring
  • workflow_recipes/debug_e2e_failures/: Recipe for local/CI e2e failure diagnosis
  • .github/workflows/pr-smoke-test.yml: e2e CI integration
  • ui/desktop/tests/e2e-tests/scripts/e2e-start.sh: Local dev config via ~/.config/goose/e2e.env

Note

If this pattern works well, we can remove the existing playwright e2e tests as they are already not actively maintained now

* main: (22 commits)
  feat: add gemini-acp provider, update docs on subscription models + improvements to codex (#8000)
  fix(openai): use Responses API for gpt-5.4 (#7982)
  Remove lead/worker provider (#7989)
  chore(release): release version 1.28.0 (#7991)
  Fix empty tool results from resource content (e.g. auto visualiser) (#7866)
  Separate SSE streaming from POST work submission (#7834)
  fix: include token usage in Databricks streaming responses (#7959)
  Optimize tool summarization (#7938)
  fix: overwrite the deprecated googledrive extension config (#7974)
  refactor: remove unnecessary Arc<Mutex> from tool execution pipeline (#7979)
  Revert message flush & test (#7966)
  docs: add Remote Access section with Telegram Gateway documentation (#7955)
  fix: update webmcp blog post metadata image URL (#7967)
  fix: clean up OAuth token cache on provider deletion (#7908)
  fix: hard-coded tool call id in code mode callback (#7939)
  Fix SSE parsers to accept optional space after data: prefix (#7929)
  docs: add GOOSE_INPUT_LIMIT to config-files.md (#7961)
  Add WebMCP for Beginners blog post (#7957)
  Fix download manager (#7933)
  Improve the formatting of tool calls, show thinking, treat Reasoning and Thinking as the same thing (sorry Kant) (#7626)
  ...
* main: (37 commits)
  fix: handle reasoning content blocks in OpenAI-compat streaming parser (#8078)
  chore(acp): build native packages on latest mac (#8075)
  Display delegate sub agents logs in UI (#7519)
  Update tar version to avoid CVE-2026-33056 (#8073)
  refactor: consolidate duplicated dependencies into workspace (#8041)
  tui: set up for publishing via github actions (#8020)
  feat: feature-gate local inference dependencies (#7976)
  feat: ability to manage sub recipes in desktop ui (#6360)
  Tweak the release process: no more merge to main (#7994)
  fix: gemini models via databricks (#8042)
  feat(apps): Pass toolInfo to MCP Apps via hostContext (#7506)
  fix: remove configured marker when deleting oauth provider configuration (#7887)
  docs: add vmware-aiops MCP extension documentation (#8055)
  Show setup instructions for ACP providers in settings modal (#8065)
  deps: replace sigstore-verification with sigstore-verify to kill vulns (#8064)
  feat(acp): add session/set_config and stabilize list, delete and close (#7984)
  docs: Correct `gosoe` typo to `goose` (#8062)
  fix: use default provider and model when provider in session no longer exists (#8035)
  feat: add GOOSE_SHELL env var to configure preferred shell (#7909)
  fix(desktop): fullscreen header bar + always-visible close controls (#8033)
  ...
@lifeizhou-ap lifeizhou-ap marked this pull request as ready for review March 24, 2026 01:36
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1fc2bf84b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


STEP=$((i + 1))
echo "[$STEP/$TOTAL] agent-browser ${GLOBAL_ARGS[*]} ${ARGS[*]} --timeout $CMD_TIMEOUT_MS"
if ! timeout "$CMD_TIMEOUT" pnpm exec agent-browser "${GLOBAL_ARGS[@]}" "${ARGS[@]}"; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect command-specific timeouts in replay

replay.sh hard-caps every agent-browser command to 10 seconds via timeout "$CMD_TIMEOUT" ..., so recorded steps that intentionally request longer waits (for example wait --timeout 60000/120000 in the new extension recordings) will still be killed at 10s. This makes long-running LLM/tool steps fail even when the recording specifies a larger timeout, causing deterministic false failures in the new E2E workflow.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// In e2e test mode, auto-confirm dialogs so CDP-based tests can proceed
// without needing to interact with native OS dialogs.
if (process.env.ENABLE_PLAYWRIGHT) {
return { response: 1 };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return the actual affirmative dialog response in E2E mode

In ENABLE_PLAYWRIGHT mode this handler always returns { response: 1 }, but several callers treat index 0 as the confirm action (e.g., KeyboardShortcutsSection conflict/reset dialogs). That means these dialogs are auto-cancelled instead of auto-confirmed during E2E runs, so test behavior diverges from user behavior and some flows become untestable.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current E2E tests (eg: delete recipe) that trigger showMessageBox all use ['Cancel', 'Action'] button order, so response: 1 correctly auto-confirms them. The keyboard shortcut dialogs that use index 0 as confirm aren't used in any E2E test today. We'll revisit if that changes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bef76bc45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +23 to +25
PORT=$((9300 + RANDOM % 100))
if ! lsof -i :"$PORT" &>/dev/null; then
echo "$PORT"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reserve CDP ports atomically across parallel workers

pick_port does a check-then-use (lsof then echo) on a shared 9300–9399 range, but e2e-run-all.sh launches multiple tests concurrently (xargs -P "$WORKERS"). Two workers can select the same still-unbound port, which makes one session connect to the wrong Electron instance or fail nondeterministically once one process binds first. This creates cross-test contamination and flaky CI outcomes in the new parallel runner.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this

Comment on lines +86 to +90
if ! CDP_PORT=$(wait_for_app "$TEST_NAME"); then
local DURATION=$(( SECONDS - START_TIME ))
echo "[$TEST_NAME] FAIL — app did not start within 30s (${DURATION}s)"
echo "FAIL ${DURATION}s" > "$STATUS_DIR/$TEST_NAME"
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Always stop the test session on startup timeout

When app startup does not become ready within 30s, run_one returns early before reaching the teardown call (e2e-stop.sh). Since the app was launched in a detached screen session, this path can leave orphaned Electron/goosed processes and stale session artifacts, which then interfere with subsequent recordings in the same run (e.g., port conflicts and residual state).

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

@lifeizhou-ap lifeizhou-ap Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dcb343a45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +21 to +23
if [[ -n "$CDP_PORT" ]]; then
lsof -ti :"$CDP_PORT" 2>/dev/null | xargs kill -9 2>/dev/null || true
pkill -9 -f "remote-debugging-port=$CDP_PORT" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop goosed cleanly instead of SIGKILLing Electron

This teardown path force-kills the CDP listener with kill -9, which bypasses Electron shutdown hooks that normally call goosedResult.cleanup() in main.ts. Because goosed is spawned as goosed agent (no session path in argv), the later pattern kill in e2e-stop.sh does not reliably target it, so tests can leave orphan backend processes running and contaminate subsequent runs.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

@lifeizhou-ap lifeizhou-ap Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary shutdown is screen -X quit now, which sends SIGHUP and lets Electron run its shutdown hooks including goosedResult.cleanup(). The kill -9 lines after that are safety nets that only fire on processes that survived the graceful shutdown.

trap 'echo ""; echo "Interrupted, stopping active E2E sessions..."; cleanup_and_exit 130' INT TERM

# Run recordings in parallel with worker limit
printf '%s\n' "${RECORDINGS[@]}" | xargs -P "$WORKERS" -I {} bash -c "run_one '{}' '$STATUS_DIR'" || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate parallel workers when single-instance lock applies

The runner always executes recordings in parallel, but the app still enforces app.requestSingleInstanceLock() on non-macOS in main.ts. On Linux/Windows, only one Electron instance can remain alive, so additional workers will fail startup and be reported as test failures; this needs a platform guard (or forcing --workers 1) to avoid deterministic false negatives outside macOS.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only supports macos at the moment

Comment on lines +102 to +103
timeout "$TIMEOUT" bash "$SCRIPT_DIR/replay.sh" \
"$RECORDING" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect scenario time budgets before test-level timeout

Each recording is wrapped in a fixed timeout "$TIMEOUT" (default 120s), but new recordings include waits up to 120s after earlier long waits (for example, extension-install-playwright.batch.json has 60s + 60s + 120s waits). This causes the harness to terminate otherwise valid long-running scenarios before their own declared timeouts elapse, producing avoidable TIMEOUT failures.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

@lifeizhou-ap lifeizhou-ap Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this timeout as we already have timeout when running agent browser command

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8281334588

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const goosedResult = await startGoosed({
serverSecret,
dir: dir || os.homedir(),
dir: process.env.GOOSE_WORKING_DIR || dir || os.homedir(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve requested chat directory over env default

createChat now always prefers process.env.GOOSE_WORKING_DIR over the per-window dir argument, so any flow that passes a specific directory (e.g., deep links/new-window creation with dir) is silently ignored whenever that env var is set. This regresses directory-scoped behavior in those launches: users/tests can request a different folder, but startGoosed still starts in the env folder. Restricting the env override to initial defaulting (or to explicit E2E mode) avoids breaking dir-driven window creation.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine now. GOOSE_WORKING_DIR never worked before this PR. This change is specifically for e2e test isolation. Will create a follow-up PR to consolidate the existing duplicated openDir logic so directory resolution is handled in one place.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0430864b99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


pkill -9 -f "$BASE_DIR" 2>/dev/null || true
pkill -9 -f 'agent-browser-chrome' 2>/dev/null || true
pkill -9 -f 'agent-browser-darwin\|agent-browser-linux' 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use ERE alternation when killing agent-browser helpers

pkill patterns are interpreted as regular expressions (BSD/FreeBSD pgrep/pkill docs describe these as extended regex), so alternation should be |, not \|. With the current pattern, this command matches a literal pipe and does not match agent-browser-darwin or agent-browser-linux, leaving helper processes alive after teardown; those leftovers can interfere with later E2E runs on the same machine.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0c9c54d77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rm -f "$BASE_DIR/.port-locks/$CDP_PORT"
fi

pkill -9 -f "$SESSION_DIR" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Kill goosed explicitly for per-session teardown

This teardown path does not reliably terminate the backend process after force-killing Electron. stop_session only pattern-kills by "$SESSION_DIR", but startGoosed launches goosed as goosed agent with the session path only in cwd (not argv), so that match can miss; and in the normal runner path we call e2e-stop.sh "$TEST_NAME", which exits before the global fallback sweeps at the bottom of the script. Fresh evidence: ui/desktop/src/goosed.ts uses spawnArgs = ['agent'] and cwd: workingDir, so argv-based pkill -f "$SESSION_DIR" is not a reliable cleanup mechanism. When this miss happens, orphan goosed processes can leak into later tests and cause cross-test contamination/flakes.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screen -X quit is the primary kill — it triggers Electron's graceful shutdown which SIGTERMs goosed. The pkill lines are last-resort fallbacks. We also fixed a screen session name mismatch that was preventing the graceful path from working.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 278f91582c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

uses: Swatinem/rust-cache@42dc69e1aa15d09112580998cf2ef0119e2e91ae # v2

- name: Install GNU timeout (if missing)
run: command -v timeout || brew install coreutils

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make timeout available on macOS before running E2E

On macos-latest, this step only runs brew install coreutils when timeout is missing, but the replay harness later calls timeout directly for every command. Homebrew coreutils installs gtimeout unless the gnubin path is added, so in the common case where timeout is absent the E2E run can still fail with timeout: command not found before tests execute. Please either export the coreutils gnubin directory (or symlink) here, or update the scripts to fall back to gtimeout.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout wasn't available on the macOS runner by default, so this step installs coreutils as a fallback. Once installed, timeout is available on PATH and the scripts work. This has been verified in CI.

Comment on lines +32 to +36
gh run download {run_id} --repo block/goose --name e2e-test-results --dir /tmp/e2e-results 2>&1
# example: result_url: https://github.com/block/goose/actions/runs/23487840554/artifacts/6079302235
# gh run download 23487840554 --repo block/goose --name e2e-test-results --dir /tmp/e2e-results 2>&1
rm -rf ui/desktop/tests/e2e-tests/results-ci
unzip -o /tmp/e2e-results.zip -d ui/desktop/tests/e2e-tests/results-ci

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop nonexistent unzip step after gh artifact download

When results_url is provided, these instructions call gh run download ... --dir /tmp/e2e-results and then immediately unzip /tmp/e2e-results.zip. gh run download already extracts artifacts into the target directory, so this zip file is not produced; the extra unzip command will fail and block Step 1 for CI-artifact debugging runs.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d38e0721cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +67 to +69
if [[ $waited -ge 120 ]]; then
echo "[$(ts)] WARNING: startup lock stale after 60s, breaking it"
rm -rf "$STARTUP_LOCK"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check lock age before breaking startup lock

The stale-lock logic removes STARTUP_LOCK after a waiter has blocked for 60 seconds, but that does not prove the lock holder crashed. If a legitimate startup takes longer than 60s (e.g., slow electron-forge rebuild), another worker will delete the active lock and start concurrently, reintroducing the .vite/build race this lock is intended to prevent and causing intermittent startup/test failures.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup usually takes less than 60s. stale lock blocking CI forever might be worse than a rare race condition.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 872f0273af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +110 to +114
release_startup_lock
local DURATION=$(( SECONDS - START_TIME ))
echo "[$(ts)] [$TEST_NAME] FAIL — app did not start within 30s (${DURATION}s)"
echo "FAIL ${DURATION}s" > "$STATUS_DIR/$TEST_NAME"
bash "$SCRIPT_DIR/e2e-stop.sh" "$TEST_NAME" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Hold startup lock until timed-out app is fully stopped

On the startup-timeout path, the script releases STARTUP_LOCK before calling e2e-stop.sh. If the timed-out electron-forge process is still rebuilding .vite/build when the 30s wait expires, another worker can immediately acquire the lock and start a second app concurrently, reintroducing the same shared-build race this lock is meant to prevent. Keep the lock held until teardown of the timed-out session completes.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +42 to +43
if [[ ! -d "$SESSIONS_DIR" ]]; then
exit 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run global fallback cleanup even without session dirs

This early exit 0 skips the global fallback pkill cleanup at the bottom of the script whenever /tmp/goose-e2e/sessions is absent. In the normal runner flow, per-test teardown removes session directories, so a later all-sessions stop can bypass fallback cleanup and leave orphan helper processes alive from previous failures, which can interfere with subsequent runs. The script should skip only the per-session loop, not terminate before global cleanup.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@lifeizhou-ap lifeizhou-ap changed the title Lifei/agent browser Add agent-driven e2e testing framework for desktop Mar 25, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc8d2de0e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +67 to +69
if [[ $waited -ge 120 ]]; then
echo "[$(ts)] WARNING: startup lock stale after 60s, breaking it"
rm -rf "$STARTUP_LOCK"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Break startup lock only once after stale detection

Fresh evidence: because waited is never reset, once a worker has waited 60s it runs rm -rf "$STARTUP_LOCK" on every 0.5s retry. If another worker is legitimately still starting Electron, this loop can repeatedly delete an active lock and let multiple startups run concurrently, which reintroduces the shared .vite/build race this lock is meant to prevent. Limit stale-lock stealing to a single verified stale lock (e.g., based on lock age/owner) before retrying.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

@lifeizhou-ap lifeizhou-ap Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you put the same comment as above. I have replied with

"Startup usually takes less than 60s. stale lock blocking CI forever might be worse than a rare race condition."

* main:
  Fix user message text silently dropped when message contains both text and image (#8071)
  fix: use pnpm when publishing (#8092)
  Alexhancock/publish npm file format fix (#8091)
  fix: iteration on the publish-npm workflow (#8087)
  chore: ignore unmaintained warning for proc-macro-error (#8084)
  chore: clean up stray recipe.yaml (#8086)
  chore(release): bump version to 1.29.0 (minor) (#8088)
  Update lockfile references (#8085)
  Fix version bump (#8083)
  Add a code review step which uses a short-lived provider token (#7932)
@jh-block jh-block self-assigned this Mar 25, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 912240b7a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -0,0 +1,17 @@
[
["wait", "[data-testid='chat-input']", "--timeout", "60000"],
["eval", "window.electron.emit('add-extension', {}, 'goose://extension?cmd=npx&arg=%40playwright%2Fmcp%40latest&id=playwright&name=Playwright&description=Interact%20with%20web%20pages%20through%20structured%20accessibility%20snapshots%20using%20Playwright')"],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin Playwright MCP dependency to a fixed version

This recording installs Playwright MCP via @latest, so each CI run can pull different upstream code and silently change behavior or break setup even when this repo is unchanged. That makes the new E2E suite non-reproducible and can create false failures unrelated to the PR under test; please pin a concrete package version in the deeplink command.

Useful? React with 👍 / 👎.

["wait", "[data-testid='nav-new-chat-btn']"],
["find", "testid", "nav-new-chat-btn", "click"],
["wait", "[data-testid='chat-input']"],
["find", "testid", "chat-input", "fill", "Use the Playwright browser to navigate to google.com and search for cats. Use the browser_navigate tool to go to https://www.google.com and then use browser_type and browser_click to search for cats."],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid live Google dependency in deterministic replay

This step drives a real google.com search and then asserts success within a short timeout, which makes the test depend on external network availability, consent/CAPTCHA behavior, and third-party page changes. In CI this can fail even when Goose itself is healthy, so the scenario should target a controlled fixture/local page (or mocked endpoint) to keep replay deterministic.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants