Skip to content

feat(go): implement Go bindings Phase 2 — services, actions, interop tests#147

Open
YuanYuYuan wants to merge 26 commits intomainfrom
dev/go-phase-2
Open

feat(go): implement Go bindings Phase 2 — services, actions, interop tests#147
YuanYuYuan wants to merge 26 commits intomainfrom
dev/go-phase-2

Conversation

@YuanYuYuan
Copy link
Copy Markdown
Collaborator

@YuanYuYuan YuanYuYuan commented Mar 18, 2026

Summary

Key Changes

  • crates/ros-z/src/ffi/service.rs, ffi/action.rs — Rust CGO FFI layer for service and action server/client with cbindgen-generated header additions
  • crates/ros-z-go/rosz/service.goServiceClient (Call, CallWithTimeout) and ServiceServer (callback-based) with structured error codes
  • crates/ros-z-go/rosz/action.goActionClient (SendGoal, GetResult, Cancel) and ActionServer with ServerGoalHandle (feedback publishing, cancellation); goal status tracking via atomic.Int32
  • crates/ros-z-go/rosz/service_typed.go, action_typed.go — typed helpers BuildTypedServiceServer, CallTyped, BuildTypedActionServer, SendTypedGoal, GetTypedResult eliminating CDR boilerplate
  • crates/ros-z-go/interop_tests/ — 18 integration tests (pub/sub, service, action) across Go↔Go, Go↔ROS 2, and Go↔Rust paths; isolated per-test Zenoh router; graceful skip when ROS 2 or Rust binaries are absent
  • crates/ros-z-go/examples/service_client, service_server, action_client, action_server, service_client_errors, action_client_errors, production_service (rate limiting, retry backoff, structured logging)
  • book/go_bindings.md, go_quick_start.md, go_interop_tests.md chapters covering the full Phase 2 API
  • CGO safety: runtime.Pinner contract documented at every uintptr cast; closure.server assigned before ros_z_action_server_create to close a weak-ordering race

Breaking Changes

None

Acknowledgement

This work is sponsored by

Dexory

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 18, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://ZettaScaleLabs.github.io/ros-z/pr-preview/pr-147/

Built to branch gh-pages at 2026-03-25 04:40 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rates/ros-z/src/dynamic/type_description_client.rs 0.00% 11 Missing ⚠️
Files with missing lines Coverage Δ
crates/ros-z/src/node.rs 57.96% <ø> (+7.30%) ⬆️
...rates/ros-z/src/dynamic/type_description_client.rs 22.59% <0.00%> (-0.16%) ⬇️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@YuanYuYuan YuanYuYuan requested a review from JEnoch March 19, 2026 10:31
@YuanYuYuan YuanYuYuan force-pushed the dev/go-phase-2 branch 4 times, most recently from 09bb1e3 to 4f04113 Compare March 23, 2026 15:33
All three README.md files under ros-z-go now follow the project-wide
minimal pattern: DO NOT EXPAND guard + one-sentence description +
single book link.  Documentation lives in the book; READMEs are
GitHub landing pages only.
- Move closure.server assignment before ros_z_action_server_create so
  the back-reference is visible to any goal callback that fires before
  Build() returns on weak-ordering architectures
- Document the Pin+uintptr_t safety contract at every cast site:
  pinner.Pin(closure) guarantees address stability, making the uintptr
  cast safe; removing Pin would silently break the FFI
- Add doc comment to GetResultWithContext warning that ctx cancellation
  returns immediately but the underlying Rust call continues running
  until the server responds or the client is closed
All Go files used the wrong module path github.com/ZettaScaleLabs/ros-z-go;
the correct path declared in go.mod is
github.com/ZettaScaleLabs/ros-z/crates/ros-z-go.  Any project following
the examples or book snippets verbatim would get an immediate build failure.

Also remove unused-import suppressions (_ = binary.LittleEndian,
_ = fmt.Sprintf) that were masking imports no longer needed after
the typed-API migration.
…rver

The previous example manually serialised/deserialised CDR bytes in both
the goal and execute callbacks, contradicting the library's idiomatic
typed API.  BuildTypedActionServer eliminates all CDR ceremony; goal and
result callbacks now work with typed structs directly, matching the
pattern shown in action_typed.go and already used by BuildTypedServiceServer.
… step

The export_json command was missing action_msgs and unique_identifier_msgs
from its --packages list. The UUID type (from unique_identifier_msgs) is
a referenced type in the Fibonacci_SendGoal service type description.
Without it in the type_descriptions map, the RIHS01 hash was computed
without the UUID entry in referenced_type_descriptions, producing a
different hash than rmw_zenoh_cpp computes from the full type tree.

This caused Go action server/client key expressions to use the wrong
hash, preventing Zenoh queries from routing between the Go FFI and
rmw_zenoh_cpp nodes.
Python startup and Zenoh discovery are slower in GitHub Actions containers.
- Go action server warmup: 500ms → 2s
- Python action server warmup: 2s → 5s
- ros2 action send_goal timeout: 30s → 60s
- go test -timeout: 120s → 300s
…adiness checks

z_srvcli lacked --endpoint/--zenoh-mode flags so ROSZ_CONFIG_OVERRIDE
was silently ignored; Rust service tests connected via multicast (broken
in CI containers) instead of the per-test router.

- Add --zenoh-mode and --endpoint CLI flags to z_srvcli, matching z_pubsub
- Pass router endpoint explicitly instead of relying on env var
- Replace time.Sleep-based server discovery with retry loops that probe
  until the service responds, eliminating the timing dependency
…interop tests

Tests were failing on humble because time.Sleep-based server discovery
was too short for slower CI containers. Replace with retry loops that
probe the server until it responds:

- TestGoActionServerToROS2Client: Go self-call before ros2 action send_goal
- TestROS2ActionServerToGoClient: retry SendGoal until Python server ready
- TestGoServiceServerToROS2Client: Go self-call before ros2 service call
Humble uses supports_type_hash=false, omitting type hashes from Zenoh key
expressions. Building the FFI library without the humble feature caused Go
to use jazzy-style key expressions (with type hashes) that humble's
rmw_zenoh_cpp couldn't match — making all Go↔ROS2 tests fail on humble.

Pass matrix.ros_z_feature to the FFI build so each distro gets the correct
key expression format.
Addresses 11 issues from the go-api-audit report (score 33/50):

- Split Service/Action interfaces: remove Message embedding, add explicit
  GetRequest/GetResponse/GetGoal/GetResult/GetFeedback methods
- Merge ActionSubServiceHashes into Action interface (5 sub-service hash
  methods promoted directly, bolt-on type assertion pattern deleted)
- Unexport Call/CallWithTimeout on ServiceClient (now call/callWithTimeout)
- Add ErrCloseFailed sentinel for Close() error wrapping
- Fix RoszError.Error() format: "rosz[CODE]: msg" → "rosz(CODE): msg"
- Rename contextPinner → contextPinnerHandle in closures (clarity)
- Remove unused min/max helpers
- Replace IsTimeout() with errors.Is(err, ErrTimeout) pattern
- Return nil slices from graph.go on empty (not []string{})
- Fix codegen: emit SendGoalHash/GetResultHash/etc. on generated actions
- Update examples and tests to new API surface
Two classes of errors caught by go vet:

1. CGO callbacks used (*T)(unsafe.Pointer(uintptr(userData))) to recover
   Go pointers passed through C as uintptr_t. This pattern is flagged by
   go vet (unsafeptr) because it can race with GC even though the objects
   are pinned. Fix: store each closure in a cgo.Handle (selfHandle field),
   pass the handle value as userData, recover via cgo.Handle(userData).Value().

2. atomic.Pointer[C.T] fails when C.T is a CGO opaque (incomplete) type
   because Go generics require a complete type. Fix: introduce thin wrapper
   structs (cGoalHandle, cActionClientHandle, cActionServerHandle,
   cServiceClientHandle, cServiceServerHandle) so atomic.Pointer[Wrapper]
   compiles and all Load/Store/Swap call sites unwrap via .p.
scripts/test-go.nu is now the single source of truth for all Go checks.
Justfile targets and both CI workflows delegate to it.

test-go.nu fixes:
- test-vet: exit 1 on failures (was ⚠ warnings), remove unnecessary
  has-ffi-library gate (vet does not link), fix ./testdata/... path
- test-runtime: set CGO_LDFLAGS via with-env (was missing entirely)
- test-integration: fix shell-style VAR=val syntax → with-env, add
  -timeout 300s, demote missing zenohd to a warning (tests fall back
  to the compiled zenoh_router example automatically)

justfile: add vet-go target, delegate test-go-pure/test-go-ffi/vet-go
to nu scripts/test-go.nu; test-go now runs vet before FFI tests

ci.yml go-tests job: replace just test-go with per-step nu script calls
(vet, codegen, FFI) using hustcer/setup-nu@v3; drop just dependency

test.yml interop job: install nushell via apt, add Go vet step before
integration tests, run integration tests via nu scripts/test-go.nu
apt-get does not have a nushell package on Ubuntu 22.04/24.04.
Download the 0.102.0 binary directly from GitHub releases instead,
matching the version used in all other CI jobs.
- action wait test: loop until terminal status instead of assuming
  exactly two watch notifications; execute()+succeed() may fire back-to-
  back and get batched into a single changed() wakeup, causing the second
  wait to hang forever
- type_description_client: use the caller-supplied timeout budget for
  Phase 1 publisher discovery instead of a hardcoded 5×500 ms cap; the
  15 s discovery_timeout was ignored, leaving only 2.5 s for the graph
  to advertise the publisher in CI
The C FFI bridge (crates/ros-z/src/ffi/) is exercised by Go interop
tests which run in a separate CI job and cannot contribute to the Rust
llvm-cov report. Excluding it prevents a spurious 0% patch coverage
failure while keeping the overall coverage gate intact.
The C FFI bridge (crates/ros-z/src/ffi/ and ffi-gated node.rs methods)
cannot be exercised by cargo-llvm-cov: the functions are called from Go
via CGo and tested by the Go interop CI job, not Rust unit tests.

- Mark patch coverage as informational so it reports but does not fail CI
- Keep project-level coverage enforced with 1% regression tolerance
- Add ros-z-tests and --features ros-z/ffi to coverage run so the
  remaining testable diff (type_description_client.rs, action tests)
  is properly instrumented
- Ignore ffi/ and ros-z-go/ from coverage reports
…bols

Combining -p ros-z (with --features ffi) and -p ros-z-tests in the same
cargo llvm-cov invocation causes link errors: the #[no_mangle] extern "C"
FFI symbols get emitted twice — once from the ros-z test binary and once
from ros-z-tests which links ros-z as a dependency.

The ffi/ directory is already excluded in codecov.yml and patch coverage
is informational, so there is no coverage loss.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant