Skip to content

redpanda: report per-record errors from output batch writes#4400

Open
twmb wants to merge 1 commit into
mainfrom
kafka-franz-output-batch-error
Open

redpanda: report per-record errors from output batch writes#4400
twmb wants to merge 1 commit into
mainfrom
kafka-franz-output-batch-error

Conversation

@twmb
Copy link
Copy Markdown
Contributor

@twmb twmb commented May 6, 2026

Summary

  • The redpanda output's WriteBatch returned a single joined error whenever any record in the batch failed, forcing Connect to re-deliver the whole batch — which duplicates every record the broker had already acked. Switch to service.NewBatchError(...).Failed(idx, err), same pattern as the legacy sarama kafka output (internal/impl/kafka/output_sarama_kafka.go:619), so Connect can re-deliver only the actually-failed records.
  • This makes the auto_replay_nacks: false + fallback DLQ pattern (the supported strict-ordering recipe documented on the redpanda input) actually usable for partial batch failures: the DLQ now receives only the failed records instead of duplicates of every record in a partially-failed batch.
  • Also fixes a real data race: the previous code did append(errs, err) from multiple Produce callbacks concurrently. The new code indexes into a preallocated per-record []error slot, so each callback writes to its own index.

Refs #4387.

Test plan

  • go test -count=1 -race -run TestBatchWriterPartialFailures ./internal/impl/kafka/
  • go test -count=1 -short ./internal/impl/kafka/...
  • gofumpt clean
  • go vet ./internal/impl/kafka/...

The new TestBatchWriterPartialFailures is a table-driven unit test exercising:

  • All records skipped via SkipRecord sentinel → returns nil.
  • All records fail DecorateRecordBatchError with all indices.
  • Mixed DecorateRecord failures + SkipRecord skips → BatchError with only the decorate-fail indices (skips do not count as errors).
  • All records fail at the produce-callback layer (closed kgo.Client) → BatchError with all indices.
  • Mixed produce-fail + skip → BatchError with only the produce-fail indices.
  • Mixed produce-fail + decorate-fail → BatchError with both sets of indices.

The franz-go-based redpanda output's WriteBatch returned a single joined
error whenever any record in the batch failed, which forced Connect to
re-deliver the whole batch and duplicate every record the broker had
already acknowledged. Switch to service.NewBatchError(...).Failed(idx,
err) -- the same pattern used by the legacy sarama kafka output -- so
Connect can re-deliver only the actually-failed records. This makes the
auto_replay_nacks: false + fallback DLQ pattern usable for partial
batch failures.

Also fix a concurrent append data race on the shared errs slice by
indexing into a preallocated per-record error slice from each Produce
callback.

Refs #4387.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Commits
LGTM

Review
The change replaces the joined-error return with a service.BatchError that records per-index failures, aligning the franz-go writer with the legacy sarama writer's pattern, and incidentally fixes a data race on the shared errs slice by writing into preallocated per-record slots. Tests exercise decorate-fail, produce-fail (via closed client), skip, and combined paths.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant