Skip to content

iceberg: skip duplicate-path check on AddDataFiles#4421

Open
Jeffail wants to merge 2 commits into
mainfrom
t6692
Open

iceberg: skip duplicate-path check on AddDataFiles#4421
Jeffail wants to merge 2 commits into
mainfrom
t6692

Conversation

@Jeffail
Copy link
Copy Markdown
Contributor

@Jeffail Jeffail commented May 12, 2026

Summary

doCommit now passes table.WithoutDuplicateCheck() alongside the existing table.WithoutAutoNameMapping() when calling Transaction.AddDataFiles. By default iceberg-go performs an O(snapshot) scan of every manifest entry in the current snapshot on every AddDataFiles call to detect path collisions with files being added. The writer in this package already stamps every parquet path with a fresh UUID, so the guarantee is held by the caller and the scan is wasted work.

This addresses ticket T6692, where the customer's Iceberg writer collapsed to approximately 40 messages/sec on a busy table — roughly 150× slower than the same workload on Kafka Connect (around 6,000 messages/sec). Profiling provided by the customer showed 88.6 GB of allocations (46% of total) inside iceberg-go.(*ManifestReader).ReadEntry, with the committer goroutine continuously blocked in doCommit. The dup-check scan is the exclusive driver of that hot path.

The check still runs in iceberg-go for any other caller that does not opt out, so this change is scoped strictly to this writer's commit path.

Measurements

Benchmark added in internal/impl/iceberg/committer_test.go (BenchmarkAddDataFilesDupCheck). Apple M3 Pro, -benchtime=20x, single Transaction.AddDataFiles call against a table pre-seeded with seed separate single-file commits:

seed dup_check on dup_check off speedup
10 7.55 ms / 6.6 MB / 55K allocs 1.84 ms / 2.8 MB / 14K allocs ~4×
100 58.9 ms / 39 MB / 407K allocs 1.98 ms / 3.0 MB / 14K allocs ~30×
1,000 530 ms / 364 MB / 3.9M allocs 3.56 ms / 5.9 MB / 17K allocs ~150×

The dup_check=on cost scales linearly with the number of manifest entries in the current snapshot; the dup_check=off cost is effectively flat. The 150× figure at seed=1,000 lines up with the customer's observed RPCN-vs-Kafka-Connect throughput gap, supporting the diagnosis that this single check accounted for the dominant share of the regression.

Run locally with:

go test -bench BenchmarkAddDataFilesDupCheck -benchmem -run='^' -benchtime=20x ./internal/impl/iceberg

Why this has not surfaced for other customers

The cost is O(manifest_entries_in_current_snapshot) × commits_per_second × concurrent_pods. For most workloads at least one of those terms is small enough to mask the dup-check overhead entirely. The reporting customer happened to maximise all three at once:

  1. Manifest entry count was unbounded. The table was created on April 30 from scratch and accumulated approximately a week of small commits with no compaction in the pipeline. Customers running RPCN's iceberg compaction processor, or who have an external Spark/Trino maintenance job, keep the manifest entry count bounded and the per-commit scan remains small.
  2. Commit cadence was very high. The initial deployment used batch.count=10000, period=1s, producing roughly one commit/sec/pod. The dup-check runs per commit, so the cost-per-second scales linearly with commit rate. Customers who batch more aggressively amortise the scan across many more messages.
  3. Multi-pod fan-in on a single hot table. All eight Kafka partitions wrote to the same table with 1-8 pods auto-scaling. As single-pod commits slowed, the inter-pod commit window widened, catalog collisions grew, retries replayed the scan, and the autoscaler added more pods — a compounding death spiral. A workload with one pod per table or with writers sharded across tables would not see this amplification.
  4. The cost is invisible on day one. Empty or freshly-compacted snapshots scan zero entries and complete instantly. Demos, smoke tests, and our standard integration tests do not run a single table long enough for the manifest list to grow to a problematic size, so the regression does not surface during pre-production evaluation.

Test plan

  • TestCommitterSkipsDuplicateCheck added in internal/impl/iceberg/committer_test.go. It commits the same path twice through the committer and asserts both calls succeed. Verified locally that the test fails with the expected cannot add files that are already referenced by table error when the option is removed.
  • go test -count=1 ./internal/impl/iceberg/... — all packages pass.
  • bin/golangci-lint run ./internal/impl/iceberg/... — clean.

The default check is an O(snapshot) scan of every manifest entry, which
dominated the commit hot path on busy tables and produced a multiplicative
slowdown over time (T6692). Each parquet file path already carries a fresh
uuid, so the check is redundant for this writer.
@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Commits
LGTM

Review
Single, well-scoped change: passes table.WithoutDuplicateCheck() to Transaction.AddDataFiles in the iceberg committer. The premise (writer stamps each path with a fresh UUID, so the iceberg-go dup-check scan is redundant) is verified against writer.go#L95. Comment at committer.go#L124-L130 explains the why and references the ticket. New unit test pins the behavioural change (committing the same path twice succeeds) and a benchmark provides the reproduction harness. Test license header, t.Context() usage, and MockResources() usage all match project conventions.

LGTM

gocritic ruleguard wants gerund-form error wrapping. Pre-existing issue
on main blocking unrelated CI runs.
@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Commits
LGTM

Review
Small, well-targeted perf fix in the iceberg committer plus a drive-by lint fix in kafka. The new test pins the WithoutDuplicateCheck behaviour (it would fail on regression with "cannot add files that are already referenced") and the benchmark is a clean local repro harness. License headers, naming, t.Context() usage, and service.MockResources() patterns all match conventions.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant