oracledb_cdc: support cache resources for internal transaction buffer by josephwoodward · Pull Request #4409 · redpanda-data/connect

josephwoodward · 2026-05-07T12:57:47Z

Currently OracleDB CDC buffers transactions via an internal, in-memory cache. It does this because it has to wait to see if the transaction rows are committed or rolled back before it can decide how to process them.

This change expands this internal buffer to be configurable via Connect's various cache resources, reducing memory footprint and improving the overall reliability of the connector when working with workflows featuring long-running transactions.

claude · 2026-05-07T13:05:20Z

+		st.Events[i] = serializedDMLEvent{
+			Operation:     ev.Operation,
+			Schema:        ev.Schema,
+			Table:         ev.Table,
+			SQLRedo:       ev.SQLRedo,
+			Data:          encodeMap(ev.Data),
+			OldValues:     encodeMap(ev.OldValues),
+			Timestamp:     ev.Timestamp,
+			TransactionID: ev.TransactionID,


int64 values lose precision when round-tripping through JSON.

In encodeVal, an int64 is stored as typedVal{T: typeInt64, V: int64Val} and marshaled as a plain JSON number. When unmarshaled into typedVal{V any}, encoding/json decodes numbers into float64 by default — so on the decode side tv.V is always a float64. Casting float64 back to int64 silently loses precision for any value larger than 2^53 (~9.0e15).

This path is reachable in practice: OracleValueConverter.ConvertValue (internal/impl/oracledb/logminer/sqlredo/valueconverter.go:80) parses bare numeric SQL literals via strconv.ParseInt(str, 10, 64), so an Oracle NUMBER primary key or ID column larger than 2^53 ends up in DMLEvent.Data as an int64. When the buffered transaction is later read back from the cache and committed, the value emitted downstream is no longer the value Oracle wrote.

Suggested fix: decode the JSON using a json.Decoder with UseNumber() (or store int64 as a string in the envelope) so the integer can be reconstructed without going through float64.

cache_resource.go#L304-L312

claude · 2026-05-07T13:05:32Z

+	txn, err := unmarshalTransaction(data)
+	if err != nil {
+		c.log.Errorf("Failed to deserialize transaction %s: %v", txnID, err)
+		return nil


No tests for ConnectCacheResource.

This PR adds ~349 lines of new code, including a non-trivial type-tagged JSON serialization layer (encodeVal/decodeVal/encodeMap/decodeMap/marshalTransaction/unmarshalTransaction) plus the cache-backed TransactionCache lifecycle (StartTransaction / AddEvent / Commit / Rollback and the discarded set / metrics interactions), but no unit or integration tests exercise any of it — internal/impl/oracledb/logminer/logminer_test.go and internal/impl/oracledb/integration_test.go make no reference to ConnectCacheResource or the new transaction_cache field.

Per the project's test patterns, please add at least:

Unit tests for round-tripping Transaction / DMLEvent through marshalTransaction / unmarshalTransaction covering each value type produced by OracleValueConverter (string, int64, json.Number, []byte, time.Time, nil) — this would also surface the precision issue flagged separately.

Behavioral tests that drive StartTransaction / AddEvent / CommitTransaction / RollbackTransaction against a real service.Cache (a memory cache via service.MockResources works) to validate the discarded tracking and the max-events discard path.

claude · 2026-05-07T13:05:40Z

Commits
LGTM

Review
Two issues found in the new ConnectCacheResource: a precision-loss bug in the JSON value envelope, and the absence of any tests for the new serialization and cache lifecycle code.

int64 values larger than 2^53 silently lose precision when round-tripping through decodeVal, because encoding/json decodes JSON numbers into float64 when unmarshaling into any. This is reachable in practice because OracleValueConverter produces int64 for bare numeric literals, so Oracle large-numeric IDs end up corrupted on the way back out of the cache. See inline comment.
The PR adds ~349 lines of new code (custom typed JSON serialization plus a cache-backed TransactionCache lifecycle) with no unit or integration tests. The project's test patterns call for tests on new components. See inline comment.

Jeffail · 2026-05-12T08:10:10Z

Hi @josephwoodward — a few concerns worth surfacing on this change, ordered by severity.

P0 — Correctness

`LowWatermarkSCN` safe-checkpoint protection is bypassed for cache-backed transactions

In logminer.go:350-360, the safe-checkpoint computation is gated on the *InMemoryCache concrete type:

if cache, ok := lm.txnCache.(*InMemoryCache); ok {
    if lowestOpenSCN := cache.LowWatermarkSCN(redoEvent.TransactionID); ... {
        safeCheckpointSCN = lowestOpenSCN - 1
    }
}

When the new ConnectCacheResource is in use this branch is skipped entirely, and the checkpoint can advance past the start SCN of any concurrently open transaction. For an external persistent cache the in-flight events survive a restart, so the COMMIT at a higher SCN can re-resolve them. For a non-persistent backing — for example a memory cache resource — the buffer is gone after restart, and the events between the missed transaction's start and the advanced checkpoint are permanently lost.

The safe-checkpoint logic should apply to both implementations. The simplest fix is to maintain an in-memory txnID → startSCN side-index inside ConnectCacheResource — it scales with the number of concurrent open transactions, not the size of any single transaction, so it stays bounded.

`txnID` cache keys are not unique across multiple `oracledb_cdc` inputs

The cache key at cache_resource.go:50 is "txn:" + txnID. Oracle USN.SLOT.SEQ transaction identifiers are unique only within a single Oracle instance, so two oracledb_cdc inputs pointed at different databases that share one cache resource will silently collide. A configurable key prefix (mirroring the existing checkpoint_cache_key) would resolve this.

P1 — Reliability / performance

`AddEvent` is O(N²) in transaction size

Each call to AddEvent performs a Get + JSON decode + append + JSON encode + Set on the entire transaction. For a transaction of N events the total marshal/unmarshal work is on the order of N²/2, and against a remote cache the bytes-on-the-wire grow the same way. This is the worst case for the long-running transaction workload the change is intended to support. At minimum it deserves a note in the documentation; preferably each event would be stored under its own key with a small index, or appended via a list-style cache operation where the underlying implementation supports it.

Cache I/O errors are silently swallowed

The TransactionCache interface methods do not return errors, so Set / Get / Delete failures inside ConnectCacheResource are logged and dropped. With the in-memory implementation this was safe because the operations could not fail. With a network-backed cache a transient Set failure causes the next AddEvent to hit the "not found, creating..." path on GetTransaction, which starts a fresh transaction in the cache and drops every event that came before the failure. The interface likely needs to surface these errors so the read loop can decide on retry / fail semantics.

`nil` TTL is ambiguous

c.cache.Set(ctx, key, data, nil) at cache_resource.go:73 and :118. Depending on the cache resource implementation nil may mean "no expiry" or "use the cache's configured default." If a user configures Redis (or similar) with a default TTL shorter than their longest open transaction, entries will expire mid-flight and the transaction will be silently corrupted when it eventually commits. Either pass an explicit no-expire sentinel, validate the cache's default TTL at startup, or document the requirement loudly.

Happy to discuss any of these.

oracledb_cdc: support cache resoueces for internal transaction buffer

3dc3f34

claude Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oracledb_cdc: support cache resources for internal transaction buffer#4409

oracledb_cdc: support cache resources for internal transaction buffer#4409
josephwoodward wants to merge 1 commit into
mainfrom
jw/oraclecdctransactioncache

josephwoodward commented May 7, 2026 •

edited

Loading

Uh oh!

claude Bot May 7, 2026

Uh oh!

claude Bot May 7, 2026

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

Jeffail commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

josephwoodward commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

Jeffail commented May 12, 2026

P0 — Correctness

LowWatermarkSCN safe-checkpoint protection is bypassed for cache-backed transactions

txnID cache keys are not unique across multiple oracledb_cdc inputs

P1 — Reliability / performance

AddEvent is O(N²) in transaction size

Cache I/O errors are silently swallowed

nil TTL is ambiguous

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josephwoodward commented May 7, 2026 •

edited

Loading

`LowWatermarkSCN` safe-checkpoint protection is bypassed for cache-backed transactions

`txnID` cache keys are not unique across multiple `oracledb_cdc` inputs

`AddEvent` is O(N²) in transaction size

`nil` TTL is ambiguous