Skip to content

oracledb_cdc: support cache resources for internal transaction buffer#4409

Draft
josephwoodward wants to merge 1 commit into
mainfrom
jw/oraclecdctransactioncache
Draft

oracledb_cdc: support cache resources for internal transaction buffer#4409
josephwoodward wants to merge 1 commit into
mainfrom
jw/oraclecdctransactioncache

Conversation

@josephwoodward
Copy link
Copy Markdown
Contributor

@josephwoodward josephwoodward commented May 7, 2026

Currently OracleDB CDC buffers transactions via an internal, in-memory cache. It does this because it has to wait to see if the transaction rows are committed or rolled back before it can decide how to process them.

This change expands this internal buffer to be configurable via Connect's various cache resources, reducing memory footprint and improving the overall reliability of the connector when working with workflows featuring long-running transactions.

Comment on lines +304 to +312
st.Events[i] = serializedDMLEvent{
Operation: ev.Operation,
Schema: ev.Schema,
Table: ev.Table,
SQLRedo: ev.SQLRedo,
Data: encodeMap(ev.Data),
OldValues: encodeMap(ev.OldValues),
Timestamp: ev.Timestamp,
TransactionID: ev.TransactionID,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64 values lose precision when round-tripping through JSON.

In encodeVal, an int64 is stored as typedVal{T: typeInt64, V: int64Val} and marshaled as a plain JSON number. When unmarshaled into typedVal{V any}, encoding/json decodes numbers into float64 by default — so on the decode side tv.V is always a float64. Casting float64 back to int64 silently loses precision for any value larger than 2^53 (~9.0e15).

This path is reachable in practice: OracleValueConverter.ConvertValue (internal/impl/oracledb/logminer/sqlredo/valueconverter.go:80) parses bare numeric SQL literals via strconv.ParseInt(str, 10, 64), so an Oracle NUMBER primary key or ID column larger than 2^53 ends up in DMLEvent.Data as an int64. When the buffered transaction is later read back from the cache and committed, the value emitted downstream is no longer the value Oracle wrote.

Suggested fix: decode the JSON using a json.Decoder with UseNumber() (or store int64 as a string in the envelope) so the integer can be reconstructed without going through float64.

cache_resource.go#L304-L312

txn, err := unmarshalTransaction(data)
if err != nil {
c.log.Errorf("Failed to deserialize transaction %s: %v", txnID, err)
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for ConnectCacheResource.

This PR adds ~349 lines of new code, including a non-trivial type-tagged JSON serialization layer (encodeVal/decodeVal/encodeMap/decodeMap/marshalTransaction/unmarshalTransaction) plus the cache-backed TransactionCache lifecycle (StartTransaction / AddEvent / Commit / Rollback and the discarded set / metrics interactions), but no unit or integration tests exercise any of it — internal/impl/oracledb/logminer/logminer_test.go and internal/impl/oracledb/integration_test.go make no reference to ConnectCacheResource or the new transaction_cache field.

Per the project's test patterns, please add at least:

  • Unit tests for round-tripping Transaction / DMLEvent through marshalTransaction / unmarshalTransaction covering each value type produced by OracleValueConverter (string, int64, json.Number, []byte, time.Time, nil) — this would also surface the precision issue flagged separately.
  • Behavioral tests that drive StartTransaction / AddEvent / CommitTransaction / RollbackTransaction against a real service.Cache (a memory cache via service.MockResources works) to validate the discarded tracking and the max-events discard path.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Commits
LGTM

Review
Two issues found in the new ConnectCacheResource: a precision-loss bug in the JSON value envelope, and the absence of any tests for the new serialization and cache lifecycle code.

  1. int64 values larger than 2^53 silently lose precision when round-tripping through decodeVal, because encoding/json decodes JSON numbers into float64 when unmarshaling into any. This is reachable in practice because OracleValueConverter produces int64 for bare numeric literals, so Oracle large-numeric IDs end up corrupted on the way back out of the cache. See inline comment.
  2. The PR adds ~349 lines of new code (custom typed JSON serialization plus a cache-backed TransactionCache lifecycle) with no unit or integration tests. The project's test patterns call for tests on new components. See inline comment.

@Jeffail
Copy link
Copy Markdown
Contributor

Jeffail commented May 12, 2026

Hi @josephwoodward — a few concerns worth surfacing on this change, ordered by severity.

P0 — Correctness

LowWatermarkSCN safe-checkpoint protection is bypassed for cache-backed transactions

In logminer.go:350-360, the safe-checkpoint computation is gated on the *InMemoryCache concrete type:

if cache, ok := lm.txnCache.(*InMemoryCache); ok {
    if lowestOpenSCN := cache.LowWatermarkSCN(redoEvent.TransactionID); ... {
        safeCheckpointSCN = lowestOpenSCN - 1
    }
}

When the new ConnectCacheResource is in use this branch is skipped entirely, and the checkpoint can advance past the start SCN of any concurrently open transaction. For an external persistent cache the in-flight events survive a restart, so the COMMIT at a higher SCN can re-resolve them. For a non-persistent backing — for example a memory cache resource — the buffer is gone after restart, and the events between the missed transaction's start and the advanced checkpoint are permanently lost.

The safe-checkpoint logic should apply to both implementations. The simplest fix is to maintain an in-memory txnID → startSCN side-index inside ConnectCacheResource — it scales with the number of concurrent open transactions, not the size of any single transaction, so it stays bounded.

txnID cache keys are not unique across multiple oracledb_cdc inputs

The cache key at cache_resource.go:50 is "txn:" + txnID. Oracle USN.SLOT.SEQ transaction identifiers are unique only within a single Oracle instance, so two oracledb_cdc inputs pointed at different databases that share one cache resource will silently collide. A configurable key prefix (mirroring the existing checkpoint_cache_key) would resolve this.

P1 — Reliability / performance

AddEvent is O(N²) in transaction size

Each call to AddEvent performs a Get + JSON decode + append + JSON encode + Set on the entire transaction. For a transaction of N events the total marshal/unmarshal work is on the order of N²/2, and against a remote cache the bytes-on-the-wire grow the same way. This is the worst case for the long-running transaction workload the change is intended to support. At minimum it deserves a note in the documentation; preferably each event would be stored under its own key with a small index, or appended via a list-style cache operation where the underlying implementation supports it.

Cache I/O errors are silently swallowed

The TransactionCache interface methods do not return errors, so Set / Get / Delete failures inside ConnectCacheResource are logged and dropped. With the in-memory implementation this was safe because the operations could not fail. With a network-backed cache a transient Set failure causes the next AddEvent to hit the "not found, creating..." path on GetTransaction, which starts a fresh transaction in the cache and drops every event that came before the failure. The interface likely needs to surface these errors so the read loop can decide on retry / fail semantics.

nil TTL is ambiguous

c.cache.Set(ctx, key, data, nil) at cache_resource.go:73 and :118. Depending on the cache resource implementation nil may mean "no expiry" or "use the cache's configured default." If a user configures Redis (or similar) with a default TTL shorter than their longest open transaction, entries will expire mid-flight and the transaction will be silently corrupted when it eventually commits. Either pass an explicit no-expire sentinel, validate the cache's default TTL at startup, or document the requirement loudly.

Happy to discuss any of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants