[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199AntiTopQuark wants to merge 18 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
996adea to
66a8f8e
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
66a8f8e to
fa48c2a
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
fa48c2a to
dc6a3a6
Compare
|
run buildall |
…TSO) Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
dc6a3a6 to
9c9dec0
Compare
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Code Review Summary for PR #61199 — feature Add global monotonically increasing Timestamp Oracle (TSO)
This PR introduces a global TSO (Timestamp Oracle) service for generating monotonically increasing timestamps, integrated into the transaction commit path, publish version flow, and rowset metadata. The feature is experimental and gated by experimental_enable_feature_tso.
Critical Checkpoint Conclusions
1. Goal and correctness:
The PR achieves the stated goal of adding a TSO service with timestamp generation, time-window management, journal persistence, and integration into the commit+publish path. Tests cover the core logic. However, several correctness and compatibility issues are identified below.
2. Modification clarity and focus:
The PR is large (50 files, 2080+ additions) but focused on a single feature. The scope is appropriate for the feature.
3. Concurrency:
TSOService uses a ReentrantLock to protect globalTimestamp. The lock usage is correct — all reads and writes to the physical/logical fields are under the lock. Env.windowEndTSO is an AtomicLong, safely accessed from multiple threads. Minor issue: generateTSO() has a redundant read-back after setLogicalCounter (see inline comment).
4. Lifecycle management:
The TSOService extends MasterDaemon and is started unconditionally in startMasterOnlyDaemonThreads() — even when TSO is disabled. The start() method runs the calibration loop regardless of Config.enable_feature_tso. While runAfterCatalogReady() correctly checks the flag, the unconditional startup wastes resources. See inline comment.
5. Configuration items:
Configs are well-documented with Chinese and English descriptions. enable_feature_tso is correctly marked EXPERIMENTAL and mutable. tso_service_update_interval_ms is mutable=false which is appropriate.
6. Incompatible changes / rolling upgrade:
CRITICAL issues — see inline comments on TPartitionVersionInfo (Thrift required vs optional) and MINIMUM_VERSION_REQUIRED bump.
7. Parallel code paths:
All three commit paths (single-txn, sub-txn, 2PC) consistently use the same getCommitTSO() helper and follow the same pattern. The generatePartitionVersionInfoWhenReport() in LocalTabletInvertedIndex also correctly propagates commitTSO.
8. Test coverage:
- Unit tests cover
TSOTimestampbit manipulation,TSOServicebasic behavior,TSOActionREST endpoint, andDatabaseTransactionMgrcommit TSO integration. - Regression tests cover the API and rowset commit_tso visibility.
- Missing: Negative test for clock backward handling, concurrent
getTSO()stress test, and test for the case where TSO service is disabled butenable_tsotable property is set.
9. Observability:
Six metrics cover clock drift, backward detection, updates, failures, and get success. Log levels are appropriate. Good coverage.
10. Transaction / persistence:
EditLog write/replay for OP_TSO_TIMESTAMP_WINDOW_END is correctly implemented. Image save/load is properly gated by config flags. TransactionState.commitTSO and TableCommitInfo.commitTSO are Gson-serialized (backward compatible).
11. Data writes:
Rowset commit_tso is set atomically during make_visible() under the existing txn lock. No concurrency issue.
12. FE-BE variable passing:
All paths updated: TPartitionVersionInfo, PendingPublishInfoPB, and DiscontinuousVersionTablet all carry commit_tso. However, the Thrift field is required which breaks rolling upgrade (see inline comment).
13. Performance:
No hot-path performance concerns. The lock in generateTSO() is lightweight. The periodic writeTimestampToBDBJE uses EditLog which is already on the critical path.
14. Other issues:
.out.rejfiles committed (merge conflict artifacts)- Regression test
test_tso_api.groovyuses wrong bit masks for validation - Silent fallback to
commitTSO=-1when TSO fetch fails on a TSO-enabled table
Issues Found (by severity)
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | Critical | AgentService.thrift |
TPartitionVersionInfo.commit_tso is required — breaks rolling upgrade from old FE |
| 2 | Critical | FeMetaVersion.java |
MINIMUM_VERSION_REQUIRED bumped to 141 — prevents rollback for experimental feature |
| 3 | High | DatabaseTransactionMgr.java |
Silent commit with commitTSO=-1 when TSO fetch fails on TSO-enabled table |
| 4 | High | *.out.rej files |
Merge conflict reject files committed to repository |
| 5 | Medium | test_tso_api.groovy |
Logical counter extraction uses 12-bit mask (0xFFF) but TSO uses 18-bit logical counter |
| 6 | Medium | TSOService.java |
start() unconditionally runs calibration even when TSO is disabled |
| 7 | Low | TSOService.java |
Redundant read-back in generateTSO() |
| 8 | Low | Config.java |
Trailing blank lines at end of file |
| 3: required Types.TVersionHash version_hash // Deprecated | ||
| 4: required i64 commit_tso | ||
| } | ||
|
|
There was a problem hiding this comment.
[Critical — Rolling Upgrade Breakage]
commit_tso is declared as required, but this field is newly added. During a rolling upgrade, an old FE (without this field) will serialize TPartitionVersionInfo with only 3 fields. The new BE will deserialize the missing required i64 field as 0 (Thrift's default for int64), not as -1 (the application-level sentinel meaning "no TSO").
This causes silent semantic corruption: rowsets will be stamped with commit_tso=0 instead of the intended -1 default (as defined in olap_file.proto with [default = -1]). Any downstream logic checking commit_tso > 0 vs commit_tso == -1 would behave incorrectly.
Fix: Change to 4: optional i64 commit_tso and have the BE check __isset.commit_tso to fall back to -1 when absent. Alternatively, set a default: 4: optional i64 commit_tso = -1.
| // these clause will be useless and we could remove them | ||
| public static final int MINIMUM_VERSION_REQUIRED = VERSION_140; | ||
| public static final int MINIMUM_VERSION_REQUIRED = VERSION_141; | ||
| } |
There was a problem hiding this comment.
[Critical — Prevents Rollback for Experimental Feature]
Bumping MINIMUM_VERSION_REQUIRED from 140 to 141 means:
- Once any FE with this code writes an image (which uses
VERSION_CURRENT=141), no older FE can read it. - Conversely, this FE refuses to load any image with version < 141.
This creates a hard one-way migration gate for a feature that is:
- Explicitly experimental (
varType = VariableAnnotation.EXPERIMENTAL) - Disabled by default (
enable_feature_tso = false) - Guarded by three separate config flags
No code anywhere checks VERSION_141 conditionally — there are zero if (version >= VERSION_141) guards in the codebase. The TSO data is persisted via:
- Gson
@SerializedNamefields (backward-compatible — unknown fields are ignored) - An image module that writes nothing when disabled
- An EditLog opcode only emitted when
enable_tso_persist_journal=true
Fix: Do NOT bump MINIMUM_VERSION_REQUIRED. Only bump VERSION_CURRENT if you need a marker. The module system already handles unknown/empty modules gracefully (MetaReader skips empty modules and has ignore_unknown_metadata_module for unknown ones).
| if (fetched <= 0) { | ||
| LOG.warn("failed to get TSO for txn {}, fallback to -1", | ||
| transactionState.getTransactionId()); | ||
| return tso; |
There was a problem hiding this comment.
[High — Silent Loss of TSO Ordering Guarantee]
When getTSO() returns -1 (exhausted retries, logical counter overflow, or not calibrated), getCommitTSO() silently falls back to -1 and the transaction still commits successfully. The user has explicitly opted into TSO ordering by setting enable_tso=true on the table, but this commit has no valid TSO.
The -1 value is indistinguishable from "TSO was never assigned" (protobuf default), breaking any downstream ordering semantics that depend on valid TSO values.
Consider:
- Throwing an exception / returning an error status to fail the commit when TSO is required but unavailable, OR
- At minimum, adding a prominent WARNING log with the transaction ID and table ID so the issue is observable, AND documenting that TSO is best-effort (not guaranteed).
| @@ -0,0 +1,9 @@ | |||
| diff a/regression-test/data/query_p0/system/test_query_sys_rowsets.out b/regression-test/data/query_p0/system/test_query_sys_rowsets.out (rejected hunks) | |||
| @@ -13,6 +13,7 @@ DATA_DISK_SIZE bigint Yes false \N | |||
There was a problem hiding this comment.
[High — Merge Conflict Artifact]
This .rej file is a Git merge/rebase reject artifact and must not be committed. The corresponding .out file (test_query_sys_rowsets.out) also needs to be updated with the COMMIT_TSO column.
Same applies to test_query_sys_scan_rowsets.out.rej.
| def extractedLogicalCounter = tsoValue & 0xFFFL // 12 bits mask | ||
| assertEquals(logicalCounter, extractedLogicalCounter) | ||
| } finally { | ||
| sql "ADMIN SET FRONTEND CONFIG ('experimental_enable_feature_tso' = '${ret[0][1]}')" |
There was a problem hiding this comment.
[Medium — Wrong Bit Mask in Test Validation]
This test extracts the logical counter using a 12-bit mask (0xFFF), but TSOTimestamp uses an 18-bit logical counter (see LOGICAL_BITS = 18 in TSOTimestamp.java). The correct mask should be 0x3FFFF (18 bits).
Similarly, the comment on line 100 says "46 bits physical time + 6 bits reserved + 12 bits logical counter" — there are no reserved bits. The format is 46 bits physical + 18 bits logical.
The test may still pass by coincidence if the logical counter value fits in 12 bits, but it would fail for counter values >= 4096.
| */ | ||
| @Override | ||
| public synchronized void start() { | ||
| super.start(); |
There was a problem hiding this comment.
[Medium — Unnecessary Work When TSO Disabled]
The start() method unconditionally calls super.start() (spawning the MasterDaemon thread) and runs the calibration loop, regardless of Config.enable_feature_tso. While runAfterCatalogReady() checks the flag, the thread still runs every tso_service_update_interval_ms (50ms by default) doing nothing useful when TSO is disabled.
Consider adding a guard at the top of start():
if (!Config.enable_feature_tso) {
return;
}Note: since the config is mutable, you may also want to support dynamic enable — but the current implementation doesn't handle that correctly anyway (the daemon would already be running but would have skipped calibration).
| } | ||
| long logicalCounter = globalTimestamp.getLogicalCounter(); | ||
| globalTimestamp.setLogicalCounter(logicalCounter + 1); | ||
| logicalCounter = globalTimestamp.getLogicalCounter(); |
There was a problem hiding this comment.
[Low — Redundant Read-back]
Line 407 reads back the logical counter immediately after setting it on line 406, while still under the lock. Since setLogicalCounter stores exactly the value passed, this is equivalent to logicalCounter = logicalCounter + 1. The extra getter call adds no value.
// Current (redundant):
globalTimestamp.setLogicalCounter(logicalCounter + 1);
logicalCounter = globalTimestamp.getLogicalCounter();
// Simplified:
logicalCounter = logicalCounter + 1;
globalTimestamp.setLogicalCounter(logicalCounter);| }) | ||
| public static boolean calc_delete_bitmap_get_versions_waiting_for_pending_txns = true; | ||
|
|
||
|
|
There was a problem hiding this comment.
[Low — Trailing Blank Lines]
Two extra blank lines added at the end of the file before the closing brace. Please remove.
|
/review |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #61198
Related #57921
Problem Summary:
Release note
The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
The service exposes two main methods:
getTSO()– returns a new TSO timestamp for transaction commits.getCurrentTSO()– returns the current TSO without bumping the logical counter.experimental_enable_feature_tso– enables/disables the TSO feature.tso_service_update_interval_ms– interval in milliseconds for the TSO service to update its window.max_update_tso_retry_countandmax_get_tso_retry_count– retry limits for updating and obtaining TSOs.tso_service_window_duration_ms– length of the time window allocated by the TSO service.tso_time_offset_debug_mode– debug offset for the physical time.enable_tso_persist_journalandenable_tso_checkpoint_module– persistence switches for edit log and checkpoint.enable_tsowhich can be configured inCREATE TABLEor modified viaALTER TABLE. Only tables withenable_tso = truegenerate commit TSO for transactions; when disabled, commit_tso remains-1.TransactionStatenow fetches a commit TSO fromTSOServicewhen TSO is enabled and stores it in the transaction state andTableCommitInfo.TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).Rowset::make_visiblenow accepts acommit_tsoparameter and writes it toRowsetMeta.RowsetMetaPBadds a new fieldcommit_tsoto persist commit timestamps.information_schema.rowsetsintroduces a new columnCOMMIT_TSOallowing users to query the commit timestamp for each rowset.A new REST endpoint
/api/tsois added for retrieving current TSO information. It returns a JSON payload containing:window_end_physical_time– end of the current TSO time window.current_tso– the current composed 64‑bit TSO.current_tso_physical_timeandcurrent_tso_logical_counter– the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.New metrics counters (e.g.,
tso_clock_drift_detected,tso_clock_backward_detected,tso_clock_calculated,tso_clock_updated) expose state and health of the TSO service.FeMetaVersion.VERSION_CURRENTis bumped toVERSION_141to indicate the addition of the TSO module.New unit tests verify
TSOTimestampbit manipulation,TSOServicebehavior, commit TSO propagation, and the/api/tsoendpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.Impact and Compatibility
experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.enable_tsototrue. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the newcommit_tsofield./api/tsoto inspect current TSO values. No existing API is modified.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
docs: add TSO-related functions description and documentation. doris-website#3454
Check List (For Reviewer who merge this PR)