Skip to content

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199

Open
AntiTopQuark wants to merge 18 commits intoapache:masterfrom
AntiTopQuark:support_tso
Open

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
AntiTopQuark wants to merge 18 commits intoapache:masterfrom
AntiTopQuark:support_tso

Conversation

@AntiTopQuark
Copy link

@AntiTopQuark AntiTopQuark commented Mar 11, 2026

What problem does this PR solve?

Issue Number: close #61198

Related #57921

Problem Summary:

Release note

  • Implement a global monotonically increasing Timestamp Oracle (TSO) service that generates unique, monotonically increasing timestamps for transactions.
    The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
    A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
  • Introduce TSOService, a master-only daemon that manages global timestamps.
    The service exposes two main methods:
    • getTSO() – returns a new TSO timestamp for transaction commits.
    • getCurrentTSO() – returns the current TSO without bumping the logical counter.
  • Add multiple configuration properties to control the behavior of the TSO feature:
    • experimental_enable_feature_tso – enables/disables the TSO feature.
    • tso_service_update_interval_ms – interval in milliseconds for the TSO service to update its window.
    • max_update_tso_retry_count and max_get_tso_retry_count – retry limits for updating and obtaining TSOs.
    • tso_service_window_duration_ms – length of the time window allocated by the TSO service.
    • tso_time_offset_debug_mode – debug offset for the physical time.
    • enable_tso_persist_journal and enable_tso_checkpoint_module – persistence switches for edit log and checkpoint.
  • Table property: Introduce enable_tso which can be configured in CREATE TABLE or modified via ALTER TABLE. Only tables with enable_tso = true generate commit TSO for transactions; when disabled, commit_tso remains -1.
  • Transaction and commit integration:
    • During commit, TransactionState now fetches a commit TSO from TSOService when TSO is enabled and stores it in the transaction state and TableCommitInfo.
    • The commit TSO is recorded per partition (via TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).
  • Rowset and meta changes:
    • Rowset::make_visible now accepts a commit_tso parameter and writes it to RowsetMeta.
    • RowsetMetaPB adds a new field commit_tso to persist commit timestamps.
    • information_schema.rowsets introduces a new column COMMIT_TSO allowing users to query the commit timestamp for each rowset.
    • Pending publish tasks, asynchronous publish tasks and other internal structures have been extended to carry commit TSO.
  • External interface:
    A new REST endpoint /api/tso is added for retrieving current TSO information. It returns a JSON payload containing:
    • window_end_physical_time – end of the current TSO time window.
    • current_tso – the current composed 64‑bit TSO.
    • current_tso_physical_time and current_tso_logical_counter – the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.
  • Metrics & observability:
    New metrics counters (e.g., tso_clock_drift_detected, tso_clock_backward_detected, tso_clock_calculated, tso_clock_updated) expose state and health of the TSO service.
  • Meta version:
    FeMetaVersion.VERSION_CURRENT is bumped to VERSION_141 to indicate the addition of the TSO module.
  • Regression & unit tests:
    New unit tests verify TSOTimestamp bit manipulation, TSOService behavior, commit TSO propagation, and the /api/tso endpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.

Impact and Compatibility

  • Experimental: the TSO feature is currently guarded by experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.
  • Table compatibility: tables created before enabling TSO remain unaffected unless explicitly modified to set enable_tso to true. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the new commit_tso field.
  • Client API: clients can call /api/tso to inspect current TSO values. No existing API is modified.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.45% (1798/2263)
Line Coverage 64.65% (32249/49881)
Region Coverage 65.60% (16145/24611)
Branch Coverage 56.06% (8611/15360)

@AntiTopQuark
Copy link
Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.65% (32290/49944)
Region Coverage 65.57% (16170/24659)
Branch Coverage 55.96% (8611/15388)

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.63% (32279/49944)
Region Coverage 65.55% (16165/24659)
Branch Coverage 55.95% (8610/15388)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

@AntiTopQuark
Copy link
Author

run buildall

…TSO)

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.66% (32294/49944)
Region Coverage 65.59% (16173/24659)
Branch Coverage 55.97% (8612/15388)

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@AntiTopQuark
Copy link
Author

run buildall

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.24% (1798/2269)
Line Coverage 64.53% (32281/50023)
Region Coverage 65.45% (16165/24699)
Branch Coverage 55.85% (8609/15414)

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.24% (1798/2269)
Line Coverage 64.50% (32297/50076)
Region Coverage 65.41% (16176/24730)
Branch Coverage 55.85% (8621/15436)

u
Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.15% (1796/2269)
Line Coverage 64.48% (32285/50072)
Region Coverage 65.36% (16161/24727)
Branch Coverage 55.81% (8614/15434)

@morningman morningman self-assigned this Mar 17, 2026
@morningman
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary for PR #61199feature Add global monotonically increasing Timestamp Oracle (TSO)

This PR introduces a global TSO (Timestamp Oracle) service for generating monotonically increasing timestamps, integrated into the transaction commit path, publish version flow, and rowset metadata. The feature is experimental and gated by experimental_enable_feature_tso.

Critical Checkpoint Conclusions

1. Goal and correctness:
The PR achieves the stated goal of adding a TSO service with timestamp generation, time-window management, journal persistence, and integration into the commit+publish path. Tests cover the core logic. However, several correctness and compatibility issues are identified below.

2. Modification clarity and focus:
The PR is large (50 files, 2080+ additions) but focused on a single feature. The scope is appropriate for the feature.

3. Concurrency:
TSOService uses a ReentrantLock to protect globalTimestamp. The lock usage is correct — all reads and writes to the physical/logical fields are under the lock. Env.windowEndTSO is an AtomicLong, safely accessed from multiple threads. Minor issue: generateTSO() has a redundant read-back after setLogicalCounter (see inline comment).

4. Lifecycle management:
The TSOService extends MasterDaemon and is started unconditionally in startMasterOnlyDaemonThreads() — even when TSO is disabled. The start() method runs the calibration loop regardless of Config.enable_feature_tso. While runAfterCatalogReady() correctly checks the flag, the unconditional startup wastes resources. See inline comment.

5. Configuration items:
Configs are well-documented with Chinese and English descriptions. enable_feature_tso is correctly marked EXPERIMENTAL and mutable. tso_service_update_interval_ms is mutable=false which is appropriate.

6. Incompatible changes / rolling upgrade:
CRITICAL issues — see inline comments on TPartitionVersionInfo (Thrift required vs optional) and MINIMUM_VERSION_REQUIRED bump.

7. Parallel code paths:
All three commit paths (single-txn, sub-txn, 2PC) consistently use the same getCommitTSO() helper and follow the same pattern. The generatePartitionVersionInfoWhenReport() in LocalTabletInvertedIndex also correctly propagates commitTSO.

8. Test coverage:

  • Unit tests cover TSOTimestamp bit manipulation, TSOService basic behavior, TSOAction REST endpoint, and DatabaseTransactionMgr commit TSO integration.
  • Regression tests cover the API and rowset commit_tso visibility.
  • Missing: Negative test for clock backward handling, concurrent getTSO() stress test, and test for the case where TSO service is disabled but enable_tso table property is set.

9. Observability:
Six metrics cover clock drift, backward detection, updates, failures, and get success. Log levels are appropriate. Good coverage.

10. Transaction / persistence:
EditLog write/replay for OP_TSO_TIMESTAMP_WINDOW_END is correctly implemented. Image save/load is properly gated by config flags. TransactionState.commitTSO and TableCommitInfo.commitTSO are Gson-serialized (backward compatible).

11. Data writes:
Rowset commit_tso is set atomically during make_visible() under the existing txn lock. No concurrency issue.

12. FE-BE variable passing:
All paths updated: TPartitionVersionInfo, PendingPublishInfoPB, and DiscontinuousVersionTablet all carry commit_tso. However, the Thrift field is required which breaks rolling upgrade (see inline comment).

13. Performance:
No hot-path performance concerns. The lock in generateTSO() is lightweight. The periodic writeTimestampToBDBJE uses EditLog which is already on the critical path.

14. Other issues:

  • .out.rej files committed (merge conflict artifacts)
  • Regression test test_tso_api.groovy uses wrong bit masks for validation
  • Silent fallback to commitTSO=-1 when TSO fetch fails on a TSO-enabled table

Issues Found (by severity)

# Severity File Issue
1 Critical AgentService.thrift TPartitionVersionInfo.commit_tso is required — breaks rolling upgrade from old FE
2 Critical FeMetaVersion.java MINIMUM_VERSION_REQUIRED bumped to 141 — prevents rollback for experimental feature
3 High DatabaseTransactionMgr.java Silent commit with commitTSO=-1 when TSO fetch fails on TSO-enabled table
4 High *.out.rej files Merge conflict reject files committed to repository
5 Medium test_tso_api.groovy Logical counter extraction uses 12-bit mask (0xFFF) but TSO uses 18-bit logical counter
6 Medium TSOService.java start() unconditionally runs calibration even when TSO is disabled
7 Low TSOService.java Redundant read-back in generateTSO()
8 Low Config.java Trailing blank lines at end of file

3: required Types.TVersionHash version_hash // Deprecated
4: required i64 commit_tso
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical — Rolling Upgrade Breakage]

commit_tso is declared as required, but this field is newly added. During a rolling upgrade, an old FE (without this field) will serialize TPartitionVersionInfo with only 3 fields. The new BE will deserialize the missing required i64 field as 0 (Thrift's default for int64), not as -1 (the application-level sentinel meaning "no TSO").

This causes silent semantic corruption: rowsets will be stamped with commit_tso=0 instead of the intended -1 default (as defined in olap_file.proto with [default = -1]). Any downstream logic checking commit_tso > 0 vs commit_tso == -1 would behave incorrectly.

Fix: Change to 4: optional i64 commit_tso and have the BE check __isset.commit_tso to fall back to -1 when absent. Alternatively, set a default: 4: optional i64 commit_tso = -1.

// these clause will be useless and we could remove them
public static final int MINIMUM_VERSION_REQUIRED = VERSION_140;
public static final int MINIMUM_VERSION_REQUIRED = VERSION_141;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical — Prevents Rollback for Experimental Feature]

Bumping MINIMUM_VERSION_REQUIRED from 140 to 141 means:

  1. Once any FE with this code writes an image (which uses VERSION_CURRENT=141), no older FE can read it.
  2. Conversely, this FE refuses to load any image with version < 141.

This creates a hard one-way migration gate for a feature that is:

  • Explicitly experimental (varType = VariableAnnotation.EXPERIMENTAL)
  • Disabled by default (enable_feature_tso = false)
  • Guarded by three separate config flags

No code anywhere checks VERSION_141 conditionally — there are zero if (version >= VERSION_141) guards in the codebase. The TSO data is persisted via:

  • Gson @SerializedName fields (backward-compatible — unknown fields are ignored)
  • An image module that writes nothing when disabled
  • An EditLog opcode only emitted when enable_tso_persist_journal=true

Fix: Do NOT bump MINIMUM_VERSION_REQUIRED. Only bump VERSION_CURRENT if you need a marker. The module system already handles unknown/empty modules gracefully (MetaReader skips empty modules and has ignore_unknown_metadata_module for unknown ones).

if (fetched <= 0) {
LOG.warn("failed to get TSO for txn {}, fallback to -1",
transactionState.getTransactionId());
return tso;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High — Silent Loss of TSO Ordering Guarantee]

When getTSO() returns -1 (exhausted retries, logical counter overflow, or not calibrated), getCommitTSO() silently falls back to -1 and the transaction still commits successfully. The user has explicitly opted into TSO ordering by setting enable_tso=true on the table, but this commit has no valid TSO.

The -1 value is indistinguishable from "TSO was never assigned" (protobuf default), breaking any downstream ordering semantics that depend on valid TSO values.

Consider:

  1. Throwing an exception / returning an error status to fail the commit when TSO is required but unavailable, OR
  2. At minimum, adding a prominent WARNING log with the transaction ID and table ID so the issue is observable, AND documenting that TSO is best-effort (not guaranteed).

@@ -0,0 +1,9 @@
diff a/regression-test/data/query_p0/system/test_query_sys_rowsets.out b/regression-test/data/query_p0/system/test_query_sys_rowsets.out (rejected hunks)
@@ -13,6 +13,7 @@ DATA_DISK_SIZE bigint Yes false \N
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High — Merge Conflict Artifact]

This .rej file is a Git merge/rebase reject artifact and must not be committed. The corresponding .out file (test_query_sys_rowsets.out) also needs to be updated with the COMMIT_TSO column.

Same applies to test_query_sys_scan_rowsets.out.rej.

def extractedLogicalCounter = tsoValue & 0xFFFL // 12 bits mask
assertEquals(logicalCounter, extractedLogicalCounter)
} finally {
sql "ADMIN SET FRONTEND CONFIG ('experimental_enable_feature_tso' = '${ret[0][1]}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium — Wrong Bit Mask in Test Validation]

This test extracts the logical counter using a 12-bit mask (0xFFF), but TSOTimestamp uses an 18-bit logical counter (see LOGICAL_BITS = 18 in TSOTimestamp.java). The correct mask should be 0x3FFFF (18 bits).

Similarly, the comment on line 100 says "46 bits physical time + 6 bits reserved + 12 bits logical counter" — there are no reserved bits. The format is 46 bits physical + 18 bits logical.

The test may still pass by coincidence if the logical counter value fits in 12 bits, but it would fail for counter values >= 4096.

*/
@Override
public synchronized void start() {
super.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium — Unnecessary Work When TSO Disabled]

The start() method unconditionally calls super.start() (spawning the MasterDaemon thread) and runs the calibration loop, regardless of Config.enable_feature_tso. While runAfterCatalogReady() checks the flag, the thread still runs every tso_service_update_interval_ms (50ms by default) doing nothing useful when TSO is disabled.

Consider adding a guard at the top of start():

if (!Config.enable_feature_tso) {
    return;
}

Note: since the config is mutable, you may also want to support dynamic enable — but the current implementation doesn't handle that correctly anyway (the daemon would already be running but would have skipped calibration).

}
long logicalCounter = globalTimestamp.getLogicalCounter();
globalTimestamp.setLogicalCounter(logicalCounter + 1);
logicalCounter = globalTimestamp.getLogicalCounter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low — Redundant Read-back]

Line 407 reads back the logical counter immediately after setting it on line 406, while still under the lock. Since setLogicalCounter stores exactly the value passed, this is equivalent to logicalCounter = logicalCounter + 1. The extra getter call adds no value.

// Current (redundant):
globalTimestamp.setLogicalCounter(logicalCounter + 1);
logicalCounter = globalTimestamp.getLogicalCounter();

// Simplified:
logicalCounter = logicalCounter + 1;
globalTimestamp.setLogicalCounter(logicalCounter);

})
public static boolean calc_delete_bitmap_get_versions_waiting_for_pending_txns = true;


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low — Trailing Blank Lines]

Two extra blank lines added at the end of the file before the closing brace. Please remove.

@AntiTopQuark
Copy link
Author

/review

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.37% (32267/50130)
Region Coverage 65.27% (16162/24760)
Branch Coverage 55.69% (8607/15456)

@AntiTopQuark
Copy link
Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.43% (32299/50130)
Region Coverage 65.34% (16177/24760)
Branch Coverage 55.75% (8617/15456)

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 44.25% (200/452) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.73% (19791/37534)
Line Coverage 36.23% (184753/509902)
Region Coverage 32.49% (143059/440261)
Branch Coverage 33.68% (62575/185808)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.67% (26329/36737)
Line Coverage 54.44% (276652/508188)
Region Coverage 51.71% (229764/444307)
Branch Coverage 53.15% (99007/186290)

@AntiTopQuark
Copy link
Author

run buildall

@AntiTopQuark
Copy link
Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.40% (32286/50130)
Region Coverage 65.29% (16165/24760)
Branch Coverage 55.70% (8609/15456)

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 44.25% (200/452) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.69% (19779/37535)
Line Coverage 36.21% (184650/509908)
Region Coverage 32.47% (142946/440295)
Branch Coverage 33.65% (62531/185804)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.26% (26916/36738)
Line Coverage 56.72% (288268/508194)
Region Coverage 54.00% (239965/444341)
Branch Coverage 55.77% (103899/186286)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add a global monotonically increasing timestamp service (TSO) for incremental computation in Doris

4 participants