PHOENIX-7751 : [SyncTable Tool] Feature to validate table data using PhoenixSyncTable tool b/w source and target cluster by rahulLiving · Pull Request #2379 · apache/phoenix

rahulLiving · 2026-02-18T15:38:22Z

No description provided.

This reverts commit 3c54c86.

This reverts commit c97f7e0.

This reverts commit fd46404.

…PhoenixSyncTable tool b/w source and target cluster

tkhurana · 2026-03-12T20:13:18Z

...t/src/main/java/org/apache/phoenix/coprocessorclient/BaseScannerRegionObserverConstants.java

+
+  /**
+   * PhoenixSyncTableTool chunk metadata cell qualifiers. These define the wire protocol between
+   * hoenixSyncTableRegionScanner (server-side coprocessor) and PhoenixSyncTableMapper (client-side


Typo missing 'P'

tkhurana · 2026-03-18T14:15:15Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

+
+  public static Long getPhoenixSyncTableFromTime(Configuration conf) {
+    Preconditions.checkNotNull(conf);
+    String value = conf.get(PHOENIX_SYNC_TABLE_FROM_TIME);


Why didn't you use conf.getLong() ?

tkhurana · 2026-03-18T14:15:46Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

+    conf.setLong(PHOENIX_SYNC_TABLE_TO_TIME, toTime);
+  }
+
+  public static Long getPhoenixSyncTableToTime(Configuration conf) {


Here also why didn't you use conf.getLong ?

tkhurana · 2026-03-18T14:18:05Z

...ix-core-server/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java

    return configuration.getBoolean(MAPREDUCE_RANDOMIZE_MAPPER_EXECUTION_ORDER,
      DEFAULT_MAPREDUCE_RANDOMIZE_MAPPER_EXECUTION_ORDER);
  }
+


IMO these APIs can remain in PhoenixSyncTableTool class only. They are specific to Sync tool

I saw other Tool also has its setter/getter in PhoenixConfigurationUtil.java, so followed same pattern. I am okay to move

tkhurana · 2026-03-20T18:53:45Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+        return false;
+      }
+
+      buildChunkMetadataResult(results, isTargetScan);


If we break out early due to page timeout won't we have a partial chunk ?

It seems that isTargetScan is for different purpose or at-least the naming can be improved.

If we break out early due to page timeout won't we have a partial chunk ?

I have kept source not to have partial chunk, whatever can be processed with page timeout will be considered as source chunk and target will scan with that source chunk size.
Though we can have partial chunk for source, but I was thinking if chunking is taking ~5-10 mins, its better not to hit the same server immediately to let server cool off ?

For target chunk, we always assume target as partial chunk. and caulculates final checksum in Mapper itslef when all rows boundary is read.
That is why isTargetScan is synonymous to partialChunk.

tkhurana · 2026-03-20T19:43:48Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+  private byte[] chunkStartKey = null;
+  private byte[] chunkEndKey = null;
+  private long currentChunkSize = 0L;
+  private long currentChunkRowCount = 0L;


Improvement can be made here to introduce the notion of a chunk object

tkhurana · 2026-03-20T19:48:14Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+          byte[] rowKey = CellUtil.cloneRow(rowCells.get(0));
+          long rowSize = calculateRowSize(rowCells);
+          addRowToChunk(rowKey, rowCells, rowSize);
+          if (!isTargetScan && willExceedChunkLimits(rowSize)) {


So addRowToChunk is already adding the rowSize to chunkSize and then willExceedChunkLimits is again adding rowSize to chunkSize

tkhurana · 2026-03-20T19:51:57Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+  public boolean next(List<Cell> results, ScannerContext scannerContext) throws IOException {
+    region.startRegionOperation();
+    try {
+      resetChunkState();


If you have a notion of a chunk object then you don't need reset you can simply create a new chunk

tkhurana · 2026-03-20T20:01:35Z

...t/src/main/java/org/apache/phoenix/coprocessorclient/BaseScannerRegionObserverConstants.java

+  /**
+   * PhoenixSyncTableTool scan attributes for server-side chunk formation and checksum
+   */
+  public static final String SYNC_TABLE_CHUNK_FORMATION = "_SyncTableChunkFormation";


Should all of these instead be named SYNC_TOOL ?

I have used SyncTableTool for user facing class/config. For others, I have used SyncTable, are you recommending to move all Classes and config to SyncTool instead of SyncTable i.e PhoenixSyncTableRegionScanner -> PhoenixSyncToolRegionScanner ?
I felt SyncTable is more self explainable compared to SyncTool, we can also change it to SyncTableTool at all places ?

tkhurana · 2026-03-20T21:31:23Z

...-core-server/src/main/java/org/apache/phoenix/coprocessor/PhoenixSyncTableRegionScanner.java

+            if (chunkStartKey == null) {
+              LOGGER.warn("Paging timed out while fetching first row of chunk, initStartRowKey: {}",
+                Bytes.toStringBinary(initStartRowKey));
+              updateDummyWithPrevRowKey(results, initStartRowKey, includeInitStartRowKey, scan);
+              return true;


Is this ever hit ? Even with 0 page timeout we get at least one row

Yeah, I was not able repro it in my Integration test. Kept it as defensive check.

Even with 0 page timeout we get at least one row

what would this row contain, if we couldn't get any row from table ?

tkhurana · 2026-03-21T00:21:49Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+  @Override
+  protected void map(NullWritable key, DBInputFormat.NullDBWritable value, Context context)
+    throws IOException, InterruptedException {
+    context.getCounter(PhoenixJobCounters.INPUT_RECORDS).increment(1);


What is the meaning of INPUT_RECORDS in the context of sync tool ?

It indicates number of mappers created

tkhurana · 2026-03-21T00:22:26Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableMapper.java

+
+      if (sourceRowsProcessed > 0) {
+        if (mismatchedChunk == 0) {
+          context.getCounter(PhoenixJobCounters.OUTPUT_RECORDS).increment(1);


What does the OUTPUT_RECORDS mean in the context of Sync tool ?

Number of mapper sucessfully processed. We also have FAILED_RECORD for failed mappers.

tkhurana · 2026-03-21T00:45:01Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+      + "    TO_TIME BIGINT NOT NULL,\n" + "    START_ROW_KEY VARBINARY_ENCODED,\n"
+      + "    END_ROW_KEY VARBINARY_ENCODED,\n" + "    IS_DRY_RUN BOOLEAN, \n"
+      + "    EXECUTION_START_TIME TIMESTAMP,\n" + "    EXECUTION_END_TIME TIMESTAMP,\n"
+      + "    STATUS VARCHAR(20),\n" + "    COUNTERS VARCHAR(255), \n"


I don't think Counters should have a fixed limit. Just make them VARCHAR so that we can add more counters in the future.

tkhurana · 2026-03-21T00:48:04Z

phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRow.java

+
+  public enum Type {
+    CHUNK,
+    MAPPER_REGION


maybe just REGION

tkhurana · 2026-03-21T00:51:42Z

...core-server/src/main/java/org/apache/phoenix/mapreduce/PhoenixSyncTableOutputRepository.java

+
+    String query = "SELECT START_ROW_KEY, END_ROW_KEY FROM " + SYNC_TABLE_CHECKPOINT_TABLE_NAME
+      + " WHERE TABLE_NAME = ?  AND TARGET_CLUSTER = ?"
+      + " AND TYPE = ? AND FROM_TIME = ? AND TO_TIME = ? AND STATUS IN ( ?, ?)";


There are only 2 possible status so does it make sense to set them in the query ? If you don't then you are only querying pk columns without any filter.

Rahul Kumar and others added 22 commits August 1, 2025 20:52

connection creation time

3c54c86

Revert "connection creation time"

c97f7e0

This reverts commit 3c54c86.

Revert "Revert "connection creation time""

53e9a3b

This reverts commit c97f7e0.

Merge remote-tracking branch 'upstream/master'

6b75fec

Merge remote-tracking branch 'upstream/master'

6f40ab4

Merge remote-tracking branch 'upstream/master'

7328f93

ITs changes

fd46404

Revert "ITs changes"

58ef6a9

This reverts commit fd46404.

Merge remote-tracking branch 'upstream/master'

6f226f6

PHOENIX-7751 : [SyncTable Tool] Feature to validate table data using …

1ccf4b6

…PhoenixSyncTable tool b/w source and target cluster

revert other changes

e75c6c1

checkstyle fix

a5060ab

checkstyle fix

cffd2e6

checkstyle fix

2ef30e6

adding more ITs

dd18dae

adding more ITs

326e792

misc fix

b7127cc

code comment

f588291

code comment formatting

f81aa56

Adding all UT/ITs

d60104f

Fix tests

359f345

Fix tests

1bcd693

rahulLiving marked this pull request as ready for review March 12, 2026 12:36

Rahul Kumar added 2 commits March 12, 2026 18:08

Merge remote-tracking branch 'upstream/master' into PHOENIX-7751

7904c50

PhoenixConfigurationUtilTest

b9dfd3c

tkhurana reviewed Mar 12, 2026

View reviewed changes

Rahul Kumar added 2 commits March 13, 2026 19:28

Fix build issues

6c50f95

Some More UTs

b8c00e4

tkhurana reviewed Mar 18, 2026

View reviewed changes

tkhurana reviewed Mar 20, 2026

View reviewed changes

tkhurana reviewed Mar 21, 2026

View reviewed changes

Conversation

rahulLiving commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkhurana Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulLiving Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tkhurana Mar 18, 2026 •

edited

Loading

tkhurana Mar 18, 2026 •

edited

Loading

tkhurana Mar 20, 2026 •

edited

Loading

rahulLiving Mar 22, 2026 •

edited

Loading