Skip to content

[SPARK-56125][SQL] Simplify schema calculation for Merge Into Schema Evolution#54934

Open
szehon-ho wants to merge 2 commits intoapache:masterfrom
szehon-ho:SPARK-55690-merge-pending-schema-changes-refactor
Open

[SPARK-56125][SQL] Simplify schema calculation for Merge Into Schema Evolution#54934
szehon-ho wants to merge 2 commits intoapache:masterfrom
szehon-ho:SPARK-55690-merge-pending-schema-changes-refactor

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Mar 21, 2026

What changes were proposed in this pull request?

Replace 'sourceSchemaForSchemaEvolution' with simply 'pendingChanges'. Also reduced path based comparisons where possible (where have the resolved type from target/source)

Why are the changes needed?

This was suggested by @aokolnychyi after the initial pr was merged. The 'sourceSchemaForSchemaEvolution' is confusing, it is supposed to be a view of the source schema, pruned by the fields actually referred by the MERGE into statement. It is used by the subsequent logic (that compares it with the target table schema) but it is hard to explain.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Run existing tests

Was this patch authored or co-authored using generative AI tooling?

Yes cursor

@szehon-ho szehon-ho changed the title [SPARK-56125][SQL] Simplify schema calculation code for Merge Into Schema Evolution [SPARK-56125][SQL] Simplify schema calculation for Merge Into Schema Evolution Mar 21, 2026
* @param root type schema
* @param path name segments
*/
def fieldExistsAtPath(
Copy link
Member Author

@szehon-ho szehon-ho Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately this is still needed, but only for the top level unresolved reference case

private def fieldExistsAtPathInternal(
dt: DataType,
parts: Seq[String]): Boolean = {
def checkAndRecurse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct.
nit: checkAndRecurse seems unnecessary, can we inline the logic? Also, can we consider rewriting the recursion using pattern matching on parts so the base case is handled in one place?

* @param valueType type of the assignment value at this path (typically source column)
* @param changes accumulator for [[TableChange]] instances
* @param fieldPath qualified path segments for nested columns (`element` / `key` / `value`
* under arrays and mapss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? mapssmaps

val changes = mutable.LinkedHashSet.empty[TableChange]
val failIncompatible: () => Nothing = () =>
throw QueryExecutionErrors.failedToMergeIncompatibleSchemasError(
originalTarget, originalSource, null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: failIncompatible passes null as the cause, the error only shows full target/source schemas with no hint about which field path actually conflicts. Since fieldPath, keyType, and valueType are already available at the call site, should we include them in the exception? Would make debugging much easier for deeply nested schemas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants