[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames by davidgvad · Pull Request #54939 · apache/spark

davidgvad · 2026-03-21T06:52:35Z

What changes were proposed in this pull request?

Hey everyone! I have been diving deeply into the Spark Connect Python DataFrame implementation, and this PR takes a small but, I believe promising, internal step toward tackling SPARK-55163.

While digging through the code, I noticed that several metadata-oriented paths were redundantly rebuilding the exact same unresolved Connect proto plan from an immutable DataFrame plan. To clean this up and make the path more explicit and reusable, I am introducing a private cached _metadata_plan helper.

schema resolution
isLocal
isStreaming
inputFiles
explain
sameSemantics
semanticHash

I also wanted to make sure our current metadata behavior is really solid and easy to reason about, so I added a bunch of regression tests. The new tests specifically cover:

repeated metadata access on the same object reusing a single proto-plan build
repeated schema / columns access on the same object using a single schema RPC
schema preserving transformations reusing cached schema
schema mutating transformations triggering one new schema RPC
a plan only test for _metadata_plan memoization without requiring full query execution

My goal was not to implement the full client side metadata cache proposed in SPARK-55163, but to land a focused and low risk foundation change that makes that future work easier.

Why are the changes needed?

SPARK-55163 is motivated by the cost of repeated metadata access in Spark Connect.

Before moving on to a broader caching design, it seemed useful to first make the current metadata-resolution path more explicit and reusable. Reusing the unresolved proto plan gives these metadata operations a shared internal path and a clearer place to build on in follow-up work.

This change also avoids rebuilding the same proto plan repeatedly for metadata-only operations on the same immutable DataFrame, while adding regression coverage around the existing schema-caching behavior.

How was this patch tested?

Built Spark locally with Hive enabled:

build/sbt -Phive clean package

Ran targeted Spark Connect tests with:

/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames \
"pyspark.sql.tests.connect.test_connect_plan SparkConnectPlanTests.test_metadata_plan_is_cached,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_metadata_plan_is_reused_across_metadata_access,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_same_object_metadata_uses_single_schema_rpc,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_schema_preserving_metadata_reuses_cached_schema,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_schema_mutating_metadata_triggers_one_new_schema_rpc,\
pyspark.sql.tests.connect.test_connect_basic,\
pyspark.sql.tests.connect.test_connect_plan"

Also ran broader Connect regression coverage with:

/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames pyspark.sql.tests.connect.test_connect_dataframe_property

/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames pyspark.sql.tests.connect.test_connect_readwriter

davidgvad added 6 commits March 20, 2026 09:31

[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames

8da527c

[SPARK-55163][PYTHON][CONNECT] Trigger CI

dfdfd46

[SPARK-55163][PYTHON][CONNECT] Trigger CI again

977a3e9

[SPARK-55163][PYTHON][CONNECT] Fix Python formatting

8fcf67d

[SPARK-55163][PYTHON][CONNECT] Fix mypy annotation

5855bb2

[SPARK-55163][PYTHON][CONNECT] Fix mypy fallback typing

5510310

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939

[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939
davidgvad wants to merge 6 commits intoapache:masterfrom
davidgvad:connect-metadata-plan-reuse

davidgvad commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidgvad commented Mar 21, 2026

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant