Skip to content

[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939

Open
davidgvad wants to merge 6 commits intoapache:masterfrom
davidgvad:connect-metadata-plan-reuse
Open

[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939
davidgvad wants to merge 6 commits intoapache:masterfrom
davidgvad:connect-metadata-plan-reuse

Conversation

@davidgvad
Copy link

What changes were proposed in this pull request?

Hey everyone! I have been diving deeply into the Spark Connect Python DataFrame implementation, and this PR takes a small but, I believe promising, internal step toward tackling SPARK-55163.

While digging through the code, I noticed that several metadata-oriented paths were redundantly rebuilding the exact same unresolved Connect proto plan from an immutable DataFrame plan. To clean this up and make the path more explicit and reusable, I am introducing a private cached _metadata_plan helper.

  • schema resolution
  • isLocal
  • isStreaming
  • inputFiles
  • explain
  • sameSemantics
  • semanticHash

I also wanted to make sure our current metadata behavior is really solid and easy to reason about, so I added a bunch of regression tests. The new tests specifically cover:

  • repeated metadata access on the same object reusing a single proto-plan build
  • repeated schema / columns access on the same object using a single schema RPC
  • schema preserving transformations reusing cached schema
  • schema mutating transformations triggering one new schema RPC
  • a plan only test for _metadata_plan memoization without requiring full query execution

My goal was not to implement the full client side metadata cache proposed in SPARK-55163, but to land a focused and low risk foundation change that makes that future work easier.

Why are the changes needed?

SPARK-55163 is motivated by the cost of repeated metadata access in Spark Connect.

Before moving on to a broader caching design, it seemed useful to first make the current metadata-resolution path more explicit and reusable. Reusing the unresolved proto plan gives these metadata operations a shared internal path and a clearer place to build on in follow-up work.

This change also avoids rebuilding the same proto plan repeatedly for metadata-only operations on the same immutable DataFrame, while adding regression coverage around the existing schema-caching behavior.

How was this patch tested?

Built Spark locally with Hive enabled:

build/sbt -Phive clean package

Ran targeted Spark Connect tests with:

/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames \
"pyspark.sql.tests.connect.test_connect_plan SparkConnectPlanTests.test_metadata_plan_is_cached,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_metadata_plan_is_reused_across_metadata_access,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_same_object_metadata_uses_single_schema_rpc,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_schema_preserving_metadata_reuses_cached_schema,\
pyspark.sql.tests.connect.test_connect_dataframe_property SparkConnectDataFramePropertyTests.test_schema_mutating_metadata_triggers_one_new_schema_rpc,\
pyspark.sql.tests.connect.test_connect_basic,\
pyspark.sql.tests.connect.test_connect_plan"

Also ran broader Connect regression coverage with:

/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames pyspark.sql.tests.connect.test_connect_dataframe_property
/Users/datun/miniconda3/bin/python3 -u ./python/run-tests.py \
  --python-executables /Users/datun/miniconda3/bin/python3 \
  -p 1 \
  --testnames pyspark.sql.tests.connect.test_connect_readwriter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant