[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939
Open
davidgvad wants to merge 6 commits intoapache:masterfrom
Open
[SPARK-55163][PYTHON][CONNECT] Reuse metadata plans for DataFrames#54939davidgvad wants to merge 6 commits intoapache:masterfrom
davidgvad wants to merge 6 commits intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Hey everyone! I have been diving deeply into the Spark Connect Python DataFrame implementation, and this PR takes a small but, I believe promising, internal step toward tackling SPARK-55163.
While digging through the code, I noticed that several metadata-oriented paths were redundantly rebuilding the exact same unresolved Connect proto plan from an immutable DataFrame plan. To clean this up and make the path more explicit and reusable, I am introducing a private cached _metadata_plan helper.
isLocalisStreaminginputFilesexplainsameSemanticssemanticHashI also wanted to make sure our current metadata behavior is really solid and easy to reason about, so I added a bunch of regression tests. The new tests specifically cover:
schema/columnsaccess on the same object using a single schema RPC_metadata_planmemoization without requiring full query executionMy goal was not to implement the full client side metadata cache proposed in SPARK-55163, but to land a focused and low risk foundation change that makes that future work easier.
Why are the changes needed?
SPARK-55163 is motivated by the cost of repeated metadata access in Spark Connect.
Before moving on to a broader caching design, it seemed useful to first make the current metadata-resolution path more explicit and reusable. Reusing the unresolved proto plan gives these metadata operations a shared internal path and a clearer place to build on in follow-up work.
This change also avoids rebuilding the same proto plan repeatedly for metadata-only operations on the same immutable DataFrame, while adding regression coverage around the existing schema-caching behavior.
How was this patch tested?
Built Spark locally with Hive enabled:
Ran targeted Spark Connect tests with:
Also ran broader Connect regression coverage with: