[Enh]: Add Expr.map_batches to pyspark#3579
[Enh]: Add Expr.map_batches to pyspark#3579pedro-villanueva-bcom wants to merge 11 commits intonarwhals-dev:mainfrom
Expr.map_batches to pyspark#3579Conversation
c95fa6f to
e5b315f
Compare
|
|
|
I don't understand this test failure: https://github.com/narwhals-dev/narwhals/actions/runs/25050142610/job/73375481418?pr=3579 This test coverage test is also strange: https://github.com/narwhals-dev/narwhals/actions/runs/25050142614/job/73375481310?pr=3579 |
491bdd7 to
0d622c0
Compare
|
Hey @pedro-villanueva-bcom - thanks for taking the initiative! I am not sure we should support map_batches for lazy backends - I am open to see how this will play out! Regarding your questions:
I am not sure, but it would not be the first time that something is passing for pyspark but not for pyspark-connect.
Coverage is calculated with SQLFrame backend, so you will need to add a |
Any specific reason for this? In my mind (and use case) udfs are just another type of expression to create a column. It has performance implications for sure, but in my case there's no other choice (that's mostly statistical functions like getting a p-value from a column with a z score for example). My use case is a library of statistical functions for large datasets that works for pyspark, pyspark connect and snowpark. I want to make it work for in-memory backends too to deal with small data and make testing faster too. I discovered narwhals and I'm quite happy with it. The syntax is nice (nicer than ibis) and migrating is not super hard. |
Expr.map _batches to pysparkExpr.map_batches to pyspark
for more information, see https://pre-commit.ci
1572e5f to
00eaa83
Compare
Description
Expr.map _batchescan be used when native expressions aren't enough, for example for statistical functions. Pyspark has several types of UDFs, including pandas UDFS that matches very well withmap _batches. This PR implementsmap _batchesusing pandas UDFs. The optional paramreturns_scalaris not supported, as pyspark doesn't allow this. UDF must return either a pandas Series, something that can be transformed into one, or a scalar that will be broadcast to one.The only change external to the spark backend is in the kind of the map_batches node, which has been changes from ordered to unordered.
Additionally, the testing fixture that creates the spark session now add the
PYSPARK_PYTHONenv var so that UDFs use that python to run (including using whatever packages are installed).What type of PR is this? (check all applicable)
Related issues
Checklist