[Enh]: Add `Expr.map_batches` to pyspark by pedro-villanueva-bcom · Pull Request #3579 · narwhals-dev/narwhals

pedro-villanueva-bcom · 2026-04-28T11:12:17Z

Description

Expr.map _batches can be used when native expressions aren't enough, for example for statistical functions. Pyspark has several types of UDFs, including pandas UDFS that matches very well with map _batches. This PR implements map _batches using pandas UDFs. The optional param returns_scalar is not supported, as pyspark doesn't allow this. UDF must return either a pandas Series, something that can be transformed into one, or a scalar that will be broadcast to one.

The only change external to the spark backend is in the kind of the map_batches node, which has been changes from ordered to unordered.
Additionally, the testing fixture that creates the spark session now add the PYSPARK_PYTHON env var so that UDFs use that python to run (including using whatever packages are installed).

What type of PR is this? (check all applicable)

Related issues

Related issue #<issue number>
Closes [Enh]: Add map_batches to pyspark #3578

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

pedro-villanueva-bcom · 2026-04-28T11:28:28Z

~~I messed-up the commits, sorry~~ fixed

pedro-villanueva-bcom · 2026-04-28T11:56:29Z

I don't understand this test failure: https://github.com/narwhals-dev/narwhals/actions/runs/25050142610/job/73375481418?pr=3579
It runs fine on my local machine. The error says pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN_OR_STR] Argument col should be a Column or str, got Column., which is really weird.

This test coverage test is also strange: https://github.com/narwhals-dev/narwhals/actions/runs/25050142614/job/73375481310?pr=3579
It doesn't run for pyspark, so the coverage is below 100% because of the new code, should I add pyspark to that test?

FBruzzesi · 2026-04-28T14:50:30Z

Hey @pedro-villanueva-bcom - thanks for taking the initiative! I am not sure we should support map_batches for lazy backends - I am open to see how this will play out!

Regarding your questions:

I don't understand this test failure: narwhals-dev/narwhals/actions/runs/25050142610/job/73375481418?pr=3579 It runs fine on my local machine. The error says pyspark.errors.exceptions.base.PySparkTypeError: [NOT_COLUMN_OR_STR] Argument col should be a Column or str, got Column., which is really weird.

I am not sure, but it would not be the first time that something is passing for pyspark but not for pyspark-connect.
It's ok to explicitly fail for spark-connect if you are not able to replicate and fix the issue.

This test coverage test is also strange: narwhals-dev/narwhals/actions/runs/25050142614/job/73375481310?pr=3579 It doesn't run for pyspark, so the coverage is below 100% because of the new code, should I add pyspark to that test?

Coverage is calculated with SQLFrame backend, so you will need to add a # pragma: no cover for the entire method if SQLFrame is not supported.

pedro-villanueva-bcom · 2026-04-28T15:18:03Z

I am not sure we should support map_batches for lazy backends - I am open to see how this will play out!

Any specific reason for this? In my mind (and use case) udfs are just another type of expression to create a column. It has performance implications for sure, but in my case there's no other choice (that's mostly statistical functions like getting a p-value from a column with a z score for example).
Additionally, I wanted to look into how to support aggregation udfs. I see that polars has a map_group function. I also use the pyspark equivalent for, again, statistical summaries that can't be calculated in other ways (this case is a little less frequent though).

My use case is a library of statistical functions for large datasets that works for pyspark, pyspark connect and snowpark. I want to make it work for in-memory backends too to deal with small data and make testing faster too. I discovered narwhals and I'm quite happy with it. The syntax is nice (nicer than ibis) and migrating is not super hard.
Let me know if you want to talk about this more, happy to do it

for more information, see https://pre-commit.ci

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch from c95fa6f to e5b315f Compare April 28, 2026 11:25

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch 2 times, most recently from 491bdd7 to 0d622c0 Compare April 28, 2026 14:30

pedro-villanueva-bcom marked this pull request as draft April 28, 2026 14:31

pedro-villanueva-bcom marked this pull request as ready for review April 29, 2026 10:09

pedro-villanueva-bcom changed the title ~~[Enh]: Add Expr.map _batches to pyspark~~ [Enh]: Add Expr.map_batches to pyspark Apr 29, 2026

pedro-villanueva-bcom and others added 11 commits April 30, 2026 10:34

use a pandas_udf to implement map_batches for pyspark

58376ae

map_batches doesn't require ordering

e659470

test for pyspark

d59f152

add env var so pyspark udfs can use installed packages

e482400

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab602cd

for more information, see https://pre-commit.ci

test only valid for pyspawrk

a1ee15a

don't import pandas directly

e489de2

coverage only checks sqlframe

a123a85

don't run tests to pyspark[connect]

fe6f0ff

remove print

c0f7457

parametrize test

00eaa83

pedro-villanueva-bcom force-pushed the add_pyspark_map_batches branch from 1572e5f to 00eaa83 Compare April 30, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: Add `Expr.map_batches` to pyspark#3579

[Enh]: Add `Expr.map_batches` to pyspark#3579
pedro-villanueva-bcom wants to merge 11 commits intonarwhals-dev:mainfrom
pedro-villanueva-bcom:add_pyspark_map_batches

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

FBruzzesi commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pedro-villanueva-bcom commented Apr 28, 2026

Description

What type of PR is this? (check all applicable)

Related issues

Checklist

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026

Uh oh!

FBruzzesi commented Apr 28, 2026

Uh oh!

pedro-villanueva-bcom commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading

pedro-villanueva-bcom commented Apr 28, 2026 •

edited

Loading