Accelerate mean() with a dedicated fast sum() path by tigercosmos · Pull Request #741 · solvcon/modmesh

tigercosmos · 2026-04-19T12:37:34Z

Motivation

The existing SimpleArray::mean() walks every element through a multi-dimensional index iterator (first_sidx / next_sidx / at(sidx)). That per-element index arithmetic dominates the runtime and makes mean() several times slower than it needs to be, especially on large arrays. This PR continues the work started in #589 and takes another pass at the same goal: make mean() fast without introducing threading or guessed micro-optimizations.

What changes

A new mixin SimpleArrayMixinSum provides sum() with two code paths. The contiguous path walks the buffer with a straight pointer loop so the compiler can auto-vectorize; the strided path walks by the innermost dimension, computing each row's base offset once per outer iteration instead of for every element. mean() now delegates to this sum() and simply divides by size().

The refactor also fixes a latent correctness bug. The previous sum() iterated data(i) by a linear buffer index, so for a non-contiguous array (any slice or transpose) it happily summed the wrong elements. The new strided path respects the stride, and there is a regression test covering it.

What is deliberately not included

This PR drops the threading and manual loop unrolling from the earlier iteration of #589, following the review feedback on that PR. Threads belong to a larger scheduling system that does not exist yet, and a hand-rolled unroll that funnels every partial into a single accumulator does not actually help the compiler vectorize. Private helpers that do not touch this are marked static, matching the reviewer's request.

Edge cases covered

Empty arrays no longer read past the end of the buffer: sum() returns zero on any shape whose product is zero, and mean() throws on an empty input rather than producing 0/0 (NaN for floats, UB for integers). Ghost-cell behavior is unchanged — both old and new paths include ghost cells, which matches how data() and the old first_sidx iterator behaved.

Benchmarks

Times are microseconds per call, best-of-5 runs of 10 iterations each, then averaged across two independent full runs. Machine: Apple Silicon, Python 3.14, release build. "before" = upstream/master at 3d73a5d; "after" = this branch HEAD.

`mean()` — 3× to 6× faster

case	before	after	speedup
1D contig N=1 000	3.26	1.00	3.26×
1D contig N=100 000	311.65	74.23	4.20×
1D contig N=10 000 000	31 105.37	6 939.76	4.48×
3D contig 100³	3 862.07	684.52	5.64×
3D strided 100³	4 285.14	873.83	4.90×
3D F-contig 100³	3 890.77	684.21	5.69×
1D F-contig N=1e7	30 999.39	6 898.32	4.49×
4D strided 25⁴	1 552.80	260.47	5.96×

`sum()` — flat on contiguous, modest cost on strided

case	before	after	ratio
1D contig N=1 000	0.94	0.94	1.00×
1D contig N=100 000	70.68	73.73	0.96×
1D contig N=10 000 000	7 116.49	6 925.18	1.03×
3D contig 100³	803.65	688.69	1.17×
3D strided 100³	800.55	968.03	0.83×
3D F-contig 100³	808.26	687.64	1.18×
1D F-contig N=1e7	7 094.45	6 897.08	1.03×
4D strided 25⁴	299.41	257.64	1.16×

Note: the sum() "before" numbers on strided cases reflect the old buggy behavior (wrong data, read linearly along the buffer), so the comparison is about cost, not correctness — the new strided sum is correct, and on the 3D strided 100³ case it is somewhat slower than the old buggy linear walk (0.83×) because it now follows the stride properly. On contiguous shapes, sum() is unchanged or slightly faster.

How to reproduce the benchmark

The benchmark script and raw results are not part of this PR; they live locally at profiling/bench_mean_compare.py. A minimal reproducer:

import os, time
os.environ.setdefault("QT_QPA_PLATFORM", "offscreen")
import numpy as np
import modmesh

def bench(fn, iters=5):
    fn()  # warmup
    ts = []
    for _ in range(iters):
        t0 = time.perf_counter()
        for _ in range(10):
            fn()
        ts.append((time.perf_counter() - t0) / 10)
    return min(ts) * 1e6  # microseconds

cases = []
for N in [1_000, 100_000, 10_000_000]:
    a = np.random.rand(N).astype(np.float64)
    cases.append((f"1D contig N={N}", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(100, 100, 100).astype(np.float64)
cases.append(("3D contig 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(300, 300, 300).astype(np.float64)[::3, ::3, ::3]
cases.append(("3D strided 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.asfortranarray(np.random.rand(100, 100, 100).astype(np.float64))
cases.append(("3D F-contig 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.asfortranarray(np.random.rand(10_000_000).astype(np.float64))
cases.append(("1D F-contig N=1e7", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(50, 50, 50, 50).astype(np.float64)[::2, ::2, ::2, ::2]
cases.append(("4D strided 25^4", a, modmesh.SimpleArrayFloat64(array=a)))

print(f"{'case':<22} {'np.mean us':>12} {'sa.mean us':>12} {'sa.sum us':>12}")
for name, a, sa in cases:
    print(f"{name:<22} {bench(lambda: np.mean(a)):>12.2f} "
          f"{bench(lambda: sa.mean()):>12.2f} {bench(lambda: sa.sum()):>12.2f}")

Procedure to get before/after numbers:

Check out upstream/master, build with make buildext, run the script, record numbers.
Apply this branch on top, rebuild with make buildext, run the same script, record numbers.
Compare — sa.mean should drop by 3–6× on every shape; sa.sum stays about the same on contiguous shapes.

Relation to #589

This is a rework of #589 based on the review there. It keeps the core idea (dedicated fast sum() path used by mean()), drops the parts the reviewer rejected (threads, unclear unroll), and adds the empty-array guards plus strided-sum correctness fix.

Add SimpleArrayMixinSum with two code paths for sum(): a contiguous path that walks the buffer directly, and a strided path that walks by the innermost dimension and computes the row base offset once per outer iteration. Rewrite mean() as sum() / size() so both share the same fast path. This also fixes a latent bug where the old sum() walked data(i) by logical index on non-contiguous arrays and read the wrong elements.

tigercosmos

@yungyuc Please take a look, thanks!

tigercosmos · 2026-04-28T20:17:06Z

-        return sum / static_cast<value_type>(total);
+            throw std::runtime_error("SimpleArray::mean(): empty array");
+        }
+        return athis->sum() / static_cast<value_type>(n);


tigercosmos · 2026-04-28T20:17:29Z

+        if (athis->is_c_contiguous() || athis->is_f_contiguous())
+        {
+            return sum_contiguous(athis->data(), n);
+        }
+        return sum_strided(athis->data(), athis->shape(), athis->stride());


Properly handle the sum.

tigercosmos · 2026-04-28T20:22:03Z

        self.assertEqual(sarr.min(), -2.3)
        self.assertEqual(sarr.max(), 9.2)

+    def test_sum_non_contiguous(self):


Add a few test cases to increase the coverage.

tigercosmos marked this pull request as draft April 19, 2026 12:38

tigercosmos force-pushed the fast_mean_sum branch 2 times, most recently from 07ed4a9 to 1ffa8e4 Compare April 28, 2026 20:05

tigercosmos marked this pull request as ready for review April 28, 2026 20:16

tigercosmos force-pushed the fast_mean_sum branch from 1ffa8e4 to 75ff6b2 Compare April 28, 2026 20:21

tigercosmos commented Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate mean() with a dedicated fast sum() path#741

Accelerate mean() with a dedicated fast sum() path#741
tigercosmos wants to merge 1 commit intosolvcon:masterfrom
tigercosmos:fast_mean_sum

tigercosmos commented Apr 19, 2026 •

edited

Loading

Uh oh!

tigercosmos left a comment

Uh oh!

tigercosmos Apr 28, 2026

Uh oh!

tigercosmos Apr 28, 2026

Uh oh!

tigercosmos Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tigercosmos commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What changes

What is deliberately not included

Edge cases covered

Benchmarks

mean() — 3× to 6× faster

sum() — flat on contiguous, modest cost on strided

How to reproduce the benchmark

Relation to #589

Uh oh!

tigercosmos left a comment

Choose a reason for hiding this comment

Uh oh!

tigercosmos Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

tigercosmos Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

tigercosmos Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tigercosmos commented Apr 19, 2026 •

edited

Loading

`mean()` — 3× to 6× faster

`sum()` — flat on contiguous, modest cost on strided