Skip to content

Accelerate mean() with a dedicated fast sum() path#741

Open
tigercosmos wants to merge 1 commit intosolvcon:masterfrom
tigercosmos:fast_mean_sum
Open

Accelerate mean() with a dedicated fast sum() path#741
tigercosmos wants to merge 1 commit intosolvcon:masterfrom
tigercosmos:fast_mean_sum

Conversation

@tigercosmos
Copy link
Copy Markdown
Collaborator

@tigercosmos tigercosmos commented Apr 19, 2026

Motivation

The existing SimpleArray::mean() walks every element through a multi-dimensional index iterator (first_sidx / next_sidx / at(sidx)). That per-element index arithmetic dominates the runtime and makes mean() several times slower than it needs to be, especially on large arrays. This PR continues the work started in #589 and takes another pass at the same goal: make mean() fast without introducing threading or guessed micro-optimizations.

What changes

A new mixin SimpleArrayMixinSum provides sum() with two code paths. The contiguous path walks the buffer with a straight pointer loop so the compiler can auto-vectorize; the strided path walks by the innermost dimension, computing each row's base offset once per outer iteration instead of for every element. mean() now delegates to this sum() and simply divides by size().

The refactor also fixes a latent correctness bug. The previous sum() iterated data(i) by a linear buffer index, so for a non-contiguous array (any slice or transpose) it happily summed the wrong elements. The new strided path respects the stride, and there is a regression test covering it.

What is deliberately not included

This PR drops the threading and manual loop unrolling from the earlier iteration of #589, following the review feedback on that PR. Threads belong to a larger scheduling system that does not exist yet, and a hand-rolled unroll that funnels every partial into a single accumulator does not actually help the compiler vectorize. Private helpers that do not touch this are marked static, matching the reviewer's request.

Edge cases covered

Empty arrays no longer read past the end of the buffer: sum() returns zero on any shape whose product is zero, and mean() throws on an empty input rather than producing 0/0 (NaN for floats, UB for integers). Ghost-cell behavior is unchanged — both old and new paths include ghost cells, which matches how data() and the old first_sidx iterator behaved.

Benchmarks

Times are microseconds per call, best-of-5 runs of 10 iterations each, then averaged across two independent full runs. Machine: Apple Silicon, Python 3.14, release build. "before" = upstream/master at 3d73a5d; "after" = this branch HEAD.

mean() — 3× to 6× faster

case before after speedup
1D contig N=1 000 3.26 1.00 3.26×
1D contig N=100 000 311.65 74.23 4.20×
1D contig N=10 000 000 31 105.37 6 939.76 4.48×
3D contig 100³ 3 862.07 684.52 5.64×
3D strided 100³ 4 285.14 873.83 4.90×
3D F-contig 100³ 3 890.77 684.21 5.69×
1D F-contig N=1e7 30 999.39 6 898.32 4.49×
4D strided 25⁴ 1 552.80 260.47 5.96×

sum() — flat on contiguous, modest cost on strided

case before after ratio
1D contig N=1 000 0.94 0.94 1.00×
1D contig N=100 000 70.68 73.73 0.96×
1D contig N=10 000 000 7 116.49 6 925.18 1.03×
3D contig 100³ 803.65 688.69 1.17×
3D strided 100³ 800.55 968.03 0.83×
3D F-contig 100³ 808.26 687.64 1.18×
1D F-contig N=1e7 7 094.45 6 897.08 1.03×
4D strided 25⁴ 299.41 257.64 1.16×

Note: the sum() "before" numbers on strided cases reflect the old buggy behavior (wrong data, read linearly along the buffer), so the comparison is about cost, not correctness — the new strided sum is correct, and on the 3D strided 100³ case it is somewhat slower than the old buggy linear walk (0.83×) because it now follows the stride properly. On contiguous shapes, sum() is unchanged or slightly faster.

How to reproduce the benchmark

The benchmark script and raw results are not part of this PR; they live locally at profiling/bench_mean_compare.py. A minimal reproducer:

import os, time
os.environ.setdefault("QT_QPA_PLATFORM", "offscreen")
import numpy as np
import modmesh

def bench(fn, iters=5):
    fn()  # warmup
    ts = []
    for _ in range(iters):
        t0 = time.perf_counter()
        for _ in range(10):
            fn()
        ts.append((time.perf_counter() - t0) / 10)
    return min(ts) * 1e6  # microseconds

cases = []
for N in [1_000, 100_000, 10_000_000]:
    a = np.random.rand(N).astype(np.float64)
    cases.append((f"1D contig N={N}", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(100, 100, 100).astype(np.float64)
cases.append(("3D contig 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(300, 300, 300).astype(np.float64)[::3, ::3, ::3]
cases.append(("3D strided 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.asfortranarray(np.random.rand(100, 100, 100).astype(np.float64))
cases.append(("3D F-contig 100^3", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.asfortranarray(np.random.rand(10_000_000).astype(np.float64))
cases.append(("1D F-contig N=1e7", a, modmesh.SimpleArrayFloat64(array=a)))

a = np.random.rand(50, 50, 50, 50).astype(np.float64)[::2, ::2, ::2, ::2]
cases.append(("4D strided 25^4", a, modmesh.SimpleArrayFloat64(array=a)))

print(f"{'case':<22} {'np.mean us':>12} {'sa.mean us':>12} {'sa.sum us':>12}")
for name, a, sa in cases:
    print(f"{name:<22} {bench(lambda: np.mean(a)):>12.2f} "
          f"{bench(lambda: sa.mean()):>12.2f} {bench(lambda: sa.sum()):>12.2f}")

Procedure to get before/after numbers:

  1. Check out upstream/master, build with make buildext, run the script, record numbers.
  2. Apply this branch on top, rebuild with make buildext, run the same script, record numbers.
  3. Compare — sa.mean should drop by 3–6× on every shape; sa.sum stays about the same on contiguous shapes.

Relation to #589

This is a rework of #589 based on the review there. It keeps the core idea (dedicated fast sum() path used by mean()), drops the parts the reviewer rejected (threads, unclear unroll), and adds the empty-array guards plus strided-sum correctness fix.

@tigercosmos tigercosmos marked this pull request as draft April 19, 2026 12:38
@tigercosmos tigercosmos force-pushed the fast_mean_sum branch 2 times, most recently from 07ed4a9 to 1ffa8e4 Compare April 28, 2026 20:05
@tigercosmos tigercosmos marked this pull request as ready for review April 28, 2026 20:16
Add SimpleArrayMixinSum with two code paths for sum(): a contiguous
path that walks the buffer directly, and a strided path that walks
by the innermost dimension and computes the row base offset once
per outer iteration. Rewrite mean() as sum() / size() so both share
the same fast path. This also fixes a latent bug where the old sum()
walked data(i) by logical index on non-contiguous arrays and read
the wrong elements.
Copy link
Copy Markdown
Collaborator Author

@tigercosmos tigercosmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yungyuc Please take a look, thanks!

return sum / static_cast<value_type>(total);
throw std::runtime_error("SimpleArray::mean(): empty array");
}
return athis->sum() / static_cast<value_type>(n);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meat.

Comment on lines +190 to +194
if (athis->is_c_contiguous() || athis->is_f_contiguous())
{
return sum_contiguous(athis->data(), n);
}
return sum_strided(athis->data(), athis->shape(), athis->stride());
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Properly handle the sum.

Comment thread tests/test_buffer.py
self.assertEqual(sarr.min(), -2.3)
self.assertEqual(sarr.max(), 9.2)

def test_sum_non_contiguous(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a few test cases to increase the coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant