Unify SIMD arithmetic under a shared transform_binary template by tigercosmos · Pull Request #740 · solvcon/modmesh

tigercosmos · 2026-04-19T12:18:16Z

Summary

Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for add / sub / mul / div. This PR collapses them into a single transform_binary per backend that takes the operation as an injected functor.

What changed

In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards.

Bugs fixed along the way

Sub-lane UB in NEON. ptr <= dest_end - N_lane formed a pointer before the buffer when the input was shorter than one SIMD lane. Now compares remaining length (dest_end - ptr >= N_lane), and the scalar remainder is inline instead of a recursive call into generic::.
check_between diagnostic. The SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.
has_vectype typing. Declared size_t; retyped to bool to match its predicate role.

Tests

tests/test_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend.

Follow-up

simd::check_between has inconsistent bound semantics across paths: the NEON SIMD body treats value == max_val as out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.

Test plan

make gtest
tests/test_simd.py on aarch64
CI on Linux / macOS / aarch64

🤖 Generated with Claude Code

tigercosmos

@yungyuc The PR is ready for review. Thanks!

tigercosmos · 2026-04-27T20:18:36Z

+        // every correctness check. Kept under an underscore-prefixed name
+        // because detect_simd() only meaningfully reflects the dispatched
+        // backend on aarch64 today; on other targets it would mislead users.
+        mod.def("_simd_feature", &simd_feature_name);


For checking if simd is working.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

tigercosmos · 2026-04-27T20:19:28Z

+struct vec_add
 {
-    return generic::check_between<T>(start, end, min_val, max_val);
-}
-
-template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
-const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
+    template <typename V>
+    static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
+};


Key design in this PR.

tigercosmos · 2026-04-27T20:20:45Z

-        constexpr size_t N_lane = type::vector_lane<T>;
+        if constexpr (!std::invocable<VecOp, vec_t, vec_t>)
+        {
+            generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);


T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.

tigercosmos · 2026-04-27T20:24:54Z

+            {
+                vec_t v1 = vld1q(src1);
+                vec_t v2 = vld1q(src2);
+                vst1q(ptr, vec_op(v1, v2));


vec_op is called here.

tigercosmos · 2026-04-27T20:26:56Z

+    if constexpr (!type::has_vectype<T>)
    {
-        return generic::add<T>(dest, dest_end, src1, src2);
+        generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);


The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.

tigercosmos · 2026-04-27T20:34:49Z


-#include <cstddef>
 #include <arm_neon.h>
+#include <cstddef>


Formattor fixes the order, I think it should be fine. Let me know if I should revert it.

tigercosmos · 2026-04-27T20:34:59Z


 template <typename T>
-inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
+inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;


Fixed the boolean type.

Good catch.

tigercosmos · 2026-04-27T20:35:40Z

+inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
 {
-    T * ptr = dest;
-    while (ptr < dest_end)
-    {
-        *ptr = *src1 - *src2;
-        ++ptr;
-        ++src1;
-        ++src2;
-    }
+    transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
 }


Main design of this PR.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

tigercosmos · 2026-04-27T20:36:20Z

+        if platform.machine() in ("arm64", "aarch64"):
+            self.assertEqual(feature, "NEON")


Check if NEON is working that we didn't test it before.

tigercosmos · 2026-04-27T20:37:36Z

+            self.skipTest("_simd_feature() = " + feature)
+
+
+class SimdTransformBinaryTC(unittest.TestCase):


Some cases for checking transform_binary functionality.

Why do you isolate this unit test out from test_buffer.py?

The whole SIMD implementation is also outside buffer directory. I think it worths a new file.

Refs solvcon#646. Replace the four duplicated add/sub/mul/div loops in both simd_generic.hpp and neon/neon.hpp with a single transform_binary template parameterized by a scalar functor (and, on NEON, a vector functor). Adding a new elementwise op now means writing a functor instead of two near-identical scalar-tail loops. The NEON path also fixes: * Sub-lane UB: `dest_end - N_lane` formed a pointer before the buffer when the input was shorter than one vector lane. Compare on remaining length instead. * check_between returned the first too-large lane in a block but not the first too-small one. Inspect both bounds before picking a winner so the diagnostic pointer is deterministic. * has_vectype was typed `size_t` (silently truthy for any non-zero lane count); retype to `bool` to match its predicate role. * Use std::invocable to let int64 mul (no vmulq overload) fall back to the scalar path automatically, removing the ad-hoc `vector_lane > 2` and `is_floating_point_v` guards. Expose a private modmesh._modmesh._simd_feature() so tests/test_simd.py can assert NEON dispatch is actually active on aarch64 -- without that guard, a regression that silently routed everything to the scalar path would still pass every correctness check. The test reaches the binding through `modmesh.core._impl`, which works whether the C++ extension is installed as a top-level module or as a package submodule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yungyuc · 2026-04-27T22:53:00Z

@KHLee529 Could you please take a look?

KHLee529 · 2026-04-28T10:48:38Z

The unified backend look nice in my first glance. I'll dive into details later.

yungyuc

Clarify if some for loops can also be replaced with while.
Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
Rename test_simd.py to test_buffer_simd.py. We can discuss which name is better.

yungyuc · 2026-04-29T11:16:28Z

+        // every correctness check. Kept under an underscore-prefixed name
+        // because detect_simd() only meaningfully reflects the dispatched
+        // backend on aarch64 today; on other targets it would mislead users.
+        mod.def("_simd_feature", &simd_feature_name);


cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

yungyuc · 2026-04-29T11:21:15Z

+        // Vector loop runs while a full lane still fits. The remaining-count
+        // form keeps the condition valid for buffers shorter than one lane.
+        T const * ptr = start;
+        while (static_cast<size_t>(end - ptr) >= N_lane)


while reads clearer than for.

yungyuc · 2026-04-29T11:21:26Z

-        if (ptr != dest_end)
+
+        // Tail scalar loop for remaining elements
+        for (; ptr < end; ++ptr)


Why not using while too?

yungyuc · 2026-04-29T11:22:00Z


-#include <cstddef>
 #include <arm_neon.h>
+#include <cstddef>


yungyuc · 2026-04-29T11:22:17Z


 template <typename T>
-inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
+inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;


Good catch.

yungyuc · 2026-04-29T11:24:15Z

+inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
 {
-    T * ptr = dest;
-    while (ptr < dest_end)
-    {
-        *ptr = *src1 - *src2;
-        ++ptr;
-        ++src1;
-        ++src2;
-    }
+    transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
 }


I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

yungyuc · 2026-04-29T11:26:38Z

Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?

KHLee529

No change requested. Only some comments and questions listed.

KHLee529 · 2026-04-29T11:55:12Z

-template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
-const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
+    template <typename V>
+    static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }


Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance

KHLee529 · 2026-04-29T11:57:12Z

+            }
+            while (ptr < dest_end)
+            {
+                *ptr = scalar_op(*src1, *src2);


Nice way to remove dependency to generic functions.

KHLee529 · 2026-04-29T12:04:22Z

    {
-        T idx = *ptr;
-        if (idx < min_val || idx > max_val)
+        if (*ptr < min_val || *ptr > max_val)


Is this refinement potentially slower due to one more dereference execution?

KHLee529 · 2026-04-29T12:08:58Z

+            self.skipTest("_simd_feature() = " + feature)
+
+
+class SimdTransformBinaryTC(unittest.TestCase):


Why do you isolate this unit test out from test_buffer.py?

tigercosmos force-pushed the issue646 branch from 30748c1 to 62e2ec3 Compare April 19, 2026 12:21

tigercosmos marked this pull request as draft April 19, 2026 12:25

tigercosmos force-pushed the issue646 branch 2 times, most recently from 488d5cd to 11a6eb1 Compare April 19, 2026 13:09

tigercosmos force-pushed the issue646 branch from 11a6eb1 to d778870 Compare April 27, 2026 20:11

tigercosmos changed the title ~~Refactor SIMD to xsimd-style loop injection~~ Unify SIMD arithmetic under a shared transform_binary template Apr 27, 2026

tigercosmos force-pushed the issue646 branch from d778870 to fcc3b50 Compare April 27, 2026 20:33

tigercosmos marked this pull request as ready for review April 27, 2026 20:38

tigercosmos commented Apr 27, 2026

View reviewed changes

tigercosmos force-pushed the issue646 branch from 858dced to 8ff634f Compare April 27, 2026 20:47

yungyuc requested review from KHLee529 and yungyuc April 27, 2026 22:53

yungyuc assigned tigercosmos Apr 27, 2026

yungyuc added performance Profiling, runtime, and memory consumption array Multi-dimensional array implementation labels Apr 27, 2026

yungyuc added this to tensor operations Apr 27, 2026

github-project-automation Bot moved this to Todo in tensor operations Apr 27, 2026

yungyuc moved this from Todo to In Progress in tensor operations Apr 27, 2026

yungyuc removed this from tensor operations Apr 27, 2026

yungyuc added this to tabular data processing Apr 27, 2026

yungyuc moved this to In Progress in tabular data processing Apr 27, 2026

yungyuc requested changes Apr 29, 2026

View reviewed changes

KHLee529 reviewed Apr 29, 2026

View reviewed changes

		if platform.machine() in ("arm64", "aarch64"):
		self.assertEqual(feature, "NEON")

		self.skipTest("_simd_feature() = " + feature)


		class SimdTransformBinaryTC(unittest.TestCase):

Conversation

tigercosmos commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Bugs fixed along the way

Tests

Follow-up

Test plan

Uh oh!

tigercosmos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yungyuc commented Apr 27, 2026

Uh oh!

KHLee529 commented Apr 28, 2026

Uh oh!

yungyuc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KHLee529 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

tigercosmos commented Apr 19, 2026 •

edited

Loading