Skip to content

Unify SIMD arithmetic under a shared transform_binary template#740

Open
tigercosmos wants to merge 1 commit intosolvcon:masterfrom
tigercosmos:issue646
Open

Unify SIMD arithmetic under a shared transform_binary template#740
tigercosmos wants to merge 1 commit intosolvcon:masterfrom
tigercosmos:issue646

Conversation

@tigercosmos
Copy link
Copy Markdown
Collaborator

@tigercosmos tigercosmos commented Apr 19, 2026

Summary

Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for add / sub / mul / div. This PR collapses them into a single transform_binary per backend that takes the operation as an injected functor.

What changed

In the generic backend, the four ops now pass std::plus / std::minus / std::multiplies / std::divides into transform_binary. In the NEON backend they pass vec_add / vec_sub / vec_mul / vec_div wrappers around neon_alias, and std::invocable routes types without a matching vector overload (e.g. int64 for vmulq) to the scalar path at compile time. This replaces the ad-hoc vector_lane > 2 and is_floating_point_v guards.

Bugs fixed along the way

  • Sub-lane UB in NEON. ptr <= dest_end - N_lane formed a pointer before the buffer when the input was shorter than one SIMD lane. Now compares remaining length (dest_end - ptr >= N_lane), and the scalar remainder is inline instead of a recursive call into generic::.
  • check_between diagnostic. The SIMD body checked the >= max mask first and only looked at < min if the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.
  • has_vectype typing. Declared size_t; retyped to bool to match its predicate role.

Tests

tests/test_simd.py pins _simd_feature() == "NEON" on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) for transform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new private modmesh._modmesh._simd_feature() binding exposes the runtime-detected backend.

Follow-up

simd::check_between has inconsistent bound semantics across paths: the NEON SIMD body treats value == max_val as out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.

Test plan

  • make gtest
  • tests/test_simd.py on aarch64
  • CI on Linux / macOS / aarch64

🤖 Generated with Claude Code

@tigercosmos tigercosmos marked this pull request as draft April 19, 2026 12:25
@tigercosmos tigercosmos force-pushed the issue646 branch 2 times, most recently from 488d5cd to 11a6eb1 Compare April 19, 2026 13:09
@tigercosmos tigercosmos changed the title Refactor SIMD to xsimd-style loop injection Unify SIMD arithmetic under a shared transform_binary template Apr 27, 2026
@tigercosmos tigercosmos marked this pull request as ready for review April 27, 2026 20:38
Copy link
Copy Markdown
Collaborator Author

@tigercosmos tigercosmos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yungyuc The PR is ready for review. Thanks!

// every correctness check. Kept under an underscore-prefixed name
// because detect_simd() only meaningfully reflects the dispatched
// backend on aarch64 today; on other targets it would mislead users.
mod.def("_simd_feature", &simd_feature_name);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checking if simd is working.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

Comment on lines +97 to +101
struct vec_add
{
return generic::check_between<T>(start, end, min_val, max_val);
}

template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
template <typename V>
static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
};
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key design in this PR.

constexpr size_t N_lane = type::vector_lane<T>;
if constexpr (!std::invocable<VecOp, vec_t, vec_t>)
{
generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.

{
vec_t v1 = vld1q(src1);
vec_t v2 = vld1q(src2);
vst1q(ptr, vec_op(v1, v2));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_op is called here.

if constexpr (!type::has_vectype<T>)
{
return generic::add<T>(dest, dest_end, src1, src2);
generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.


#include <cstddef>
#include <arm_neon.h>
#include <cstddef>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formattor fixes the order, I think it should be fine. Let me know if I should revert it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine.


template <typename T>
inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the boolean type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +70 to 73
inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
{
T * ptr = dest;
while (ptr < dest_end)
{
*ptr = *src1 - *src2;
++ptr;
++src1;
++src2;
}
transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main design of this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Comment thread tests/test_simd.py
Comment on lines +45 to +46
if platform.machine() in ("arm64", "aarch64"):
self.assertEqual(feature, "NEON")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if NEON is working that we didn't test it before.

Comment thread tests/test_simd.py
self.skipTest("_simd_feature() = " + feature)


class SimdTransformBinaryTC(unittest.TestCase):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cases for checking transform_binary functionality.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you isolate this unit test out from test_buffer.py?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole SIMD implementation is also outside buffer directory. I think it worths a new file.

Refs solvcon#646. Replace the four duplicated add/sub/mul/div loops in both
simd_generic.hpp and neon/neon.hpp with a single transform_binary
template parameterized by a scalar functor (and, on NEON, a vector
functor). Adding a new elementwise op now means writing a functor
instead of two near-identical scalar-tail loops.

The NEON path also fixes:

  * Sub-lane UB: `dest_end - N_lane` formed a pointer before the buffer
    when the input was shorter than one vector lane. Compare on
    remaining length instead.
  * check_between returned the first too-large lane in a block but not
    the first too-small one. Inspect both bounds before picking a
    winner so the diagnostic pointer is deterministic.
  * has_vectype was typed `size_t` (silently truthy for any non-zero
    lane count); retype to `bool` to match its predicate role.
  * Use std::invocable to let int64 mul (no vmulq overload) fall back
    to the scalar path automatically, removing the ad-hoc
    `vector_lane > 2` and `is_floating_point_v` guards.

Expose a private modmesh._modmesh._simd_feature() so tests/test_simd.py
can assert NEON dispatch is actually active on aarch64 -- without that
guard, a regression that silently routed everything to the scalar path
would still pass every correctness check. The test reaches the binding
through `modmesh.core._impl`, which works whether the C++ extension is
installed as a top-level module or as a package submodule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yungyuc
Copy link
Copy Markdown
Member

yungyuc commented Apr 27, 2026

@KHLee529 Could you please take a look?

@yungyuc yungyuc requested review from KHLee529 and yungyuc April 27, 2026 22:53
@yungyuc yungyuc added performance Profiling, runtime, and memory consumption array Multi-dimensional array implementation labels Apr 27, 2026
@yungyuc yungyuc moved this from Todo to In Progress in tensor operations Apr 27, 2026
@yungyuc yungyuc moved this to In Progress in tabular data processing Apr 27, 2026
@KHLee529
Copy link
Copy Markdown
Collaborator

The unified backend look nice in my first glance. I'll dive into details later.

Copy link
Copy Markdown
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Clarify if some for loops can also be replaced with while.
  • Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
  • Rename test_simd.py to test_buffer_simd.py. We can discuss which name is better.

// every correctness check. Kept under an underscore-prefixed name
// because detect_simd() only meaningfully reflects the dispatched
// backend on aarch64 today; on other targets it would mislead users.
mod.def("_simd_feature", &simd_feature_name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.

// Vector loop runs while a full lane still fits. The remaining-count
// form keeps the condition valid for buffers shorter than one lane.
T const * ptr = start;
while (static_cast<size_t>(end - ptr) >= N_lane)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reads clearer than for.

if (ptr != dest_end)

// Tail scalar loop for remaining elements
for (; ptr < end; ++ptr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using while too?


#include <cstddef>
#include <arm_neon.h>
#include <cstddef>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine.


template <typename T>
inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0;
inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +70 to 73
inline void add(T * dest, T const * dest_end, T const * src1, T const * src2)
{
T * ptr = dest;
while (ptr < dest_end)
{
*ptr = *src1 - *src2;
++ptr;
++src1;
++src2;
}
transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.

Comment thread tests/test_simd.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?

Copy link
Copy Markdown
Collaborator

@KHLee529 KHLee529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested. Only some comments and questions listed.

template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr>
const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val)
template <typename V>
static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance

}
while (ptr < dest_end)
{
*ptr = scalar_op(*src1, *src2);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to remove dependency to generic functions.

{
T idx = *ptr;
if (idx < min_val || idx > max_val)
if (*ptr < min_val || *ptr > max_val)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refinement potentially slower due to one more dereference execution?

Comment thread tests/test_simd.py
self.skipTest("_simd_feature() = " + feature)


class SimdTransformBinaryTC(unittest.TestCase):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you isolate this unit test out from test_buffer.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array Multi-dimensional array implementation performance Profiling, runtime, and memory consumption

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants