Unify SIMD arithmetic under a shared transform_binary template#740
Unify SIMD arithmetic under a shared transform_binary template#740tigercosmos wants to merge 1 commit intosolvcon:masterfrom
Conversation
488d5cd to
11a6eb1
Compare
tigercosmos
left a comment
There was a problem hiding this comment.
@yungyuc The PR is ready for review. Thanks!
| // every correctness check. Kept under an underscore-prefixed name | ||
| // because detect_simd() only meaningfully reflects the dispatched | ||
| // backend on aarch64 today; on other targets it would mislead users. | ||
| mod.def("_simd_feature", &simd_feature_name); |
There was a problem hiding this comment.
For checking if simd is working.
There was a problem hiding this comment.
cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.
| struct vec_add | ||
| { | ||
| return generic::check_between<T>(start, end, min_val, max_val); | ||
| } | ||
|
|
||
| template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr> | ||
| const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val) | ||
| template <typename V> | ||
| static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); } | ||
| }; |
There was a problem hiding this comment.
Key design in this PR.
| constexpr size_t N_lane = type::vector_lane<T>; | ||
| if constexpr (!std::invocable<VecOp, vec_t, vec_t>) | ||
| { | ||
| generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op); |
There was a problem hiding this comment.
T does have a vector type, but the specific VecOp functor can't be called with it. For example, vdivq doesn't exist for integer vector types in NEON, so vec_div{} isn't invocable with int32x4_t.
| { | ||
| vec_t v1 = vld1q(src1); | ||
| vec_t v2 = vld1q(src2); | ||
| vst1q(ptr, vec_op(v1, v2)); |
There was a problem hiding this comment.
vec_op is called here.
| if constexpr (!type::has_vectype<T>) | ||
| { | ||
| return generic::add<T>(dest, dest_end, src1, src2); | ||
| generic::transform_binary<T>(dest, dest_end, src1, src2, scalar_op); |
There was a problem hiding this comment.
The scalar type T itself has no corresponding NEON vector type (e.g., bool, int64_t). There's no vector register representation at all, so SIMD is impossible.
|
|
||
| #include <cstddef> | ||
| #include <arm_neon.h> | ||
| #include <cstddef> |
There was a problem hiding this comment.
Formattor fixes the order, I think it should be fine. Let me know if I should revert it.
|
|
||
| template <typename T> | ||
| inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0; | ||
| inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0; |
There was a problem hiding this comment.
Fixed the boolean type.
| inline void add(T * dest, T const * dest_end, T const * src1, T const * src2) | ||
| { | ||
| T * ptr = dest; | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = *src1 - *src2; | ||
| ++ptr; | ||
| ++src1; | ||
| ++src2; | ||
| } | ||
| transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{}); | ||
| } |
There was a problem hiding this comment.
Main design of this PR.
There was a problem hiding this comment.
I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.
| if platform.machine() in ("arm64", "aarch64"): | ||
| self.assertEqual(feature, "NEON") |
There was a problem hiding this comment.
Check if NEON is working that we didn't test it before.
| self.skipTest("_simd_feature() = " + feature) | ||
|
|
||
|
|
||
| class SimdTransformBinaryTC(unittest.TestCase): |
There was a problem hiding this comment.
Some cases for checking transform_binary functionality.
There was a problem hiding this comment.
Why do you isolate this unit test out from test_buffer.py?
There was a problem hiding this comment.
The whole SIMD implementation is also outside buffer directory. I think it worths a new file.
Refs solvcon#646. Replace the four duplicated add/sub/mul/div loops in both simd_generic.hpp and neon/neon.hpp with a single transform_binary template parameterized by a scalar functor (and, on NEON, a vector functor). Adding a new elementwise op now means writing a functor instead of two near-identical scalar-tail loops. The NEON path also fixes: * Sub-lane UB: `dest_end - N_lane` formed a pointer before the buffer when the input was shorter than one vector lane. Compare on remaining length instead. * check_between returned the first too-large lane in a block but not the first too-small one. Inspect both bounds before picking a winner so the diagnostic pointer is deterministic. * has_vectype was typed `size_t` (silently truthy for any non-zero lane count); retype to `bool` to match its predicate role. * Use std::invocable to let int64 mul (no vmulq overload) fall back to the scalar path automatically, removing the ad-hoc `vector_lane > 2` and `is_floating_point_v` guards. Expose a private modmesh._modmesh._simd_feature() so tests/test_simd.py can assert NEON dispatch is actually active on aarch64 -- without that guard, a regression that silently routed everything to the scalar path would still pass every correctness check. The test reaches the binding through `modmesh.core._impl`, which works whether the C++ extension is installed as a top-level module or as a package submodule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@KHLee529 Could you please take a look? |
|
The unified backend look nice in my first glance. I'll dive into details later. |
yungyuc
left a comment
There was a problem hiding this comment.
- Clarify if some
forloops can also be replaced withwhile. - Run performance test to compare the runtime before and after the change. List the results to show that the change does not degrade runtime performance.
- Rename
test_simd.pytotest_buffer_simd.py. We can discuss which name is better.
| // every correctness check. Kept under an underscore-prefixed name | ||
| // because detect_simd() only meaningfully reflects the dispatched | ||
| // backend on aarch64 today; on other targets it would mislead users. | ||
| mod.def("_simd_feature", &simd_feature_name); |
There was a problem hiding this comment.
cpp/modmesh/toggle/ may be a more on-topic module for the SIMD check, but it's fine to have it here in buffer.
| // Vector loop runs while a full lane still fits. The remaining-count | ||
| // form keeps the condition valid for buffers shorter than one lane. | ||
| T const * ptr = start; | ||
| while (static_cast<size_t>(end - ptr) >= N_lane) |
| if (ptr != dest_end) | ||
|
|
||
| // Tail scalar loop for remaining elements | ||
| for (; ptr < end; ++ptr) |
|
|
||
| #include <cstddef> | ||
| #include <arm_neon.h> | ||
| #include <cstddef> |
|
|
||
| template <typename T> | ||
| inline constexpr size_t has_vectype = detail::vector<T>::N_lane > 0; | ||
| inline constexpr bool has_vectype = detail::vector<T>::N_lane > 0; |
| inline void add(T * dest, T const * dest_end, T const * src1, T const * src2) | ||
| { | ||
| T * ptr = dest; | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = *src1 - *src2; | ||
| ++ptr; | ||
| ++src1; | ||
| ++src2; | ||
| } | ||
| transform_binary<T>(dest, dest_end, src1, src2, std::plus<T>{}); | ||
| } |
There was a problem hiding this comment.
I am not sure if the additional abstraction still generates good SIMD binaries. Please profile to check. If you have time, also check the built assembly.
There was a problem hiding this comment.
Since most tests are against SimpleArray, I suggest to name the new test file as test_buffer_simd.py?
KHLee529
left a comment
There was a problem hiding this comment.
No change requested. Only some comments and questions listed.
| template <typename T, typename std::enable_if_t<type::has_vectype<T>> * = nullptr> | ||
| const T * check_between(T const * start, T const * end, T const & min_val, T const & max_val) | ||
| template <typename V> | ||
| static auto operator()(V a, V b) -> decltype(vaddq(a, b)) { return vaddq(a, b); } |
There was a problem hiding this comment.
Can these operator helper functions be also inlined? Based on my experience profiling the speed of SimpleArray SIMD operations, whether the vector operations are inlined impact a lot on the performance
| } | ||
| while (ptr < dest_end) | ||
| { | ||
| *ptr = scalar_op(*src1, *src2); |
There was a problem hiding this comment.
Nice way to remove dependency to generic functions.
| { | ||
| T idx = *ptr; | ||
| if (idx < min_val || idx > max_val) | ||
| if (*ptr < min_val || *ptr > max_val) |
There was a problem hiding this comment.
Is this refinement potentially slower due to one more dereference execution?
| self.skipTest("_simd_feature() = " + feature) | ||
|
|
||
|
|
||
| class SimdTransformBinaryTC(unittest.TestCase): |
There was a problem hiding this comment.
Why do you isolate this unit test out from test_buffer.py?
Summary
Refs #646 (Task 2). The generic and NEON backends each had four near-identical loops for
add/sub/mul/div. This PR collapses them into a singletransform_binaryper backend that takes the operation as an injected functor.What changed
In the generic backend, the four ops now pass
std::plus/std::minus/std::multiplies/std::dividesintotransform_binary. In the NEON backend they passvec_add/vec_sub/vec_mul/vec_divwrappers aroundneon_alias, andstd::invocableroutes types without a matching vector overload (e.g. int64 forvmulq) to the scalar path at compile time. This replaces the ad-hocvector_lane > 2andis_floating_point_vguards.Bugs fixed along the way
ptr <= dest_end - N_laneformed a pointer before the buffer when the input was shorter than one SIMD lane. Now compares remaining length (dest_end - ptr >= N_lane), and the scalar remainder is inline instead of a recursive call intogeneric::.check_betweendiagnostic. The SIMD body checked the>= maxmask first and only looked at< minif the first was empty, so a later too-large lane could hide an earlier too-small one. Both bounds are now inspected before picking the returned pointer.has_vectypetyping. Declaredsize_t; retyped toboolto match its predicate role.Tests
tests/test_simd.pypins_simd_feature() == "NEON"on aarch64 so a silent fallback to the scalar path cannot pass unnoticed. It then covers the int32 shape matrix (n=1, 3, 4, 5, 8, 17) fortransform_binary, the int64-mul SFINAE fallback, and float sub/mul/div with one block + tail. A new privatemodmesh._modmesh._simd_feature()binding exposes the runtime-detected backend.Follow-up
simd::check_betweenhas inconsistent bound semantics across paths: the NEON SIMD body treatsvalue == max_valas out-of-range, while the scalar fallback accepts it. Out of scope here; left for a separate change.Test plan
make gtesttests/test_simd.pyon aarch64🤖 Generated with Claude Code