Skip to content

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation#1619

Merged
opsiff merged 3 commits intodeepin-community:linux-6.18.yfrom
Avenger-285714:crc64-6.18
Apr 14, 2026
Merged

[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation#1619
opsiff merged 3 commits intodeepin-community:linux-6.18.yfrom
Avenger-285714:crc64-6.18

Conversation

@Avenger-285714
Copy link
Copy Markdown
Member

@Avenger-285714 Avenger-285714 commented Apr 13, 2026

Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.

The acceleration is implemented using C intrinsics (<arm_neon.h>) rather than raw assembly for better readability and maintainability.

Key highlights of this implementation:

  • Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.
  • Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.
  • Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.

Performance results (kunit crc_benchmark on Cortex-A72):

  • Generic (len=4096): ~268 MB/s
  • PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)

Link: https://lore.kernel.org/all/20260329074338.1053550-1-demyansh@gmail.com/

Summary by Sourcery

Add an ARM64 NEON/PMULL-accelerated CRC64-NVMe implementation and wire it into the generic CRC64 architecture-specific path for capable CPUs.

New Features:

  • Introduce an ARM64 NEON-based inner CRC64-NVMe routine using PMULL intrinsics for accelerated checksum computation.
  • Provide an ARM64-specific crc64_nvme_arch() helper that conditionally uses the PMULL-accelerated path based on buffer size and CPU capabilities.

Build:

  • Hook the new ARM64 CRC64 NEON implementation into the CRC64 build when CONFIG_CRC64_ARCH and CONFIG_ARM64 are enabled, with appropriate compiler flags for PMULL/crypto support.

Copilot AI review requested due to automatic review settings April 13, 2026 12:41
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 13, 2026

Reviewer's Guide

Adds an ARM64 NEON/PMULL-accelerated CRC64-NVMe implementation and wires it into the generic CRC64 architecture layer, with build flags and a chunked SIMD dispatch path that falls back to the generic implementation for short buffers or when PMULL/SIMD is unavailable.

Sequence diagram for crc64_nvme_arch dispatch and fallback

sequenceDiagram
    participant Caller
    participant crc64_nvme_arch
    participant cpu_have_named_feature
    participant may_use_simd
    participant scoped_ksimd
    participant crc64_nvme_arm64_c
    participant crc64_nvme_generic

    Caller->>crc64_nvme_arch: crc64_nvme_arch(crc, p, len)
    alt len >= 128
        crc64_nvme_arch->>cpu_have_named_feature: cpu_have_named_feature(PMULL)
        cpu_have_named_feature-->>crc64_nvme_arch: has_pmull
        crc64_nvme_arch->>may_use_simd: may_use_simd()
        may_use_simd-->>crc64_nvme_arch: simd_allowed
        alt has_pmull and simd_allowed
            loop while len >= 128
                crc64_nvme_arch->>crc64_nvme_arch: chunk = min(len & ~15, 4KB)
                crc64_nvme_arch->>scoped_ksimd: enter ksimd section
                scoped_ksimd->>crc64_nvme_arm64_c: crc64_nvme_arm64_c(crc, p, chunk)
                crc64_nvme_arm64_c-->>scoped_ksimd: updated_crc
                scoped_ksimd-->>crc64_nvme_arch: leave ksimd section
                crc64_nvme_arch->>crc64_nvme_arch: crc = updated_crc, p += chunk, len -= chunk
            end
        end
    end
    crc64_nvme_arch->>crc64_nvme_generic: crc64_nvme_generic(crc, p, len)
    crc64_nvme_generic-->>crc64_nvme_arch: final_crc
    crc64_nvme_arch-->>Caller: final_crc
Loading

Class diagram for CRC64 ARM64 NEON implementation and dispatch

classDiagram
    class Crc64Arm64 {
        +u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len)
        -u64 fold_consts_val[2]
        -u64 bconsts_val[2]
    }

    class Crc64ArchLayer {
        +u64 crc64_nvme_arch(u64 crc, const u8 *p, size_t len)
        +u64 crc64_nvme_generic(u64 crc, const u8 *p, size_t len)
        +u64 crc64_be_arch(u64 crc, const u8 *p, size_t len)
        +u64 crc64_be_generic(u64 crc, const u8 *p, size_t len)
    }

    class CpuFeature {
        +bool cpu_have_named_feature(int feature)
        +const int PMULL
    }

    class SimdSubsystem {
        +bool may_use_simd()
        +scoped_ksimd scoped_ksimd()
    }

    class BuildConfig {
        +CONFIG_CRC64
        +CONFIG_CRC64_ARCH
        +CONFIG_ARM64
    }

    class Objects {
        +crc64_main_o
        +arm64_crc64_neon_inner_o
        +riscv_crc64_lsb_o
        +riscv_crc64_msb_o
        +x86_crc64_pclmul_o
    }

    Crc64ArchLayer --> Crc64Arm64 : uses
    Crc64ArchLayer --> CpuFeature : checks_features
    Crc64ArchLayer --> SimdSubsystem : manages_simd_context
    BuildConfig --> Objects : selects
    Objects --> Crc64Arm64 : links_arm64_neon_path
    Objects --> Crc64ArchLayer : links_common_crc64
Loading

File-Level Changes

Change Details Files
Introduce ARM64 NEON/PMULL-based inner CRC64-NVMe implementation
  • Add crc64_nvme_arm64_c() that computes CRC64 using ARM NEON intrinsics and PMULL, including folding and Barrett reduction using precomputed constants
  • Implement the main 16-byte folding loop operating on a 128-bit CRC state with vmull_p64/vmull_high_p64 and xor blending of input blocks
  • Handle final x^64 multiplication and Barrett reduction to bring the 128-bit state back to the 64-bit CRC value
lib/crc/arm64/crc64-neon-inner.c
Provide ARM64-specific CRC64 dispatch helper that uses the NEON path when beneficial
  • Declare crc64_nvme_arm64_c prototype and define crc64_nvme_arch() as the ARM64 CRC64-NVMe entry point
  • Gate the NEON/PMULL path on buffer length >= 128, CPU PMULL capability, and may_use_simd()
  • Process data in up-to-4KB aligned SIMD chunks under scoped_ksimd() and fall back to crc64_nvme_generic() for remaining bytes or when SIMD is not used
  • Alias crc64_be_arch to crc64_be_generic to keep big-endian CRC64 using the generic implementation
lib/crc/arm64/crc64.h
Wire the new ARM64 CRC64 implementation into the build system with appropriate compiler flags
  • Add arm64/crc64-neon-inner.o to the crc64 objects when CONFIG_ARM64 and CONFIG_CRC64_ARCH are enabled
  • Remove -mgeneral-regs-only for the NEON inner object and add -ffreestanding and -march=armv8-a+crypto to enable PMULL/crypto instructions
  • Include the toolchain system headers via -isystem for the NEON intrinsic definitions
  • Clarify the CONFIG_CRC64_ARCH conditional with a closing comment
lib/crc/Makefile
lib/crc/Kconfig

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Consider adding an include guard to lib/crc/arm64/crc64.h to avoid accidental multiple inclusion as this header grows or is reused elsewhere.
  • crc64_nvme_arm64_c is only used from the architecture-specific path; making it file-local (static) and exposing only the inline wrapper in crc64.h would better encapsulate the NEON implementation detail and reduce the chance of unintended external use.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider adding an include guard to lib/crc/arm64/crc64.h to avoid accidental multiple inclusion as this header grows or is reused elsewhere.
- crc64_nvme_arm64_c is only used from the architecture-specific path; making it file-local (static) and exposing only the inline wrapper in crc64.h would better encapsulate the NEON implementation detail and reduce the chance of unintended external use.

## Individual Comments

### Comment 1
<location path="lib/crc/arm64/crc64.h" line_range="1" />
<code_context>
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * CRC64 using ARM64 PMULL instructions
</code_context>
<issue_to_address>
**nitpick (bug_risk):** Consider adding a traditional include guard to the new header.

In this codebase we usually prefer `#ifndef`/`#define` guards over `#pragma once`, and they make multiple inclusion behavior explicit and easier to reason about. Please add a guard consistent with nearby headers.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -0,0 +1,30 @@
/* SPDX-License-Identifier: GPL-2.0-only */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (bug_risk): Consider adding a traditional include guard to the new header.

In this codebase we usually prefer #ifndef/#define guards over #pragma once, and they make multiple inclusion behavior explicit and easier to reason about. Please add a guard consistent with nearby headers.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an ARM64 NVMe CRC64 implementation accelerated with NEON/PMULL to improve throughput vs the generic shift/XOR path, and wires it into the existing CRC64 arch-dispatch mechanism.

Changes:

  • Enables CRC64_ARCH by default on ARM64 and adds an ARM64-specific dispatch header for crc64_nvme().
  • Introduces a NEON/PMULL-based CRC64-NVMe inner implementation (crc64_nvme_arm64_c()), with chunking to cap SIMD critical sections.
  • Updates lib/crc/Makefile to build the new ARM64 object with appropriate compiler flags.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
lib/crc/Makefile Adds ARM64 CRC64-NVMe NEON inner object and per-file compile flags.
lib/crc/Kconfig Defaults CRC64_ARCH to enabled on ARM64.
lib/crc/arm64/crc64.h Adds ARM64 crc64_nvme_arch() dispatch with PMULL + SIMD gating and chunking.
lib/crc/arm64/crc64-neon-inner.c Implements the PMULL-accelerated CRC64-NVMe update routine using NEON intrinsics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +24
if (len >= 128 && cpu_have_named_feature(PMULL) &&
likely(may_use_simd())) {
do {
size_t chunk = min_t(size_t, len & ~15, SZ_4K);

scoped_ksimd()
crc = crc64_nvme_arm64_c(crc, p, chunk);

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoped_ksimd() does not exist anywhere in this tree (the arm64 SIMD users use kernel_neon_begin() / kernel_neon_end() after may_use_simd()). As written, this will fail to build. Replace this with a kernel_neon_begin()/kernel_neon_end() pair around the accelerated call (and include <asm/neon.h>), keeping the 4KB chunking to bound preempt-disable time.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +44
CFLAGS_REMOVE_arm64/crc64-neon-inner.o += -mgeneral-regs-only
CFLAGS_arm64/crc64-neon-inner.o += -ffreestanding -march=armv8-a+crypto
CFLAGS_arm64/crc64-neon-inner.o += -isystem $(shell $(CC) -print-file-name=include)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-object flags here hardcode -ffreestanding and -isystem $(CC ... -print-file-name=include), and use -march=armv8-a+crypto. For consistency with existing arm64 NEON/crypto intrinsics build rules, prefer using the arch-exported CC_FLAGS_FPU/CC_FLAGS_NO_FPU (see arch/arm64/Makefile:36-42) and consider -mcpu=generic+crypto (as done for aegis128-neon-inner.o in crypto/Makefile:110-123) instead of a raw -march override.

Suggested change
CFLAGS_REMOVE_arm64/crc64-neon-inner.o += -mgeneral-regs-only
CFLAGS_arm64/crc64-neon-inner.o += -ffreestanding -march=armv8-a+crypto
CFLAGS_arm64/crc64-neon-inner.o += -isystem $(shell $(CC) -print-file-name=include)
CFLAGS_REMOVE_arm64/crc64-neon-inner.o += $(CC_FLAGS_NO_FPU)
CFLAGS_arm64/crc64-neon-inner.o += $(CC_FLAGS_FPU) -mcpu=generic+crypto

Copilot uses AI. Check for mistakes.
@Avenger-285714
Copy link
Copy Markdown
Member Author

@opsiff
Linux previously underwent a cross-architecture refactor of its CRC crypto library, so this patch cannot simply be backported to linux-6.6.y.

That said, backporting it does seem necessary, as it can significantly boost NVMe I/O performance in certain scenarios.

I just couldn't come up with an elegant solution within half a minute, so I couldn't be bothered to deal with it.

[Upstream commit 814f541]

Implement the ksimd scoped guard API so that it can be used by code that
supports both ARM and arm64.

Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
ardbiesheuvel and others added 2 commits April 14, 2026 10:19
[Upstream commit c5b91a1]

Encapsulate kernel_neon_begin() and kernel_neon_end() using a 'ksimd'
cleanup guard. This hides the prototype of those functions, allowing
them to be changed for arm64 but not ARM, without breaking code that is
shared between those architectures (RAID6, AEGIS-128)

It probably makes sense to expose this API more widely across
architectures, as it affords more flexibility to the arch code to
plumb it in, while imposing more rigid rules regarding the start/end
bookends appearing in matched pairs.

Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON
Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR
software implementation is slow, which creates a bottleneck in NVMe and
other storage subsystems.

The acceleration is implemented using C intrinsics (<arm_neon.h>) rather
than raw assembly for better readability and maintainability.

Key highlights of this implementation:
- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency
  spikes on large buffers.
- Pre-calculates and loads fold constants via vld1q_u64() to minimize
  register spilling.
- Benchmarks show the break-even point against the generic implementation
  is around 128 bytes. The PMULL path is enabled only for len >= 128.

Performance results (kunit crc_benchmark on Cortex-A72):
- Generic (len=4096): ~268 MB/s
- PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)

Signed-off-by: Demian Shulhan <demyansh@gmail.com>
Link: https://lore.kernel.org/all/20260329074338.1053550-1-demyansh@gmail.com/
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
@opsiff opsiff merged commit f39ec81 into deepin-community:linux-6.18.y Apr 14, 2026
10 of 12 checks passed
@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants