Skip to content

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31

Open
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-use-bcax
Open

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-use-bcax

Conversation

@georges-arm
Copy link
Collaborator

The aegis128l_common.h code contains repeated lines of paired XOR and AND operations, for example:

msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));

This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction.

The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR) instruction which is the equivalent of XOR(a, AND(b, NOT(c))), however this does not quite match due to the need to negate c.

To enable the BCAX instruction to be used, introduce a new AES_INVERT_STATE37 toggle to optionally store state[3] and state[7] as bitwise-negated in aegis128l_common.h. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly.

Since state[3] and state[7] are now bitwise-negated, also update aegis128l_neon_sha3.c to add a new AES_ENC1 macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly.

Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.

@georges-arm georges-arm requested a review from jedisct1 March 12, 2026 16:47
@jedisct1
Copy link
Collaborator

Nice!

Is it something we can apply to other variants as well?

The `aegis128l_common.h` code contains repeated lines of paired XOR and
AND operations, for example:

    msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));

This is suboptimal on Arm because there is no instruction do to XOR and
AND in a single instruction.

The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR)
instruction which is the equivalent of `XOR(a, AND(b, NOT(c)))`, however
this does not quite match due to the need to negate `c`.

To enable the BCAX instruction to be used, introduce a new
`AES_INVERT_STATE37` toggle to optionally store `state[3]` and
`state[7]` as bitwise-negated in `aegis128l_common.h`. With LLVM 22 this
is sufficient to have the compiler automatically make use of the BCAX
instructions so there is no need to use them explicitly.

Since `state[3]` and `state[7]` are now bitwise-negated, also update
`aegis128l_neon_sha3.c` to add a new `AES_ENC1` macro that undoes the
bitwise negation as part of the AESE instruction. The compiler will
ordinarily try to materialise the all-ones constant here in a
sub-optimal way, necessitating the use of inline assembly.

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
a 5-15% speedup over the existing Neon SHA3 implementation.
@georges-arm georges-arm force-pushed the georges-arm/aarch64-sha3-use-bcax branch from cf0950a to 6c6a2ce Compare March 13, 2026 16:44
@georges-arm
Copy link
Collaborator Author

Is it something we can apply to other variants as well?

Good point, I think yes! I did a quick test and it seems like it shows a speedup in most cases. For the larger cases LLVM is sometimes struggling to generate code for the state arrays without spilling it all to the stack which is ruining performance, I will need to investigate further to see if I can avoid that.

Assuming I can get that to work, I'll aim to put up something similar to this for the other cases some time in the next few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants