Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31
Open
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
Open
Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
Conversation
Collaborator
|
Nice! Is it something we can apply to other variants as well? |
The `aegis128l_common.h` code contains repeated lines of paired XOR and
AND operations, for example:
msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));
This is suboptimal on Arm because there is no instruction do to XOR and
AND in a single instruction.
The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR)
instruction which is the equivalent of `XOR(a, AND(b, NOT(c)))`, however
this does not quite match due to the need to negate `c`.
To enable the BCAX instruction to be used, introduce a new
`AES_INVERT_STATE37` toggle to optionally store `state[3]` and
`state[7]` as bitwise-negated in `aegis128l_common.h`. With LLVM 22 this
is sufficient to have the compiler automatically make use of the BCAX
instructions so there is no need to use them explicitly.
Since `state[3]` and `state[7]` are now bitwise-negated, also update
`aegis128l_neon_sha3.c` to add a new `AES_ENC1` macro that undoes the
bitwise negation as part of the AESE instruction. The compiler will
ordinarily try to materialise the all-ones constant here in a
sub-optimal way, necessitating the use of inline assembly.
Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
a 5-15% speedup over the existing Neon SHA3 implementation.
cf0950a to
6c6a2ce
Compare
Collaborator
Author
Good point, I think yes! I did a quick test and it seems like it shows a speedup in most cases. For the larger cases LLVM is sometimes struggling to generate code for the state arrays without spilling it all to the stack which is ruining performance, I will need to investigate further to see if I can avoid that. Assuming I can get that to work, I'll aim to put up something similar to this for the other cases some time in the next few weeks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
aegis128l_common.hcode contains repeated lines of paired XOR and AND operations, for example:This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction.
The
FEAT_SHA3extension includes theBCAX(bit-clear and XOR) instruction which is the equivalent ofXOR(a, AND(b, NOT(c))), however this does not quite match due to the need to negatec.To enable the
BCAXinstruction to be used, introduce a newAES_INVERT_STATE37toggle to optionally storestate[3]andstate[7]as bitwise-negated inaegis128l_common.h. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly.Since
state[3]andstate[7]are now bitwise-negated, also updateaegis128l_neon_sha3.cto add a newAES_ENC1macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly.Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.