Uses simple sbox with parallel mixcolumns
Forward mixcolumns is done as (and according to this or this paper, can be done with 3 xor + 3 rotations or 4 xor + 2 rotations as used here):
tmp = s0 ^ s1 ^ s2 ^ s3
s0` ^= tmp ^ gmul2(s0^s1) // s1^s2^s3^gmul2(s0^s1)
s1` ^= tmp ^ gmul2(s1^s2) // s0^s2^s3^gmul2(s1^s2)
s2` ^= tmp ^ gmul2(s2^s3) // s0^s1^s3^gmul2(s2^s3)
S3` ^= tmp ^ gmul2(s3^s0) // s0^s1^s2^gmul2(s3^s0)
Inverse mixcolums is implemented as:
S{2} = gmul2(S{1})
S{4} = gmul2(S{2})
S{8} = gmul2(S{4})
S{9} = S{8} ^ S{1}
S{b} = S{9} ^ S{2}
S{d} = S{9} ^ S{4}
S{e} = S{8} ^ S{4} ^ S{2}
out = S{e} ^ ror8(S{b}) ^ ror16(S{d}) ^ ror24(S{9})
s0{e}^s1{b}^s2{d}^s3{9} | s1{e}^s2{b}^s3{d}^s0{9} | s2{e}^s3{b}^s0{d}^s1{9} | s3{e}^s0{b}^s1{d}^s2{9}
gmul2()
is implementend as:
mask = in & 0x80808080;
out = ((in & 0x7f7f7f7f) << 1) ^ ((mask - (mask >> 7)) & 0x1b1b1b1b);
Faster than CM0sBOX only when running on core with single cycle multiplier (used for predicated reduction in mixcolumns multiplication)
Implemented similarly to CM0sBOX but with gmul2()
implementend as:
out = ((in & 0x7f7f7f7f) << 1) ^ (((in & 0x80808080) >> 7)) * 0x1b);
// or equivalent sequence to perform shifts first in order to avoid extra moves
out = ((in << 1) & 0xfefefefe) ^ (((in >> 7) & 0x01010101) * 0x1b)
Cipher function | STM32F0 (0ws/1ws) - CM0_sBOX | STM32F0 (0ws/1ws) - CM0_FASTMULsBOX | STM32L0 (0ws/1ws) - CM0_sBOX | STM32L0 (0ws/1ws) - CM0_FASTMULsBOX |
---|---|---|---|---|
setEncKey<128> |
399/415 | (sBOX) | ||
setEncKey<192> |
375/389 | (sBOX) | ||
setEncKey<256> |
568/586 | (sBOX) | ||
encrypt<128> |
1666/1680 | 1587/1600 | ||
encrypt<192> |
2000/2016 | 1905/1920 | ||
encrypt<256> |
2334/2352 | 2223/2240 | ||
setDecKey<128> |
0 | 0 | 0 | 0 |
setDecKey<192> |
0 | 0 | 0 | 0 |
setDecKey<256> |
0 | 0 | 0 | 0 |
decrypt<128> |
2567/2580 | 2387/2400 | ||
decrypt<192> |
3099/3114 | 2879/2894 | ||
decrypt<256> |
3631/3648 | 3371/3388 |
STM32F0 is cortex-m0 (prefetch enabled for 1ws, no prefetch leads to ~45% performance degradation)
STM32L0 is cortex-m0+ (prefetch enabled for 1ws)
Function | code size in bytes | stack usage in bytes | notes |
---|---|---|---|
CM0_sBOX_AES128_keyschedule_enc |
80 | 16 | uses sbox table |
CM0_sBOX_AES192_keyschedule_enc |
88 | 20(24) | uses sbox table |
CM0_sBOX_AES256_keyschedule_enc |
164 | 32 | uses sbox table |
CM0_sBOX_AES_encrypt |
508 | 40 | uses sbox table |
CM0_sBOX_AES_decrypt |
712 | 40 | uses inv_sbox table |
CM0_FASTMULsBOX_AES_encrypt |
480 | 36(40) | uses sbox table, requires single cycle multiplier |
CM0_FASTMULsBOX_AES_decrypt |
672 | 40 | uses inv_sbox table, requires single cycle multiplier |
code sizes include pc-rel constants and their padding
extra 4 bytes on stack comes from aligning stack to 8 bytes on ISR entry.