Skip to content

Commit

Permalink
Don't unroll when u32_muladd64 is too big
Browse files Browse the repository at this point in the history
The previous commit shows that unroll has, as expected, a cost in terms
of code size, and depending on the core either improves or degrades
performance. It also notes that naive unrolling was the start of an
experiment. This commit marks the end, by keeping the unrolled version
for DSP cores but not the others.

Non-DSP cores (eg Cortex-M0)
----------------------------

An important factor is whether u32_muladd64 will be inlined even in an
unrolled loop: it obviously should when it's a single instruction such
as umaal (which GCC notices but Clang misses) and it shouldn't when it's
dozens of instructions (such as on DSP-less Arm cores).

In that second case, the cost of making the function call can be higher
than the cost of loop management. This is what we observe on Cortex-M0
as the figures in the previous commit show.

There are 3 components to the cost of making the function call:
1. Inputs and outputs need to be in specific registers.
2. Saving/restoring scratch registers.
3. The function call itself (two branches, only one for the loop, plus
BL is more expensive that B).

The cost of (1) and (2) is apparent in this case by comparing what
U288_MULADD_STEP(1) translates to on Cortex-M0 and Cortex-M4:

 // M0
 11c:	6862      	ldr	r2, [r4, #4]
 11e:	000b      	movs	r3, r1
 120:	0028      	movs	r0, r5
 122:	6871      	ldr	r1, [r6, #4]
 124:	f7ff ffd7 	bl	d6 <u32_muladd64>
 128:	6060      	str	r0, [r4, #4]

 // M4
  d4:	6855      	ldr	r5, [r2, #4]
  d6:	6844      	ldr	r4, [r0, #4]
  d8:	fbe1 4365 	umaal	r4, r3, r1, r5
  dc:	6044      	str	r4, [r0, #4]

We can are see two additional MOVS in the M0 case.

- The "movs r3, r1" is because the high half of the result, returned in r1,
will be used as the 't' argument in the next call, passed as r3. This
could be avoided by changing the signature of the u32_muladd64 in order
to make 't' the second parameter. However I couldn't find a way to do
that without adding a "movs" somewhere near the end of the function,
which negates expected the benefits of this change in terms of execution
time (but still saved a few bytes).

- The "movs r0, r5" is because u32_muladd64 doesn't preserve its "x"
argument, so the value of u288_muladd's "x" needs to be copied every
time. The only way to get around that would be to manually inline
u32_muladd64 and make it respect a stronger contract that what's
guaranteed by the ABI standard (that is, preserve the value of this one
argument).

Getting rid of the second movs seems like a lot of work and it's not
even clear it would result in better performance than a loop: it would
avoid cost (1) about, but since u32_muladd64 needs 7 registers, and we
need to preserve the values of u288_muladd's "y" and "z" too, so that's
9 values, and on M0 (v6-M, Thumb1) most instruction can only use the
lower 8 registers, so we'd need to use the stack anyway, ie cost (2)
wouldn't be totally eliminated, and cost (3) would always remain.
(Perhaps there is more hope for a performance improvement on M3, which
with Thumb2 can efficiently use more registers.)

Overall, I decided not to attempt this effort, with unclear chances of
improving the performance, especially as it's clear that code size would
increase compared to a loop even in the best case.

DSP cores (eg Cortext-M4)
-------------------------

I hoped to be able to reduce code size by using LDM/STM in the unrolled
body, like this (which would be the body of u288_muladd on those cores):

    __asm__(
        ".syntax unified\n\t"
        "mov    r3, #0\n\t"
        "ldm    %[y]!, {r4, r6, r8, r10}\n\t"
        "ldm    %[z], {r5, r7, r9, r11}\n\t"
        "umaal  r5, r3, %[x], r4\n\t"
        "umaal  r7, r3, %[x], r6\n\t"
        "umaal  r9, r3, %[x], r8\n\t"
        "umaal  r11, r3, %[x], r10\n\t"
        "stm    %[z]!, {r5, r7, r9, r11}\n\t"

        "ldm    %[y]!, {r4, r6, r8, r10}\n\t"
        "ldm    %[z], {r5, r7, r9, r11, r12}\n\t"
        "umaal  r5, r3, %[x], r4\n\t"
        "umaal  r7, r3, %[x], r6\n\t"
        "umaal  r9, r3, %[x], r8\n\t"
        "umaal  r11, r3, %[x], r10\n\t"

        "adds   r12, r3\n\t"
        "stm    %[z], {r5, r7, r9, r11, r12}\n\t"
        "mov    %[z], #0\n\t"
        "adc    %[z], %[z], #0\n\t"
        : [y] "+l" (y), [z] "+l" (z)
        : [x] "l" (x)
        : "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12", "cc", "memory"
    );
    return (uint32_t) z;

This has the following effects:

- code size: -16 bytes on M4 and A7
- stack usage: +20 bytes on M4 and A7
- performance: 621 -> 625 ms on M4, 55.5 -> 51.7 ms on A7
- source: more asm

This is a mixed picture; in the end I decided not to do that change.

OTOH, I'm keeping the basic unrolling, as the performance impact (+20%
on M4) is worth the +56 bytes of code size IMO (and it has not impact on
stack usage and keeps the amount of asm low).
  • Loading branch information
mpg committed Jan 13, 2022
1 parent bf89286 commit 0211adc
Showing 1 changed file with 21 additions and 10 deletions.
31 changes: 21 additions & 10 deletions p256-m.c
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,9 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t);

/* This macro is used to mark whether an asm implentation is found */
#undef MULADD64_ASM
/* This macro is used to mark whether the implementation has a small
* code size (ie, it can be inlined even in an unrolled loop) */
#undef MULADD64_SMALL

/*
* Currently assembly optimisations are only supported with GCC/Clang for
Expand Down Expand Up @@ -218,6 +221,7 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
return ((uint64_t) t << 32) | z;
}
#define MULADD64_ASM
#define MULADD64_SMALL

#else /* __ARM_FEATURE_DSP */

Expand Down Expand Up @@ -297,6 +301,7 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
{
return (uint64_t) x * y + z + t;
}
#define MULADD64_SMALL
#else
static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
{
Expand Down Expand Up @@ -342,21 +347,27 @@ static uint32_t u288_muladd(uint32_t z[9], uint32_t x, const uint32_t y[8])
{
uint32_t carry = 0;

//for (unsigned i = 0; i < 8; i++) {
#define STEP(i) \
#define U288_MULADD_STEP(i) \
do { \
uint64_t prod = u32_muladd64(x, y[i], z[i], carry); \
z[i] = (uint32_t) prod; \
carry = (uint32_t) (prod >> 32); \
} while( 0 )
STEP(0);
STEP(1);
STEP(2);
STEP(3);
STEP(4);
STEP(5);
STEP(6);
STEP(7);

#if defined(MULADD64_SMALL)
U288_MULADD_STEP(0);
U288_MULADD_STEP(1);
U288_MULADD_STEP(2);
U288_MULADD_STEP(3);
U288_MULADD_STEP(4);
U288_MULADD_STEP(5);
U288_MULADD_STEP(6);
U288_MULADD_STEP(7);
#else
for (unsigned i = 0; i < 8; i++) {
U288_MULADD_STEP(i);
}
#endif

uint64_t sum = (uint64_t) z[8] + carry;
z[8] = (uint32_t) sum;
Expand Down

0 comments on commit 0211adc

Please sign in to comment.