Don't unroll when u32_muladd64 is too big

The previous commit shows that unroll has, as expected, a cost in terms of code size, and depending on the core either improves or degrades performance. It also notes that naive unrolling was the start of an experiment. This commit marks the end, by keeping the unrolled version for DSP cores but not the others. Non-DSP cores (eg Cortex-M0) ---------------------------- An important factor is whether u32_muladd64 will be inlined even in an unrolled loop: it obviously should when it's a single instruction such as umaal (which GCC notices but Clang misses) and it shouldn't when it's dozens of instructions (such as on DSP-less Arm cores). In that second case, the cost of making the function call can be higher than the cost of loop management. This is what we observe on Cortex-M0 as the figures in the previous commit show. There are 3 components to the cost of making the function call: 1. Inputs and outputs need to be in specific registers. 2. Saving/restoring scratch registers. 3. The function call itself (two branches, only one for the loop, plus BL is more expensive that B). The cost of (1) and (2) is apparent in this case by comparing what U288_MULADD_STEP(1) translates to on Cortex-M0 and Cortex-M4: // M0 11c: 6862 ldr r2, [r4, #4] 11e: 000b movs r3, r1 120: 0028 movs r0, r5 122: 6871 ldr r1, [r6, #4] 124: f7ff ffd7 bl d6 <u32_muladd64> 128: 6060 str r0, [r4, #4] // M4 d4: 6855 ldr r5, [r2, #4] d6: 6844 ldr r4, [r0, #4] d8: fbe1 4365 umaal r4, r3, r1, r5 dc: 6044 str r4, [r0, #4] We can are see two additional MOVS in the M0 case. - The "movs r3, r1" is because the high half of the result, returned in r1, will be used as the 't' argument in the next call, passed as r3. This could be avoided by changing the signature of the u32_muladd64 in order to make 't' the second parameter. However I couldn't find a way to do that without adding a "movs" somewhere near the end of the function, which negates expected the benefits of this change in terms of execution time (but still saved a few bytes). - The "movs r0, r5" is because u32_muladd64 doesn't preserve its "x" argument, so the value of u288_muladd's "x" needs to be copied every time. The only way to get around that would be to manually inline u32_muladd64 and make it respect a stronger contract that what's guaranteed by the ABI standard (that is, preserve the value of this one argument). Getting rid of the second movs seems like a lot of work and it's not even clear it would result in better performance than a loop: it would avoid cost (1) about, but since u32_muladd64 needs 7 registers, and we need to preserve the values of u288_muladd's "y" and "z" too, so that's 9 values, and on M0 (v6-M, Thumb1) most instruction can only use the lower 8 registers, so we'd need to use the stack anyway, ie cost (2) wouldn't be totally eliminated, and cost (3) would always remain. (Perhaps there is more hope for a performance improvement on M3, which with Thumb2 can efficiently use more registers.) Overall, I decided not to attempt this effort, with unclear chances of improving the performance, especially as it's clear that code size would increase compared to a loop even in the best case. DSP cores (eg Cortext-M4) ------------------------- I hoped to be able to reduce code size by using LDM/STM in the unrolled body, like this (which would be the body of u288_muladd on those cores): __asm__( ".syntax unified\n\t" "mov r3, #0\n\t" "ldm %[y]!, {r4, r6, r8, r10}\n\t" "ldm %[z], {r5, r7, r9, r11}\n\t" "umaal r5, r3, %[x], r4\n\t" "umaal r7, r3, %[x], r6\n\t" "umaal r9, r3, %[x], r8\n\t" "umaal r11, r3, %[x], r10\n\t" "stm %[z]!, {r5, r7, r9, r11}\n\t" "ldm %[y]!, {r4, r6, r8, r10}\n\t" "ldm %[z], {r5, r7, r9, r11, r12}\n\t" "umaal r5, r3, %[x], r4\n\t" "umaal r7, r3, %[x], r6\n\t" "umaal r9, r3, %[x], r8\n\t" "umaal r11, r3, %[x], r10\n\t" "adds r12, r3\n\t" "stm %[z], {r5, r7, r9, r11, r12}\n\t" "mov %[z], #0\n\t" "adc %[z], %[z], #0\n\t" : [y] "+l" (y), [z] "+l" (z) : [x] "l" (x) : "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12", "cc", "memory" ); return (uint32_t) z; This has the following effects: - code size: -16 bytes on M4 and A7 - stack usage: +20 bytes on M4 and A7 - performance: 621 -> 625 ms on M4, 55.5 -> 51.7 ms on A7 - source: more asm This is a mixed picture; in the end I decided not to do that change. OTOH, I'm keeping the basic unrolling, as the performance impact (+20% on M4) is worth the +56 bytes of code size IMO (and it has not impact on stack usage and keeps the amount of asm low).
mpg · Jan 13, 2022 · 0211adc · 0211adc
1 parent bf89286
commit 0211adc
Showing 1 changed file with 21 additions and 10 deletions.
diff --git a/p256-m.c b/p256-m.c
@@ -185,6 +185,9 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t);
 
 /* This macro is used to mark whether an asm implentation is found */
 #undef MULADD64_ASM
+/* This macro is used to mark whether the implementation has a small
+ * code size (ie, it can be inlined even in an unrolled loop) */
+#undef MULADD64_SMALL
 
 /*
  * Currently assembly optimisations are only supported with GCC/Clang for
@@ -218,6 +221,7 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
     return ((uint64_t) t << 32) | z;
 }
 #define MULADD64_ASM
+#define MULADD64_SMALL
 
 #else /* __ARM_FEATURE_DSP */
 
@@ -297,6 +301,7 @@ static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
 {
     return (uint64_t) x * y + z + t;
 }
+#define MULADD64_SMALL
 #else
 static uint64_t u32_muladd64(uint32_t x, uint32_t y, uint32_t z, uint32_t t)
 {
@@ -342,21 +347,27 @@ static uint32_t u288_muladd(uint32_t z[9], uint32_t x, const uint32_t y[8])
 {
     uint32_t carry = 0;
 
-    //for (unsigned i = 0; i < 8; i++) {
-#define STEP(i) \
+#define U288_MULADD_STEP(i) \
     do { \
         uint64_t prod = u32_muladd64(x, y[i], z[i], carry); \
         z[i] = (uint32_t) prod; \
         carry = (uint32_t) (prod >> 32); \
     } while( 0 )
-    STEP(0);
-    STEP(1);
-    STEP(2);
-    STEP(3);
-    STEP(4);
-    STEP(5);
-    STEP(6);
-    STEP(7);
+
+#if defined(MULADD64_SMALL)
+    U288_MULADD_STEP(0);
+    U288_MULADD_STEP(1);
+    U288_MULADD_STEP(2);
+    U288_MULADD_STEP(3);
+    U288_MULADD_STEP(4);
+    U288_MULADD_STEP(5);
+    U288_MULADD_STEP(6);
+    U288_MULADD_STEP(7);
+#else
+    for (unsigned i = 0; i < 8; i++) {
+        U288_MULADD_STEP(i);
+    }
+#endif
 
     uint64_t sum = (uint64_t) z[8] + carry;
     z[8] = (uint32_t) sum;