-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Part2: Vectorize operations for Radix-2 FFT & re-work vectorization part #254
Conversation
2a54e06
to
af01aed
Compare
src/simd_128.h
Outdated
|
||
const m128i F3_m128i_u16 = _mm_set1_epi16(257); // NOLINT(cert-err58-cpp) | ||
const m128i F3minus1_m128i_u16 = _mm_set1_epi16(256); // NOLINT(cert-err58-cpp) | ||
#define F4_u32 _mm_set1_epi32(65537) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we replacing the const
by preprocessor define
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to avoid the NOLINT comment only :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good reason.
const are typed, preprocessor constant aren't.
const can be printed in GDB, preprocessor constant no (except with some debugging option).
…
So, I would rather keep the NOLINT and the const.
Moreover, the NOLINT can be removed for recent Clang (>= 6), so no reason to replace the const by a define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that const
gives worse performances than define
. I check that the use of const
will create __static_initialization_and_destruction_0
. Maybe it decreases the performances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that const gives worse performances than define
Are you sure? By how much?
the use of const will create __static_initialization_and_destruction_0. Maybe it decreases the performances.
I don't think so, I'm almost sure that this function is only called once, before main
(during the C runtime initialization).
I ran the benchmark (Release mode, SIMD=ON, #samples=1000) on my machine and I haven't got conclusive results that prove a real negative impact from const.
Benchmark results using define
FEC | m | packet size | Enc. lat (µs) | Enc. throughput (MB/s) | Dec. lat (µs) | Dec. throuput (MB/s) |
---|---|---|---|---|---|---|
rs-fnt | 240 | 256 | 5522.86 ± 222.015 | 2373.26 | 10496.5 ± 467.663 | 78.0454 |
rs-fnt | 240 | 512 | 5483.07 ± 239.125 | 2390.48 | 10607 ± 431.391 | 77.2318 |
rs-fnt | 240 | 1024 | 5366.83 ± 230.948 | 2442.26 | 10342.7 ± 433.89 | 79.206 |
rs-fnt | 1008 | 256 | 23123.1 ± 807.104 | 2267.38 | 50222.8 ± 1436.43 | 16.3113 |
rs-fnt | 1008 | 512 | 23857.8 ± 1066.64 | 2197.56 | 47290.8 ± 1941.65 | 17.3226 |
rs-fnt | 1008 | 1024 | 27243.1 ± 1149.77 | 1924.48 | 60185 ± 3605.52 | 13.6114 |
rs-fnt-sys | 240 | 256 | 13804.7 ± 476.495 | 890.131 | 15471.8 ± 559.92 | 52.948 |
rs-fnt-sys | 240 | 512 | 14075.7 ± 534.422 | 872.994 | 15321.9 ± 575.059 | 53.4661 |
rs-fnt-sys | 240 | 1024 | 14343.9 ± 650.119 | 856.669 | 15152.5 ± 580.651 | 54.0637 |
rs-fnt-sys | 1008 | 256 | 65326.7 ± 2384.91 | 790.024 | 70649.7 ± 2187.57 | 11.5952 |
rs-fnt-sys | 1008 | 512 | 72595.2 ± 2733.07 | 710.923 | 73075 ± 3558 | 11.2104 |
rs-fnt-sys | 1008 | 1024 | 80948.2 ± 3173.47 | 637.564 | 90715.6 ± 3893.63 | 9.03042 |
Benchmark results using const
FEC | m | packet size | Enc. lat (µs) | Enc. throughput (MB/s) | Dec. lat (µs) | Dec. throuput (MB/s) |
---|---|---|---|---|---|---|
rs-fnt | 240 | 256 | 5508.43 ± 227.965 | 2379.48 | 10360.4 ± 412.958 | 79.0706 |
rs-fnt | 240 | 512 | 5480.7 ± 234.837 | 2391.52 | 10552.5 ± 432.283 | 77.6305 |
rs-fnt | 240 | 1024 | 5429.65 ± 281.588 | 2414 | 10338.2 ± 427.177 | 79.2404 |
rs-fnt | 1008 | 256 | 23129.9 ± 907.399 | 2266.71 | 49949.6 ± 1596.99 | 16.4005 |
rs-fnt | 1008 | 512 | 24179.1 ± 1458.99 | 2168.35 | 47136.3 ± 3114.64 | 17.3794 |
rs-fnt | 1008 | 1024 | 26870.8 ± 1176.66 | 1951.14 | 59786 ± 3434.83 | 13.7022 |
rs-fnt-sys | 240 | 256 | 13777.5 ± 562.277 | 891.889 | 15263 ± 514.184 | 53.6722 |
rs-fnt-sys | 240 | 512 | 14071.1 ± 536.787 | 873.28 | 15303.4 ± 547.433 | 53.5305 |
rs-fnt-sys | 240 | 1024 | 14326.6 ± 645.201 | 857.702 | 15114.8 ± 674.564 | 54.1986 |
rs-fnt-sys | 1008 | 256 | 65830.4 ± 2277.08 | 783.978 | 70598.5 ± 2168.34 | 11.6036 |
rs-fnt-sys | 1008 | 512 | 72133.8 ± 2794.12 | 715.47 | 73205.2 ± 3214.68 | 11.1905 |
rs-fnt-sys | 1008 | 1024 | 82156.5 ± 2912.75 | 628.186 | 89188.1 ± 3586.39 | 9.18508 |
Comparison
Encoding
define (MB/s) | const (MB/s) | diff (%) |
---|---|---|
2373.26 | 2379.48 | 0.26 |
2390.48 | 2391.52 | 0.04 |
2442.26 | 2414 | -1.16 |
2267.38 | 2266.71 | -0.03 |
2197.56 | 2168.35 | -1.33 |
1924.48 | 1951.14 | 1.39 |
890.131 | 891.889 | 0.2 |
872.994 | 873.28 | 0.03 |
856.669 | 857.702 | 0.12 |
790.024 | 783.978 | -0.77 |
710.923 | 715.47 | 0.64 |
637.564 | 628.186 | -1.47 |
Decoding
define (MB/s) | const (MB/s) | diff (%) |
---|---|---|
78.0454 | 79.0706 | 1.31 |
77.2318 | 77.6305 | 0.52 |
79.206 | 79.2404 | 0.04 |
16.3113 | 16.4005 | 0.55 |
17.3226 | 17.3794 | 0.33 |
13.6114 | 13.7022 | 0.67 |
52.948 | 53.6722 | 1.37 |
53.4661 | 53.5305 | 0.12 |
54.0637 | 54.1986 | 0.25 |
11.5952 | 11.6036 | 0.07 |
11.2104 | 11.1905 | -0.18 |
9.03042 | 9.18508 | 1.71 |
src/simd_256_u16.h
Outdated
{ | ||
const m256i _card = _mm256_set1_epi16(card); | ||
const m256i _card_minus_1 = _mm256_set1_epi16(card - 1); | ||
#define F3_u16 _mm256_set1_epi16(257) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skip the review of this commit since the whole file and thmeir code are delete in another commit.
BTW, what's the point of this commit? Since all the changes are deleted (and another commit reimplement it in another file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering what the role of this commit?
The file is deleted just after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commits implement new algorithms for modular operations. Next commits are based on it to refactor codes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what I'm seeing in the Git history.
SIMD 256 u16 & u32: update
add codes insrc/simd_256_u16.h
SIMD 256 u16 u32: remove useless files
delete the WHOLE file (so all the code is removed).- And finally,
SIMD 256: essential operations for AVX
reimplements the content but differently than fromSIMD 256 u16 & u32: update
By looking at the Git history, it seems like only the 2 last commits useful, the first one (this one and its sibling for SSE) can be removed.
And in fact, if I checkout your branch and remove these 2 commits during an interactive rebase, then everything works, with no conflict.
So you should remove them, they are just confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. I will remove them
- remove useless commits
src/simd_128_u32.h
Outdated
@@ -154,6 +154,17 @@ inline m128i mul_f4(m128i a, m128i b) | |||
return mod_after_multiply_f4(c); | |||
} | |||
|
|||
inline m128i mul_f4_simple(m128i a, m128i b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skip the review of this commit since the whole file and thmeir code are delete in another commit.
BTW, what's the point of this commit? Since all the changes are deleted (and another commit reimplement it in another file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering what the role of this commit?
The file is deleted just after.
src/simd_fnt.h
Outdated
for (unsigned i = start; i < bufs_nb; i += step) { | ||
VecType x1, y1; | ||
VecType x2, y2; | ||
VecType* __restrict p = reinterpret_cast<VecType*>(mem[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid restrict (see my comment about it), or if we really need it then be extra-careful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- test diff performance w/ and w/o __restrict
4946986
to
acc1ee8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaperche-scality the code is updated. I separate the changes in commits. Thanks for your next reviews.
src/simd_128.h
Outdated
|
||
const m128i F3_m128i_u16 = _mm_set1_epi16(257); // NOLINT(cert-err58-cpp) | ||
const m128i F3minus1_m128i_u16 = _mm_set1_epi16(256); // NOLINT(cert-err58-cpp) | ||
#define F4_u32 _mm_set1_epi32(65537) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that const
gives worse performances than define
. I check that the use of const
will create __static_initialization_and_destruction_0
. Maybe it decreases the performances.
src/simd_nf4.h
Outdated
|
||
#if defined(__AVX2__) | ||
|
||
inline VecType CAST_TO_DOUBLE(HalfVecType x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- rename
CAST_TO_DOUBLE
3a4c9d0
to
1cbb261
Compare
160a031
to
0830532
Compare
0830532
to
05d3938
Compare
@slaperche-scality could you review it? |
Could you squash all the "rename *" commits into the one that introduced the uppercase names? That would ease the review and keep the |
05d3938
to
d79d7a1
Compare
@slaperche-scality sure, the commits are squashed. Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review still in progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
We're almost here.
Several commits can be squashed together:
SIMD 256 u16 u32: remove useless files
SIMD 128 u16 u32: remove useless files
SIMD 128: use template functions
SIMD 256: use template functions
SIMD Basic: use templated essential functions
SIMD 128: add function is_all_zeros
SIMD 256: add function is_all_zeros
SIMD 128: fix is_all_zeros
SIMD 256: remove NF4Type
SIMD NF4: remove NF4Type
FFT_2n.h: compute simd indices
FFT_2n.cpp: remove calculation of indices
FFT_2n.h: define butterfly_ct_two_layers_step_slow
FFT_2n.cpp: use butterfly_ct_two_layers_step_slow
FEC RS FNT: simd indices as member variables
FEC Vectorisation: use FNT's simd indices
SIMD Basic: clang-format fix
SIMD NF4: clang-format fix
src/simd_128.h
Outdated
@@ -81,6 +81,10 @@ inline uint16_t TESTZ(VecType x, VecType y) | |||
{ | |||
return _mm_testz_si128(x, y); | |||
} | |||
inline int is_all_zeros(VecType x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a bool
since it's a predicate (starts with is_
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's better
-
is_xx
return bool
src/simd_basic.h
Outdated
const VecType hi = (q == F3) ? BLEND8(ZERO, SHIFTR(res, 1), MASK8_LO) | ||
: BLEND16(ZERO, SHIFTR(res, 2), 0x55); | ||
return SUB_MOD(lo, hi, q); | ||
if (is_all_zeros(cmp) == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you apply my comment about returning a bool
for is_all_zeros
, then you can get rid of the == 1
part.
@@ -168,9 +168,15 @@ inline __uint128_t pack(__uint128_t a, uint32_t flag) | |||
|
|||
#if defined(__AVX2__) | |||
|
|||
inline VecType CAST_TO_DOUBLE(HalfVecType x) | |||
inline VecType load_to_reg(HalfVecType x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _a
and _b
can probably be const, no?
Also, try to avoid prefixing name with _
, this is usually reserved for the compiler and internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- remove underscored from names
379607f
to
4acbf0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reviews. There are two points:
- remove __restrict?
- use const or #define for const register variables
Nice squashing, the PR looks better now. I think we can still squash some:
|
723eb38
to
d1d7a3f
Compare
29358aa
to
9302df1
Compare
ed316e9
to
b53d062
Compare
b53d062
to
d28cb13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just keep the uppercase naming and squash what need to be squash, and then we're good to merge it :)
src/simd_256.h
Outdated
@@ -57,28 +57,26 @@ typedef __m128i HalfVecType; | |||
// with static storage duration may throw an exception that cannot be caught | |||
|
|||
// NOLINTNEXTLINE(cert-err58-cpp) | |||
const VecType F4_U32 = _mm256_set1_epi32(65537); | |||
const VecType vec32_f4 = _mm256_set1_epi32(65537); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should keep the UPPERCASE naming.
They're not #define anymore, but they are still global constants and as entities with a global scope they should be in UPPERCASE.
ed82ce8
to
ed7127e
Compare
The SIMD parts will be re-implemented in next commits
1. Essential operations - simd_128.h contains essential wrappers of SIMD operations on SSE - simd_256.h contains essential wrappers of SIMD operations on AVX 2. Basic operations - simd_basic.h contain basic operations used in following cases, and also operations for RingModN 3. Vectorized operations - simd_fnt.h contains vectorized operations dedicated for FNT - simd_nf4.h contains vectorized operations dedicated for nf4
It implements basic operations that will be used everywhere. It includes also operations for RingModN
- Indices for SIMD parts are computed once in FFT function - Define butterfly_ct_two_layers_step_slow for non-vectorized functions
ed7127e
to
9c47e18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaperche-scality thank you 👍 |
The target branch
eh/fft_radix2_and_re_work_sim
would be merged into master when two parts are done: #253 #254Part2