Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part2: Vectorize operations for Radix-2 FFT & re-work vectorization part #254

Merged
merged 13 commits into from
Dec 20, 2018

Conversation

lamphamsy
Copy link
Contributor

@lamphamsy lamphamsy commented Oct 15, 2018

The target branch eh/fft_radix2_and_re_work_sim would be merged into master when two parts are done: #253 #254

Part2

  • Vectorize operations for Radix-2 FFT
  • Re-work also vectorization part to avoid duplication codes

@lamphamsy lamphamsy changed the title Vectorize operations for Radix-2 FFT & re-work vectorization part Part2: Vectorize operations for Radix-2 FFT & re-work vectorization part Oct 15, 2018
@lamphamsy lamphamsy force-pushed the eh/radix2_fft branch 2 times, most recently from 2a54e06 to af01aed Compare October 19, 2018 13:21
src/fft_2n.cpp Show resolved Hide resolved
src/simd_128.h Outdated

const m128i F3_m128i_u16 = _mm_set1_epi16(257); // NOLINT(cert-err58-cpp)
const m128i F3minus1_m128i_u16 = _mm_set1_epi16(256); // NOLINT(cert-err58-cpp)
#define F4_u32 _mm_set1_epi32(65537)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we replacing the const by preprocessor define?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to avoid the NOLINT comment only :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good reason.

const are typed, preprocessor constant aren't.
const can be printed in GDB, preprocessor constant no (except with some debugging option).

So, I would rather keep the NOLINT and the const.

Moreover, the NOLINT can be removed for recent Clang (>= 6), so no reason to replace the const by a define.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that const gives worse performances than define. I check that the use of const will create __static_initialization_and_destruction_0. Maybe it decreases the performances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that const gives worse performances than define

Are you sure? By how much?

the use of const will create __static_initialization_and_destruction_0. Maybe it decreases the performances.

I don't think so, I'm almost sure that this function is only called once, before main (during the C runtime initialization).

I ran the benchmark (Release mode, SIMD=ON, #samples=1000) on my machine and I haven't got conclusive results that prove a real negative impact from const.

Benchmark results using define

FEC m packet size Enc. lat (µs) Enc. throughput (MB/s) Dec. lat (µs) Dec. throuput (MB/s)
rs-fnt 240 256 5522.86 ± 222.015 2373.26 10496.5 ± 467.663 78.0454
rs-fnt 240 512 5483.07 ± 239.125 2390.48 10607 ± 431.391 77.2318
rs-fnt 240 1024 5366.83 ± 230.948 2442.26 10342.7 ± 433.89 79.206
rs-fnt 1008 256 23123.1 ± 807.104 2267.38 50222.8 ± 1436.43 16.3113
rs-fnt 1008 512 23857.8 ± 1066.64 2197.56 47290.8 ± 1941.65 17.3226
rs-fnt 1008 1024 27243.1 ± 1149.77 1924.48 60185 ± 3605.52 13.6114
rs-fnt-sys 240 256 13804.7 ± 476.495 890.131 15471.8 ± 559.92 52.948
rs-fnt-sys 240 512 14075.7 ± 534.422 872.994 15321.9 ± 575.059 53.4661
rs-fnt-sys 240 1024 14343.9 ± 650.119 856.669 15152.5 ± 580.651 54.0637
rs-fnt-sys 1008 256 65326.7 ± 2384.91 790.024 70649.7 ± 2187.57 11.5952
rs-fnt-sys 1008 512 72595.2 ± 2733.07 710.923 73075 ± 3558 11.2104
rs-fnt-sys 1008 1024 80948.2 ± 3173.47 637.564 90715.6 ± 3893.63 9.03042

Benchmark results using const

FEC m packet size Enc. lat (µs) Enc. throughput (MB/s) Dec. lat (µs) Dec. throuput (MB/s)
rs-fnt 240 256 5508.43 ± 227.965 2379.48 10360.4 ± 412.958 79.0706
rs-fnt 240 512 5480.7 ± 234.837 2391.52 10552.5 ± 432.283 77.6305
rs-fnt 240 1024 5429.65 ± 281.588 2414 10338.2 ± 427.177 79.2404
rs-fnt 1008 256 23129.9 ± 907.399 2266.71 49949.6 ± 1596.99 16.4005
rs-fnt 1008 512 24179.1 ± 1458.99 2168.35 47136.3 ± 3114.64 17.3794
rs-fnt 1008 1024 26870.8 ± 1176.66 1951.14 59786 ± 3434.83 13.7022
rs-fnt-sys 240 256 13777.5 ± 562.277 891.889 15263 ± 514.184 53.6722
rs-fnt-sys 240 512 14071.1 ± 536.787 873.28 15303.4 ± 547.433 53.5305
rs-fnt-sys 240 1024 14326.6 ± 645.201 857.702 15114.8 ± 674.564 54.1986
rs-fnt-sys 1008 256 65830.4 ± 2277.08 783.978 70598.5 ± 2168.34 11.6036
rs-fnt-sys 1008 512 72133.8 ± 2794.12 715.47 73205.2 ± 3214.68 11.1905
rs-fnt-sys 1008 1024 82156.5 ± 2912.75 628.186 89188.1 ± 3586.39 9.18508

Comparison

Encoding

define (MB/s) const (MB/s) diff (%)
2373.26 2379.48 0.26
2390.48 2391.52 0.04
2442.26 2414 -1.16
2267.38 2266.71 -0.03
2197.56 2168.35 -1.33
1924.48 1951.14 1.39
890.131 891.889 0.2
872.994 873.28 0.03
856.669 857.702 0.12
790.024 783.978 -0.77
710.923 715.47 0.64
637.564 628.186 -1.47

Decoding

define (MB/s) const (MB/s) diff (%)
78.0454 79.0706 1.31
77.2318 77.6305 0.52
79.206 79.2404 0.04
16.3113 16.4005 0.55
17.3226 17.3794 0.33
13.6114 13.7022 0.67
52.948 53.6722 1.37
53.4661 53.5305 0.12
54.0637 54.1986 0.25
11.5952 11.6036 0.07
11.2104 11.1905 -0.18
9.03042 9.18508 1.71

src/simd.h Outdated Show resolved Hide resolved
{
const m256i _card = _mm256_set1_epi16(card);
const m256i _card_minus_1 = _mm256_set1_epi16(card - 1);
#define F3_u16 _mm256_set1_epi16(257)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skip the review of this commit since the whole file and thmeir code are delete in another commit.

BTW, what's the point of this commit? Since all the changes are deleted (and another commit reimplement it in another file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering what the role of this commit?

The file is deleted just after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commits implement new algorithms for modular operations. Next commits are based on it to refactor codes :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I'm seeing in the Git history.

  • SIMD 256 u16 & u32: update add codes in src/simd_256_u16.h
  • SIMD 256 u16 u32: remove useless files delete the WHOLE file (so all the code is removed).
  • And finally, SIMD 256: essential operations for AVX reimplements the content but differently than from SIMD 256 u16 & u32: update

By looking at the Git history, it seems like only the 2 last commits useful, the first one (this one and its sibling for SSE) can be removed.

And in fact, if I checkout your branch and remove these 2 commits during an interactive rebase, then everything works, with no conflict.

So you should remove them, they are just confusing.

Copy link
Contributor Author

@lamphamsy lamphamsy Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I will remove them

  • remove useless commits

@@ -154,6 +154,17 @@ inline m128i mul_f4(m128i a, m128i b)
return mod_after_multiply_f4(c);
}

inline m128i mul_f4_simple(m128i a, m128i b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skip the review of this commit since the whole file and thmeir code are delete in another commit.

BTW, what's the point of this commit? Since all the changes are deleted (and another commit reimplement it in another file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering what the role of this commit?

The file is deleted just after.

src/simd_fnt.h Outdated Show resolved Hide resolved
src/simd_fnt.h Show resolved Hide resolved
src/simd_fnt.h Outdated Show resolved Hide resolved
src/simd_fnt.h Outdated Show resolved Hide resolved
src/simd_fnt.h Outdated
for (unsigned i = start; i < bufs_nb; i += step) {
VecType x1, y1;
VecType x2, y2;
VecType* __restrict p = reinterpret_cast<VecType*>(mem[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid restrict (see my comment about it), or if we really need it then be extra-careful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test diff performance w/ and w/o __restrict

Copy link
Contributor Author

@lamphamsy lamphamsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaperche-scality the code is updated. I separate the changes in commits. Thanks for your next reviews.

src/simd_128.h Outdated

const m128i F3_m128i_u16 = _mm_set1_epi16(257); // NOLINT(cert-err58-cpp)
const m128i F3minus1_m128i_u16 = _mm_set1_epi16(256); // NOLINT(cert-err58-cpp)
#define F4_u32 _mm_set1_epi32(65537)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that const gives worse performances than define. I check that the use of const will create __static_initialization_and_destruction_0. Maybe it decreases the performances.

src/simd_128.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated Show resolved Hide resolved
src/simd_fnt.h Outdated Show resolved Hide resolved
src/simd_nf4.h Outdated

#if defined(__AVX2__)

inline VecType CAST_TO_DOUBLE(HalfVecType x)
Copy link
Contributor Author

@lamphamsy lamphamsy Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rename CAST_TO_DOUBLE

src/simd_nf4.h Outdated Show resolved Hide resolved
src/simd_nf4.h Outdated Show resolved Hide resolved
src/simd_nf4.h Show resolved Hide resolved
@lamphamsy lamphamsy changed the base branch from eh/radix2_fft to eh/fft_radix2_and_re_work_simd October 30, 2018 15:09
@lamphamsy lamphamsy force-pushed the eh/fft_radix2_and_re_work_simd branch from 3a4c9d0 to 1cbb261 Compare October 30, 2018 15:16
@lamphamsy
Copy link
Contributor Author

@slaperche-scality could you review it?

@slaperche-scality
Copy link
Contributor

Could you squash all the "rename *" commits into the one that introduced the uppercase names?
(basically, edit the commit that introduces these functions instead of adding extra commit on the top).

That would ease the review and keep the git log readable 🙂

@lamphamsy
Copy link
Contributor Author

@slaperche-scality sure, the commits are squashed. Thanks :)

Copy link
Contributor

@slaperche-scality slaperche-scality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review still in progress

src/property.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated Show resolved Hide resolved
Copy link
Contributor

@slaperche-scality slaperche-scality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

We're almost here.

Several commits can be squashed together:

SIMD 256 u16 u32: remove useless files
SIMD 128 u16 u32: remove useless files

SIMD 128: use template functions
SIMD 256: use template functions
SIMD Basic: use templated essential functions

SIMD 128: add function is_all_zeros
SIMD 256: add function is_all_zeros
SIMD 128: fix is_all_zeros

SIMD 256: remove NF4Type
SIMD NF4: remove NF4Type

FFT_2n.h: compute simd indices
FFT_2n.cpp: remove calculation of indices

FFT_2n.h: define butterfly_ct_two_layers_step_slow
FFT_2n.cpp: use butterfly_ct_two_layers_step_slow

FEC RS FNT: simd indices as member variables
FEC Vectorisation: use FNT's simd indices

SIMD Basic: clang-format fix
SIMD NF4: clang-format fix

src/simd_basic.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated Show resolved Hide resolved
src/simd_128.h Outdated
@@ -81,6 +81,10 @@ inline uint16_t TESTZ(VecType x, VecType y)
{
return _mm_testz_si128(x, y);
}
inline int is_all_zeros(VecType x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a bool since it's a predicate (starts with is_).

Copy link
Contributor Author

@lamphamsy lamphamsy Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's better

  • is_xx return bool

src/simd_basic.h Outdated
const VecType hi = (q == F3) ? BLEND8(ZERO, SHIFTR(res, 1), MASK8_LO)
: BLEND16(ZERO, SHIFTR(res, 2), 0x55);
return SUB_MOD(lo, hi, q);
if (is_all_zeros(cmp) == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you apply my comment about returning a bool for is_all_zeros, then you can get rid of the == 1 part.

@@ -168,9 +168,15 @@ inline __uint128_t pack(__uint128_t a, uint32_t flag)

#if defined(__AVX2__)

inline VecType CAST_TO_DOUBLE(HalfVecType x)
inline VecType load_to_reg(HalfVecType x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _a and _b can probably be const, no?

Also, try to avoid prefixing name with _, this is usually reserved for the compiler and internals.

Copy link
Contributor Author

@lamphamsy lamphamsy Dec 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • remove underscored from names

src/simd_128.h Outdated Show resolved Hide resolved
src/simd_basic.h Outdated Show resolved Hide resolved
src/simd_basic.h Outdated Show resolved Hide resolved
src/simd_basic.h Outdated Show resolved Hide resolved
src/simd_basic.h Outdated Show resolved Hide resolved
@lamphamsy lamphamsy force-pushed the eh/rework_simd branch 2 times, most recently from 379607f to 4acbf0f Compare November 30, 2018 14:19
Copy link
Contributor Author

@lamphamsy lamphamsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reviews. There are two points:

  • remove __restrict?
  • use const or #define for const register variables

@slaperche-scality
Copy link
Contributor

Nice squashing, the PR looks better now.

I think we can still squash some:

  • "SIMD 256 u16 u32: remove useless files" + "SIMD 256 u16 u32: remove useless files" = "SIMD{128,256}/{u16,u32}: remove useless files"
  • "FFT_2n.h: compute indices for SIMD" + "FFT_2n.cpp: compute indices for SIMD" = "FFT_2n: compute indices for SIMD"
  • "FEC RS FNT: simd indices as member variables" + "FEC Vectorisation: use FNT's simd indices"

@lamphamsy lamphamsy force-pushed the eh/rework_simd branch 2 times, most recently from 723eb38 to d1d7a3f Compare December 2, 2018 03:25
@lamphamsy lamphamsy force-pushed the eh/rework_simd branch 2 times, most recently from 29358aa to 9302df1 Compare December 2, 2018 05:02
@vrancurel vrancurel added the Release 0.1 Initial Release label Dec 14, 2018
Copy link
Contributor

@slaperche-scality slaperche-scality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just keep the uppercase naming and squash what need to be squash, and then we're good to merge it :)

src/simd_256.h Outdated
@@ -57,28 +57,26 @@ typedef __m128i HalfVecType;
// with static storage duration may throw an exception that cannot be caught

// NOLINTNEXTLINE(cert-err58-cpp)
const VecType F4_U32 = _mm256_set1_epi32(65537);
const VecType vec32_f4 = _mm256_set1_epi32(65537);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep the UPPERCASE naming.

They're not #define anymore, but they are still global constants and as entities with a global scope they should be in UPPERCASE.

@lamphamsy lamphamsy force-pushed the eh/rework_simd branch 2 times, most recently from ed82ce8 to ed7127e Compare December 20, 2018 12:02
The SIMD parts will be re-implemented in next commits
1. Essential operations

- simd_128.h contains essential wrappers of SIMD operations on SSE
- simd_256.h contains essential wrappers of SIMD operations on AVX

2. Basic operations

- simd_basic.h contain basic operations used in following cases, and
also operations for RingModN

3. Vectorized operations

- simd_fnt.h contains vectorized operations dedicated for FNT
- simd_nf4.h contains vectorized operations dedicated for nf4
It implements basic operations that will be used everywhere.
It includes also operations for RingModN
- Indices for SIMD parts are computed once in FFT function
- Define butterfly_ct_two_layers_step_slow for non-vectorized functions
Copy link
Contributor

@slaperche-scality slaperche-scality left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At long last: this can be merged.
Go 🙂

merge

@lamphamsy
Copy link
Contributor Author

@slaperche-scality thank you 👍

@lamphamsy lamphamsy merged commit 1171bdb into eh/fft_radix2_and_re_work_simd Dec 20, 2018
@lamphamsy lamphamsy mentioned this pull request Jun 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release 0.1 Initial Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants