C operator for scalable vector types #13

Hsiangkai · 2020-04-22T03:59:56Z

What kinds of C operator should we support for scalable vector types? What is the semantic of C operator on scalable vector types? Should it operate on VLMAX or vl or something else?

What is the behavior and limitation of scalable vector types?

rdolbeau · 2020-04-22T07:17:58Z

I'm not a huge fan of mixing apples and oranges (except in sponge cake ;-) ).

For fixed-width vector, it's already not obvious - AVX for instance doesn't type the register so the 'SEW' isn't known, so no meaning. For NEON/SVE/V with the more specific (a.k.a. 'better ;-) ') types it can be reasonably meaningful, but then there's the relationship with VL that is an issue in V:

vint32m1_t a, b, c, d;
unsigned long gvl = vsetvl_i32m1(16);
a = (some vector stuff)
b = (more vector stuff);
unsigned long gvl2 = vsetvl_i32m1(8);
c = (and some other vector stuff);
(extra stuff that may or may not be vector)
d = a + b;

So, how wide is 'd' now? To a casual reader, it's less than obvious. Both input should be 16-wide (assuming VLMAX>=16 for SEW=32/LMUL=1 ...).
It can be 8 if we obey the implicit, invisible VL that we set for 'c'.
It can be 16 if we assume the user want to use all its data in 'a' and 'b'.
It can be VLMAX (which may or may not be larger than 16) if we assume C operators are full-vector as they are not intrinsics and therefore might not obey any specific VL.

... or the compiler can just spew an error message and force the user to write d = vadd_vv_i32m1(a, b); or (my favorite) the even more explicit d = vadd_vv_i32_m1_vl(a, b, gvl);

The problem with implicit 'VL' is that it semantically works on instructions, but in many developer's mind, the 'VL' is going to be associated to the data itself (i.e. as an analogy 'N' is the size of the array, and therefore is used as the bound of the loop; not the other way around...).

So for this specific issue - I would say, none.

kito-cheng · 2020-04-22T07:45:38Z

Here is several operator we might need to discuss, this table is organized from wiki page: Operators in C and C++

Arithmetic, comparison, relational, logical, bit-wise and compound assignment operators are controversial, because those are relative to the #8, how to pass the VL.

So I would like to discuss other part, operators not list above:

Assume a and b both are vint32m1_t, and n is int :

Operator	Synatx
Assignment	a = b, a = n
Subscript	a[n]
Address-of	&a
Ternary conditional	n ? a : b, a > b ? a : b
sizeof	sizeof (a)
Alignof	alignof (a)
typeid	typeid (a)
Conversion, static_cast	type (a), (type)a, static_cast(a)
new	new a
delete	delete a

Pointer type, assume p are vint32m1_t *:

Operator	Synatx
Subscript	p[n]
Deference	*p
Increment/Decrement	p++, p--, ++p, --p
Pointer Arithmetic	p + 1, p + n
reinterpret_cast<>	reinterpret_cast<vint64m1_t>(p)

The point we need discuss is, should we support those operator with scalable vector type? if so what's the semantics?

kito-cheng · 2020-04-22T08:12:01Z

SiFive's implementation:

Assume a and b both are vint32m1_t, and n is int :

Operator	Synatx	Supported in SiFive's implementation	Semantics
Assignment	a = b, a = n	Y	Same as vcopy, `VL`-aware
Subscript	a[n]	N
Address-of	&a	Y
Ternary conditional	n ? a : b, a > b ? a : b	Y *(partial support)
sizeof	sizeof (a)	N
Alignof	alignof (a)	Y
typeid	typeid (a)	Y
Conversion, static_cast	type (a), (type)a, static_cast(a)	N
new	new a	N	Due to sizeof not supported
delete	delete a	N	Due to sizeof not supported

Pointer type, assume p are vint32m1_t *:

Operator	Synatx	Supported in SiFive's implementation	Semantics
Subscript	p[n]	N
Deference	*p	Y	Same as unit stride load/store, `VL`-aware
Increment/Decrement	p++, p--, ++p, --p	N
Pointer Arithmetic	p + 1, p + n	N
reinterpret_cast<>	reinterpret_cast<vint64m1_t>(p)	Y

rofirrim · 2020-04-22T08:15:55Z

I believe that if we choose to implement the GCC extension for vector types for these types, perhaps the more reasonable thing to do here is to give them VLMAX semantics. Just to agree with existing practice of using vectors like "big scalars".

However I see how this introduces confusion in the context of implicit vector length because one could argue that the VL is also implicit in C builtin operations. Also I think lack of control over VLMAX by the user makes these extensions not very useful in general (how much are we going to load/store?).

So I would be inclined, that for now, C builtin operations not be extended to rvv vectors, to avoid introducing legacy.

Assignment is a bit of a special. Seems too fundamental to disallow (otherwise nothing will work), so I'd expect

va = vb;

to copy the whole vector (aligned with my expectation that these "builtin" operations use VLMAX).

This is important if we want to preserve the as-if behaviour that an assignment allows the user to "name" a value (and this is why we can replace usages of va with vb in the compiler). If we only copy up to VL I think we might be breaking this assumption.

Does this make sense?

nick-knight · 2020-04-22T18:06:44Z

@rofirrim I agree we should support operator= and that it should copy the entire underlying register group (up to VLMAX). This implies uniform semantics in cases of copy-elision like return-value optimization.

vmv.vv is still available if the user wants length-VL and masked copies.

We should all be aware of the possibility of implementation-defined behavior for inactive and tail elements:
https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
(This is still under discussion by the V-ext task group.)

ebahapo · 2020-04-23T00:10:21Z

Besides the assignment operator, it would make sense to me to also implement pointer indirection, the arithmetic and comparison operators on VL length. Methinks that users would be more comfortable porting the core of their algorithms if they could retain at least some of their original algebraic syntax.

kito-cheng · 2020-04-23T07:21:51Z

@ebahapo :
It's little counter-intuitive if assignment and arithmetic has different behavior for VL, that mean a = b + c; d = b + c and a = b + c; d = a is different.

kito-cheng · 2020-04-24T07:41:33Z

I was thought it might kill the performance if we define assignment/operator= always operate with VLMAX, but it should able to optimized by compiler.

Extend the example in my last comment:

a = vadd(b, c, avl);
d = vadd(b, c, avl); // It could optimized by CSE
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

a = vadd(b, c, avl);
d = vcopy(a, avl);
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

a = vadd(b, c, avl);
d = a;
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

For those 3 case vcopy, recompute and operator= should get same performance after optimization.

d = a should able to optimized out or optimized into copy avl elements only, since the reset part of d are unused.

jan-wassenberg · 2020-12-09T19:56:27Z

Joining a bit late :) I'm working on portable wrappers for intrinsics at Google and heard complaints about verbose code. Where possible, operators are very helpful for readability.

Somewhat related: our goal is to reduce the large cost of implementing and porting by having the same code compile for multiple platforms, including RVV. If the code requests VL-aware or even masked load/store unnecessarily, that would be expensive on other platforms.

The same code could be efficient everywhere if we have the main loop using VLMAX, and a second cleanup 'loop' using masks or avl.

Thus it would be nice to have VLMAX operators for the first loop, especially if the app does not need a cleanup because it is able to pad inputs/outputs to VLMAX. Does that make sense?

(BTW an ARM engineer seemed receptive to such operators for SVE ACLE.)

nick-knight · 2020-12-09T20:49:38Z

Hi @jan-wassenberg, sorry, I don't completely follow your proposal. Please let me clarify where I am confused.

This thread concerned extending C operators to the new RVV types, which exposes a kind of impedance mismatch between the underlying assembly language and the C abstraction. There was consensus that the assignment operator would operate up to VLMAX, but it was unclear whether these semantics should also apply to the other operators.

For example, currently we can express N-by-N matrix multiply (C += A*B) in RVV intrinsics something like this:

for (int i = 0; i < N; ++i)
  for (int j = 0; vl = vsetvl_f32m1(N - j); j += vl)
    for (int k = 0; k < N; ++k) {
      vfloat32m1_t B_vec = vle32_v_f32m1(&B[k][j]);
      vfloat32m1_t C_vec = vle32_v_f32m1(&C[i][j]);
      C_vec = vfmacc_vf_f32m1(C_vec, A[i][k], B_vec);
      vse32_v_f32m1(&C[i][j], C_vec);
    }

It's tempting to extend the += and * operators to replace the vfmacc_vf_f32m1 intrinsic by:

C_vec += A[i][k] * B_vec;

However, like you mention, if these operators are defined to work on VLMAX elements, then we will need to add cleanup code to handle the fringe case, which will use the vfmacc_vf_f32m1 intrinsic anyway. So this alternative appears to me to be strictly more verbose, and also likely less performant because of increased (static and dynamic) code size.

EDIT: to be clear, I'm not opposed to the VLMAX semantics --- it doesn't remove any functionality --- I'm just concerned that it doesn't yield a net decrease on the intrinsics programmer's cognitive burden.

jan-wassenberg · 2020-12-10T09:10:09Z

Hi @knightsifive , thanks for looking into this. For c += a*b it makes sense to use FMA despite the increased verbosity, but c += a is enough to show the operator.

Let's imagine we take your code, which looks good for RVV, and replace each intrinsic with a wrapper function, then re-implement the wrappers using AVX2. Because the loop relies on VL, each iteration would have to check whether vl==vlmax, because AVX2 has neither VL nor masks and fairly expensive masked load/store. That is wasteful because only the last iteration actually needs it.

Now if we have a VLMAX first loop followed by cleanup, I agree with you that it is more verbose and also larger code. In the AVX2 case, we can still expect a performance benefit. (ICC also generates two such loops even for AVX3.)

In my experience with the JPEG XL image codec, we are often able to arrange for N to be a multiple of VLMAX, or at least make it safe to pretend it is by padding all inputs/outputs. Then we do not need a cleanup loop, and it would be nice if the first loop is able to use the shorter and more readable operators.

Why talk about AVX2 here? I imagine not all software is going to be rewritten specifically for RVV in the VL style. Projects such as OpenCV/JPEG XL already have such wrappers and would hope to write (performance-portable) code only once, not per platform. Is that something we would want to enable for faster adoption and porting?

I am actually not sure the above use case cares whether += uses VL or VLMAX, but I do hope that operators would be included/allowed for readability.

jan-wassenberg · 2021-02-16T11:52:52Z

@kito-cheng now that we are moving to explicit VL, it seems a good time to resume this discussion.
In operator+= etc functions, we do not have a user-provided AVL argument, so they would now use VLMAX, right?

Can the compiler define operator+= builtin functions? Unfortunately overloading them in normal C++ code is not possible because the arguments (vuint*) are built-in types, not user-defined.

kito-cheng · 2021-03-08T08:42:54Z

Apologize for very late reply, I would prefer block most operation at this first and then relax later if needed.

The list should be allowed in first version in my mind is:

Assignment operator, with VLMAX semantic.
typeid operator.
Address-of operator.

I think for those operators should be supported when size is known, and maybe we should only supported on VLS type (e.g. int32x8_t), but this part we don't have well discussion yet, although upstream LLVM has some initial support there.

jan-wassenberg · 2021-03-08T09:34:18Z

@kito-cheng
For user ergonomics, do you agree operators are much easier to read/understand at a glance? Here are some real-world examples:

const V y1 = (x2 / (sqrt_x2_plus_1 + kOne)) + abs_x;
const V y1 = Add(Div(x2, Add(sqrt_x2_plus_1, kOne)), abs_x);

const V z = (y + kTwo) / (y + kOne) * (y * kHalf);
const V z = Mul(Div(Add(y, kTwo), Add(y, kOne)), Mul(y, kHalf));

"relax later if needed" would have unfortunate consequences: users of a generic interface (e.g. Highway) would have to use Div etc now instead of operators, and once written I doubt code would be changed back to operators (some risk of introducing mistakes).

Is it infeasible to provide a builtin that behaves as if the following were allowed?
vfloat32m8_t operator/(vfloat32m8_t a, vfloat32m8_t b) { return vfdiv_vf_f32m1(a, b, MaxVL()); }

sh1boot · 2024-01-03T20:12:12Z

FWIW, Clang (but not GCC) lets you pass its own built-in vector types to RVV intrinsics: https://godbolt.org/z/vTYsPWsMT

That is:

#define VECTOR_BITS 256  // use __riscv_v_min_vlen if you don't care (or __riscv_v_fixed_vlen if it is defined)
using fixed_vuint8m1_t = uint8_t __attribute__((vector_size(VECTOR_BITS / 8)));
fixed_vuint8m1_t add(fixed_vuint8m1_t a, fixed_vuint8m1_t b) {
    return __riscv_vadd(a, b, 32);
}

works perfectly well provided the type is no larger than an m8 register of the minimum vector length of the compilation target. It also works for other architectures (but still not with GCC).

That example isn't interesting because addition is already supported by the compiler, but it does mean you can switch freely to and from intrinsics where necessary. Consequently, you can introduce VL by switching to intrinsics.

So in a sense C operator support is already halfway there. The only missing part is the ability to create an unsized vector with __attribute__((vector_size())) which would always use VLMAX.

Hsiangkai mentioned this issue Apr 22, 2020

RFC for RVV intrinsic API proposal #10

Closed

nick-knight mentioned this issue Dec 10, 2020

should compiler update vl after typecast? #53

Closed

zakk0610 mentioned this issue Mar 25, 2021

Utility of vmmv.m and vmv.v.v intrinsics #74

Closed

eopXD added the Revisit after v1.0 Features or problems we will revisit after the v1.0 release label Jul 29, 2022

eopXD mentioned this issue Aug 2, 2022

Make the tuple type API for segment load store as compiler optional feature #139

Closed

sh1boot mentioned this issue Jan 19, 2024

[Proposal] Support for C operators on RVV types #291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C operator for scalable vector types #13

C operator for scalable vector types #13

Hsiangkai commented Apr 22, 2020 •

edited

Loading

rdolbeau commented Apr 22, 2020

kito-cheng commented Apr 22, 2020 •

edited

Loading

kito-cheng commented Apr 22, 2020 •

edited

Loading

rofirrim commented Apr 22, 2020 •

edited

Loading

nick-knight commented Apr 22, 2020 •

edited

Loading

ebahapo commented Apr 23, 2020

kito-cheng commented Apr 23, 2020

kito-cheng commented Apr 24, 2020

jan-wassenberg commented Dec 9, 2020

nick-knight commented Dec 9, 2020 •

edited

Loading

jan-wassenberg commented Dec 10, 2020

jan-wassenberg commented Feb 16, 2021

kito-cheng commented Mar 8, 2021

jan-wassenberg commented Mar 8, 2021

sh1boot commented Jan 3, 2024 •

edited

Loading

C operator for scalable vector types #13

C operator for scalable vector types #13

Comments

Hsiangkai commented Apr 22, 2020 • edited Loading

rdolbeau commented Apr 22, 2020

kito-cheng commented Apr 22, 2020 • edited Loading

kito-cheng commented Apr 22, 2020 • edited Loading

rofirrim commented Apr 22, 2020 • edited Loading

nick-knight commented Apr 22, 2020 • edited Loading

ebahapo commented Apr 23, 2020

kito-cheng commented Apr 23, 2020

kito-cheng commented Apr 24, 2020

jan-wassenberg commented Dec 9, 2020

nick-knight commented Dec 9, 2020 • edited Loading

jan-wassenberg commented Dec 10, 2020

jan-wassenberg commented Feb 16, 2021

kito-cheng commented Mar 8, 2021

jan-wassenberg commented Mar 8, 2021

sh1boot commented Jan 3, 2024 • edited Loading

Hsiangkai commented Apr 22, 2020 •

edited

Loading

kito-cheng commented Apr 22, 2020 •

edited

Loading

kito-cheng commented Apr 22, 2020 •

edited

Loading

rofirrim commented Apr 22, 2020 •

edited

Loading

nick-knight commented Apr 22, 2020 •

edited

Loading

nick-knight commented Dec 9, 2020 •

edited

Loading

sh1boot commented Jan 3, 2024 •

edited

Loading