-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C operator for scalable vector types #13
Comments
I'm not a huge fan of mixing apples and oranges (except in sponge cake ;-) ). For fixed-width vector, it's already not obvious - AVX for instance doesn't type the register so the 'SEW' isn't known, so no meaning. For NEON/SVE/V with the more specific (a.k.a. 'better ;-) ') types it can be reasonably meaningful, but then there's the relationship with VL that is an issue in V:
So, how wide is 'd' now? To a casual reader, it's less than obvious. Both input should be 16-wide (assuming VLMAX>=16 for SEW=32/LMUL=1 ...). ... or the compiler can just spew an error message and force the user to write The problem with implicit 'VL' is that it semantically works on instructions, but in many developer's mind, the 'VL' is going to be associated to the data itself (i.e. as an analogy 'N' is the size of the array, and therefore is used as the bound of the loop; not the other way around...). So for this specific issue - I would say, none. |
Here is several operator we might need to discuss, this table is organized from wiki page: Operators in C and C++ Arithmetic, comparison, relational, logical, bit-wise and compound assignment operators are controversial, because those are relative to the #8, how to pass the So I would like to discuss other part, operators not list above: Assume
Pointer type, assume
The point we need discuss is, should we support those operator with scalable vector type? if so what's the semantics? |
SiFive's implementation: Assume
Pointer type, assume
|
I believe that if we choose to implement the GCC extension for vector types for these types, perhaps the more reasonable thing to do here is to give them VLMAX semantics. Just to agree with existing practice of using vectors like "big scalars". However I see how this introduces confusion in the context of implicit vector length because one could argue that the VL is also implicit in C builtin operations. Also I think lack of control over VLMAX by the user makes these extensions not very useful in general (how much are we going to load/store?). So I would be inclined, that for now, C builtin operations not be extended to rvv vectors, to avoid introducing legacy. Assignment is a bit of a special. Seems too fundamental to disallow (otherwise nothing will work), so I'd expect
to copy the whole vector (aligned with my expectation that these "builtin" operations use VLMAX). This is important if we want to preserve the as-if behaviour that an assignment allows the user to "name" a value (and this is why we can replace usages of va with vb in the compiler). If we only copy up to VL I think we might be breaking this assumption. Does this make sense? |
@rofirrim I agree we should support
We should all be aware of the possibility of implementation-defined behavior for inactive and tail elements: |
Besides the assignment operator, it would make sense to me to also implement pointer indirection, the arithmetic and comparison operators on VL length. Methinks that users would be more comfortable porting the core of their algorithms if they could retain at least some of their original algebraic syntax. |
@ebahapo : |
I was thought it might kill the performance if we define assignment/ Extend the example in my last comment:
For those 3 case vcopy, recompute and
|
Joining a bit late :) I'm working on portable wrappers for intrinsics at Google and heard complaints about verbose code. Where possible, operators are very helpful for readability. Somewhat related: our goal is to reduce the large cost of implementing and porting by having the same code compile for multiple platforms, including RVV. If the code requests VL-aware or even masked load/store unnecessarily, that would be expensive on other platforms. The same code could be efficient everywhere if we have the main loop using VLMAX, and a second cleanup 'loop' using masks or avl. Thus it would be nice to have VLMAX operators for the first loop, especially if the app does not need a cleanup because it is able to pad inputs/outputs to VLMAX. Does that make sense? (BTW an ARM engineer seemed receptive to such operators for SVE ACLE.) |
Hi @jan-wassenberg, sorry, I don't completely follow your proposal. Please let me clarify where I am confused. This thread concerned extending C operators to the new RVV types, which exposes a kind of impedance mismatch between the underlying assembly language and the C abstraction. There was consensus that the assignment operator would operate up to VLMAX, but it was unclear whether these semantics should also apply to the other operators. For example, currently we can express N-by-N matrix multiply (C += A*B) in RVV intrinsics something like this:
It's tempting to extend the
However, like you mention, if these operators are defined to work on VLMAX elements, then we will need to add cleanup code to handle the fringe case, which will use the EDIT: to be clear, I'm not opposed to the VLMAX semantics --- it doesn't remove any functionality --- I'm just concerned that it doesn't yield a net decrease on the intrinsics programmer's cognitive burden. |
Hi @knightsifive , thanks for looking into this. For c += a*b it makes sense to use FMA despite the increased verbosity, but c += a is enough to show the operator. Let's imagine we take your code, which looks good for RVV, and replace each intrinsic with a wrapper function, then re-implement the wrappers using AVX2. Because the loop relies on VL, each iteration would have to check whether vl==vlmax, because AVX2 has neither VL nor masks and fairly expensive masked load/store. That is wasteful because only the last iteration actually needs it. Now if we have a VLMAX first loop followed by cleanup, I agree with you that it is more verbose and also larger code. In the AVX2 case, we can still expect a performance benefit. (ICC also generates two such loops even for AVX3.) In my experience with the JPEG XL image codec, we are often able to arrange for N to be a multiple of VLMAX, or at least make it safe to pretend it is by padding all inputs/outputs. Then we do not need a cleanup loop, and it would be nice if the first loop is able to use the shorter and more readable operators. Why talk about AVX2 here? I imagine not all software is going to be rewritten specifically for RVV in the VL style. Projects such as OpenCV/JPEG XL already have such wrappers and would hope to write (performance-portable) code only once, not per platform. Is that something we would want to enable for faster adoption and porting? I am actually not sure the above use case cares whether += uses VL or VLMAX, but I do hope that operators would be included/allowed for readability. |
@kito-cheng now that we are moving to explicit VL, it seems a good time to resume this discussion. Can the compiler define operator+= builtin functions? Unfortunately overloading them in normal C++ code is not possible because the arguments (vuint*) are built-in types, not user-defined. |
Apologize for very late reply, I would prefer block most operation at this first and then relax later if needed. The list should be allowed in first version in my mind is:
I think for those operators should be supported when size is known, and maybe we should only supported on VLS type (e.g. int32x8_t), but this part we don't have well discussion yet, although upstream LLVM has some initial support there. |
@kito-cheng
"relax later if needed" would have unfortunate consequences: users of a generic interface (e.g. Highway) would have to use Div etc now instead of operators, and once written I doubt code would be changed back to operators (some risk of introducing mistakes). Is it infeasible to provide a builtin that behaves as if the following were allowed? |
FWIW, Clang (but not GCC) lets you pass its own built-in vector types to RVV intrinsics: https://godbolt.org/z/vTYsPWsMT That is: #define VECTOR_BITS 256 // use __riscv_v_min_vlen if you don't care (or __riscv_v_fixed_vlen if it is defined)
using fixed_vuint8m1_t = uint8_t __attribute__((vector_size(VECTOR_BITS / 8)));
fixed_vuint8m1_t add(fixed_vuint8m1_t a, fixed_vuint8m1_t b) {
return __riscv_vadd(a, b, 32);
} works perfectly well provided the type is no larger than an m8 register of the minimum vector length of the compilation target. It also works for other architectures (but still not with GCC). That example isn't interesting because addition is already supported by the compiler, but it does mean you can switch freely to and from intrinsics where necessary. Consequently, you can introduce VL by switching to intrinsics. So in a sense C operator support is already halfway there. The only missing part is the ability to create an unsized vector with |
What kinds of C operator should we support for scalable vector types? What is the semantic of C operator on scalable vector types? Should it operate on VLMAX or vl or something else?
What is the behavior and limitation of scalable vector types?
The text was updated successfully, but these errors were encountered: