Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

Open
camel-cdr opened this issue Oct 21, 2024 · 21 comments
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release

Comments

@camel-cdr
Copy link
Contributor

The absence of whole register load/store instructions was already discussed in the past with the following conclusion:

The conclusion is that the API intentionally omits whole-register loads, stores, and moves. The rationale is that the usual loads/stores/moves provide the same functionality, and a compiler could instead generate whole-register versions if it thought it would be profitable. If a compelling use-case arises in the future, we could introduce new intrinsics in a backwards-compatible way.

I'd like to posit what I think is a "compelling use-case".

A lot of libraries use a fixed size SIMD abstraction type that allows code sharing between existing SIMD ISAs (sse,avx,neon,...): simdjson, vectorscan, ...
This requires the ability to store the vector register state in data structures, which is currently only properly possible via the riscv_rvv_vector_bits attribute extension supported by both gcc and clang, since it requires a fixed size, known at compile time.
This attribute isn't standardized, and can only be used for a single VLEN without potentially majorly rearchitecting code structure and build system, as it depends on the -mrvv-vector-bits command line argument.

An alternative approach for implementing such abstract SIMD types is to have all operations load/store from a fixed width buffer in the structure, and rely on the compiler eliminating redundant load/stores. This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.
It could also be used to e.g. create a SIMD type that can be stored in data structures, and works for VLEN 128, 256 and 512, by making the buffer always 512-bit wide, and just not using the extra bits for VLEN<512. This isn't possible using the riscv_rvv_vector_bits attribute either, because it assumes a single fixed VLEN.

This approach is however unusable, since no current compiler is capable of eliminating redundant predicated load/stores: https://godbolt.org/z/TdajMTMKT
The actual function in the link above represents what the codegen for such an implementation currently looks like, and expected simulations what the codegen should be with redundant load/store elimination.
As you can see, even when always using vl=VLMAX no redundant load/stores are removed.
Since the RVV compiler backends aren't as mature, I also compared how the compilers handle predicated (masked) vs unpredicated AVX512 load/stores.
There you can observe, that predicated redundant AVX512 load/stores also can't be eliminated, but unpredicated ones can.

Hence, I suggest adding unpredicated RVV load/store intrinsics, aka the whole register load/stores, to help facilitate the compilers load/store elimination in this use-case.

@camel-cdr
Copy link
Contributor Author

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

@topperc
Copy link
Collaborator

topperc commented Oct 29, 2024

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

@dzaima
Copy link

dzaima commented Oct 29, 2024

Some LLVM IR surgery to get good codegen, even being VLEN-agnostic: https://godbolt.org/z/jh8YK394Thttps://godbolt.org/z/ePrGdzEKx

I'd imagine wouldn't be hard for a compiler to recognize & convert __riscv_vle/__riscv_vse with a constant vl into their native loads/stores similarly to how shown here.

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

@camel-cdr
Copy link
Contributor Author

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data.
It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz
Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

@topperc
Copy link
Collaborator

topperc commented Oct 29, 2024

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

@kito-cheng
Copy link
Collaborator

The core problem you're trying to resolve is using vector intrinsics to implement some fixed-length vector functionality. Honestly, this isn't the focus of the RVV intrinsic spec (at least in version 1.0). A straightforward approach is to use GNU vectors (e.g., typedef int int32x4_t __attribute__ ((vector_size (16)));), which is well-supported by both compilers and generates good code quality. However, the issue is that all operations are supported by built-in operators in C/C++, so we eventually need to convert GNU types to RVV types, which leads to several redundant load/store operations that are hard to eliminate.

This reflects a limitation in the current compiler implementation, as it doesn't handle memory analysis for scalable vector types very effectively.

Returning to the main problem we want to solve: creating an easier programming model for SIMD-style programs while also improving code generation. One idea I have is to provide an alternative set of intrinsics to improve both user experience and code generation quality. Here’s a concrete example:

   int32x4_t a, b, c;
   a = __riscv_vle32(int32x4_t, ptr_a, 4);
   // We could also provide an overloaded version for VLMAX, e.g., __riscv_vle32(int32x4_t, ptr_a);
   // Or simply use: a = *(int32x4_t *)ptr_a;
   // ----
   b = __riscv_vle32(int32x4_t, ptr_b, 4);
   c = __riscv_vadd(int32x4_t, a, b, 4);
   // Alternative syntax: c = a + b;
   // or c = __riscv_vadd(int32x4_t, a, b);
   // ----
   __riscv_vse32(int32x4_t, ptr_c, c);
   // Or: *(int32x4_t *)ptr_c = c;

This approach was discussed in the early stages of the RVV intrinsics, but it wasn’t prioritized, so it didn’t come to fruition.

Another idea I have is to try converting scalable vector types to fixed-length vectors, which might improve code generation quality. However, this would require significant engineering effort, so upstream toolchain compilers may not consider it unless there's strong motivation.

In conclusion, I would say that introducing intrinsics for whole-register vector load/store doesn’t truly solve the issue...the real problem lies in the compiler implementation.

@kito-cheng
Copy link
Collaborator

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

FYI: This ABI proposal is trying to resolve this issue:

riscv-non-isa/riscv-elf-psabi-doc#418

@topperc
Copy link
Collaborator

topperc commented Nov 8, 2024

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

@dzaima
Copy link

dzaima commented Nov 8, 2024

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

That's what I noted as an option here before:

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

I don't think there necessarily needs to be a restriction on the relative size. Without the restriction, you could have a 32B or 64B buffer and do generic code over VLEN=128/256/512 (VLEN≥1024 would still work, but only use the low 512 bits), allowing good VLEN=256/512 perf while still being compatible with VLEN=128). Also noted as an idea here.

@rofirrim
Copy link
Collaborator

rofirrim commented Nov 18, 2024

@camel-cdr

This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.

It is possible to abuse C++'s type system so it generates the code we want

https://godbolt.org/z/rWj3oGx7e

(I presume in this case your classes will be parameterised by VLEN, in this example that class presumes VLEN=512)

That said, those two conversions might be a bit questionable. I'm not a language lawyer.

EDIT: do not mind the noexcept, they don't seem to matter much.

@jan-wassenberg
Copy link

Unfortunately I think the type punning there is not allowed (breaks strict aliasing).

@rofirrim
Copy link
Collaborator

Unfortunately I think the type punning there is not allowed (breaks strict aliasing).

Thanks Jan!

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Nov 18, 2024

It is possible to abuse C++'s type system so it generates the code we want

https://godbolt.org/z/rWj3oGx7e

Unfortunately this doesn't nest properly: https://godbolt.org/z/jTP1aeT9E

Edit: I do think it's a important blocker on some projects getting RVV support. But, if there isn't a relatively easy change to adress this usecase, it may be preferable to get the 1.0 release done and solve it the next version.

@rofirrim
Copy link
Collaborator

rofirrim commented Nov 18, 2024

Unfortunately this doesn't nest properly: https://godbolt.org/z/jTP1aeT9E

Looks like we're thwarted by the default copy constructor that introduces a memcpy. Telling it to copy the vector instead seems to do something reasonable:

https://godbolt.org/z/9cfh1c975

That said given @jan-wassenberg comment above, I would be a bit wary of using this approach.

@dzaima
Copy link

dzaima commented Nov 18, 2024

Unfortunately this doesn't nest properly: https://godbolt.org/z/jTP1aeT9E

That looks like a generic missed optimization around scalable vs fixed-width types; that is, if not fixed, such behaviors would remain even if there were whole-register load/store intrinsics (and, as far as LLVM goes, the vuint32m1_t* load/store convert directly to load/store IR instructions, which are as native as loads/stores get, so optimizations (or lack thereof) around them are what you'd get if there were actual whole-register intrinsics anyway).

But I don't think whole-register loads/stores are the thing to expand on/encourage for use-cases like this; fixed-vl vle/vsedo the actually desired operation, while stayingVLEN-agnostic, and I would imagine it'd be quite trivial for compilers to have special-cased behavior for a constant vl` argument, converting them to their internal fixed-width ops (plus reinterprets, which should be simple enough to optimize out where applicable).

@camel-cdr
Copy link
Contributor Author

camel-cdr commented Nov 18, 2024

@rofirrim @jan-wassenberg Can't we just use __attribute__((__may_alias__))?

Here is a quick proof of concept that does 8/16/32/64 bit unsigned add/sub/mul. The type is designed to work with VLEN<=512.

https://godbolt.org/z/oeGE65z4n

This is what I imagined for the dedicated while register load/store, it seems to work on gcc and clang and produce optimal codegen.

I'd prefer the explicit whole register load/store instructions, assuming the semantic -> optimization passes, behave the same, as that would be a lot cleaner.

The only thing that could be improved is having all operations, including load/store set vl=512/SEW, so it would also work for VLEN>512 just without using the entire register. But that would require more drastic changes to compilers and/or spec.

@rofirrim
Copy link
Collaborator

@rofirrim @jan-wassenberg Can't we just use __attribute__((__may_alias__))?

I can't speak much about the may_alias thing because I don't consider myself very knowledgeable in this part. From the documentation of GCC it seems that it might work. Intel intrinsics use this attribute in several places. Their usage seems like a way to counter a limitation in the Intel intrinsic design but I'm not sure exactly what went wrong there. This might be a similar issue you're facing as you seem to want to wrap a vector register and then reinterpret it as needed for the operations you need.

Another thing I've seen is that you may be paying a bit too much due to the copy constructors as you're passing something that must copy its internal buffer (so you store to local memory just to be loaded afterwards). Those copies would be the ones simplified if a whole vector load/store intrinsic were mapped to a plain LLVM/GCC IR load/store.

Given that you have only one field, your struct is in practice equivalent to the storage for a vector register. Your mileage may vary of course, but if you're able to use (possibly const) references code generation seems slightly better:

https://godbolt.org/z/PafnY9E1P

Hope this is useful.

@rofirrim
Copy link
Collaborator

@dzaima

But I don't think whole-register loads/stores are the thing to expand on/encourage for use-cases like this; fixed-vl vle/vsedo the actually desired operation, while stayingVLEN-agnostic, and I would imagine it'd be quite trivial for compilers to have special-cased behavior for a constant vl` argument, converting them to their internal fixed-width ops (plus reinterprets, which should be simple enough to optimize out where applicable).

I wonder if we could somehow special case vle/vse intrinsics that operate on the corresponding vlmax<sew><lmul>? 🤔

@rofirrim
Copy link
Collaborator

Given the changes may be substantial in the specification, I suggest we postpone this after 1.0. Thoughts?

@kito-cheng
Copy link
Collaborator

We may add few more intrinsics to improve this, but I would prefer postpone this after 1.0.

@rofirrim rofirrim added the Revisit after v1.0 Features or problems we will revisit after the v1.0 release label Nov 21, 2024
@rofirrim
Copy link
Collaborator

I suggest we revisit this after 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release
Projects
None yet
Development

No branches or pull requests

6 participants