-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes #334
base: master
Are you sure you want to change the base?
Conversation
3f9b8c4
to
d80fc0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this is WIP, but I still have a couple of nits I identified.
src/gpuarray/array.h
Outdated
GA_REDUCE_XOR, /* ^ */ | ||
GA_REDUCE_ALL, /* &&/all() */ | ||
GA_REDUCE_ANY, /* ||/any() */ | ||
} ga_reduce_op; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably go in a reduction.h header.
src/gpuarray/array.h
Outdated
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this should also probably go into a reduction.h header.
src/gpuarray_reduction.c
Outdated
strb_appendf(&ctx->s, "typedef %s S;/* The type of the source array. */\n", ctx->srcTypeStr); | ||
strb_appendf(&ctx->s, "typedef %s T;/* The type of the destination array. */\n", ctx->dstTypeStr); | ||
strb_appendf(&ctx->s, "typedef %s A;/* The type of the destination argument array. */\n", ctx->dstArgTypeStr); | ||
strb_appendf(&ctx->s, "typedef %s X;/* The type of the indices: signed 32/64-bit. */\n", ctx->idxTypeStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been new development that makes the kernel compile/call code responsible for handling this.
You can always use ga_size for this and everything will work out.
src/gpuarray_reduction.c
Outdated
strb_appends(&ctx->s, "\n"); | ||
strb_appends(&ctx->s, "\n"); | ||
strb_appends(&ctx->s, "\n"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using nvrtc there cannot be any includes except for explicitly supported ones (and non are at the moment).
So this function is probably useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered that by myself, but forgot to nuke this function.
src/gpuarray_reduction.c
Outdated
*/ | ||
|
||
static int reduxGetSumInit (int typecode, const char** property){ | ||
if(typecode == GA_POINTER || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STYLE: Space between if and (
src/gpuarray_reduction.c
Outdated
ctx->blockSize [0] = ctx->blockSize [1] = ctx->blockSize [2] = 1; | ||
ctx->gridSize [0] = ctx->gridSize [1] = ctx->gridSize [2] = 1; | ||
ctx->chunkSize [0] = ctx->chunkSize [1] = ctx->chunkSize [2] = 1; | ||
for(i=0;i<MAX_HW_DIMS;i++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STYLE: space between for and (
src/gpuarray_reduction.c
Outdated
@@ -256,113 +801,492 @@ static int maxandargmaxCheckargs (maxandargmax_ctx* ctx){ | |||
ctx->nds = ctx->src->nd; | |||
ctx->ndr = ctx->reduxLen; | |||
ctx->ndd = ctx->nds - ctx->ndr; | |||
strb_ensure(&ctx->s, 5*1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you preallocate so large a buffer? Is the kernel you're generating really 5kb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the kernel really is ~5KB for the 3D testcases.
src/gpuarray_reduction.c
Outdated
|
||
return ctx->ret; | ||
|
||
return reduxSelectModel(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that it is convenient to do tail calls everywhere like this, but C doesn't have tail-call optimization (or at least not all the time).
It also feels inappropriate that calling reduxCheckArgs ends up compiling the reduction kernel and calling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous version I tried to do all of the error handling in the highest-level function. That doesn't really scale as I add more and more and deeper functions, since each such function has to be able to detect the condition and return, and all its callers have to also detect a condition, abort and pass on the error until the highest level.
So I instead arranged the code in CPS style, and every function can abort the entire GpuArray_reduction()
by tail-calling reduxCleanup()
with whichever is the correct GA_ERROR_CODE
as argument. A nice way of having clean "exception" handling in C.
CPS is somewhat unidiomatic C, but GCC does optimize my tailcalls from callq
to jmpq
in Release mode, or inlines them whole. And there is a net readability win from all the error handling code I've eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then at least rename the functions so that it is clear that they do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron Alright, how shall I call them then? reduxDoXYZ()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Something like reduxCheckArgsAndRest()
? I don't like it very much ...
src/gpuarray_reduction.c
Outdated
static void reduxAppendFuncGetInitVal (redux_ctx* ctx){ | ||
strb_appends(&ctx->s, "/**\n"); | ||
strb_appends(&ctx->s, " * Initial value function.\n"); | ||
strb_appends(&ctx->s, " */\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know it outputting comments in a generated kernel is useful at all. Perhaps you can keep this as an actual comment right here.
src/gpuarray_reduction.c
Outdated
strb_appends(&ctx->s, "}\n"); | ||
strb_appends(&ctx->s, "\n"); | ||
strb_appends(&ctx->s, "\n"); | ||
strb_appends(&ctx->s, "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a single string. Don't use an append per line, it is mostly inefficient.
@abergeron I've cleaned up the code given your feedback. It's still a WIP and has numerous bugs. However, I'm aware of most of what remains to be done. |
|
||
static inline void strb_init(strb* sb){ | ||
const strb s = STRB_STATIC_INIT; | ||
*sb = s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be a memset(b, 0, sizeof(strb))
. We can assume that it is ok at this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron I think it would be wise to keep it as is, Don't Repeat Yourself and all that. If STRB_STATIC_INIT
changes in such a way that memset()
no longer suffices, this function will continue working. Otherwise there's two different places that must be changed, and one can forget one or the other.
At any rate, for the present strb
, the compiler can and will optimize memset()
to just
xor eax, eax
movq [sb ], rax
movq [sb+ 8], rax
movq [sb+16], rax
retq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it optimize the current form too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron As a matter of fact, a Release-mode build with GCC inlined strb_init()
, then recognized that strb
, along with everything else I set to zero explicitly in reduxCheckArgs()
, could be memsetted to 0:
Dump of assembler code for function GpuArray_reduction:
0x00007ffff74f5300 <+0>: push %r15
0x00007ffff74f5302 <+2>: xor %eax,%eax # Zero RAX
0x00007ffff74f5304 <+4>: push %r14
0x00007ffff74f5306 <+6>: push %r13
0x00007ffff74f5308 <+8>: push %r12
0x00007ffff74f530a <+10>: mov %rcx,%r12
0x00007ffff74f530d <+13>: mov $0x2f,%ecx # How many words to clear
0x00007ffff74f5312 <+18>: push %rbp
0x00007ffff74f5313 <+19>: mov %edi,%ebp
0x00007ffff74f5315 <+21>: push %rbx
0x00007ffff74f5316 <+22>: sub $0x1b8,%rsp
0x00007ffff74f531d <+29>: test %r12,%r12
0x00007ffff74f5320 <+32>: lea 0x30(%rsp),%rdi # Base pointer for memset
0x00007ffff74f5325 <+37>: lea 0x30(%rsp),%rbx
0x00007ffff74f532a <+42>: rep stos %rax,%es:(%rdi) # Memset to 0.
0x00007ffff74f532d <+45>: mov %ebp,0x30(%rsp)
0x00007ffff74f5331 <+49>: mov %rsi,0x38(%rsp)
0x00007ffff74f5336 <+54>: mov %rdx,0x40(%rsp)
0x00007ffff74f533b <+59>: mov %r12,0x48(%rsp)
0x00007ffff74f5340 <+64>: mov %r8d,0x50(%rsp)
0x00007ffff74f5345 <+69>: mov %r9,0x58(%rsp)
0x00007ffff74f534a <+74>: movq $0x1,0x138(%rsp)
0x00007ffff74f5356 <+86>: movq $0x1,0x150(%rsp)
0x00007ffff74f5362 <+98>: movq $0x1,0x168(%rsp)
0x00007ffff74f536e <+110>: movq $0x1,0x140(%rsp)
0x00007ffff74f537a <+122>: movq $0x1,0x158(%rsp)
0x00007ffff74f5386 <+134>: movq $0x1,0x170(%rsp)
0x00007ffff74f5392 <+146>: movq $0x1,0x148(%rsp)
0x00007ffff74f539e <+158>: movq $0x1,0x160(%rsp)
0x00007ffff74f53aa <+170>: movq $0x1,0x178(%rsp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm good.
Is there anything missing from this? |
@abergeron Yes; The codegen breaks somewhat at random depending on the kernel's required list of arguments and the tensor dimensionalities involved. The number of cases is large. Where it happens is in cases like generating a |
Ok |
ca571e4
to
51c559d
Compare
@abergeron @nouiz You're going to be very happy:
Now, the Jenkins bot apparently has an older In several places there are notices Similarly, for the pre-scalar/post-scalar op fusions there are notices at the appropriate places for where they will hook in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some style changes and a few nits but overall ok.
I didn't run the tests though.
|
||
|
||
/* Enumerations */ | ||
enum srcb_state{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before {
.
|
||
|
||
/* Functions */ | ||
static inline void srcbInit (srcb* s, strb* sb){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No spaces between name and (
and space before {
.
s->empty = empty; | ||
} | ||
static inline void srcbEndList(srcb* s){ | ||
if(s->numElems == 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after if
and space before {
.
* Initialize at runtime an strb. | ||
*/ | ||
|
||
static inline void strb_init(strb* sb){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before {
.
#endif | ||
va_end(ap); | ||
|
||
va_end(apSave); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the va_end inside the else branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The va_end()
is in the correct place.
va_copy()
ing ava_list
is equivalent tova_start()
ing it followed by an equivalent number ofva_arg()
calls.- Every
va_start()
must be matched by ava_end()
. va_copy()
doesn't exist on older Visual Studios, but in the Visual Studios where it's implemented it's implemented as just a straight assignment.
Therefore va_end()
should be unconditional or else it's UB (although on x86-64 it's actually a no-op)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is technically UB to just assign and reuse va_lists like you did.
If we start from the assumption that a va_list is just a pointer on MSVC and we can just copy it then let's see it through and not use va_end either since it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not fixed.
src/gpuarray/reduction.h
Outdated
GPUARRAY_PUBLIC int GpuArray_any (GpuArray* dst, | ||
const GpuArray* src, | ||
unsigned reduxLen, | ||
const unsigned* reduxList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these functions could (and probably should) be macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron One thing that I've really liked about the current design is that I can breakpoint one of these functions without breakpointing on any others. It was very helpful when debugging using the test-suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but it costs in code size and number of exported symbols.
For debugging you can always use conditional breakpoints on the op to only stop for those you are interested in. I don't really mind making debugging a bit harder here.
src/gpuarray_reduction.c
Outdated
|
||
static int reduxGetSumInit (int typecode, const char** property){ | ||
if (typecode == GA_POINTER || | ||
typecode == GA_BUFFER){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use typecode < 0. The "special" values will always be below 0.
return GA_UNSUPPORTED_ERROR; | ||
} | ||
|
||
return GA_NO_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that an additional property on the types for min/max values would be better than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron Doable by hacking the Python generator script a bit.
061c174
to
4188424
Compare
52d8d6e
to
2c4a3cb
Compare
@abergeron @nouiz @lamblin Great success. Code now passes all tests (apparently deterministically, since no failures in multiple tries) on my machine and Key points:
|
6c2abc8
to
6d9e62d
Compare
#endif | ||
va_end(ap); | ||
|
||
va_end(apSave); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not fixed.
* | ||
* It is assumed that at most one axis will ever be of length > 2**31-1. The | ||
* assumption is believed safe because no GPU or similar accelerator presently | ||
* on Earth has the capacity to store or process 2**62-element tensors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true, what about the counterexample of an array with super-large broadcasted dimensions? Or is there special code to handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abergeron When a user broadcasts a tensor on two or more axes, each of which are of length >= 2^31, he implicitly accepts a computational cost of 2^62 FLOP's. The most powerful GPUs on Earth right now have a throughput of 1e10 float32
additions or multiplications per second, and so would chug through such a tensor of 2^62 elements in a mere 15 years. I'm unaware of a realistic environment and usecase for a 15-year computation.
src/gpuarray_reduction.c
Outdated
int reduxLen; | ||
const int* reduxList; | ||
GpuReductionAttr grAttr; | ||
gpucontext* gpuCtx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have a gpucontext in GpuReductionAttr
/* Source code Generator. */ | ||
strb s; | ||
srcb srcGen; | ||
char kName[256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need such a large name? Also, why don't you just use a fixed name?
srcb srcGen; | ||
char kName[256]; | ||
char* kSourceCode; | ||
size_t kSourceCodeLen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need those two members, they are available from either the strb or srcb above.
src/gpuarray_reduction.c
Outdated
/* Workspace */ | ||
if (reduxGenKernelRequiresWspace(gr)){ | ||
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user); | ||
if (reduxGenKern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what is the type of B (because the kernel code is impossible to follow, but since left is 64 bits this will always do a 64-bit modulo. I don't know how often this code is run, but using GA_SIZE would allow this to be 32 bit for smaller arrays.
src/gpuarray_reduction.c
Outdated
/* Workspace */ | ||
if (reduxGenKernelRequiresWspace(gr)){ | ||
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user); | ||
if (reduxGenKern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These pointers need to be tagged with their memory space.
src/gpuarray_reduction.c
Outdated
/* Workspace */ | ||
if (reduxGenKernelRequiresWspace(gr)){ | ||
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user); | ||
if (reduxGenKern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of printing to the output, you should set the error message with the new error API.
src/gpuarray_reduction.c
Outdated
/* Workspace */ | ||
if (reduxGenKernelRequiresWspace(gr)){ | ||
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user); | ||
if (reduxGenKern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is horribly fragile code. How is this a good idea?
src/gpuarray_reduction.c
Outdated
/* Workspace */ | ||
if (reduxGenKernelRequiresWspace(gr)){ | ||
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user); | ||
if (reduxGenKern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have function aliases like this? Can't you just reuse the wrapped function?
It allows initializing at runtime an strb. This can't always be done at compile-time, for instance if it is dynamically allocated.
All tests pass, but currently the codegen is locked to the large code model (the small code model has most of the groundwork laid down but has several extra complexities which haven't yet been implemented, like atomic reduction operators.
Clang and MSVC correctly recognize that all paths to the allegedly- uninitialized variables are, in fact, dominated by their initialization.
40% of tests still failing, and the code has a wierd smell to it that I really don't appreciate.
All tests now pass except summation, which fails to meet tolerance.
- Subtract 0.5 from random numbers, so they sum to 0 in expectation. - Increase tolerance from 1e-5 to 1e-4 just for summation.
doubles the speed.
Now, all the veryhighrank tests pass and the others fail for an unknown reason.
They are overkill but seem to fix the problems with the testcases, at least so far.
There is now not a single -Wdeclaration-after-statement warning origination in that file.
@abergeron In case this interests you I rebased against On Jenkins I see that for some reason or other Given your fp16-related changes I may have to review certain data-access macros, but otherwise this PR is still as good as it was last month. |
Added multiple new reduction ops, enumerated in
enum ga_reduce_op
, and also heavily refactored internally in preparation to add a generator for "small code model" kernels, which will be used for reductions with a small destination.