Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes #334

obilaniu · 2017-01-25T04:19:45Z

Added multiple new reduction ops, enumerated in enum ga_reduce_op, and also heavily refactored internally in preparation to add a generator for "small code model" kernels, which will be used for reductions with a small destination.

typedef enum _ga_reduce_op {
	GA_REDUCE_SUM,             /*        +        */
	GA_REDUCE_PROD,            /*        *        */
	GA_REDUCE_PRODNZ,          /*        * (!=0)  */
	GA_REDUCE_MIN,             /*      min()      */
	GA_REDUCE_MAX,             /*      max()      */
	GA_REDUCE_ARGMIN,          /*     argmin()    */
	GA_REDUCE_ARGMAX,          /*     argmax()    */
	GA_REDUCE_MINANDARGMIN,    /* min(), argmin() */
	GA_REDUCE_MAXANDARGMAX,    /* max(), argmax() */
	GA_REDUCE_AND,             /*        &        */
	GA_REDUCE_OR,              /*        |        */
	GA_REDUCE_XOR,             /*        ^        */
	GA_REDUCE_ALL,             /*     &&/all()    */
	GA_REDUCE_ANY,             /*     ||/any()    */
} ga_reduce_op;

abergeron

I know that this is WIP, but I still have a couple of nits I identified.

abergeron · 2017-01-25T18:05:49Z

src/gpuarray/array.h

+	GA_REDUCE_XOR,             /*        ^        */
+	GA_REDUCE_ALL,             /*     &&/all()    */
+	GA_REDUCE_ANY,             /*     ||/any()    */
+} ga_reduce_op;


This should probably go in a reduction.h header.

abergeron · 2017-01-25T18:11:24Z

src/gpuarray/array.h

+
+
+
+


All of this should also probably go into a reduction.h header.

abergeron · 2017-01-25T18:22:13Z

src/gpuarray_reduction.c

+	strb_appendf(&ctx->s, "typedef %s     S;/* The type of the source array. */\n",                ctx->srcTypeStr);
+	strb_appendf(&ctx->s, "typedef %s     T;/* The type of the destination array. */\n",           ctx->dstTypeStr);
+	strb_appendf(&ctx->s, "typedef %s     A;/* The type of the destination argument array. */\n",  ctx->dstArgTypeStr);
+	strb_appendf(&ctx->s, "typedef %s     X;/* The type of the indices: signed 32/64-bit. */\n",   ctx->idxTypeStr);


There has been new development that makes the kernel compile/call code responsible for handling this.

You can always use ga_size for this and everything will work out.

abergeron · 2017-01-25T18:23:38Z

src/gpuarray_reduction.c

+	strb_appends(&ctx->s, "\n");
+	strb_appends(&ctx->s, "\n");
+	strb_appends(&ctx->s, "\n");
+}


Since we are using nvrtc there cannot be any includes except for explicitly supported ones (and non are at the moment).

So this function is probably useless.

I discovered that by myself, but forgot to nuke this function.

abergeron · 2017-01-25T18:25:38Z

src/gpuarray_reduction.c

+ */
+
+static int   reduxGetSumInit               (int typecode, const char** property){
+	if(typecode == GA_POINTER ||


STYLE: Space between if and (

abergeron · 2017-01-25T19:01:18Z

src/gpuarray_reduction.c

-	ctx->blockSize [0] = ctx->blockSize [1] = ctx->blockSize [2] = 1;
-	ctx->gridSize  [0] = ctx->gridSize  [1] = ctx->gridSize  [2] = 1;
-	ctx->chunkSize [0] = ctx->chunkSize [1] = ctx->chunkSize [2] = 1;
+	for(i=0;i<MAX_HW_DIMS;i++){


STYLE: space between for and (

abergeron · 2017-01-25T19:02:27Z

src/gpuarray_reduction.c

@@ -256,113 +801,492 @@ static int   maxandargmaxCheckargs              (maxandargmax_ctx*  ctx){
 	ctx->nds = ctx->src->nd;
 	ctx->ndr = ctx->reduxLen;
 	ctx->ndd = ctx->nds - ctx->ndr;
+	strb_ensure(&ctx->s, 5*1024);


Why do you preallocate so large a buffer? Is the kernel you're generating really 5kb?

Yes, the kernel really is ~5KB for the 3D testcases.

abergeron · 2017-01-25T19:05:34Z

src/gpuarray_reduction.c


-	return ctx->ret;
+
+	return reduxSelectModel(ctx);


I know that it is convenient to do tail calls everywhere like this, but C doesn't have tail-call optimization (or at least not all the time).

It also feels inappropriate that calling reduxCheckArgs ends up compiling the reduction kernel and calling it.

In my previous version I tried to do all of the error handling in the highest-level function. That doesn't really scale as I add more and more and deeper functions, since each such function has to be able to detect the condition and return, and all its callers have to also detect a condition, abort and pass on the error until the highest level.

So I instead arranged the code in CPS style, and every function can abort the entire GpuArray_reduction() by tail-calling reduxCleanup() with whichever is the correct GA_ERROR_CODE as argument. A nice way of having clean "exception" handling in C.

CPS is somewhat unidiomatic C, but GCC does optimize my tailcalls from callq to jmpq in Release mode, or inlines them whole. And there is a net readability win from all the error handling code I've eliminated.

Then at least rename the functions so that it is clear that they do this.

@abergeron Alright, how shall I call them then? reduxDoXYZ()?

I'm not sure. Something like reduxCheckArgsAndRest()? I don't like it very much ...

abergeron · 2017-01-25T19:09:31Z

src/gpuarray_reduction.c

+static void  reduxAppendFuncGetInitVal     (redux_ctx*  ctx){
+	strb_appends(&ctx->s, "/**\n");
+	strb_appends(&ctx->s, " * Initial value function.\n");
+	strb_appends(&ctx->s, " */\n");


I don't know it outputting comments in a generated kernel is useful at all. Perhaps you can keep this as an actual comment right here.

abergeron · 2017-01-25T19:10:13Z

src/gpuarray_reduction.c

+	strb_appends(&ctx->s, "}\n");
+	strb_appends(&ctx->s, "\n");
+	strb_appends(&ctx->s, "\n");
+	strb_appends(&ctx->s, "\n");


Make this a single string. Don't use an append per line, it is mostly inefficient.

obilaniu · 2017-01-26T01:40:10Z

@abergeron I've cleaned up the code given your feedback. It's still a WIP and has numerous bugs. However, I'm aware of most of what remains to be done.

abergeron · 2017-01-30T21:35:44Z

src/util/strb.h

+
+static inline void strb_init(strb* sb){
+  const strb s = STRB_STATIC_INIT;
+  *sb = s;


This can also be a memset(b, 0, sizeof(strb)). We can assume that it is ok at this level.

@abergeron I think it would be wise to keep it as is, Don't Repeat Yourself and all that. If STRB_STATIC_INIT changes in such a way that memset() no longer suffices, this function will continue working. Otherwise there's two different places that must be changed, and one can forget one or the other.

At any rate, for the present strb, the compiler can and will optimize memset() to just

xor eax, eax movq [sb ], rax movq [sb+ 8], rax movq [sb+16], rax retq

Will it optimize the current form too?

@abergeron As a matter of fact, a Release-mode build with GCC inlined strb_init(), then recognized that strb, along with everything else I set to zero explicitly in reduxCheckArgs(), could be memsetted to 0:

Dump of assembler code for function GpuArray_reduction: 0x00007ffff74f5300 <+0>: push %r15 0x00007ffff74f5302 <+2>: xor %eax,%eax # Zero RAX 0x00007ffff74f5304 <+4>: push %r14 0x00007ffff74f5306 <+6>: push %r13 0x00007ffff74f5308 <+8>: push %r12 0x00007ffff74f530a <+10>: mov %rcx,%r12 0x00007ffff74f530d <+13>: mov $0x2f,%ecx # How many words to clear 0x00007ffff74f5312 <+18>: push %rbp 0x00007ffff74f5313 <+19>: mov %edi,%ebp 0x00007ffff74f5315 <+21>: push %rbx 0x00007ffff74f5316 <+22>: sub $0x1b8,%rsp 0x00007ffff74f531d <+29>: test %r12,%r12 0x00007ffff74f5320 <+32>: lea 0x30(%rsp),%rdi # Base pointer for memset 0x00007ffff74f5325 <+37>: lea 0x30(%rsp),%rbx 0x00007ffff74f532a <+42>: rep stos %rax,%es:(%rdi) # Memset to 0. 0x00007ffff74f532d <+45>: mov %ebp,0x30(%rsp) 0x00007ffff74f5331 <+49>: mov %rsi,0x38(%rsp) 0x00007ffff74f5336 <+54>: mov %rdx,0x40(%rsp) 0x00007ffff74f533b <+59>: mov %r12,0x48(%rsp) 0x00007ffff74f5340 <+64>: mov %r8d,0x50(%rsp) 0x00007ffff74f5345 <+69>: mov %r9,0x58(%rsp) 0x00007ffff74f534a <+74>: movq $0x1,0x138(%rsp) 0x00007ffff74f5356 <+86>: movq $0x1,0x150(%rsp) 0x00007ffff74f5362 <+98>: movq $0x1,0x168(%rsp) 0x00007ffff74f536e <+110>: movq $0x1,0x140(%rsp) 0x00007ffff74f537a <+122>: movq $0x1,0x158(%rsp) 0x00007ffff74f5386 <+134>: movq $0x1,0x170(%rsp) 0x00007ffff74f5392 <+146>: movq $0x1,0x148(%rsp) 0x00007ffff74f539e <+158>: movq $0x1,0x160(%rsp) 0x00007ffff74f53aa <+170>: movq $0x1,0x178(%rsp)

Ok. I'm good.

changes

abergeron · 2017-02-14T04:10:28Z

Is there anything missing from this?

obilaniu · 2017-02-14T04:21:48Z

@abergeron Yes; The codegen breaks somewhat at random depending on the kernel's required list of arguments and the tensor dimensionalities involved. The number of cases is large. Where it happens is in cases like generating a , to separate arguments to a generated function, but a part of the argument list is empty (e.g. because the dst tensor is 0-dim, or the kernel doesn't need a dstArg argument, ...), so you get consecutive commas and an angry runtime-compile error.

abergeron · 2017-02-14T04:54:08Z

Ok

obilaniu · 2017-03-05T07:01:43Z

@abergeron @nouiz You're going to be very happy:

ALL reductions are now implemented, using the large code model.
ALL testcases pass.
The code is a bit cleaner after heavy-duty refactoring of the code-gen functions,
.. which required that I add a new utility util/srcgen.h to make printing lists of arguments or summands easier,
... which required that I add strb_appendv(), which is to strb_appendf() as vfprintf() is to fprintf().

Now, the Jenkins bot apparently has an older libcheck than Travis, because I'm using ck_assert_double_eq_tol() and Jenkins vomits all over that while everything is just fine with Travis.

In several places there are notices /* BUG: Small code model not implemented */. At those places (currently unreachable) I must add code-gen for the small code model. This will require great care and some help from the CLUDA language in providing atomic*() portably.

Similarly, for the pre-scalar/post-scalar op fusions there are notices at the appropriate places for where they will hook in.

abergeron

There are some style changes and a few nits but overall ok.

I didn't run the tests though.

abergeron · 2017-03-10T16:04:04Z

src/util/srcgen.h

+
+
+/* Enumerations */
+enum srcb_state{


Space before {.

abergeron · 2017-03-10T16:04:55Z

src/util/srcgen.h

+
+
+/* Functions */
+static inline  void srcbInit  (srcb* s, strb* sb){


No spaces between name and ( and space before {.

abergeron · 2017-03-10T16:06:00Z

src/util/srcgen.h

+	s->empty      = empty;
+}
+static inline  void srcbEndList(srcb* s){
+	if(s->numElems == 0){


Space after if and space before {.

abergeron · 2017-03-10T16:09:44Z

src/util/strb.h

+ * Initialize at runtime an strb.
+ */
+
+static inline void strb_init(strb* sb){


Space before {.

abergeron · 2017-03-10T16:19:35Z

src/util/strb.c

 #endif
-  va_end(ap);
-
+  va_end(apSave);


Move the va_end inside the else branch.

The va_end() is in the correct place.

va_copy()ing a va_list is equivalent to va_start()ing it followed by an equivalent number of va_arg() calls.

Every va_start() must be matched by a va_end().

va_copy() doesn't exist on older Visual Studios, but in the Visual Studios where it's implemented it's implemented as just a straight assignment.

Therefore va_end() should be unconditional or else it's UB (although on x86-64 it's actually a no-op)

It is technically UB to just assign and reuse va_lists like you did.

If we start from the assumption that a va_list is just a pointer on MSVC and we can just copy it then let's see it through and not use va_end either since it doesn't matter.

This is still not fixed.

abergeron · 2017-03-10T16:22:06Z

src/gpuarray/reduction.h

+GPUARRAY_PUBLIC int GpuArray_any         (GpuArray*       dst,
+                                          const GpuArray* src,
+                                          unsigned        reduxLen,
+                                          const unsigned* reduxList);


All these functions could (and probably should) be macros.

@abergeron One thing that I've really liked about the current design is that I can breakpoint one of these functions without breakpointing on any others. It was very helpful when debugging using the test-suite.

That is true, but it costs in code size and number of exported symbols.

For debugging you can always use conditional breakpoints on the op to only stop for those you are interested in. I don't really mind making debugging a bit harder here.

abergeron · 2017-03-10T16:23:09Z

src/gpuarray_reduction.c

+
+static int   reduxGetSumInit               (int typecode, const char** property){
+	if (typecode == GA_POINTER ||
+	    typecode == GA_BUFFER){


You can use typecode < 0. The "special" values will always be below 0.

abergeron · 2017-03-10T16:24:53Z

src/gpuarray_reduction.c

+		  return GA_UNSUPPORTED_ERROR;
+	}
+
+	return GA_NO_ERROR;


I think that an additional property on the types for min/max values would be better than this.

@abergeron Doable by hacking the Python generator script a bit.

obilaniu · 2017-07-14T18:41:18Z

@abergeron @nouiz @lamblin Great success.

Code now passes all tests (apparently deterministically, since no failures in multiple tries) on my machine and leto13. But still waiting for Travis results to triple-check.

Key points:

My kernels are fully deterministic given:
- An identical number of SMs
- Identical max local workgroup size
- Type, dimensions and strides of all input/output tensors.
For all-reduction of a 100M-element float32 vector to a scalar:
- Global memory bandwidth:
  - On my laptop's GTX 765M, I achieve 86% of the max in-practice bandwidth (~46.6 GB/s).
  - cuDNN does get about 100% though (53 GB/s).
  - Apart from this best-case, Nvidia Visual Profiler reports that a dozen of my generated kernels get in excess of 20 GB/s, most get > 10 GB/s and a few stragglers get > 3 GB/s.
- Workspace:
  - For that same reduction, however, I use exactly 2*sizeof(TK0)*BlocksSpawned*D bytes of memory, where, in this case, TK0 is float32, BlocksSpawned is 64*NUMBER_OF_SMs and D is the number of writebacks per block, which in this case is 1. On a GTX765M, this results in 256 blocks of 1024/4=256 threads being spawned, and the workspace size is just 2KB. My workspace size is thus O(k) in the number k of simultaneously-executing threads on the GPU.
  - But cuDNN requests 400MB of workspace for the same reduction.
cuDNN is limited to tensors with no more than 2**31 - 1 elements and strides (measured in elements) representable in a 32-bit signed integer. I am almost completely unconstrained; Practically the only assumption I make is that the tensor will have at most one axis of length > 2**31 - 1, while the strides (measured in bytes) can be 64-bit integers.
I enforce a max block size 4 times smaller than the maximum allowed since that gives better performance.
You may customize almost all types used in my kernel through a new GpuReductionAttr API.
It is in principle possible to add arbitrary pre/post scalar operations. In fact, my code is capable of generalizing GpuElemwise.

abergeron · 2017-07-17T19:18:34Z

src/util/strb.c

 #endif
-  va_end(ap);
-
+  va_end(apSave);


This is still not fixed.

abergeron · 2017-07-18T16:04:57Z

src/gpuarray_reduction.c

+ * 
+ * It is assumed that at most one axis will ever be of length > 2**31-1. The
+ * assumption is believed safe because no GPU or similar accelerator presently
+ * on Earth has the capacity to store or process 2**62-element tensors.


While this is true, what about the counterexample of an array with super-large broadcasted dimensions? Or is there special code to handle that?

@abergeron When a user broadcasts a tensor on two or more axes, each of which are of length >= 2^31, he implicitly accepts a computational cost of 2^62 FLOP's. The most powerful GPUs on Earth right now have a throughput of 1e10 float32 additions or multiplications per second, and so would chug through such a tensor of 2^62 elements in a mere 15 years. I'm unaware of a realistic environment and usecase for a 15-year computation.

abergeron · 2017-07-18T16:09:26Z

src/gpuarray_reduction.c

-	int             reduxLen;
-	const int*      reduxList;
+	GpuReductionAttr grAttr;
+	gpucontext*      gpuCtx;


You already have a gpucontext in GpuReductionAttr

abergeron · 2017-07-18T16:10:08Z

src/gpuarray_reduction.c

+	/* Source code Generator. */
+	strb             s;
+	srcb             srcGen;
+	char             kName[256];


Why do you need such a large name? Also, why don't you just use a fixed name?

abergeron · 2017-07-18T16:10:53Z

src/gpuarray_reduction.c

+	srcb             srcGen;
+	char             kName[256];
+	char*            kSourceCode;
+	size_t           kSourceCodeLen;


You don't need those two members, they are available from either the strb or srcb above.

abergeron · 2017-07-18T17:28:57Z

src/gpuarray_reduction.c

+	/* Workspace */
+	if (reduxGenKernelRequiresWspace(gr)){
+		fn(gr, GA_BUFFER,  "GLOBAL_MEM char* restrict",         "W",           0, user);
+		if (reduxGenKern


I don't know what is the type of B (because the kernel code is impossible to follow, but since left is 64 bits this will always do a 64-bit modulo. I don't know how often this code is run, but using GA_SIZE would allow this to be 32 bit for smaller arrays.

abergeron · 2017-07-18T17:32:21Z

src/gpuarray_reduction.c

+	/* Workspace */
+	if (reduxGenKernelRequiresWspace(gr)){
+		fn(gr, GA_BUFFER,  "GLOBAL_MEM char* restrict",         "W",           0, user);
+		if (reduxGenKern


These pointers need to be tagged with their memory space.

abergeron · 2017-07-18T17:34:48Z

src/gpuarray_reduction.c

+	/* Workspace */
+	if (reduxGenKernelRequiresWspace(gr)){
+		fn(gr, GA_BUFFER,  "GLOBAL_MEM char* restrict",         "W",           0, user);
+		if (reduxGenKern


Instead of printing to the output, you should set the error message with the new error API.

abergeron · 2017-07-18T17:37:36Z

src/gpuarray_reduction.c

+	/* Workspace */
+	if (reduxGenKernelRequiresWspace(gr)){
+		fn(gr, GA_BUFFER,  "GLOBAL_MEM char* restrict",         "W",           0, user);
+		if (reduxGenKern


This is horribly fragile code. How is this a good idea?

abergeron · 2017-07-18T17:38:35Z

src/gpuarray_reduction.c

+	/* Workspace */
+	if (reduxGenKernelRequiresWspace(gr)){
+		fn(gr, GA_BUFFER,  "GLOBAL_MEM char* restrict",         "W",           0, user);
+		if (reduxGenKern


Why do you have function aliases like this? Can't you just reuse the wrapped function?

support.

It allows initializing at runtime an strb. This can't always be done at compile-time, for instance if it is dynamically allocated.

All tests pass, but currently the codegen is locked to the large code model (the small code model has most of the groundwork laid down but has several extra complexities which haven't yet been implemented, like atomic reduction operators.

Clang and MSVC correctly recognize that all paths to the allegedly- uninitialized variables are, in fact, dominated by their initialization.

40% of tests still failing, and the code has a wierd smell to it that I really don't appreciate.

All tests now pass except summation, which fails to meet tolerance.

- Subtract 0.5 from random numbers, so they sum to 0 in expectation. - Increase tolerance from 1e-5 to 1e-4 just for summation.

doubles the speed.

Now, all the veryhighrank tests pass and the others fail for an unknown reason.

They are overkill but seem to fix the problems with the testcases, at least so far.

There is now not a single -Wdeclaration-after-statement warning origination in that file.

obilaniu · 2017-08-27T01:04:19Z

@abergeron In case this interests you I rebased against master, so there are now no conflicts and the codebase now builds, runs and passes tests again.

On Jenkins I see that for some reason or other ck_assert_ptr_nonnull() no longer works properly. It did use to, AFAIK. Was libcheck downgraded? For now I simply re-#define it to another libcheck API, but I'm still surprised.

Given your fp16-related changes I may have to review certain data-access macros, but otherwise this PR is still as good as it was last month.

obilaniu force-pushed the smallredux branch 2 times, most recently from 3f9b8c4 to d80fc0e Compare January 25, 2017 10:16

abergeron previously requested changes Jan 25, 2017

View reviewed changes

abergeron reviewed Jan 30, 2017

View reviewed changes

obilaniu force-pushed the smallredux branch 2 times, most recently from ca571e4 to 51c559d Compare March 3, 2017 18:07

obilaniu force-pushed the smallredux branch from 7aff01f to b11fd76 Compare March 5, 2017 07:06

abergeron reviewed Mar 10, 2017

View reviewed changes

obilaniu force-pushed the smallredux branch from b11fd76 to c6ba860 Compare March 10, 2017 22:27

obilaniu force-pushed the smallredux branch 2 times, most recently from 061c174 to 4188424 Compare May 30, 2017 14:52

obilaniu force-pushed the smallredux branch 2 times, most recently from 52d8d6e to 2c4a3cb Compare June 12, 2017 16:17

obilaniu changed the title ~~Generalize MaxAndArgmax to multiple ops and prepare groundwork for small-destination reductions~~ Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes Jul 14, 2017

obilaniu force-pushed the smallredux branch 5 times, most recently from 6c2abc8 to 6d9e62d Compare July 25, 2017 16:23

abergeron reviewed Jul 25, 2017

View reviewed changes

obilaniu added 3 commits August 26, 2017 18:52

Current status of reduction generalization and small-destination

c3ae76c

support.

Add strb_init() function.

939a115

It allows initializing at runtime an strb. This can't always be done at compile-time, for instance if it is dynamically allocated.

Moved the reduction API to reduction.h.

a21bcb5

obilaniu added 26 commits August 26, 2017 18:59

Massive refactor of kernel codegen.

0949626

Added testcases for all reductions.

8fe9083

All tests pass, but currently the codegen is locked to the large code model (the small code model has most of the groundwork laid down but has several extra complexities which haven't yet been implemented, like atomic reduction operators.

Muzzle incorrect GCC maybe-uninitialized diagnostic.

32bd11d

Clang and MSVC correctly recognize that all paths to the allegedly- uninitialized variables are, in fact, dominated by their initialization.

Current State

19bd939

Current State

1a2df8d

Remove warp axis select.

fffd323

Massive cleanup.

1cfe552

More planning for 2-stage reduction.

2317ca1

Near-complete rewrite based on 1/2-phase code model with workspace.

c3977d8

More fixes.

8fc792b

40% of tests still failing, and the code has a wierd smell to it that I really don't appreciate.

Really dumb division bug fixed.

8debf2d

All tests now pass except summation, which fails to meet tolerance.

Fix summation tests:

5f4ec4e

- Subtract 0.5 from random numbers, so they sum to 0 in expectation. - Increase tolerance from 1e-5 to 1e-4 just for summation.

Add huge sum-reduction and pepper kernel with restrict keyword, it

eb108be

doubles the speed.

Massive Refactor into effectively a lattice engine.

ce9c067

More refactoring.

c9a0389

Now, all the veryhighrank tests pass and the others fail for an unknown reason.

Delete an "initialization" that should not be there.

6fb0793

Added an initialization that WAS needed.

4a17f48

Add a bunch of local_barrier()'s.

328c957

They are overkill but seem to fix the problems with the testcases, at least so far.

Style fixes.

925688c

Muzzle -Wdeclaration-after-statement in check_reduction.c.

8f5250e

There is now not a single -Wdeclaration-after-statement warning origination in that file.

Easy feedback fixes applied.

fac52b6

Add stdargs support to the error API.

f129c69

Deleted recently-removed properties.

76fd38c

Added missing header

0832fa1

For test purposes, create buffer of ULONG rather than unsupported SIZE.

c679474

Bugfix in GpuReduction_new().

ecde75c

obilaniu force-pushed the smallredux branch from 1b18d0e to ecde75c Compare August 27, 2017 00:28

Bugfixes in check_reduction.c

79d3649

obilaniu force-pushed the smallredux branch from 8b4ff54 to 79d3649 Compare August 27, 2017 00:49



		/* Functions */
		static inline void srcbInit (srcb* s, strb* sb){

Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes #334

Are you sure you want to change the base?

Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes #334

Conversation

obilaniu commented Jan 25, 2017 • edited Loading

abergeron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obilaniu commented Jan 26, 2017

Choose a reason for hiding this comment

obilaniu Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obilaniu Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abergeron commented Feb 14, 2017

obilaniu commented Feb 14, 2017

abergeron commented Feb 14, 2017

obilaniu commented Mar 5, 2017

abergeron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obilaniu commented Jul 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obilaniu Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obilaniu commented Aug 27, 2017

obilaniu commented Jan 25, 2017 •

edited

Loading

obilaniu Jan 30, 2017 •

edited

Loading

obilaniu Jan 30, 2017 •

edited

Loading

obilaniu Jul 25, 2017 •

edited

Loading