-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework GPU runtime system and copies #1991
Conversation
Replaces rebasing, I hope.
We are actually still too conservative. A fully dynamic approach is needed.
Adds LMADCopy to Imp and implements generic code generation. Very slow, but at least functional for the C backend.
This is no longer just about getting rid of permutations, but also how we handle copies in code generation. The new approach will allow much more dynamic behaviour, with the goal of moving more intelligence to the runtime system, where it is much easier to understand and debug. |
This change works quite well, but unfortunately there is one performance regression, on OptionPricing. This regression is due to the compiler now inserting fewer copies, but one of the remaining copies is now a |
Oh, a simple manifest-manifest simplification rule solved that quite simply. and now OptionPricing is actually quite a bit faster than on master. |
This PR removes the explicit tracking of permutation information from LMADs. The motivation is primarily simplicity: the presence of permutations made some of the LMAD functions much more complicated. Further, it was not actually complete: it was perfectly possible to express e.g. a column-major array without actually making use of the permutation mechanism, simply by permuting the strides (and shape) of the LMAD instead.
The only thing we truly use the permutations for is to detect transpositions during copies. This can be done in another way: check whether the index function basis of the source is a permutation of the destination. An even better solution would be to dynamically check whether the involved LMADs express a transposition. This can be easily done in time quadratic to the rank of the arrays, which is usually very low (and the operations involved are integer comparisons).