You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point. In absolute terms, it is 325k operations on a 128x128 detector. In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.
For polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.
For polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point. Again, this can be done in parallel.
Could be implemented using global working memory (ticket SasView/sasview#810).
Trac update at 2019/02/14 03:35:20: pkienzle commented:
Playing with a model with lots of polydispersity, computation efficiency for 2-D ellipsoid kernel on NVIDIA 1080 Ti is 25% of the theoretical maximum.
Can maybe improve performance 20% by prefetching the pd values and weights for the inner loop from global memory to shared memory. Support long pd vectors by introducing an outermost loop that prefetches the next block of the innermost loop whenever all the other loops have exhausted the current block.
A simple experiment replacing the fetching code with a constant shows an improvement to 35% of the theoretical maximum.
Need to shut off the partial dispatch in kernelcl/kernelcuda to achieve maximum performance. It is there to prevent machines from crashing or returning a bad result if the computation kernel takes too long. Without the ability to turn this off, the additional performance will not be relevant.
Floating point operations for the computation are 2.1 TFLOP equivalent, adjusting for the fact that sin, etc. take four cycles rather than 1.
Control flow, etc., adds another 100 instructions to the inner loop, so 55% may be the best we can achieve (equivalent to 0.33s, or a factor of 2+ better than we are currently doing).
Relative improvement for more complicated models will be less. The additional time and complexity to implement and maintain this may not be worthwhile, especially if it only affects a few simple models.
Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point. In absolute terms, it is 325k operations on a 128x128 detector. In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.
Need to transform:
Into a precompute phase:
and a compute phase:
For polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.
For polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point. Again, this can be done in parallel.
Could be implemented using global working memory (ticket SasView/sasview#810).
Migrated from http://trac.sasview.org/ticket/782
The text was updated successfully, but these errors were encountered: