parallelize polydispersity loops (Trac #1230) #393

pkienzle · 2019-03-30T08:56:05Z

There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.

This is particularly important for mcSAS, which needs to evaluate

I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)

where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing

I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})

with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.

Next turn the problem on its side, compute the following:

I(q_j) = sum_k I_k(q_j)

with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:

https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf

I'm guessing the 4k reductions is too small to warrant a fast algorithm.

The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.

Migrated from http://trac.sasview.org/ticket/1230

{
    "status": "new",
    "changetime": "2019-02-22T16:28:36",
    "_ts": "2019-02-22 16:28:36.578150+00:00",
    "description": "There is unexploited parallelism in the polydiserpsity calculation.  This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value.  A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results.  With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes).  May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value.   Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit.  Looking at the graphs on page 5 of the following:\n\n    https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "McSAS Integration Project",
    "time": "2019-02-19T14:23:34",
    "component": "SasView",
    "summary": "parallelize polydispersity loops",
    "priority": "major",
    "keywords": "",
    "milestone": "SasView 4.3.0",
    "owner": "",
    "type": "defect"
}

The text was updated successfully, but these errors were encountered:

pkienzle · 2019-03-30T10:43:57Z

Trac update at 2019/02/22 16:28:36: pkienzle commented:

See also ticket http://trac.sasview.org/ticket/1172.
which is now #187

butlerpd · 2020-04-25T23:00:04Z

I believe this is a sasmodels issue not a sasview issue so transferring it. If this is a sasview issue it needs to be spelled out and moved to 5.1 most likely. At any rate it is no longer relevant to worry about for 4.x I don't think.

butlerpd transferred this issue from SasView/sasview Apr 25, 2020

butlerpd added this to the sasmodels 1.1 milestone May 2, 2021

butlerpd added major SasModels Infrastructure SasView Framework Enhancements labels May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize polydispersity loops (Trac #1230) #393

parallelize polydispersity loops (Trac #1230) #393

pkienzle commented Mar 30, 2019

pkienzle commented Mar 30, 2019 •

edited by butlerpd

Loading

butlerpd commented Apr 25, 2020

parallelize polydispersity loops (Trac #1230) #393

parallelize polydispersity loops (Trac #1230) #393

Comments

pkienzle commented Mar 30, 2019

pkienzle commented Mar 30, 2019 • edited by butlerpd Loading

butlerpd commented Apr 25, 2020

pkienzle commented Mar 30, 2019 •

edited by butlerpd

Loading