Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize polydispersity loops (Trac #1230) #393

Open
pkienzle opened this issue Mar 30, 2019 · 2 comments
Open

parallelize polydispersity loops (Trac #1230) #393

pkienzle opened this issue Mar 30, 2019 · 2 comments

Comments

@pkienzle
Copy link
Contributor

There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.

This is particularly important for mcSAS, which needs to evaluate

I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)

where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing

I_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})

with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.

Next turn the problem on its side, compute the following:

I(q_j) = sum_k I_k(q_j)

with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:

https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf

I'm guessing the 4k reductions is too small to warrant a fast algorithm.

The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.

Migrated from http://trac.sasview.org/ticket/1230

{
    "status": "new",
    "changetime": "2019-02-22T16:28:36",
    "_ts": "2019-02-22 16:28:36.578150+00:00",
    "description": "There is unexploited parallelism in the polydiserpsity calculation.  This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value.  A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results.  With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes).  May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value.   Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit.  Looking at the graphs on page 5 of the following:\n\n    https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "McSAS Integration Project",
    "time": "2019-02-19T14:23:34",
    "component": "SasView",
    "summary": "parallelize polydispersity loops",
    "priority": "major",
    "keywords": "",
    "milestone": "SasView 4.3.0",
    "owner": "",
    "type": "defect"
}
@pkienzle
Copy link
Contributor Author

pkienzle commented Mar 30, 2019

Trac update at 2019/02/22 16:28:36: pkienzle commented:

See also ticket http://trac.sasview.org/ticket/1172.
which is now #187

@butlerpd
Copy link
Member

I believe this is a sasmodels issue not a sasview issue so transferring it. If this is a sasview issue it needs to be spelled out and moved to 5.1 most likely. At any rate it is no longer relevant to worry about for 4.x I don't think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants