You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.
This is particularly important for mcSAS, which needs to evaluate
I(q_j) = sum_{i=1}^m w_i P(q_j, r_i)
where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing
with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.
Next turn the problem on its side, compute the following:
I(q_j) = sum_k I_k(q_j)
with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:
I'm guessing the 4k reductions is too small to warrant a fast algorithm.
The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.
{
"status": "new",
"changetime": "2019-02-22T16:28:36",
"_ts": "2019-02-22 16:28:36.578150+00:00",
"description": "There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.\n\nThis is particularly important for mcSAS, which needs to evaluate\n{{{\nI(q_j) = sum_{i=1}^m w_i P(q_j, r_i)\n}}}\nwhere P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing\n{{{\nI_k(q_j) = sum_{i=1}^{m/p} w_{k+p*i} P(q_j, r_{k+p*i})\n}}}\nwith the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.\n\nNext turn the problem on its side, compute the following:\n{{{\nI(q_j) = sum_k I_k(q_j)\n}}}\nwith one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:\n\n https://www.cl.cam.ac.uk/teaching/1617/AdvGraph/07_OpenCL.pdf\n\nI'm guessing the 4k reductions is too small to warrant a fast algorithm.\n\nThe existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.\n",
"reporter": "pkienzle",
"cc": "",
"resolution": "",
"workpackage": "McSAS Integration Project",
"time": "2019-02-19T14:23:34",
"component": "SasView",
"summary": "parallelize polydispersity loops",
"priority": "major",
"keywords": "",
"milestone": "SasView 4.3.0",
"owner": "",
"type": "defect"
}
The text was updated successfully, but these errors were encountered:
I believe this is a sasmodels issue not a sasview issue so transferring it. If this is a sasview issue it needs to be spelled out and moved to 5.1 most likely. At any rate it is no longer relevant to worry about for 4.x I don't think.
There is unexploited parallelism in the polydiserpsity calculation. This limits the speedup available on high end graphics cards which are operating on 1-D dataset, since the current implementation is limited to only using one processor per q value. A card with 5000 separate processors will be mostly idle.
This is particularly important for mcSAS, which needs to evaluate
where P(q,r) is the sphere form, q is length n and the distribution w,r is length m, for a total of m x n total evaluations. The current sasmodels code does this in parallel over q_j, with the sum over w_i, r_i running in serial. Instead, we could break up the loop into non-intersecting stripes, first computing
with the number of stripes p, then we can keep n x p processors busy at the same time, and the cost of n x p intermediate results. With 256 q points, the 5120 processors on an nvidia V100 can compute 20 batches in parallel simultaneously, using minimal extra memory (256 x 20 x 8 bytes). May be faster to use p=16 so that memory accesses align better.
Next turn the problem on its side, compute the following:
with one processor for each q value. Can perhaps do better by computing pairs in parallel, then pairs of pairs, requiring four cycles for p=16 rather than 16, though the overhead of managing this may outweigh any benefit. Looking at the graphs on page 5 of the following:
I'm guessing the 4k reductions is too small to warrant a fast algorithm.
The existing kernel_iq.c would benefit from this, at least for the inner polydispersity loop, if you are willing to tackle it. Generating a specialized kernel for the particular problem of a distribution of spheres in mcSAS will probably be easier.
Migrated from http://trac.sasview.org/ticket/1230
The text was updated successfully, but these errors were encountered: