Performance tuning for 2D calculations (Trac #782) #125

pkienzle · 2019-03-30T10:46:26Z

Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point. In absolute terms, it is 325k operations on a 128x128 detector. In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.

Need to transform:

    q = sqrt(qx*qx + qy*qy);
    const double qxhat = qx/q;
    const double qyhat = qy/q;
    double sin_theta, cos_theta;
    double sin_phi, cos_phi;
    double sin_psi, cos_psi;
    SINCOS(theta*M_PI_180, sin_theta, cos_theta);
    SINCOS(phi*M_PI_180, sin_phi, cos_phi);
    SINCOS(psi*M_PI_180, sin_psi, cos_psi);
    cos_alpha = cos_theta*cos_phi*qxhat + sin_theta*qyhat;
    cos_mu = (-sin_theta*cos_psi*cos_phi - sin_psi*sin_phi)*qxhat + cos_theta*cos_psi*qyhat;
    cos_nu = (-cos_phi*sin_psi*sin_theta + sin_phi*cos_psi)*qxhat + sin_psi*cos_theta*qyhat;

Into a precompute phase:

    double sin_theta, cos_theta;
    double sin_phi, cos_phi;
    double sin_psi, cos_psi;
    SINCOS(theta*M_PI_180, sin_theta, cos_theta);
    SINCOS(phi*M_PI_180, sin_phi, cos_phi);
    SINCOS(psi*M_PI_180, sin_psi, cos_psi);
    alpha_x = cos_theta*cos_phi;
    alpha_y = sin_theta;
    mu_x = -sin_theta*cos_psi*cos_phi - sin_psi*sin_phi;
    mu_y = cos_theta*cos_psi;
    nu_x = -cos_phi*sin_psi*sin_theta + sin_phi*cos_psi;
    nu_y = sin_psi*cos_theta;

and a compute phase:

    q = sqrt(qx*qx + qy*qy);
    const double qxhat = qx/q;
    const double qyhat = qy/q;
    cos_alpha = alpha_x*qxhat + alpha_y*qyhat;
    cos_mu = mu_x*qxhat + mu_y*qyhat;
    cos_nu = nu_x*qxhat + nu_y*qyhat;

For polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.

For polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point. Again, this can be done in parallel.

Could be implemented using global working memory (ticket SasView/sasview#810).

Migrated from http://trac.sasview.org/ticket/782

{
    "status": "new",
    "changetime": "2019-02-14T03:35:20",
    "_ts": "2019-02-14 03:35:20.686629+00:00",
    "description": "Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point.  In absolute terms, it is 325k operations on a 128x128 detector.  In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.  \n\nNeed to transform:\n{{{\n    q = sqrt(qx*qx + qy*qy);\n    const double qxhat = qx/q;\n    const double qyhat = qy/q;\n    double sin_theta, cos_theta;\n    double sin_phi, cos_phi;\n    double sin_psi, cos_psi;\n    SINCOS(theta*M_PI_180, sin_theta, cos_theta);\n    SINCOS(phi*M_PI_180, sin_phi, cos_phi);\n    SINCOS(psi*M_PI_180, sin_psi, cos_psi);\n    cos_alpha = cos_theta*cos_phi*qxhat + sin_theta*qyhat;\n    cos_mu = (-sin_theta*cos_psi*cos_phi - sin_psi*sin_phi)*qxhat + cos_theta*cos_psi*qyhat;\n    cos_nu = (-cos_phi*sin_psi*sin_theta + sin_phi*cos_psi)*qxhat + sin_psi*cos_theta*qyhat;\n}}}\n\nInto a precompute phase:\n{{{\n    double sin_theta, cos_theta;\n    double sin_phi, cos_phi;\n    double sin_psi, cos_psi;\n    SINCOS(theta*M_PI_180, sin_theta, cos_theta);\n    SINCOS(phi*M_PI_180, sin_phi, cos_phi);\n    SINCOS(psi*M_PI_180, sin_psi, cos_psi);\n    alpha_x = cos_theta*cos_phi;\n    alpha_y = sin_theta;\n    mu_x = -sin_theta*cos_psi*cos_phi - sin_psi*sin_phi;\n    mu_y = cos_theta*cos_psi;\n    nu_x = -cos_phi*sin_psi*sin_theta + sin_phi*cos_psi;\n    nu_y = sin_psi*cos_theta;\n}}}\n\nand a compute phase:\n{{{\n    q = sqrt(qx*qx + qy*qy);\n    const double qxhat = qx/q;\n    const double qyhat = qy/q;\n    cos_alpha = alpha_x*qxhat + alpha_y*qyhat;\n    cos_mu = mu_x*qxhat + mu_y*qyhat;\n    cos_nu = nu_x*qxhat + nu_y*qyhat;\n}}}\n\nFor polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.\n\nFor polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point.  Again, this can be done in parallel.\n\nCould be implemented using global working memory (ticket #679).",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "SasView Bug Fixing",
    "time": "2016-10-14T15:00:42",
    "component": "sasmodels",
    "summary": "Performance tuning for 2D calculations",
    "priority": "minor",
    "keywords": "",
    "milestone": "sasmodels WishList",
    "owner": "",
    "type": "enhancement"
}

The text was updated successfully, but these errors were encountered:

pkienzle · 2019-03-30T10:50:10Z

Trac update at 2019/02/14 03:35:20: pkienzle commented:

Playing with a model with lots of polydispersity, computation efficiency for 2-D ellipsoid kernel on NVIDIA 1080 Ti is 25% of the theoretical maximum.

Can maybe improve performance 20% by prefetching the pd values and weights for the inner loop from global memory to shared memory. Support long pd vectors by introducing an outermost loop that prefetches the next block of the innermost loop whenever all the other loops have exhausted the current block.

A simple experiment replacing the fetching code with a constant shows an improvement to 35% of the theoretical maximum.

Need to shut off the partial dispatch in kernelcl/kernelcuda to achieve maximum performance. It is there to prevent machines from crashing or returning a bad result if the computation kernel takes too long. Without the ability to turn this off, the additional performance will not be relevant.

Test command:

SAS_OPENCL=cuda sascomp ellipsoid -2d -pars theta_pd=5 phi_pd=5 radius_polar_pd=0.1 radius_equatorial_pd=0.1 theta_pd_n=32 phi_pd_n=32 radius_polar_pd_n=32 radius_equatorial_pd_n=32 -nq=128 radius_polar=250 theta=80 phi=5 -midq -nq=128 -fast -neval=10

Current speed: 0.85s on OpenCL, 0.77s on CUDA.

Expected speed: 0.71s on OpenCL, 0.62s on CUDA.

Floating point operations for the computation are 2.1 TFLOP equivalent, adjusting for the fact that sin, etc. take four cycles rather than 1.

Control flow, etc., adds another 100 instructions to the inner loop, so 55% may be the best we can achieve (equivalent to 0.33s, or a factor of 2+ better than we are currently doing).

Relative improvement for more complicated models will be less. The additional time and complexity to implement and maintain this may not be worthwhile, especially if it only affects a few simple models.

pkienzle · 2022-11-05T03:10:57Z

See notes on #227 (comment)

pkienzle added this to the sasmodels WishList milestone Mar 30, 2019

pkienzle added enhancement Incomplete Migration Migrated from Trac minor SasView Bug Fixing and removed Incomplete Migration labels Mar 30, 2019

pkienzle added SasModels Infrastructure and removed SasView Bug Fixing SasModels Infrastructure labels Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning for 2D calculations (Trac #782) #125

Performance tuning for 2D calculations (Trac #782) #125

pkienzle commented Mar 30, 2019 •

edited

Loading

pkienzle commented Mar 30, 2019

pkienzle commented Nov 5, 2022

Performance tuning for 2D calculations (Trac #782) #125

Performance tuning for 2D calculations (Trac #782) #125

Comments

pkienzle commented Mar 30, 2019 • edited Loading

pkienzle commented Mar 30, 2019

pkienzle commented Nov 5, 2022

pkienzle commented Mar 30, 2019 •

edited

Loading