Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tuning for 2D calculations (Trac #782) #125

Open
pkienzle opened this issue Mar 30, 2019 · 2 comments
Open

Performance tuning for 2D calculations (Trac #782) #125

pkienzle opened this issue Mar 30, 2019 · 2 comments

Comments

@pkienzle
Copy link
Contributor

pkienzle commented Mar 30, 2019

Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point. In absolute terms, it is 325k operations on a 128x128 detector. In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.

Need to transform:

    q = sqrt(qx*qx + qy*qy);
    const double qxhat = qx/q;
    const double qyhat = qy/q;
    double sin_theta, cos_theta;
    double sin_phi, cos_phi;
    double sin_psi, cos_psi;
    SINCOS(theta*M_PI_180, sin_theta, cos_theta);
    SINCOS(phi*M_PI_180, sin_phi, cos_phi);
    SINCOS(psi*M_PI_180, sin_psi, cos_psi);
    cos_alpha = cos_theta*cos_phi*qxhat + sin_theta*qyhat;
    cos_mu = (-sin_theta*cos_psi*cos_phi - sin_psi*sin_phi)*qxhat + cos_theta*cos_psi*qyhat;
    cos_nu = (-cos_phi*sin_psi*sin_theta + sin_phi*cos_psi)*qxhat + sin_psi*cos_theta*qyhat;

Into a precompute phase:

    double sin_theta, cos_theta;
    double sin_phi, cos_phi;
    double sin_psi, cos_psi;
    SINCOS(theta*M_PI_180, sin_theta, cos_theta);
    SINCOS(phi*M_PI_180, sin_phi, cos_phi);
    SINCOS(psi*M_PI_180, sin_psi, cos_psi);
    alpha_x = cos_theta*cos_phi;
    alpha_y = sin_theta;
    mu_x = -sin_theta*cos_psi*cos_phi - sin_psi*sin_phi;
    mu_y = cos_theta*cos_psi;
    nu_x = -cos_phi*sin_psi*sin_theta + sin_phi*cos_psi;
    nu_y = sin_psi*cos_theta;

and a compute phase:

    q = sqrt(qx*qx + qy*qy);
    const double qxhat = qx/q;
    const double qyhat = qy/q;
    cos_alpha = alpha_x*qxhat + alpha_y*qyhat;
    cos_mu = mu_x*qxhat + mu_y*qyhat;
    cos_nu = nu_x*qxhat + nu_y*qyhat;

For polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.

For polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point. Again, this can be done in parallel.

Could be implemented using global working memory (ticket SasView/sasview#810).

Migrated from http://trac.sasview.org/ticket/782

{
    "status": "new",
    "changetime": "2019-02-14T03:35:20",
    "_ts": "2019-02-14 03:35:20.686629+00:00",
    "description": "Can save 6 trig functions 9 multiplications and 5 additions by precomputing the orientation info for each q point.  In absolute terms, it is 325k operations on a 128x128 detector.  In relative terms, the fcc mode uses an additional 4 special functions, 49 multiplications and 16 adds, so this could be a 25% speed up.  \n\nNeed to transform:\n{{{\n    q = sqrt(qx*qx + qy*qy);\n    const double qxhat = qx/q;\n    const double qyhat = qy/q;\n    double sin_theta, cos_theta;\n    double sin_phi, cos_phi;\n    double sin_psi, cos_psi;\n    SINCOS(theta*M_PI_180, sin_theta, cos_theta);\n    SINCOS(phi*M_PI_180, sin_phi, cos_phi);\n    SINCOS(psi*M_PI_180, sin_psi, cos_psi);\n    cos_alpha = cos_theta*cos_phi*qxhat + sin_theta*qyhat;\n    cos_mu = (-sin_theta*cos_psi*cos_phi - sin_psi*sin_phi)*qxhat + cos_theta*cos_psi*qyhat;\n    cos_nu = (-cos_phi*sin_psi*sin_theta + sin_phi*cos_psi)*qxhat + sin_psi*cos_theta*qyhat;\n}}}\n\nInto a precompute phase:\n{{{\n    double sin_theta, cos_theta;\n    double sin_phi, cos_phi;\n    double sin_psi, cos_psi;\n    SINCOS(theta*M_PI_180, sin_theta, cos_theta);\n    SINCOS(phi*M_PI_180, sin_phi, cos_phi);\n    SINCOS(psi*M_PI_180, sin_psi, cos_psi);\n    alpha_x = cos_theta*cos_phi;\n    alpha_y = sin_theta;\n    mu_x = -sin_theta*cos_psi*cos_phi - sin_psi*sin_phi;\n    mu_y = cos_theta*cos_psi;\n    nu_x = -cos_phi*sin_psi*sin_theta + sin_phi*cos_psi;\n    nu_y = sin_psi*cos_theta;\n}}}\n\nand a compute phase:\n{{{\n    q = sqrt(qx*qx + qy*qy);\n    const double qxhat = qx/q;\n    const double qyhat = qy/q;\n    cos_alpha = alpha_x*qxhat + alpha_y*qyhat;\n    cos_mu = mu_x*qxhat + mu_y*qyhat;\n    cos_nu = nu_x*qxhat + nu_y*qyhat;\n}}}\n\nFor polydisperse systems, need to precompute for each independent (theta,phi,psi) triple, but this can be done in parallel.\n\nFor polydisperse systems, can save a sqrt, 4 multiplies and an add by precomputing q, qxhat and qyhat for each point.  Again, this can be done in parallel.\n\nCould be implemented using global working memory (ticket #679).",
    "reporter": "pkienzle",
    "cc": "",
    "resolution": "",
    "workpackage": "SasView Bug Fixing",
    "time": "2016-10-14T15:00:42",
    "component": "sasmodels",
    "summary": "Performance tuning for 2D calculations",
    "priority": "minor",
    "keywords": "",
    "milestone": "sasmodels WishList",
    "owner": "",
    "type": "enhancement"
}
@pkienzle
Copy link
Contributor Author

Trac update at 2019/02/14 03:35:20: pkienzle commented:

Playing with a model with lots of polydispersity, computation efficiency for 2-D ellipsoid kernel on NVIDIA 1080 Ti is 25% of the theoretical maximum.

Can maybe improve performance 20% by prefetching the pd values and weights for the inner loop from global memory to shared memory. Support long pd vectors by introducing an outermost loop that prefetches the next block of the innermost loop whenever all the other loops have exhausted the current block.

A simple experiment replacing the fetching code with a constant shows an improvement to 35% of the theoretical maximum.

Need to shut off the partial dispatch in kernelcl/kernelcuda to achieve maximum performance. It is there to prevent machines from crashing or returning a bad result if the computation kernel takes too long. Without the ability to turn this off, the additional performance will not be relevant.

Test command:

SAS_OPENCL=cuda sascomp ellipsoid -2d -pars theta_pd=5 phi_pd=5 radius_polar_pd=0.1 radius_equatorial_pd=0.1 theta_pd_n=32 phi_pd_n=32 radius_polar_pd_n=32 radius_equatorial_pd_n=32 -nq=128 radius_polar=250 theta=80 phi=5 -midq -nq=128 -fast -neval=10

Current speed: 0.85s on OpenCL, 0.77s on CUDA.

Expected speed: 0.71s on OpenCL, 0.62s on CUDA.

Floating point operations for the computation are 2.1 TFLOP equivalent, adjusting for the fact that sin, etc. take four cycles rather than 1.

Control flow, etc., adds another 100 instructions to the inner loop, so 55% may be the best we can achieve (equivalent to 0.33s, or a factor of 2+ better than we are currently doing).

Relative improvement for more complicated models will be less. The additional time and complexity to implement and maintain this may not be worthwhile, especially if it only affects a few simple models.

@pkienzle
Copy link
Contributor Author

pkienzle commented Nov 5, 2022

See notes on #227 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant