-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add choice for triangular solver implementation for Ginkgo #585
Conversation
@cameronrutherford, we need to update our pipelines with the latest head from Ginkgo. See related GitLab issue. |
@fritzgoebel, does 2bcae30 requires Ginkgo version with corresponding updates? |
@pelesh no, this only changes hiop code. |
@fritzgoebel Could this by chance be addressed? ginkgo-project/ginkgo#1258 |
The |
As long as that gets fixed in |
This will sure happen, but as the error seems to originate in third-party code we used as a starting point for this effort but will not rely on in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is well tested and ready to merge. It might be a good idea to get the updated version of Ginkgo on all CI pipelines first.
Pushed updated modules, hopefully tests pass |
This is the error on Marianas. The convex problem is failing with the error below, but the non-convex test case works fine. @cnpetra @nychiang do you have any suggestions what is going on here? 22: Test command: /share/apps/openmpi/4.1.0mlx5.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/113026/build/src/Drivers/Sparse/NlpSparseEx1.exe" "500" "-ginkgo_cuda" "-selfcheck"
22: Test timeout computed to be: 10000000
22: [1675891897.306782] [dl09:11014:0] ib_iface.c:665 UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
22: [dl09.local:11014] pml_ucx.c:273 Error: Failed to create UCP worker
22: ===============
22: Hiop SOLVER
22: ===============
22: Using 1 MPI ranks.
22: ---------------
22: Problem Summary
22: ---------------
22: Total number of variables: 500
22: lower/upper/lower_and_upper bounds: 499 / 1 / 1
22: Total number of equality constraints: 1
22: Total number of inequality constraints: 498
22: lower/upper/lower_and_upper bounds: 498 / 497 / 497
22: LSQ linear solver --- KKT_SPARSE_XDYcYd linsys: MA57 size 1497 cons 499 nnz 3991 (option 'duals_init_linear_solver_sparse' 'auto')
22: iter objective inf_pr inf_du lg(mu) alpha_du alpha_pr linesrch
22: 0 7.6705009e+00 9.980e+00 1.118e+00 -1.00 0.000e+00 0.000e+00 -(-)
22: Setting up Ginkgo solver ...
22: 1 7.4398508e+00 9.733e+00 9.690e+00 -1.00 2.424e-03 2.474e-02 1(s)
22: 2 5.6597411e+01 3.144e-12 7.214e+01 -1.00 1.230e-02 1.000e+00 1(s)
22: 3 5.3831008e+01 2.752e-12 5.103e+01 -1.00 9.011e-01 1.246e-01 1(f)
22: 4 1.1479960e-01 1.084e-12 5.715e+01 -1.00 1.354e-01 6.046e-01 1(f)
22: 5 6.5161913e+00 5.418e-13 2.875e+01 -1.00 4.395e-02 5.000e-01 2(f)
22: 6 7.7841332e+00 1.563e-13 5.185e+00 -1.00 9.927e-01 7.884e-01 1(f)
22: 7 7.8320269e+00 1.368e-13 3.665e+00 -1.00 1.000e+00 1.250e-01 4(f)
22: 8 8.3081034e+00 8.882e-16 2.619e-02 -1.00 1.000e+00 1.000e+00 1(f)
22: 9 3.3051737e+00 8.882e-16 2.171e+00 -2.55 7.838e-01 1.000e+00 1(f)
22: iter objective inf_pr inf_du lg(mu) alpha_du alpha_pr linesrch
22: 10 1.2846460e+00 1.776e-15 7.038e-01 -2.55 6.203e-02 1.000e+00 1(f)
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: - Error code 1
22: - Abs res=-nann - Rel res=-nan
22: - ||rhs||_2=1.14224 ||sol||_2=nan
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: - Error code 1
22: - Abs res=nann - Rel res=nan
22: - ||rhs||_2=nan ||sol||_2=-nan
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22: NlpSparseEx1.exe: /people/svcexasgd/gitlab/113026/src/Optimization/hiopIterate.cpp:380: virtual bool hiop::hiopIterate::takeStep_duals(const hiop::hiopIterate&, const hiop::hiopIterate&, const double&, const double&): Assertion `zl->matchesPattern(nlp->get_ixl())' failed.
22: [dl09:11014] *** Process received signal *** |
@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required. |
Fixed the newell variables file and CI should at least run there now. |
Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives? |
I think it happens in this example:
I think this is a false positive, as hiop converges with inf_pr=inf_du=0. |
[edited-removed comment] |
@nychiang is this for MA57? |
@nychiang, @cnpetra: The verbose output is here. @fritzgoebel, please take a look, as well. |
Judging from the detailed output, HiOp enters FR because of the nans. Very likely the linear solve fails and gives nan search direction, since the nans first appear when the residual is computed. I would say that one of the triangular solves fail. BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides. @fritzgoebel |
Actually it is comparison with |
my comment was not about the false positive issue, but about the source of nans. Just talked to @nychiang and we're pretty sure it originates in the linear solver. |
b0a8b1f
to
b68ab6e
Compare
I rebased onto develop and changed the triangular solver choice to default to the Ginkgo implementation. It still seems to need some regularization, but contrary to the cusparse solver at least does not generate |
@fritzgoebel: You need to rebase this branch to trigger newly configured CI pipelines. The old ones don't work anymore. Once you do that and tests pass, we can merge this PR. |
…erative refinement inside Ginkgo integration
592c712
to
ee374e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CI pipelines pass. This branch is well tested (it was used for recent HiOp profiling) and it is safe to merge.
This PR adds a parameter
ginkgo_trisolve
to select the triangular solver used in Ginkgo. The options aresparselib
andsyncfree
to distinguish between the vendor library and our own implementation.As discussed with @pelesh it also removes the previously present refinement steps with GMRES.
I also updated the
glu_experimental
branch in Ginkgo to make the triangular solver choice possible, so you will need to update your CI pipelines.