Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] reuse option in Newton Raphson method #206

Closed
wants to merge 5 commits into from

Conversation

frankschae
Copy link
Member

TODOs:

  • add a check for convergence, and automatically compute a new Jacobian if convergence slows or starts to diverge. Do we have to find a good heuristic ourselves or is there a paper that has some details about it?

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Merging #206 (093490a) into master (c6c875f) will decrease coverage by 0.36%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #206      +/-   ##
==========================================
- Coverage   48.45%   48.10%   -0.36%     
==========================================
  Files          19       19              
  Lines        1814     1846      +32     
==========================================
+ Hits          879      888       +9     
- Misses        935      958      +23     
Files Coverage Δ
src/raphson.jl 57.89% <50.00%> (-15.13%) ⬇️

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@frankschae
Copy link
Member Author

I had a discussion with @FHoltorf regarding an effective heuristic for updating the Jacobian:

  • Reusing the Jacobian becomes advantageous when taking relatively small Newton steps. If the function is smooth, the Jacobian shouldn't exhibit significant changes during these steps.
  • The proposition is to track the step size between Jacobian updates. When the norm of this size exceeds a predefined hyperparameter (HP1), we trigger a Jacobian update.
  • I will also check if we need additional checks, e.g. with respect to the residual.
  • I haven't mapped out the performance as a function of HP1 yet (I saw for a few examples roughly what we would expect, more function evaluations but less Jacobian evaluations).

@ChrisRackauckas
Copy link
Member

That is precisely the thinking of Hairer II's approach. However, with a lot of benchmarking we have found that the optimal HP1 tends to be ... zero a lot of the time. So basically, just take new Jacobians when you start diverging instead of converging. That is a pretty hard heuristic to beat. In theory you could take it earlier if convergence slows, but we haven't found great schema for that. So for the first version I'd keep it simple just based off of a convergence heuristic set to zero.

@ChrisRackauckas
Copy link
Member

You may want to base it off of the reuse heuristics of OrdinaryDiffEq.jl. Take a look at its nonlinear solvers which documents this.

@frankschae
Copy link
Member Author

That is precisely the thinking of Hairer II's approach. However, with a lot of benchmarking we have found that the optimal HP1 tends to be ... zero a lot of the time.

Yeah seems to be the case for the 23 test problems as well (red line below = speed without reuse). I am wondering though why they look so similar.

using NonlinearSolve, LinearAlgebra, LinearSolve, NonlinearProblemLibrary, Test, BenchmarkTools, Plots

problems = NonlinearProblemLibrary.problems
dicts = NonlinearProblemLibrary.dicts

function test_on_library(problems, dicts, ϵ=1e-5)

    samples = 100
    secs = 5

    HP = [1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5]

    # store a list of all Plots
    plts = []

    alg = NewtonRaphson(; reuse=false)
    broken_tests = (1,6) # test problems where method does not converge to small residual

    for (idx, (problem, dict)) in enumerate(zip(problems, dicts))
        title = dict["title"]
        @testset "$title" begin
            # Newton without reuse
            x = dict["start"]
            res = similar(x)
            nlprob = NonlinearProblem(problem, x)
            sol = solve(nlprob, alg, abstol = 1e-18, reltol = 1e-18)
            problem(res, sol.u, nothing)
            broken = idx in broken_tests ? true : false
            @test norm(res)ϵ broken=broken

            balg = @benchmarkable solve(nlprob, $alg, abstol=1e-18, reltol=1e-18)
            talg = run(balg, samples=samples, seconds=secs)

            # Newton with reuse
            ts = []
            for reusetol in HP
                @show reusetol
                sol = solve(nlprob, NewtonRaphson(; reuse=true, reusetol=reusetol), abstol=1e-18, reltol=1e-18)
                problem(res, sol.u, nothing)
                broken = idx in broken_tests ? true : false
                @test norm(res)  ϵ broken = broken


                balg2 = @benchmarkable solve(nlprob, NewtonRaphson(; reuse=true, reusetol=$reusetol), abstol=1e-18, reltol=1e-18)
                talg2 = run(balg2, samples=samples, seconds=secs)

                push!(ts, mean(talg2.times) / mean(talg.times))
            end

            pl = scatter(HP, ts, xaxis=:log, xticks=HP, label=false, xlabel="HP", ylabel="mean(time w/ reuse)/mean(time  wo/ reuse)", title=title)
            hline!([1.0], color = "red", label = false)
            display(pl)
            savefig(pl, "Reuse-Newton-" * title * ".png")
            push!(plts, pl)
        end
    end
    return plts
end

# NewtonRaphson
plts = test_on_library(problems, dicts)

# Merge the plots into a single plot
merged_plot = plot(plts..., layout=(5, 5), size=(2500, 2500), margin=10*Plots.mm)
savefig(merged_plot, "Reuse-Newton.png")

Reuse-Newton

So basically, just take new Jacobians when you start diverging instead of converging. That is a pretty hard heuristic to beat.

So only recomputing when the residual increases?

You may want to base it off of the reuse heuristics of OrdinaryDiffEq.jl. Take a look at its nonlinear solvers which documents this.

I only scanned quickly through
https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/newton.jl
but didn't immediately see where the reuse happens. Is it in there?

@ChrisRackauckas
Copy link
Member

So only recomputing when the residual increases?

Yup.

but didn't immediately see where the reuse happens. Is it in there?

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/nlsolve.jl#L63-L96

@FHoltorf
Copy link
Contributor

One heuristic (or part of one) that might be interesting to also try and incorporate is to recompute Jacobians when the Newton direction associated with the old Jacobian stops being a descent direction for the residual norm. If for nothing else but the fact that it would leverage the AD system nicely.

One would need to compute the gradient of the residual norm with reverse mode AD but, even though more expensive then just checking the residual, it would make things play nicely with linesearch. (If you don't make any of such checks, I am guessing you could run into trouble if you want to combine Jacobian reuse with linesearch?)

@ChrisRackauckas
Copy link
Member

Ahh yes that would be an interesting approach to try.

@frankschae
Copy link
Member Author

For the gradient of the residual norm, I think we would need to change the default norm.

https://github.com/SciML/DiffEqBase.jl/blob/master/src/common_defaults.jl#L26C1-L33C4 drops the gradient

function res_norm1(u)
    b = f(u, p)
    sqrt(sum(abs2, b) / length(u))
end

function res_norm2(u)
    b = f(u, p)
    DiffEqBase.ODE_DEFAULT_NORM(b, nothing)
end

ForwardDiff.gradient(res_norm2, u0) # => zero(u0)

(Why is it actually sqrt(sum(abs2, b) / length(u)) and not sqrt(sum(abs2, b)) / length(u)?

@ChrisRackauckas
Copy link
Member

(Why is it actually sqrt(sum(abs2, b) / length(u)) and not sqrt(sum(abs2, b)) / length(u)?

That's just matching Hairer, I don't think it really matters which one it is.

https://github.com/SciML/DiffEqBase.jl/blob/master/src/common_defaults.jl#L26C1-L33C4 drops the gradient

We just shouldn't be differentiating the solvers here at all

@ChrisRackauckas
Copy link
Member

how did the AD thing come up?

@frankschae
Copy link
Member Author

One heuristic (or part of one) that might be interesting to also try and incorporate is to recompute Jacobians when the Newton direction associated with the old Jacobian stops being a descent direction for the residual norm. If for nothing else but the fact that it would leverage the AD system nicely.

One would need to compute the gradient of the residual norm with reverse mode AD but, even though more expensive then just checking the residual, it would make things play nicely with linesearch. (If you don't make any of such checks, I am guessing you could run into trouble if you want to combine Jacobian reuse with linesearch?)

Only for testing this idea, which might be necessary for Jacobian reuse if $\Delta x$ is not a decent direction. So I thought we'd need to compare $\Delta x$ (with the reused Jacobian) to $\frac{d ||f(x,p)||} {dx}$.

@ChrisRackauckas
Copy link
Member

ahh yeah, the ODE one purposely changes away from differentiating the norm, which isn't what you want here.

@frankschae
Copy link
Member Author

@ChrisRackauckas: @yonatanwesen, @FHoltorf, and I were talking again about a good example today. Was there any specific ODE system for which the reuse strategy in https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/nlsolve/nlsolve.jl#L63-L96 was particularly good? (To have another starting point.)

@ChrisRackauckas
Copy link
Member

Anything sufficiently large. Bruss with N=32 should do.

@avik-pal
Copy link
Member

SciML/SciMLBenchmarks.jl#796 has a benchmarking script. Drop the sundials and minpack versions and you should be able to test quite rapidly

@avik-pal
Copy link
Member

@avik-pal
Copy link
Member

though their scheme might not be ideal, given how long it takes for bruss

@yonatanwesen
Copy link
Contributor

The results from benchmarking what we currently we have on 2d brusselator problem doesn't show any significant improvement with the jacobian reuse.
brusselator_scaling

@ChrisRackauckas
Copy link
Member

what's the cost breakdown like?

@yonatanwesen
Copy link
Contributor

what's the cost breakdown like?

Do you mean the cost when I profile it?

@ChrisRackauckas
Copy link
Member

Yes share some flamegraphs https://github.com/tkluck/StatProfilerHTML.jl

@yonatanwesen
Copy link
Contributor

@avik-pal
Copy link
Member

Check the stats and trace to see how many times the factorization is reused

@avik-pal
Copy link
Member

#345 supercedes this

@avik-pal avik-pal closed this Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants