Derivatives are different from one run to the next #403

abaillod · 2024-04-17T22:22:53Z

Hi!

I realized that getting twice the same derivatives is not guaranteed in some cases. For example, running

surf.dgamma_by_dcoeff_vjp(np.ones((1,))) - surf.dgamma_by_dcoeff_vjp(np.ones((1,)))

for some surf that is a SurfaceRZFourier instance, I do not get zero as expected, for example it would returns

array([ 0.00000000e+00, -5.68434189e-14,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -3.15544362e-30,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
       -1.57772181e-30,  0.00000000e+00, -6.31088724e-30,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -6.31088724e-30,  6.31088724e-30,  0.00000000e+00,
        0.00000000e+00,  6.31088724e-30,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  6.31088724e-30,
        0.00000000e+00,  6.31088724e-30,  0.00000000e+00,  0.00000000e+00,
        2.52435490e-29,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  3.15544362e-30,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -3.15544362e-30,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  1.57772181e-30,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00])

Now this difference is very negligible; however, when running the single stage example, examples/3_Advanced/single_stage_optimization.py, these errors accumulate, and after 100 iterations the optimum found can differ by 10^-3 in the degrees of freedom from one run to the other, which is an issue.

To be clear - when running twice the same script, with the same initial guess, I obtain different optima.

Is this a known problem? Does anyone know how to have reproducible runs?

The text was updated successfully, but these errors were encountered:

andrewgiuliani · 2024-04-18T18:25:14Z

Hi Antoine,

I wouldn't say that this is a bug, simply an artifact of using a multithreaded code. The discrepancy is due to OpenMP and goes away when you only allow simsopt to use one thread.
export OMP_NUM_THREADS=1

Specifically, the critical region here is the problem.
https://github.com/hiddenSymmetries/simsopt/blob/38a20cb8a0bcb6aab6d7250e595fd47b5f720ecf/src/simsoptpp/surfacerzfourier.cpp#L609C2-L609C21

The critical region doesn't guarantee that the threads all write to resptr in the same order, which causes the small floating point discrepancies. This sort of thing would be fixed if we added something to guarantee that the threads always write in the same order.

abaillod · 2024-04-20T19:41:35Z

You are right, setting export OMP_NUM_THREADS=1 solves the problem.

I think it would be worth fixing, as it is currently impossible to get twice the same result, which is obvisouly an issue for reproducibility. As an example, consider the examples/3_Advanced/single_stage_optimization.py example, and change the line 35 to MAXITER_single_stage = 100.

First run gives

...
fun#131: Objective function = 0.2159
fun#132: Objective function = 0.2159
Aspect ratio after optimization: 7.048877300334167
Mean iota after optimization: -0.9001021369022315
Quasisymmetry objective after optimization: 0.21345053416150483
Magnetic well after optimization: -0.14651386077848788
Squared flux after optimization: 8.485373878526792e-05

While running a second time gives

...
fun#121: Objective function = 0.2159
fun#122: Objective function = 0.2159
Aspect ratio after optimization: 7.04875160522698
Mean iota after optimization: -0.9002470575161438
Quasisymmetry objective after optimization: 0.21346284023142553
Magnetic well after optimization: -0.14653190866717544
Squared flux after optimization: 8.81152422891433e-05

Note that the aspect ratio, mean iota, QS error, magnetic well and squared flux are different in both runs. In addition, first run required 132 iterations to converge, while the second one took only 122. If you increase the weight on the coils, coils_objective_weight = ..., the difference between both results grows even more.

Any suggestions on how to fix that? Something like this? https://stackoverflow.com/questions/27199255/openmp-ordering-critical-sections

andrewgiuliani · 2024-04-20T19:49:26Z

There are a couple of ways of remedying this.

Replacing the critical regions with something that guarantees an ordering. I haven't tested this, but you could check if something like:

#pragma omp for ordered
for( int t=0; t<omp_get_num_threads(); ++t )
        {
            #pragma omp ordered
                {
                for(int i=0; i<num_dofs(); ++i) {
                    resptr[i] += resptr_private[i];
                }
            }
        }

might work and fits the bill. I'd also like to see how much things slow down now that threads must wait and write in the same order.

each thread could saves their contribution to a separate array, e.g. resptr_private[thread_id, i] in the multithreaded region of the code. Then, after the multi-threaded region of the code is exited, thread 0 sums the contributions into resptr. This approach uses more buffer memory.

abaillod · 2024-04-26T16:10:51Z

Thank you @andrewgiuliani, what you proposed fixed the issue. When evaluating twice the same derivative, they match up to machine precision.

However, when running the examples/3_Advanced/single_stage_optimization.py example with MAXITER_single_stage = 1E3, I still get discrepancies. For example, running with sbatch -N 1 --ntasks-per-node=32, I get the two following results:

< fun#1001: Objective function = 0.1230
< fun#1002: Objective function = 0.1230
< Aspect ratio after optimization: 7.033478399904086
< Mean iota after optimization: -0.0007434078919211339
< Quasisymmetry objective after optimization: 0.09954032247447002
< Magnetic well after optimization: -0.001651371860370606
< Squared flux after optimization: 2.2158571643393266e-05
---
> fun#1070: Objective function = 0.1231
> fun#1071: Objective function = 0.1231
> Aspect ratio after optimization: 7.033602930037235
> Mean iota after optimization: -0.0007454067291868244
> Quasisymmetry objective after optimization: 0.09966106730240758
> Magnetic well after optimization: -0.0016879256487739636
> Squared flux after optimization: 2.207314699559636e-05

Interestingly, running rith --ntasks-per-node=1, I get the same result on two different runs, down to machine precision, even after 10k iterations.

There is thus somewhere else in simsopt an issue with the OpenMP parallelization. I would be interested to know if anyone has a suggestion to where to look!

abaillod added the bug Something isn't working label Apr 17, 2024

abaillod assigned mbkumar, landreman, rogeriojorge and abaillod Apr 17, 2024

andrewgiuliani closed this as completed Apr 19, 2024

andrewgiuliani reopened this Apr 19, 2024

abaillod removed the bug Something isn't working label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derivatives are different from one run to the next #403

Derivatives are different from one run to the next #403

abaillod commented Apr 17, 2024

andrewgiuliani commented Apr 18, 2024 •

edited

Loading

abaillod commented Apr 20, 2024

andrewgiuliani commented Apr 20, 2024 •

edited

Loading

abaillod commented Apr 26, 2024

Derivatives are different from one run to the next #403

Derivatives are different from one run to the next #403

Comments

abaillod commented Apr 17, 2024

andrewgiuliani commented Apr 18, 2024 • edited Loading

abaillod commented Apr 20, 2024

andrewgiuliani commented Apr 20, 2024 • edited Loading

abaillod commented Apr 26, 2024

andrewgiuliani commented Apr 18, 2024 •

edited

Loading

andrewgiuliani commented Apr 20, 2024 •

edited

Loading