Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

sriharshakandala · 2024-10-14T18:46:24Z

No description provided.

Sbozzolo · 2024-10-14T20:15:16Z

Note that if you seed the random number every time you call radiation, you lose all the randomness properties of your RNG. RNGs require a statistical sample of values to satisfy basic entrophy properties, so we should fix the seed as infrequently as we can. If we do this once every checkpoint, we will reset the seed every ~700 numbers, which is not very much, so will probably also want an option to disable this behavior.

sriharshakandala · 2024-10-14T20:54:54Z

We are using a different seed at every rrtmgp call, based on the dycore time step number. Update flux is typically called every one hour with a dycore step of about 100 sec.
The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 * (>= 224)(i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column * (>= 224)). I believe the percentage cloud cover is about 70%!
Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Sbozzolo · 2024-10-14T21:02:59Z

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!

Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

sriharshakandala · 2024-10-14T21:06:51Z

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!

Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

The user can keep changing the checkpointing frequency during the course of a long simulation! Simulations run for the same integration time using different checkpointing frequency for different fractions of total integration time could see different results!

sriharshakandala · 2024-10-14T21:07:39Z

I do agree that we should have an option to turn off seeding completely if the user chooses to do so!

Sbozzolo · 2024-10-14T21:44:07Z

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!
Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

The user can keep changing the checkpointing frequency during the course of a long simulation! Simulations run for the same integration time using different checkpointing frequency for different fractions of total integration time could see different results!

I agree that it's not a good property to have that the evolution is affected by the checkpoint frequency. My worry is that seeding too often will break the RNG. RNGs are proven to satisfy certain properties only with large number of points generated (in particular with regards to the correlation of the generated points). In the limit where you generate one point per step because you have very little clouds, the RNG will no longer be an RNG but just a predetermined sequence of numbers. So, resetting the random number generator at every radiation step can introduce bias in the random numbers.

If we could save the internal state of the RNG at checkpoint, we would just do that and restart using the same RNG.

This is another solution: we could keep count many random numbers were generated during a simulation, and when we restart the simulation, we restart from the same original seed and throw away that many points (we "spin up" our RNG). In this way, we keep the simulation reproducible without affecting the quality of the random numbers and without having that the frequency of checkpoints affect the evolution. And we can make this feature optional.

One easy way to make the count easy is to have every column/layer always generate random numbers, even when they are not needed. This way, the number of random numbers generated is just determined by how many times radiation is called.

sriharshakandala · 2024-10-14T21:59:21Z

This manual seeding feature is primarily for verifying bit-wise reproducibility for testing purposes.
We should just turn off manual seeding for production runs. I do not believe the goal for our production runs is bitwise reproducibility. We just need reasonable agreement in some statistical sense.
I do not believe we should be generating random numbers when they are not needed.
Checkpoint frequency is not currently linked to rrtmgp call frequency!

Sbozzolo · 2024-10-14T22:14:01Z

This manual seeding feature is primarily for verifying bit-wise reproducibility for testing purposes.
We should just turn off manual seeding for production runs. I do not believe the goal for our production runs is bitwise reproducibility. We just need reasonable agreement in some statistical sense.

I agree. The main use case of this is allowing people to restart a simulation that broke and explore what went wrong.

Checkpoint frequency is not currently linked to rrtmgp call frequency!

We started requiring for the checkpoint frequency to be integer multiple of the rrtgmp frequency:

ClimaAtmos.jl/src/callbacks/get_callbacks.jl

Lines 284 to 285 in 3caa650

    
           if parsed_args["dt_save_state_to_disk"] != "Inf" && 
        
              !CA.isdivisible(dt_save_state_to_disk_dates, dt_rad_ms)

Consider that for our nightly AMIP simulations, we only have ~ 4000 zones, so setting the RNG at every radiation call would lead to a very small run of points, which means that the points will be biased (even by resetting the seed to a new value). Being biased means that we could have more points closer to 0 than to 1.

If the quality of random numbers don't matter much, we can go ahead with resetting the seed every hour, but let's make sure that this gotcha is documented somewhere.

Sbozzolo · 2024-10-14T22:35:53Z

If the quality of random numbers don't matter much, we can go ahead with resetting the seed every hour, but let's make sure that this gotcha is documented somewhere.

I ran some tests and convinced myself that the bias is small, even when we generate only 100 numbers per seed. So, this should be fine.

sriharshakandala · 2024-10-15T00:45:47Z

I updated the above estimate by a factor of (>= 224) since we do this for each of the 224 g-points (shortwave), or 256 g-points (longwave) in each cloudy layer of each column with clouds!

sriharshakandala mentioned this issue Oct 14, 2024

Support reproducible restarts with RRTMGP #3383

Closed

sriharshakandala linked a pull request Oct 14, 2024 that will close this issue

Support reproducible restarts with RRTMGP.jl #3382

Merged

1 task

sriharshakandala closed this as completed in #3382 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 14, 2024 •

edited

Loading

Sbozzolo commented Oct 14, 2024 •

edited

Loading

sriharshakandala commented Oct 14, 2024 •

edited

Loading

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 15, 2024 •

edited

Loading

Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

Comments

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 14, 2024 • edited Loading

Sbozzolo commented Oct 14, 2024 • edited Loading

sriharshakandala commented Oct 14, 2024 • edited Loading

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

Sbozzolo commented Oct 14, 2024

sriharshakandala commented Oct 15, 2024 • edited Loading

sriharshakandala commented Oct 14, 2024 •

edited

Loading

Sbozzolo commented Oct 14, 2024 •

edited

Loading

sriharshakandala commented Oct 14, 2024 •

edited

Loading

sriharshakandala commented Oct 15, 2024 •

edited

Loading