Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed the random number generator with the time-step number before calling the radiation solver to guarantee reproducibility. #3384

Closed
Tracked by #3383
sriharshakandala opened this issue Oct 14, 2024 · 10 comments · Fixed by #3382

Comments

@sriharshakandala
Copy link
Member

No description provided.

@Sbozzolo
Copy link
Member

Note that if you seed the random number every time you call radiation, you lose all the randomness properties of your RNG. RNGs require a statistical sample of values to satisfy basic entrophy properties, so we should fix the seed as infrequently as we can. If we do this once every checkpoint, we will reset the seed every ~700 numbers, which is not very much, so will probably also want an option to disable this behavior.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Oct 14, 2024

  • We are using a different seed at every rrtmgp call, based on the dycore time step number. Update flux is typically called every one hour with a dycore step of about 100 sec.
  • The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 * (>= 224)(i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column * (>= 224)). I believe the percentage cloud cover is about 70%!
  • Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

@Sbozzolo
Copy link
Member

Sbozzolo commented Oct 14, 2024

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!

Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Oct 14, 2024

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!

Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

The user can keep changing the checkpointing frequency during the course of a long simulation! Simulations run for the same integration time using different checkpointing frequency for different fractions of total integration time could see different results!

@sriharshakandala
Copy link
Member Author

I do agree that we should have an option to turn off seeding completely if the user chooses to do so!

@Sbozzolo
Copy link
Member

Linking seeding frequency with checkpoint frequency is not natural and, in general, does not guarantee reproducibility!

Why not? We only need to seed at the checkpoint to ensure reproducibility and we need to be able to set the same seed upon restart.

The number of rand calls, after seeding, for each update_flux call is typically of the order 3000 * 4 * 4 * (fraction of cols with clouds) * (average number of cloudy layers per cloudy column) * 2 (i.e. 96,000 * frac of cols with clouds * avg # of cloudy layers per column).

Good!
Even so, we should interfere the least we can to ensure that the properties of the RNG are correctly realized.

The user can keep changing the checkpointing frequency during the course of a long simulation! Simulations run for the same integration time using different checkpointing frequency for different fractions of total integration time could see different results!

I agree that it's not a good property to have that the evolution is affected by the checkpoint frequency. My worry is that seeding too often will break the RNG. RNGs are proven to satisfy certain properties only with large number of points generated (in particular with regards to the correlation of the generated points). In the limit where you generate one point per step because you have very little clouds, the RNG will no longer be an RNG but just a predetermined sequence of numbers. So, resetting the random number generator at every radiation step can introduce bias in the random numbers.

If we could save the internal state of the RNG at checkpoint, we would just do that and restart using the same RNG.

This is another solution: we could keep count many random numbers were generated during a simulation, and when we restart the simulation, we restart from the same original seed and throw away that many points (we "spin up" our RNG). In this way, we keep the simulation reproducible without affecting the quality of the random numbers and without having that the frequency of checkpoints affect the evolution. And we can make this feature optional.

One easy way to make the count easy is to have every column/layer always generate random numbers, even when they are not needed. This way, the number of random numbers generated is just determined by how many times radiation is called.

@sriharshakandala
Copy link
Member Author

  • This manual seeding feature is primarily for verifying bit-wise reproducibility for testing purposes.
  • We should just turn off manual seeding for production runs. I do not believe the goal for our production runs is bitwise reproducibility. We just need reasonable agreement in some statistical sense.
  • I do not believe we should be generating random numbers when they are not needed.
  • Checkpoint frequency is not currently linked to rrtmgp call frequency!

@Sbozzolo
Copy link
Member

This manual seeding feature is primarily for verifying bit-wise reproducibility for testing purposes.
We should just turn off manual seeding for production runs. I do not believe the goal for our production runs is bitwise reproducibility. We just need reasonable agreement in some statistical sense.

I agree. The main use case of this is allowing people to restart a simulation that broke and explore what went wrong.

Checkpoint frequency is not currently linked to rrtmgp call frequency!

We started requiring for the checkpoint frequency to be integer multiple of the rrtgmp frequency:

if parsed_args["dt_save_state_to_disk"] != "Inf" &&
!CA.isdivisible(dt_save_state_to_disk_dates, dt_rad_ms)

Consider that for our nightly AMIP simulations, we only have ~ 4000 zones, so setting the RNG at every radiation call would lead to a very small run of points, which means that the points will be biased (even by resetting the seed to a new value). Being biased means that we could have more points closer to 0 than to 1.

If the quality of random numbers don't matter much, we can go ahead with resetting the seed every hour, but let's make sure that this gotcha is documented somewhere.

@Sbozzolo
Copy link
Member

If the quality of random numbers don't matter much, we can go ahead with resetting the seed every hour, but let's make sure that this gotcha is documented somewhere.

I ran some tests and convinced myself that the bias is small, even when we generate only 100 numbers per seed. So, this should be fine.

@sriharshakandala
Copy link
Member Author

sriharshakandala commented Oct 15, 2024

I updated the above estimate by a factor of (>= 224) since we do this for each of the 224 g-points (shortwave), or 256 g-points (longwave) in each cloudy layer of each column with clouds!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants