Update to recent versions of TensorFlow #90

jeipollack · 2023-12-07T13:00:28Z

WaveDiff implements the Rectified Adam Optimiser from the TensorFlow Addons library that has stopped development (see details in link). Minimal maintenance releases will continue until May 2024. As a result, it is not compatible with the latest versions of TensorFlow 2.11+. Interestingly, 2.9+ is also affected resulting in the following error reported in Issue #88 which results when loading a saved checkpoint. The Rectified Adam Optimiser is currently not part of the core library of TensorFlow 2. The fix is use the tf.keras.optimizers.legacy namespace (see here) that allows the old optimizers to work.

This issue to do a couple of tasks:

Re-evaluate performance with the Adam Optimiser versus the Rectified Adam Optimiser to see if it is necessary to use it
If Adam Optimiser is good enough, update WF to recent versions of TF with new optimiser functionality
If not, re-evaluate possible next steps such as asking TF community if and when migration of Rectified Adam will happen and/or development of custom optimiser

The text was updated successfully, but these errors were encountered:

nadamoukaddem · 2023-12-11T13:14:52Z

After updating the optimizer to the legacy version, running the validation test on shape metrics resulted in the following error:

assert ratio_rmse_e1 < tol
E       assert 2.450450287128092e-09 < 1e-09

Lowering the tolerance to 1e-8 gives this error:

assert ratio_rel_rmse_e1 < tol
E       assert 1.647135320670401e-05 < 1e-08

nadamoukaddem · 2024-02-26T14:22:06Z

I evaluated the performance using the Adam Optimizer versus the Rectified Adam Optimizer, and here are the results I obtained:

I used these as the number of training epochs:

Number of training epochs for training the parametric model parameters per cycle.
      n_epochs_params: [15, 15] 
 Number of training epochs for training the non-parametric model parameters per cycle.
      n_epochs_non_params: [100, 50]

jeipollack · 2024-02-26T14:51:19Z

Thanks @nadamoukaddem for this update. Just to confirm did you use the same random seed and learning rates for both runs?
Perhaps you could share the complete training configuration file parameters.

nadamoukaddem · 2024-02-26T15:07:30Z

Yes, I used the same random seed and learning rates for both runs. Here is the training_config.yaml file:

training:
# Run ID name
id_name: -coherent_euclid_200stars
# Name of Data Config file
data_config: data_config.yaml
# Metrics Config file - Enter file to run metrics evaluation else if empty run train only
metrics_config: metrics_config.yaml
# PSF model parameters
model_params:
# Model type. Options are: 'mccd', 'graph', 'poly, 'param', 'poly_physical'."
model_name: poly

\#Num of wavelength bins to reconstruct polychromatic objects.
n_bins_lda: 8 

\#Downsampling rate to match the oversampled model to the specified telescope's sampling.
output_Q: 3

\#Oversampling rate used for the OPD/WFE PSF model.
oversampling_rate: 3 

\#Dimension of the pixel PSF postage stamp
output_dim: 32
  
\#Dimension of the OPD/Wavefront space."
pupil_diameter: 256

\#Boolean to define if we use sample weights based on the noise standard deviation estimation
use_sample_weights: True 

\#Interpolation type for the physical poly model. Options are: 'none', 'all', 'top_K', 'independent_Zk'."
interpolation_type: None

\# SED intepolation points per bin
sed_interp_pts_per_bin: 0

\# SED extrapolate 
sed_extrapolate: True

\# SED interpolate kind
sed_interp_kind: linear

\# Standard deviation of the multiplicative SED Gaussian noise.
sed_sigma: 0

\#Limits of the PSF field coordinates for the x axis.
x_lims: [0.0, 1.0e+3]

\#Limits of the PSF field coordinates for the y axis.
y_lims: [0.0, 1.0e+3]

\# Hyperparameters for Parametric model 
param_hparams:
  \# Set the random seed for Tensor Flow Initialization
  random_seed: 3877572
  
  \# Set the parameter for the l2 loss function for the Optical path differences (OPD)/WFE
  l2_param: 0.

  \#Zernike polynomial modes to use on the parametric part.
  n_zernikes: 15

  \#Max polynomial degree of the parametric part.  m
  d_max: 2  

  \#Flag to save optimisation history for parametric model
  save_optim_history_param: true

\# Hyperparameters for non-parametric model
nonparam_hparams:
  \#Max polynomial degree of the non-parametric part. chg to max_deg_nonparam
  d_max_nonparam: 5 

  \# Number of graph features
  num_graph_features: 10
  
  \#L1 regularisation parameter for the non-parametric part."
  l1_rate: 1.0e-8

  \#Flag to enable Projected learning for DD_features to be used with `poly` or `semiparametric` model.
  project_dd_features: False

  \#Flag to reset DD_features to be used with `poly` or `semiparametric` model
  reset_dd_features: False

  \#Flag to save optimisation history for non-parametric model
  save_optim_history_nonparam: true

# Training hyperparameters
training_hparams:

\# Batch Size
batch_size: 32
     
\# Multi-cyclic Parameters     
multi_cycle_params:
 
  \# Total number of cycles to perform for training.  
  total_cycles: 2
     
  \# Train cycle definition. It can be: 'parametric', 'non-parametric', 'complete', 'only-non-parametric' and 'only-parametric'."
  cycle_def: complete

  \# Flag to save all cycles. If "True", create a checkpoint at every cycle, else if "False" only save the checkpoint at the end of the training."
  save_all_cycles: True

  \# Learning rates for training the parametric model parameters per cycle.
  learning_rate_params: [0.01, 0.004]

  \# Learning rates for training the non-parametric model parameters per cycle.
  learning_rate_non_params: [0.1, 0.06]

  \# Number of training epochs for training the parametric model parameters per cycle.
  n_epochs_params: [15, 15] 

  \# Number of training epochs for training the non-parametric model parameters per cycle.
  n_epochs_non_params: [100, 50]

nadamoukaddem · 2024-02-26T15:51:46Z

I remember I used the same TensorFlow 2.15.0 version for both optimizers. It's been a while since I ran these tests, I need to rerun them.

jeipollack · 2024-02-27T10:30:10Z

Okay. Could you run the tests 2-3 more times using different random numbers to see if the results are consistent? Make sure to use the same set of random numbers for both optimizers.

nadamoukaddem · 2024-03-04T15:04:02Z

I ran the tests twice with different random numbers, and the metrics changed slightly. However, I noticed something strange: when I used Adam or Rectified Adam, I'm getting the same numbers for the metrics. I'm using TensorFlow 2.15.0

jeipollack · 2024-03-04T15:11:51Z

I don't understand what you mean as your first statement (metrics differ) seems contrary to the second statement (metrics are the same). Try explaining more clearly and post the outputs.

nadamoukaddem · 2024-03-04T16:33:44Z

I have included the configurations I used in the attached text document. These include configs.yaml, training_config.yaml, and training_config_1.yaml. I didn't make any changes to the other configuration files.
configurations.odt
This is the output i get for adam and rectified adam optimizers for 2 random numbers:

Rectified Adam

'rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

Adam

rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

jeipollack · 2024-03-05T09:56:43Z

Can you confirm whether you rebuilt the package pip install . between changing optimisers?
For added assurance, you can uninstall the wf-psf package, delete associated build directories, and reinstall with the following steps:

cd wf-psf
pip uninstall wf-psf -y
rm -rf build
cd src
rm -rf wf_psf.egg_info
cd ..
pip install .

Else, create two branches one with Adam and the other with Rectified Adam.

nadamoukaddem · 2024-03-21T11:50:21Z

wf-psf_Adam.log
wf-psf_legacy.log
I ran WaveDiff with TensorFlow 2.11 and the Adam Optimizer. The code started the first cycle, then there was an error (see log file). So I tried to use the legacy Adam optimizer, but then at the end of the cycles, there was another error.

jeipollack · 2024-03-21T13:09:45Z

The first log seems like a GPU issue. Not sure how you're loading Tensor Flow 2.11.

Second log states that you need to update the optimizer to be an instance of a TensorFlow 2.11+ compatible optimizer. This means replacing any legacy optimizers with their TensorFlow 2.11+ counterparts. And, what worked for 2.15 doesn't work for 2.11.

nadamoukaddem · 2024-03-21T14:00:59Z

If it's a GPU issue, I should have the same problem when using the legacy optimizer.

jeipollack · 2024-03-21T14:56:55Z

True, I am too busy right now to assist you. You can try looking online for some clues.

jeipollack · 2024-03-21T14:59:52Z

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

nadamoukaddem · 2024-03-21T15:14:34Z

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

I understand. Thank you. I'll take a look.

nadamoukaddem · 2024-03-21T16:22:05Z

psf_pytest_tf2.9.log
psf_pytest_tf2.11.log
These are the outputs of the validation tests for the 2 versions of TensorFlow (2.9.1 and 2.11).

jeipollack · 2024-03-21T17:45:31Z

If you look carefully at your log, you will see that the validation tests did not run. They were skipped.

jeipollack · 2024-03-21T17:47:19Z

And it seems you solved your TensorFlow 2.11 issue, but didn't update this issue with how you solved it. It's really important that you share the solution to a reported problem in the issue.

nadamoukaddem · 2024-03-25T10:57:23Z

No, I haven't solved it yet. I read (link) that the following steps can fix the error:

 # Install NVCC
 conda install -c nvidia cuda-nvcc=11.3.58
 # Configure the XLA cuda directory
 mkdir -p $CONDA_PREFIX/etc/conda/activate.d
 printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 # Copy libdevice file to the required path
 mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
 cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

I couldn't install it via Conda. Is this a good solution, so I keep working on it?

jeipollack · 2024-03-25T11:01:42Z

If you didn't solve it, I don't understand how you were able to run the tests for TensorFlow 2.11 unless you ran it on a different system.

You can submit a ticket to Jean-Zay/Idris Support.

nadamoukaddem · 2024-03-26T11:35:23Z

Following the instructions of Idris support, the problem was resolved by adding these lines:

export CUDA_DIR=$CUDA_HOME
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME

nadamoukaddem · 2024-03-26T14:27:41Z

WaveDiff runs without error using TensorFlow 2.11 and Adam optimizer, but with Rectified and the legacy optimizer, there is this error at the end of the cycles:
ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, which can cause errors. Please update the optimizer referenced in your code to be an instance of

jeipollack · 2024-03-26T15:00:01Z

The message is cut-off. Are you stuck here and unsure how to update the optimiser?

nadamoukaddem · 2024-03-26T15:10:01Z

Aren't we looking to compare the performance of the Adam optimizer with that of the Rectified optimizer in this issue ?

jeipollack · 2024-03-26T22:18:47Z

The complete error message was already reported in #88 by Ezequiel. I tried to reproduce it but for me I don't encounter this error when I launched some test runs. Below is my implementation

pyproject.toml

[project]
name = "wf_psf"
requires-python = ">=3.9"
authors = [
    { "name" = "Tobias Liaudat", "email" = "[email protected]"},
    { "name" = "Jennifer Pollack", "email" = "[email protected]"},
]
maintainers = [
    { "name" = "Jennifer Pollack", "email" = "[email protected]" },
]

description = 'A software framework to perform Differentiable wavefront-based PSF modelling.'
dependencies = [
    "numpy",
    "scipy",
    "keras",
    "tensorflow",
    "tensorflow-addons",
    "tensorflow-estimator",
    "zernike",
    "opencv-python",
    "pillow",
    "galsim",
    "astropy",
    "matplotlib",
    "seaborn",
]

I looked at the TensorFlow-Addons documentation and issues. Their optimizer points to the tf.keras.optimizer.legacy.Optimizer, which is the legacy optimiser. So, the WaveDiff implementation of RectifiedAdam remains as is:

  # Prepare the optimizers
        param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_params[current_cycle - 1]
        )
        non_param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_non_params[current_cycle - 1]
        )
        logger.info("Starting cycle {}..".format(current_cycle))

I went through and changed all tf.keras.optimizer.Adam -> tf.keras.optimizer.legacy.Adam

I decided to use git diff to show the file changes.

diff --git a/src/wf_psf/psf_models/psf_models.py b/src/wf_psf/psf_models/psf_models.py
index 262faaa..f22de1a 100644
--- a/src/wf_psf/psf_models/psf_models.py
+++ b/src/wf_psf/psf_models/psf_models.py
@@ -164,7 +164,7 @@ def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):

     # Define optimizer function
     if optimizer is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=1e-2, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False
         )

and did these replacements (although I don't think they are used)

diff --git a/src/wf_psf/training/train_utils.py b/src/wf_psf/training/train_utils.py
index 8c1036a..49fc2c5 100644
--- a/src/wf_psf/training/train_utils.py
+++ b/src/wf_psf/training/train_utils.py
@@ -160,7 +160,7 @@ def general_train_cycle(

     # Define optimisers
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate_param,
             beta_1=0.9,
             beta_2=0.999,
@@ -289,7 +289,7 @@ def general_train_cycle(

         # Define optimiser
         if non_param_optim is None:
-            optimizer = tf.keras.optimizers.Adam(
+            optimizer = tf.keras.optimizers.legacy.Adam(
                 learning_rate=learning_rate_non_param,
                 beta_1=0.9,
                 beta_2=0.999,
@@ -364,7 +364,7 @@ def param_train_cycle(

     # Define optimiser
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate,
             beta_1=0.9,
             beta_2=0.999,

Could you try this and report an update?

nadamoukaddem · 2024-03-27T13:06:36Z

I don't have the build_PSF_model function in psf_models.py. Is it in the develop branch?

jeipollack · 2024-03-27T13:12:55Z

No, either way you can search for it in your branch and modify it there.

nadamoukaddem · 2024-03-27T13:34:07Z

This function doesn't exist. I can work on what you mentioned yesterday in the meeting so we can discuss this issue later.

jeipollack · 2024-03-27T13:47:54Z

it does exist:

wf-psf/src/wf_psf/psf_models/tf_psf_field.py

Line 1214 in 87e0c8e

def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):

nadamoukaddem · 2024-03-27T13:55:43Z

ok, so it's in tf_psf_field.py and not in psf_models.py.

jeipollack · 2024-03-27T14:01:28Z

yes, I meant for you to search for the function in your branch to find what module it is in. Apologies if that wasn't clear to you.

I've done some refactoring in the branch where I tested and I moved the function to psf_models.py. That's related to the new task I asked you to work on yesterday. But, I decided I will work on it myself.

nadamoukaddem · 2024-03-28T11:43:17Z

Thank you. I tried the changes that you made in the tf_psf_field.py file, and I obtained the same results for comparing the two optimizers with Tensorflow 2.9 and 2.11. I extracted the metrics from the metrics-poly-coherent_euclid_200stars.npy file and used these training configuration parameters training_config.log

jeipollack · 2024-04-02T09:59:37Z

Thanks. can you work on a Pull Request to perform an update to TensorFlow to 2.11 with Rectified Adam optimiser as well as associated package dependencies, i.e. Keras, TensorFlow-Addons, etc?

Make sure to run the validation tests on training and metrics, which you have to run locally with pytest as it cannot run during CI.

nadamoukaddem · 2024-04-03T15:46:11Z

The train_test is failing even though I can run WaveDiff normally.
psf_pytest.log

jeipollack · 2024-04-03T18:49:30Z

I am unable to reproduce your error nor have I ever encountered it.

My steps were:

Check out a new branch from the develop
Apply all the changes noted above concerning the optimizers and pyproject.toml file.
Add module load tensorflow 2.11 and environment commands in the batch script

See the output here:psf_pytest1.txt

jeipollack · 2024-04-04T09:10:26Z

Hi Nada, I can open the PR since I was able to implement the needed change without an issue.

Could you work on #133 which is more urgently needed? Let Tobias and me know if you have questions by either asking directly in #133 or in #sgs-sdc-fr-psf.

nadamoukaddem · 2024-04-04T10:46:34Z

Hi Jennifer, Ok.

jeipollack added the enhancement New feature or request label Dec 7, 2023

jeipollack assigned nadamoukaddem, sfarrens, jeipollack and tobias-liaudat Jan 10, 2024

jeipollack added this to the Modelling Performance Enhancement milestone Feb 13, 2024

Update to recent versions of TensorFlow #90

Update to recent versions of TensorFlow #90

Comments

jeipollack commented Dec 7, 2023 • edited Loading

nadamoukaddem commented Dec 11, 2023

nadamoukaddem commented Feb 26, 2024 • edited Loading

jeipollack commented Feb 26, 2024

nadamoukaddem commented Feb 26, 2024

nadamoukaddem commented Feb 26, 2024 • edited by jeipollack Loading

jeipollack commented Feb 27, 2024

nadamoukaddem commented Mar 4, 2024

jeipollack commented Mar 4, 2024 • edited Loading

nadamoukaddem commented Mar 4, 2024

jeipollack commented Mar 5, 2024 • edited Loading

nadamoukaddem commented Mar 21, 2024

jeipollack commented Mar 21, 2024

nadamoukaddem commented Mar 21, 2024

jeipollack commented Mar 21, 2024

jeipollack commented Mar 21, 2024

nadamoukaddem commented Mar 21, 2024

nadamoukaddem commented Mar 21, 2024

jeipollack commented Mar 21, 2024

jeipollack commented Mar 21, 2024 • edited Loading

nadamoukaddem commented Mar 25, 2024

jeipollack commented Mar 25, 2024

nadamoukaddem commented Mar 26, 2024

nadamoukaddem commented Mar 26, 2024

jeipollack commented Mar 26, 2024

nadamoukaddem commented Mar 26, 2024

jeipollack commented Mar 26, 2024 • edited Loading

nadamoukaddem commented Mar 27, 2024

jeipollack commented Mar 27, 2024

nadamoukaddem commented Mar 27, 2024

jeipollack commented Mar 27, 2024

nadamoukaddem commented Mar 27, 2024

jeipollack commented Mar 27, 2024 • edited Loading

nadamoukaddem commented Mar 28, 2024

jeipollack commented Apr 2, 2024

nadamoukaddem commented Apr 3, 2024

jeipollack commented Apr 3, 2024

jeipollack commented Apr 4, 2024

nadamoukaddem commented Apr 4, 2024

jeipollack commented Dec 7, 2023 •

edited

Loading

nadamoukaddem commented Feb 26, 2024 •

edited

Loading

nadamoukaddem commented Feb 26, 2024 •

edited by jeipollack

Loading

jeipollack commented Mar 4, 2024 •

edited

Loading

jeipollack commented Mar 5, 2024 •

edited

Loading

jeipollack commented Mar 21, 2024 •

edited

Loading

jeipollack commented Mar 26, 2024 •

edited

Loading

jeipollack commented Mar 27, 2024 •

edited

Loading