Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to recent versions of TensorFlow #90

Open
jeipollack opened this issue Dec 7, 2023 · 38 comments
Open

Update to recent versions of TensorFlow #90

jeipollack opened this issue Dec 7, 2023 · 38 comments
Assignees
Labels
enhancement New feature or request

Comments

@jeipollack
Copy link
Contributor

jeipollack commented Dec 7, 2023

WaveDiff implements the Rectified Adam Optimiser from the TensorFlow Addons library that has stopped development (see details in link). Minimal maintenance releases will continue until May 2024. As a result, it is not compatible with the latest versions of TensorFlow 2.11+. Interestingly, 2.9+ is also affected resulting in the following error reported in Issue #88 which results when loading a saved checkpoint. The Rectified Adam Optimiser is currently not part of the core library of TensorFlow 2. The fix is use the tf.keras.optimizers.legacy namespace (see here) that allows the old optimizers to work.

This issue to do a couple of tasks:

  • Re-evaluate performance with the Adam Optimiser versus the Rectified Adam Optimiser to see if it is necessary to use it
  • If Adam Optimiser is good enough, update WF to recent versions of TF with new optimiser functionality
  • If not, re-evaluate possible next steps such as asking TF community if and when migration of Rectified Adam will happen and/or development of custom optimiser
@jeipollack jeipollack added the enhancement New feature or request label Dec 7, 2023
@nadamoukaddem
Copy link
Contributor

After updating the optimizer to the legacy version, running the validation test on shape metrics resulted in the following error:

assert ratio_rmse_e1 < tol
E       assert 2.450450287128092e-09 < 1e-09

Lowering the tolerance to 1e-8 gives this error:

assert ratio_rel_rmse_e1 < tol
E       assert 1.647135320670401e-05 < 1e-08

@nadamoukaddem
Copy link
Contributor

nadamoukaddem commented Feb 26, 2024

I evaluated the performance using the Adam Optimizer versus the Rectified Adam Optimizer, and here are the results I obtained:
Adam_Vs_RectifiedAdam
I used these as the number of training epochs:

Number of training epochs for training the parametric model parameters per cycle.
      n_epochs_params: [15, 15] 
 Number of training epochs for training the non-parametric model parameters per cycle.
      n_epochs_non_params: [100, 50]

@jeipollack
Copy link
Contributor Author

Thanks @nadamoukaddem for this update. Just to confirm did you use the same random seed and learning rates for both runs?
Perhaps you could share the complete training configuration file parameters.

@nadamoukaddem
Copy link
Contributor

Yes, I used the same random seed and learning rates for both runs. Here is the training_config.yaml file:

training:
# Run ID name
id_name: -coherent_euclid_200stars
# Name of Data Config file
data_config: data_config.yaml
# Metrics Config file - Enter file to run metrics evaluation else if empty run train only
metrics_config: metrics_config.yaml
# PSF model parameters
model_params:
# Model type. Options are: 'mccd', 'graph', 'poly, 'param', 'poly_physical'."
model_name: poly

\#Num of wavelength bins to reconstruct polychromatic objects.
n_bins_lda: 8 

\#Downsampling rate to match the oversampled model to the specified telescope's sampling.
output_Q: 3

\#Oversampling rate used for the OPD/WFE PSF model.
oversampling_rate: 3 

\#Dimension of the pixel PSF postage stamp
output_dim: 32
  
\#Dimension of the OPD/Wavefront space."
pupil_diameter: 256

\#Boolean to define if we use sample weights based on the noise standard deviation estimation
use_sample_weights: True 

\#Interpolation type for the physical poly model. Options are: 'none', 'all', 'top_K', 'independent_Zk'."
interpolation_type: None

\# SED intepolation points per bin
sed_interp_pts_per_bin: 0

\# SED extrapolate 
sed_extrapolate: True

\# SED interpolate kind
sed_interp_kind: linear

\# Standard deviation of the multiplicative SED Gaussian noise.
sed_sigma: 0

\#Limits of the PSF field coordinates for the x axis.
x_lims: [0.0, 1.0e+3]

\#Limits of the PSF field coordinates for the y axis.
y_lims: [0.0, 1.0e+3]

\# Hyperparameters for Parametric model 
param_hparams:
  \# Set the random seed for Tensor Flow Initialization
  random_seed: 3877572
  
  \# Set the parameter for the l2 loss function for the Optical path differences (OPD)/WFE
  l2_param: 0.

  \#Zernike polynomial modes to use on the parametric part.
  n_zernikes: 15

  \#Max polynomial degree of the parametric part.  m
  d_max: 2  

  \#Flag to save optimisation history for parametric model
  save_optim_history_param: true

\# Hyperparameters for non-parametric model
nonparam_hparams:
  \#Max polynomial degree of the non-parametric part. chg to max_deg_nonparam
  d_max_nonparam: 5 

  \# Number of graph features
  num_graph_features: 10
  
  \#L1 regularisation parameter for the non-parametric part."
  l1_rate: 1.0e-8

  \#Flag to enable Projected learning for DD_features to be used with `poly` or `semiparametric` model.
  project_dd_features: False

  \#Flag to reset DD_features to be used with `poly` or `semiparametric` model
  reset_dd_features: False

  \#Flag to save optimisation history for non-parametric model
  save_optim_history_nonparam: true 

# Training hyperparameters
training_hparams:

\# Batch Size
batch_size: 32
     
\# Multi-cyclic Parameters     
multi_cycle_params:
 
  \# Total number of cycles to perform for training.  
  total_cycles: 2
     
  \# Train cycle definition. It can be: 'parametric', 'non-parametric', 'complete', 'only-non-parametric' and 'only-parametric'."
  cycle_def: complete

  \# Flag to save all cycles. If "True", create a checkpoint at every cycle, else if "False" only save the checkpoint at the end of the training."
  save_all_cycles: True

  \# Learning rates for training the parametric model parameters per cycle.
  learning_rate_params: [0.01, 0.004]

  \# Learning rates for training the non-parametric model parameters per cycle.
  learning_rate_non_params: [0.1, 0.06]

  \# Number of training epochs for training the parametric model parameters per cycle.
  n_epochs_params: [15, 15] 

  \# Number of training epochs for training the non-parametric model parameters per cycle.
  n_epochs_non_params: [100, 50]

@nadamoukaddem
Copy link
Contributor

nadamoukaddem commented Feb 26, 2024

I remember I used the same TensorFlow 2.15.0 version for both optimizers. It's been a while since I ran these tests, I need to rerun them.

@jeipollack
Copy link
Contributor Author

Okay. Could you run the tests 2-3 more times using different random numbers to see if the results are consistent? Make sure to use the same set of random numbers for both optimizers.

@nadamoukaddem
Copy link
Contributor

I ran the tests twice with different random numbers, and the metrics changed slightly. However, I noticed something strange: when I used Adam or Rectified Adam, I'm getting the same numbers for the metrics. I'm using TensorFlow 2.15.0

@jeipollack
Copy link
Contributor Author

jeipollack commented Mar 4, 2024

I don't understand what you mean as your first statement (metrics differ) seems contrary to the second statement (metrics are the same). Try explaining more clearly and post the outputs.

@nadamoukaddem
Copy link
Contributor

I have included the configurations I used in the attached text document. These include configs.yaml, training_config.yaml, and training_config_1.yaml. I didn't make any changes to the other configuration files.
configurations.odt
This is the output i get for adam and rectified adam optimizers for 2 random numbers:

Rectified Adam

'rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

Adam

rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

@jeipollack
Copy link
Contributor Author

jeipollack commented Mar 5, 2024

Can you confirm whether you rebuilt the package pip install . between changing optimisers?
For added assurance, you can uninstall the wf-psf package, delete associated build directories, and reinstall with the following steps:

cd wf-psf
pip uninstall wf-psf -y
rm -rf build
cd src
rm -rf wf_psf.egg_info
cd ..
pip install .

Else, create two branches one with Adam and the other with Rectified Adam.

@nadamoukaddem
Copy link
Contributor

wf-psf_Adam.log
wf-psf_legacy.log
I ran WaveDiff with TensorFlow 2.11 and the Adam Optimizer. The code started the first cycle, then there was an error (see log file). So I tried to use the legacy Adam optimizer, but then at the end of the cycles, there was another error.

@jeipollack
Copy link
Contributor Author

The first log seems like a GPU issue. Not sure how you're loading Tensor Flow 2.11.

Second log states that you need to update the optimizer to be an instance of a TensorFlow 2.11+ compatible optimizer. This means replacing any legacy optimizers with their TensorFlow 2.11+ counterparts. And, what worked for 2.15 doesn't work for 2.11.

@nadamoukaddem
Copy link
Contributor

If it's a GPU issue, I should have the same problem when using the legacy optimizer.

@jeipollack
Copy link
Contributor Author

True, I am too busy right now to assist you. You can try looking online for some clues.

@jeipollack
Copy link
Contributor Author

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

@nadamoukaddem
Copy link
Contributor

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

I understand. Thank you. I'll take a look.

@nadamoukaddem
Copy link
Contributor

psf_pytest_tf2.9.log
psf_pytest_tf2.11.log
These are the outputs of the validation tests for the 2 versions of TensorFlow (2.9.1 and 2.11).

@jeipollack
Copy link
Contributor Author

If you look carefully at your log, you will see that the validation tests did not run. They were skipped.

@jeipollack
Copy link
Contributor Author

jeipollack commented Mar 21, 2024

And it seems you solved your TensorFlow 2.11 issue, but didn't update this issue with how you solved it. It's really important that you share the solution to a reported problem in the issue.

@nadamoukaddem
Copy link
Contributor

No, I haven't solved it yet. I read (link) that the following steps can fix the error:

 # Install NVCC
 conda install -c nvidia cuda-nvcc=11.3.58
 # Configure the XLA cuda directory
 mkdir -p $CONDA_PREFIX/etc/conda/activate.d
 printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 # Copy libdevice file to the required path
 mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
 cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

I couldn't install it via Conda. Is this a good solution, so I keep working on it?

@jeipollack
Copy link
Contributor Author

If you didn't solve it, I don't understand how you were able to run the tests for TensorFlow 2.11 unless you ran it on a different system.

You can submit a ticket to Jean-Zay/Idris Support.

@nadamoukaddem
Copy link
Contributor

Following the instructions of Idris support, the problem was resolved by adding these lines:

export CUDA_DIR=$CUDA_HOME
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME

@nadamoukaddem
Copy link
Contributor

WaveDiff runs without error using TensorFlow 2.11 and Adam optimizer, but with Rectified and the legacy optimizer, there is this error at the end of the cycles:
ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, which can cause errors. Please update the optimizer referenced in your code to be an instance of

@jeipollack
Copy link
Contributor Author

The message is cut-off. Are you stuck here and unsure how to update the optimiser?

@nadamoukaddem
Copy link
Contributor

Aren't we looking to compare the performance of the Adam optimizer with that of the Rectified optimizer in this issue ?

@jeipollack
Copy link
Contributor Author

jeipollack commented Mar 26, 2024

The complete error message was already reported in #88 by Ezequiel. I tried to reproduce it but for me I don't encounter this error when I launched some test runs. Below is my implementation

pyproject.toml

[project]
name = "wf_psf"
requires-python = ">=3.9"
authors = [
    { "name" = "Tobias Liaudat", "email" = "[email protected]"},
    { "name" = "Jennifer Pollack", "email" = "[email protected]"},
]
maintainers = [
    { "name" = "Jennifer Pollack", "email" = "[email protected]" },
]

description = 'A software framework to perform Differentiable wavefront-based PSF modelling.'
dependencies = [
    "numpy",
    "scipy",
    "keras",
    "tensorflow",
    "tensorflow-addons",
    "tensorflow-estimator",
    "zernike",
    "opencv-python",
    "pillow",
    "galsim",
    "astropy",
    "matplotlib",
    "seaborn",
]

I looked at the TensorFlow-Addons documentation and issues. Their optimizer points to the tf.keras.optimizer.legacy.Optimizer, which is the legacy optimiser. So, the WaveDiff implementation of RectifiedAdam remains as is:

  # Prepare the optimizers
        param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_params[current_cycle - 1]
        )
        non_param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_non_params[current_cycle - 1]
        )
        logger.info("Starting cycle {}..".format(current_cycle))

I went through and changed all tf.keras.optimizer.Adam -> tf.keras.optimizer.legacy.Adam

I decided to use git diff to show the file changes.

diff --git a/src/wf_psf/psf_models/psf_models.py b/src/wf_psf/psf_models/psf_models.py
index 262faaa..f22de1a 100644
--- a/src/wf_psf/psf_models/psf_models.py
+++ b/src/wf_psf/psf_models/psf_models.py
@@ -164,7 +164,7 @@ def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):

     # Define optimizer function
     if optimizer is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=1e-2, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False
         )

and did these replacements (although I don't think they are used)

diff --git a/src/wf_psf/training/train_utils.py b/src/wf_psf/training/train_utils.py
index 8c1036a..49fc2c5 100644
--- a/src/wf_psf/training/train_utils.py
+++ b/src/wf_psf/training/train_utils.py
@@ -160,7 +160,7 @@ def general_train_cycle(

     # Define optimisers
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate_param,
             beta_1=0.9,
             beta_2=0.999,
@@ -289,7 +289,7 @@ def general_train_cycle(

         # Define optimiser
         if non_param_optim is None:
-            optimizer = tf.keras.optimizers.Adam(
+            optimizer = tf.keras.optimizers.legacy.Adam(
                 learning_rate=learning_rate_non_param,
                 beta_1=0.9,
                 beta_2=0.999,
@@ -364,7 +364,7 @@ def param_train_cycle(

     # Define optimiser
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate,
             beta_1=0.9,
             beta_2=0.999,

Could you try this and report an update?

@nadamoukaddem
Copy link
Contributor

I don't have the build_PSF_model function in psf_models.py. Is it in the develop branch?

@jeipollack
Copy link
Contributor Author

No, either way you can search for it in your branch and modify it there.

@nadamoukaddem
Copy link
Contributor

This function doesn't exist. I can work on what you mentioned yesterday in the meeting so we can discuss this issue later.

@jeipollack
Copy link
Contributor Author

it does exist:

def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):

@nadamoukaddem
Copy link
Contributor

ok, so it's in tf_psf_field.py and not in psf_models.py.

@jeipollack
Copy link
Contributor Author

jeipollack commented Mar 27, 2024

yes, I meant for you to search for the function in your branch to find what module it is in. Apologies if that wasn't clear to you.

I've done some refactoring in the branch where I tested and I moved the function to psf_models.py. That's related to the new task I asked you to work on yesterday. But, I decided I will work on it myself.

@nadamoukaddem
Copy link
Contributor

Thank you. I tried the changes that you made in the tf_psf_field.py file, and I obtained the same results for comparing the two optimizers with Tensorflow 2.9 and 2.11. I extracted the metrics from the metrics-poly-coherent_euclid_200stars.npy file and used these training configuration parameters training_config.log

Screenshot from 2024-03-28 12-20-48

@jeipollack
Copy link
Contributor Author

Thanks. can you work on a Pull Request to perform an update to TensorFlow to 2.11 with Rectified Adam optimiser as well as associated package dependencies, i.e. Keras, TensorFlow-Addons, etc?

Make sure to run the validation tests on training and metrics, which you have to run locally with pytest as it cannot run during CI.

@nadamoukaddem
Copy link
Contributor

The train_test is failing even though I can run WaveDiff normally.
psf_pytest.log

@jeipollack
Copy link
Contributor Author

I am unable to reproduce your error nor have I ever encountered it.

My steps were:

  • Check out a new branch from the develop
  • Apply all the changes noted above concerning the optimizers and pyproject.toml file.
  • Add module load tensorflow 2.11 and environment commands in the batch script

See the output here:psf_pytest1.txt

@jeipollack
Copy link
Contributor Author

Hi Nada, I can open the PR since I was able to implement the needed change without an issue.

Could you work on #133 which is more urgently needed? Let Tobias and me know if you have questions by either asking directly in #133 or in #sgs-sdc-fr-psf.

@nadamoukaddem
Copy link
Contributor

Hi Jennifer, Ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants