You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reaching out to report an anomaly encountered while using NequIP (stress_branch), which occurs when I try to increase the size of the dataset for build my model. The software seems to function correctly with up to 1000 configurations (energies and forces), but when I try with 2000 I started to find some irregularities.
In the initial phase of training, the metrics calculated in the Train section show unusually high and nonsensical values. The metrics in Validation, while appearing more coherent, are significantly higher than those obtained with a smaller dataset. Here is a copy of the log file up to portion of the ill log file up to the anomalous results:
torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[712200, 1], batch=[712200], cell=[2000, 3, 3], edge_cell_shift=[30711318, 3], edge_index=[2, 30711318], forces=[712200, 3], pbc=[2000, 3], pos=[712200, 3], ptr=[2001], total_energy=[2000, 1])
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(2000)...
Replace string dataset_forces_rms to 55.92239761352539
Replace string dataset_per_atom_total_energy_mean to -9247.5546875
Atomic outputs are scaled by: [B, O, Na, Si, Ca: 55.922398], shifted by [B, O, Na, Si, Ca: -9247.554688].
/ccc/work/cont003/gen7069/brugnolu/My_venv/lib/python3.8/site-packages/nequip/nn/_grad_output.py:196: UserWarning: !! Stresses in NequIP are in BETA and UNDER DEVELOPMENT: please carefully check the sanity of your results and report any (potential) issues on the GitHub
warnings.warn(
Replace string dataset_forces_rms to 55.92239761352539
Initially outputs are globally scaled by: 55.92239761352539, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 132280
! Starting training ...
I am fairly confident that the problem is not due to the presence of outliers in the dataset, as the same behavior occurs with different samplings from the trajectories. It was assembled by taking snapshots from AIMDs simulations from boxes of different size and compositions, ranging between 250 to 450 atoms. I don't know either if the fault is in the stress_branch version, I got the same behavior using both StressForceOutput and ForceOutput in the model builders. Here it is the used config.yaml( config.yaml.txt. I would also like to report an additional issue encountered when using an even larger dataset, around 4000 configurations. In this scenario I have found a different problem: either the training stops before completing an epoch due to insufficient memory, or, after reducing the batch_size from 5 to 1, the metrics calculated before the end of the first epoch and thereafter all turn out to be 0. If the training is not interrupted, it continues until the maximum number of epochs is reached without any variation.
I believe that in this case, both issues are related to exceeding the available memory, but for some reason, in the second case it does not lead to the termination of the training. I would be really grateful for any suggestions or solutions to these problems.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Dear NequIP developer team,
I am reaching out to report an anomaly encountered while using NequIP (stress_branch), which occurs when I try to increase the size of the dataset for build my model. The software seems to function correctly with up to 1000 configurations (energies and forces), but when I try with 2000 I started to find some irregularities.
In the initial phase of training, the metrics calculated in the Train section show unusually high and nonsensical values. The metrics in Validation, while appearing more coherent, are significantly higher than those obtained with a smaller dataset. Here is a copy of the log file up to portion of the ill log file up to the anomalous results:
torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[712200, 1], batch=[712200], cell=[2000, 3, 3], edge_cell_shift=[30711318, 3], edge_index=[2, 30711318], forces=[712200, 3], pbc=[2000, 3], pos=[712200, 3], ptr=[2001], total_energy=[2000, 1])
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(2000)...
Replace string dataset_forces_rms to 55.92239761352539
Replace string dataset_per_atom_total_energy_mean to -9247.5546875
Atomic outputs are scaled by: [B, O, Na, Si, Ca: 55.922398], shifted by [B, O, Na, Si, Ca: -9247.554688].
/ccc/work/cont003/gen7069/brugnolu/My_venv/lib/python3.8/site-packages/nequip/nn/_grad_output.py:196: UserWarning: !! Stresses in NequIP are in BETA and UNDER DEVELOPMENT: please carefully check the sanity of your results and report any (potential) issues on the GitHub
warnings.warn(
Replace string dataset_forces_rms to 55.92239761352539
Initially outputs are globally scaled by: 55.92239761352539, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 132280
! Starting training ...
validation
Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
Initialization # Epoch wal LR loss_f loss_e loss f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
! Initial Validation 0 7.828 0.005 0.996 422 423 42.2 55.9 50.2 40.2 20.1 54.6 24 37.8 64.6 52.8 26.3 69.5 31.7 49 3.05e+05 890
Wall time: 7.831220627995208
! Best model 0 423.427
training
Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
validation
Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
Train # Epoch wal LR loss_f loss_e loss f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
! Train 1 62.458 0.005 8e+24 4.19e+19 8e+24 2.09e+12 1.5e+14 4.3e+12 7.11e+11 4.85e+12 1.16e+12 1.53e+13 5.27e+12 2.04e+14 3.62e+13 2.18e+14 6.67e+13 6.56e+14 2.36e+14 3.29e+12 9.05e+09
! Validation 1 62.458 0.005 36.4 94.8 131 246 339 271 210 499 185 535 340 346 279 635 252 684 439 8.65e+04 286
Wall time: 62.46107454999583
! Best model 1 131.214
I am fairly confident that the problem is not due to the presence of outliers in the dataset, as the same behavior occurs with different samplings from the trajectories. It was assembled by taking snapshots from AIMDs simulations from boxes of different size and compositions, ranging between 250 to 450 atoms. I don't know either if the fault is in the stress_branch version, I got the same behavior using both StressForceOutput and ForceOutput in the model builders. Here it is the used config.yaml(
config.yaml.txt. I would also like to report an additional issue encountered when using an even larger dataset, around 4000 configurations. In this scenario I have found a different problem: either the training stops before completing an epoch due to insufficient memory, or, after reducing the batch_size from 5 to 1, the metrics calculated before the end of the first epoch and thereafter all turn out to be 0. If the training is not interrupted, it continues until the maximum number of epochs is reached without any variation.
I believe that in this case, both issues are related to exceeding the available memory, but for some reason, in the second case it does not lead to the termination of the training. I would be really grateful for any suggestions or solutions to these problems.
Beta Was this translation helpful? Give feedback.
All reactions