Training Issue when increasing Dataset size #393

LucaBrugnoli · 2023-12-08T15:37:57Z

LucaBrugnoli
Dec 8, 2023

Dear NequIP developer team,

I am reaching out to report an anomaly encountered while using NequIP (stress_branch), which occurs when I try to increase the size of the dataset for build my model. The software seems to function correctly with up to 1000 configurations (energies and forces), but when I try with 2000 I started to find some irregularities.
In the initial phase of training, the metrics calculated in the Train section show unusually high and nonsensical values. The metrics in Validation, while appearing more coherent, are significantly higher than those obtained with a smaller dataset. Here is a copy of the log file up to portion of the ill log file up to the anomalous results:

torch device: cuda
Processing dataset...
Loaded data: Batch(atomic_numbers=[712200, 1], batch=[712200], cell=[2000, 3, 3], edge_cell_shift=[30711318, 3], edge_index=[2, 30711318], forces=[712200, 3], pbc=[2000, 3], pos=[712200, 3], ptr=[2001], total_energy=[2000, 1])
Cached processed data to disk
Done!
Successfully loaded the data set of type ASEDataset(2000)...
Replace string dataset_forces_rms to 55.92239761352539
Replace string dataset_per_atom_total_energy_mean to -9247.5546875
Atomic outputs are scaled by: [B, O, Na, Si, Ca: 55.922398], shifted by [B, O, Na, Si, Ca: -9247.554688].
/ccc/work/cont003/gen7069/brugnolu/My_venv/lib/python3.8/site-packages/nequip/nn/_grad_output.py:196: UserWarning: !! Stresses in NequIP are in BETA and UNDER DEVELOPMENT: please carefully check the sanity of your results and report any (potential) issues on the GitHub
warnings.warn(
Replace string dataset_forces_rms to 55.92239761352539
Initially outputs are globally scaled by: 55.92239761352539, total_energy are globally shifted by None.
Successfully built the network...
Number of weights: 132280
! Starting training ...

validation

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

  0    10          670        0.962          669         41.4         54.9         50.4         37.6         18.8         53.8         20.8         36.3         65.1         49.3         25.1         68.5         26.6         46.9     3.74e+05     1.38e+03
  0    20          223         0.98          222         41.8         55.4         50.5         40.3           21         54.9         23.4           38           65         52.8         26.6         69.4         31.5         49.1     1.96e+05          629
  0    30          525        0.941          524         41.3         54.2         51.2         38.2         23.6         52.6         23.2         37.8         66.2         49.4         31.9         67.1           29         48.7     3.15e+05     1.13e+03
  0    40          845         1.15          844         45.2         60.1         52.9         42.7           20           57         21.8         38.9         68.1           57         26.6         72.4         26.6         50.1      5.4e+05     1.26e+03
  0    50          253        0.995          252         41.9         55.8         50.6         40.5           20         55.8           23           38         64.3         53.6         25.2         70.6         30.7         48.9     2.23e+05          720
  0    60          257        0.964          256         41.4         54.9         47.6         40.6         18.8         55.6           25         37.5         60.9         53.1           25         70.9           34         48.8     2.08e+05          678
  0    70          145        0.934          144         40.8           54         47.5         39.9         20.2         54.5         21.4         36.7         61.4         52.2         26.5         68.6           28         47.3     1.92e+05          565
  0    80          124         1.05          123         42.9         57.4         52.4         41.2         21.8           54         24.3         38.7         67.4         54.8         28.2         69.9         32.8         50.6     1.61e+05          472

Initialization # Epoch wal LR loss_f loss_e loss f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
! Initial Validation 0 7.828 0.005 0.996 422 423 42.2 55.9 50.2 40.2 20.1 54.6 24 37.8 64.6 52.8 26.3 69.5 31.7 49 3.05e+05 890
Wall time: 7.831220627995208
! Best model 0 423.427

training

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

  1    10          226            1          225         41.9         55.9         52.5         40.1         18.7           53         28.5         38.6         67.4         53.1         24.5         67.1         38.2         50.1     2.45e+05          735
  1    20          730        0.879          729         40.2         52.4         48.1           38         18.3         49.5         25.4         35.9         61.3         49.3         23.3         61.8         32.6         45.7      4.4e+05     1.26e+03
  1    30          396        0.937          395         40.8         54.1         51.7         38.3         21.2         51.3         25.3         37.6         66.2         50.1         28.2         66.8         33.4         48.9     3.01e+05        1e+03
  1    40          268          0.9          267         39.5           53         55.2         37.4         17.2         49.1         26.3         37.1         70.4         49.8         22.7         61.7         37.3         48.4     2.55e+05          813
  1    50          237        0.911          237         39.9         53.4         53.4         40.1         20.1         40.1         22.6         35.3         68.9         53.1         26.2           51         28.6         45.6     2.51e+05          779
  1    60         24.5         2.72         21.8         71.4         92.3         63.3         83.7         33.3         50.1         31.2         52.3         81.7          105           42         62.1         39.3         65.9     9.04e+04          217
  1    70          576         11.7          564          147          191          109          151         87.2          177         55.3          116          139          193          112          225         67.1          147        3e+05      1.1e+03
  1    80          386         28.9          357          214          301          339          189          131          119         82.7          172          437          260          162          151          105          223     2.96e+05          765
  1    90          519         20.6          498          188          254          286          161          285          152          173          211          369          212          360          194          224          272     3.45e+05     1.02e+03
  1   100          194         37.3          157          259          341          298          216          465          233          354          313          379          279          593          301          429          396     1.64e+05          368
  1   110          374         57.9          316          307          425          445          266          527          209          461          382          582          364          675          269          566          491     1.82e+05          606
  1   120          103         55.1         48.2          309          415          370          267          572          197          470          375          468          351          729          257          612          484      1.1e+05          265
  1   130         58.8         41.4         17.4          260          360          345          225          480          180          395          325          450          303          617          229          495          419     5.86e+04          192
  1   140          276         20.6          256          183          254          196          153          421          156          343          254          248          202          521          202          441          323     1.37e+05          510
  1   150         53.1         39.5         13.6          260          351          358          226          400          191          356          306          457          302          512          246          466          397     5.13e+04          166
  1   160         50.9         40.4         10.5          266          356          319          227          480          203          374          321          406          296          611          263          450          405     3.61e+04          115
  1   170          318         21.2          297          188          258          309          154          376          141          297          256          391          203          468          180          375          323     2.17e+05          703
  1   180         75.1         37.9         37.2          256          344          333          214          440          182          416          317          430          283          538          234          542          406     9.05e+04          227
  1   190         91.4         58.3         33.1          322          427          372          271          587          229          516          395          471          352          748          300          649          504      9.4e+04          221
  1   200          105         40.6         64.3          270          356          334          224          493          186          406          329          420          290          627          238          523          419     1.49e+05          343
  1   210         52.8         45.8         6.97          276          378          323          239          478          184          443          333          422          324          617          230          576          434     4.62e+04          129
  1   220          302         27.2          275          213          291          295          177          404          136          468          296          380          232          526          186          566          378     2.22e+05          655
  1   230         37.7         31.1         6.61          226          312          237          188          436          195          456          302          313          247          556          250          598          393      3.7e+04          118
  1   240          231         32.8          198          229          320          270          194          493          157          463          315          350          261          619          218          578          405     1.04e+05          394
  1   250         69.7         43.9         25.8          278          370          288          241          542          220          537          366          367          314          684          281          673          464     7.41e+04          168
  1   260         56.5         42.5         13.9          272          365          297          235          489          198          475          339          378          308          624          255          594          432     5.44e+04          164
  1   270         52.6         27.4         25.2          215          293          240          175          398          203          427          288          310          229          513          256          522          366     6.57e+04          225
  1   280          214         32.6          181          228          319          302          193          440          153          474          313          384          261          572          227          648          419     1.37e+05          459
  1   290           61         40.5         20.5          268          356          293          227          574          243          536          375          371          291          724          312          693          478     7.49e+04          168
  1   300         44.9         41.8         3.09          269          362          247          230          511          251          567          361          315          296          648          316          705          456      2.4e+04         73.2
  1   310         65.6         34.9         30.7          245          330          264          203          466          241          547          344          340          264          585          305          698          438     8.64e+04          244
  1   320         56.2         30.7         25.5          231          310          238          193          448          238          478          319          301          247          574          302          616          408     8.65e+04          236

validation

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

  1    10          480         14.4          466          132          212          181          114          441         98.9          485          264          234          162          566          162          616          348     2.51e+05          975
  1    20         49.3         39.2         10.1          260          350          252          224          492          230          491          338          324          292          619          293          616          429     3.36e+04          113
  1    30          336         23.3          313          180          270          240          154          454          130          453          286          307          214          585          200          596          381     1.82e+05          696
  1    40         65.7         35.1         30.6          249          332          270          214          479          221          613          359          344          278          603          282          764          454     9.08e+04          203
  1    50           49         38.1         10.9          254          345          259          211          467          232          540          342          334          275          601          292          690          438     4.21e+04          138
  1    60         53.4         39.5         13.9          260          352          270          219          515          213          474          338          340          286          655          268          590          428     4.13e+04          140
  1    70         48.6         45.1         3.52          276          376          274          238          509          229          590          368          353          311          656          290          751          472     2.92e+04         84.9
  1    80         52.8         49.4         3.36          291          393          275          253          568          236          597          386          353          329          719          300          740          488     2.24e+04         64.6

Train # Epoch wal LR loss_f loss_e loss f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae
! Train 1 62.458 0.005 8e+24 4.19e+19 8e+24 2.09e+12 1.5e+14 4.3e+12 7.11e+11 4.85e+12 1.16e+12 1.53e+13 5.27e+12 2.04e+14 3.62e+13 2.18e+14 6.67e+13 6.56e+14 2.36e+14 3.29e+12 9.05e+09
! Validation 1 62.458 0.005 36.4 94.8 131 246 339 271 210 499 185 535 340 346 279 635 252 684 439 8.65e+04 286
Wall time: 62.46107454999583
! Best model 1 131.214

I am fairly confident that the problem is not due to the presence of outliers in the dataset, as the same behavior occurs with different samplings from the trajectories. It was assembled by taking snapshots from AIMDs simulations from boxes of different size and compositions, ranging between 250 to 450 atoms. I don't know either if the fault is in the stress_branch version, I got the same behavior using both StressForceOutput and ForceOutput in the model builders. Here it is the used config.yaml(
config.yaml.txt. I would also like to report an additional issue encountered when using an even larger dataset, around 4000 configurations. In this scenario I have found a different problem: either the training stops before completing an epoch due to insufficient memory, or, after reducing the batch_size from 5 to 1, the metrics calculated before the end of the first epoch and thereafter all turn out to be 0. If the training is not interrupted, it continues until the maximum number of epochs is reached without any variation.

I believe that in this case, both issues are related to exceeding the available memory, but for some reason, in the second case it does not lead to the termination of the training. I would be really grateful for any suggestions or solutions to these problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Issue when increasing Dataset size #393

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Training Issue when increasing Dataset size #393

LucaBrugnoli Dec 8, 2023

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

Epoch batch loss loss_f loss_e f_mae f_rmse B_f_mae O_f_mae Na_f_mae Si_f_mae Ca_f_mae psavg_f_mae B_f_rmse O_f_rmse Na_f_rmse Si_f_rmse Ca_f_rmse psavg_f_rmse e_mae e/N_mae

Replies: 0 comments

LucaBrugnoli
Dec 8, 2023