Our of memory from the beginning. #374
Unanswered
turbosonics
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My extxyz input geometry contains 11820 frames (194 atoms system) of VASP DFT data. These are ~1200 frames of 300K AIMD, ~3000 frames of 3000K MD, ~5000 frames of super high temperature AIMD, and ~20 frames of geometry optimzations for equation of state (each frame contains geo opt with expanded or compressed volume). The geometry is multil component (4 elements) condensed phase cubic cell.
The training job crashes few minutes later after I submit. At the end of the error file generated by Slurm system for local facility server:
And the log file contains nothing, it just printed:
Torch device: cuda
This line. That is all.
I applied 1 node, and our server GPU node has 1 GPU unit per 1 node (GPU = NVIDA a100). What would be the best way to escape from this OOM error? I don't even know which hyperparameter to touch...
Beta Was this translation helpful? Give feedback.
All reactions