Our of memory from the beginning. #374

turbosonics · 2023-09-29T18:53:07Z

turbosonics
Sep 29, 2023

My extxyz input geometry contains 11820 frames (194 atoms system) of VASP DFT data. These are ~1200 frames of 300K AIMD, ~3000 frames of 3000K MD, ~5000 frames of super high temperature AIMD, and ~20 frames of geometry optimzations for equation of state (each frame contains geo opt with expanded or compressed volume). The geometry is multil component (4 elements) condensed phase cubic cell.

The training job crashes few minutes later after I submit. At the end of the error file generated by Slurm system for local facility server:

...
/cm/local/apps/slurm/var/spool/job267378/slurm_script: line 53: 1008019 Killed                  nequip-train ./example_mine.yaml > train.out 2> train.err
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=267378.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

And the log file contains nothing, it just printed:

Torch device: cuda

This line. That is all.

I applied 1 node, and our server GPU node has 1 GPU unit per 1 node (GPU = NVIDA a100). What would be the best way to escape from this OOM error? I don't even know which hyperparameter to touch...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Our of memory from the beginning. #374

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Our of memory from the beginning. #374

turbosonics Sep 29, 2023

Replies: 0 comments

turbosonics
Sep 29, 2023