GPU training problem #149

haidi-ustc · 2023-08-25T03:02:22Z

haidi-ustc
Aug 25, 2023

Hi matgl team, thanks for the powerful tool.
I try to use the tutorial example to train a PES model on GPU- A800 with 64 cores Intel(R) Xeon(R) Platinum 8358 CPU

#!/bin/bash
#SBATCH --partition=gpu-a800
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH -e err
#SBATCH -o out
#SBATCH --gres=gpu:1 
hostname
source ~/.bashrc
export target=Linux_x86_64
export version=21.2
export nvhome=/opt/app/nvhpc/2021_212
export CUDA_PATH=$nvhome/$target/$version/cuda/11.8
export nvcudadir=$nvhome/$target/$version/cuda/11.8
export PATH=$PATH:$nvcudadir/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#export TORCH_USE_CUDA_DSA=1
~/miniconda3/envs/py310/bin/python train.py

train.py

import shutil
import warnings

import numpy as np
import pytorch_lightning as pl
from dgl.data.utils import split_dataset
from pymatgen.ext.matproj import MPRester
from pytorch_lightning.loggers import CSVLogger

from matgl.ext.pymatgen import Structure2Graph, get_element_list
from matgl.graph.data import M3GNetDataset, MGLDataLoader, collate_fn_efs
from matgl.models import M3GNet
from matgl.utils.training import PotentialLightningModule
from monty.serialization import loadfn ,dumpfn
from pymatgen.ext.matproj import MPRester
# To suppress warnings for clearer output
warnings.simplefilter("ignore")
import torch
torch.set_default_device("cuda")
mpr = MPRester(os.environ['API_KEY'])

#torch.cuda.set_device(0)
if os.path.isfile('entries.json'):
   entries=loadfn('entries.json')
else:
   entries = mpr.get_entries_in_chemsys(["Si", "O"])
   dumpfn(entries,'entries.json')
structures = [e.structure for e in entries]
energies = [e.energy for e in entries]
forces = [np.zeros((len(s), 3)).tolist() for s in structures]
stresses = [np.zeros((3, 3)).tolist() for s in structures]
labels = {
    "energies": energies,
    "forces": forces,
    "stresses": stresses,
}

print(f"{len(structures)} downloaded from MP.")

element_types = get_element_list(structures)
converter = Structure2Graph(element_types=element_types, cutoff=5.0)
dataset = M3GNetDataset(
    threebody_cutoff=4.0,
    structures=structures,
    converter=converter,
    labels=labels,
)
train_data, val_data, test_data = split_dataset(
    dataset,
    frac_list=[0.8, 0.1, 0.1],
    shuffle=True,
    random_state=42,
)
train_loader, val_loader, test_loader = MGLDataLoader(
    train_data=train_data,
    val_data=val_data,
    test_data=test_data,
    collate_fn=collate_fn_efs,
    batch_size=2,
    num_workers=1,
)
model = M3GNet(
    element_types=element_types,
    is_intensive=False,
)
lit_module = PotentialLightningModule(model=model)

# If you wish to disable GPU or MPS (M1 mac) training, use the accelerator="cpu" kwarg.
logger = CSVLogger("logs", name="M3GNet_training")
#trainer = pl.Trainer(max_epochs=20, accelerator="cpu", logger=logger)
trainer = pl.Trainer(max_epochs=20, logger=logger)
trainer.fit(model=lit_module, train_dataloaders=train_loader, val_dataloaders=val_loader)

but i got the following error:

Traceback (most recent call last):
  File "/home/u2019/work/ml/GPU-test/train.py", line 45, in <module>
    dataset = M3GNetDataset(
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/matgl/graph/data.py", line 255, in __init__
    super().__init__(name=name)
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 112, in __init__
    self._load()
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 203, in _load
    self.process()
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/matgl/graph/data.py", line 278, in process
    line_graph = create_line_graph(graph, self.threebody_cutoff)  # type: ignore
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/matgl/graph/compute.py", line 146, in create_line_graph
    l_g, triple_bond_indices, n_triple_ij, n_triple_i, n_triple_s = compute_3body(graph_with_three_body)
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/matgl/graph/compute.py", line 24, in compute_3body
    first_col = g.edges()[0].numpy().reshape(-1, 1)
  File "/home/u2019/miniconda3/envs/py310/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__
    return func(*args, **kwargs)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

would you pls help me to fix it ? thank you.

`

shyuep · 2023-08-25T03:05:29Z

shyuep
Aug 25, 2023
Maintainer

Pls refer to the tutorials and usage guide. It is written how to deal with GPU and CPU training.

6 replies

SmallBearC Sep 3, 2023

You should use num_workers=0, and you can refs my discussion Training with GPU.

haidi-ustc Sep 6, 2023
Author

Thank you for your reply. I tried to use num_workers=0, but still failed. A little confused now, would you please give a working example for GPU training ? Thanks again.

SmallBearC Sep 9, 2023

train.pdf
It should be noted that I only used one 3090 card here. I can train normally on Centos. The Matgl version used is 0.8.5。It should be noted that num_workers and generator parameters.Any questions can be communicated. Wishing you all the best！

SmallBearC Sep 9, 2023

I'm sorry, the version of matgl I used is 0.7.1 @haidi-ustc

haidi-ustc Sep 28, 2023
Author

@SmallBearC Thank you again, would you please leave your email address and we may discuss more

SmallBearC · 2023-09-28T15:03:33Z

SmallBearC
Sep 28, 2023

Here is it 发自我的iPhone

…

------------------ Original ------------------ From: haidi ***@***.***> Date: Thu,Sep 28,2023 11:01 PM To: materialsvirtuallab/matgl ***@***.***> Cc: SmallBearC ***@***.***>, Mention ***@***.***> Subject: Re: [materialsvirtuallab/matgl] GPU training problem (Discussion#149) @SmallBearC Thank you again, would you please leave your email address and we may discuss more — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

mstapelberg · 2023-11-11T21:22:08Z

mstapelberg
Nov 11, 2023

First off thank you all for posting this, had the same issue myself

I can confirm on 0.8.5 matgl that even following @SmallBearC 's example the

python trainer.py
File: vasprun_69_01.xml loaded
File: vasprun_69_02.xml loaded
File: vasprun_69_03.xml loaded
90 downloaded from MP.
  0%|                                                                                            | 0/90 [00:01<?, ?it/s]Traceback (most recent call last):
  File "/mnt/c/Users/myles/Dropbox (MIT)/Research/2023/Fall_2023/VASP/James_Work/Training/GPU_Test/trainer.py", line 63, in <module>
    dataset = M3GNetDataset(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 256, in __init__    super().__init__(name=name)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 112, in __init__
    self._load()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 203, in _load
    self.process()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 279, in process
    line_graph = create_line_graph(graph, self.threebody_cutoff)  # type: ignore
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/compute.py", line 141, in create_line_graph
    l_g, triple_bond_indices, n_triple_ij, n_triple_i, n_triple_s = compute_3body(graph_with_three_body)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/compute.py", line 25, in compute_3body
    first_col = g.edges()[0].numpy().reshape(-1, 1)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Error still happens :(

Downgrading to 0.7.1 seems to work until I get past the sanity check. Instead of the issue being in compute_3body, it's now in the epoch itself...

python train.py
File: vasprun_69_01.xml loaded
File: vasprun_69_02.xml loaded
File: vasprun_69_03.xml loaded
90 downloaded from MP.
100%|███████████████████████████████████████████████████████████████████████████████████| 90/90 [00:08<00:00, 11.01it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type              | Params
--------------------------------------------
0 | mae   | MeanAbsoluteError | 0
1 | rmse  | MeanSquaredError  | 0
2 | model | Potential         | 279 K
--------------------------------------------
279 K     Trainable params
0         Non-trainable params
279 K     Total params
1.120     Total estimated model params size (MB)
Epoch 0:   0%|                                                                                    | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/c/Users/myles/Dropbox (MIT)/Research/2023/Fall_2023/VASP/James_Work/Training/GPU_Test/train.py", line 99, in <module>
    trainer.fit(model=lit_module, train_dataloaders=train_loader, val_dataloaders=val_loader)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
    batch = super().__next__()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
    batch = next(self.iterator)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
    out = next(self._iterator)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 283, in __iter__
    for idx in self.sampler:
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 165, in __iter__
    yield from map(int, torch.randperm(n, generator=generator).numpy())
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Here's a zip file containing the test training data and the python program, please let me know if there's anything else I can attach.

gpu_fail.zip

3 replies

JiQi535 Nov 11, 2023
Maintainer

I don't know if it is the only or the best solution, but using pytorch==2.0.1 should avoid this issue.

mstapelberg Nov 12, 2023

Hi, thanks for your quick reply! This worked for me:
I made a new matgl-gpu conda environment:

Installed Torch with

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

Installed DGL with

pip install dgl -f https://data.dgl.ai/wheels/cu118/repo.html
pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html

then installed matgl version 0.7.1 with

pip install matgl==0.7.1

Can confirm training works, I'll try using a newer version of matgl as well

Thanks for the help!

JiQi535 Nov 12, 2023
Maintainer

Happy to know and thanks for posting your installation scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU training problem #149

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GPU training problem #149

haidi-ustc Aug 25, 2023

Replies: 3 comments · 9 replies

shyuep Aug 25, 2023 Maintainer

SmallBearC Sep 3, 2023

haidi-ustc Sep 6, 2023 Author

SmallBearC Sep 9, 2023

SmallBearC Sep 9, 2023

haidi-ustc Sep 28, 2023 Author

SmallBearC Sep 28, 2023

mstapelberg Nov 11, 2023

JiQi535 Nov 11, 2023 Maintainer

mstapelberg Nov 12, 2023

JiQi535 Nov 12, 2023 Maintainer

haidi-ustc
Aug 25, 2023

Replies: 3 comments 9 replies

shyuep
Aug 25, 2023
Maintainer

haidi-ustc Sep 6, 2023
Author

haidi-ustc Sep 28, 2023
Author

SmallBearC
Sep 28, 2023

mstapelberg
Nov 11, 2023

JiQi535 Nov 11, 2023
Maintainer

JiQi535 Nov 12, 2023
Maintainer