GPU training problem #149
Replies: 3 comments 9 replies
-
Pls refer to the tutorials and usage guide. It is written how to deal with GPU and CPU training. |
Beta Was this translation helpful? Give feedback.
-
Here is it
发自我的iPhone
…------------------ Original ------------------
From: haidi ***@***.***>
Date: Thu,Sep 28,2023 11:01 PM
To: materialsvirtuallab/matgl ***@***.***>
Cc: SmallBearC ***@***.***>, Mention ***@***.***>
Subject: Re: [materialsvirtuallab/matgl] GPU training problem (Discussion#149)
@SmallBearC Thank you again, would you please leave your email address and we may discuss more
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
First off thank you all for posting this, had the same issue myself I can confirm on 0.8.5 matgl that even following @SmallBearC 's example the python trainer.py
File: vasprun_69_01.xml loaded
File: vasprun_69_02.xml loaded
File: vasprun_69_03.xml loaded
90 downloaded from MP.
0%| | 0/90 [00:01<?, ?it/s]Traceback (most recent call last):
File "/mnt/c/Users/myles/Dropbox (MIT)/Research/2023/Fall_2023/VASP/James_Work/Training/GPU_Test/trainer.py", line 63, in <module>
dataset = M3GNetDataset(
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 256, in __init__ super().__init__(name=name)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 112, in __init__
self._load()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/data/dgl_dataset.py", line 203, in _load
self.process()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 279, in process
line_graph = create_line_graph(graph, self.threebody_cutoff) # type: ignore
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/compute.py", line 141, in create_line_graph
l_g, triple_bond_indices, n_triple_ij, n_triple_i, n_triple_s = compute_3body(graph_with_three_body)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/compute.py", line 25, in compute_3body
first_col = g.edges()[0].numpy().reshape(-1, 1)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. Error still happens :( Downgrading to 0.7.1 seems to work until I get past the sanity check. Instead of the issue being in compute_3body, it's now in the epoch itself... python train.py
File: vasprun_69_01.xml loaded
File: vasprun_69_02.xml loaded
File: vasprun_69_03.xml loaded
90 downloaded from MP.
100%|███████████████████████████████████████████████████████████████████████████████████| 90/90 [00:08<00:00, 11.01it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
--------------------------------------------
0 | mae | MeanAbsoluteError | 0
1 | rmse | MeanSquaredError | 0
2 | model | Potential | 279 K
--------------------------------------------
279 K Trainable params
0 Non-trainable params
279 K Total params
1.120 Total estimated model params size (MB)
Epoch 0: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/c/Users/myles/Dropbox (MIT)/Research/2023/Fall_2023/VASP/James_Work/Training/GPU_Test/train.py", line 99, in <module>
trainer.fit(model=lit_module, train_dataloaders=train_loader, val_dataloaders=val_loader)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 202, in advance
batch, _, __ = next(data_fetcher)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 127, in __next__
batch = super().__next__()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 56, in __next__
batch = next(self.iterator)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 326, in __next__
out = next(self._iterator)
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 74, in __next__
out[i] = next(self.iterators[i])
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
index = self._next_index() # may raise StopIteration
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 283, in __iter__
for idx in self.sampler:
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/sampler.py", line 165, in __iter__
yield from map(int, torch.randperm(n, generator=generator).numpy())
File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. Here's a zip file containing the test training data and the python program, please let me know if there's anything else I can attach. |
Beta Was this translation helpful? Give feedback.
-
Hi matgl team, thanks for the powerful tool.
I try to use the tutorial example to train a PES model on GPU- A800 with 64 cores Intel(R) Xeon(R) Platinum 8358 CPU
train.py
but i got the following error:
would you pls help me to fix it ? thank you.
`
Beta Was this translation helpful? Give feedback.
All reactions