-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #136
Comments
The OOM is likely caused by the sampling configuration using all neighbors:
In the current implementation, using all neighbors is not scalable on large graphs for mini-batch training due to the exponential explosion of the number of neighbors. We used uniform sampling of 15-10-5 neighbors for training in the For the main branch the corresponding config for neighbor sampling is:
Updating this should solve the OOM. |
Thanks so much for replying. It works for OOM. However, there are several new errors. root@c54d23ae2acd:/working_dir# marius_train examples/configuration/ogbn_paper100m_disk.yaml
[2023-03-13 09:11:20.123] [info] [marius.cpp:41] Start initialization
[03/13/23 09:11:23.806] Initialization Complete: 3.682s
[03/13/23 09:11:23.807] Generating Sequential Ordering
[03/13/23 09:11:23.808] Num Train Partitions: 90
[03/13/23 09:12:51.593] ################ Starting training epoch 1 ################
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from launch_vectorized_kernel at ../aten/src/ATen/native/cuda/CUDALoops.cuh:98 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7c974c41ee in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0xb88 (0x7f7c478c0218 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #2: void at::native::gpu_kernel<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0x31b (0x7f7c478c0deb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #3: <unknown function> + 0x18f68e2 (0x7f7c478a88e2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #4: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x20 (0x7f7c478a9b30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #5: <unknown function> + 0x1a3078d (0x7f7c6f40d78d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2d4d91b (0x7f7c48cff91b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #7: at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) + 0x12b (0x7f7c6f9fa77b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::zero_(at::Tensor&) + 0x83 (0x7f7c6f40dcc3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x2d4b955 (0x7f7c48cfd955 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: at::_ops::zero_::call(at::Tensor&) + 0x9e (0x7f7c6fd5910e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::structured_nll_loss_backward_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) + 0x3d (0x7f7c47df5b0d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #12: <unknown function> + 0x2d49a6b (0x7f7c48cfba6b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #13: <unknown function> + 0x2d49b35 (0x7f7c48cfbb35 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #14: at::_ops::nll_loss_backward::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x94 (0x7f7c6fd338e4 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x377a776 (0x7f7c71157776 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x377ae0b (0x7f7c71157e0b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::nll_loss_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&) + 0x1cd (0x7f7c6fd9e5fd in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::generated::NllLossBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x23d (0x7f7c70e8d42d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x3db919b (0x7f7c7179619b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1640 (0x7f7c7178f710 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x698 (0x7f7c71790148 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x8b (0x7f7c7178790b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4f (0x7f7c9532726f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #24: <unknown function> + 0xd6de4 (0x7f7c9771bde4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #25: <unknown function> + 0x8609 (0x7f7cb6d50609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #26: clone + 0x43 (0x7f7cb6e8a133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped) When I set
Is this error caused by the wrong preprocess command? $ marius_preprocess --dataset ogbn_papers100m --output_dir datasets/marius/ogbn_papers100m/ --num_partitions 8192 --sequential_train_nodes |
Hi, I'm trying to run dataset
ogbn_papers100m
using marius with PARTITION_BUFFER in main branch. But I cannot find an example about it. So I followed the example in eurosys_2023_artifact branch and rewrote a yaml file.I have tested the example for
fb15k_237
and it works well. However, it didn't work for large datasets, such asogbn_papers100m
. I used the following commands forogbn_papers100m
dataset.Then, an error
CUDAOutOfMemoryError
happened.It seems marius still tries to allocate memory in GPU and does not use the partition buffer? Could you please tell me whether the configuration is correct in yaml? BTW, is there any examples for python API to use partition buffer directly rather than yaml?
m.storage.tensor_from_file
seems only support the device memory?Thanks for replying.
My environment is the following.
The text was updated successfully, but these errors were encountered: