-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk space usage problem #258
Comments
cc @matthewdeng what's the best way to debug object store memory usage for xgboost on ray? @showkeyjar I think your workload has high object store usage which triggers spilling https://docs.ray.io/en/master/ray-core/objects/object-spilling.html. When your disk usage keeps increasing, what's the output of |
@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed? |
Are you using Ray Datasets? There's an issue with xgboost-ray we are working on currently that causes the data to be loaded in a suboptimal manner, causing too much object store usage. |
thanks for all your advice, @rkooo567 @matthewdeng 1395642 train data, boost round 20, disk usage 15G train code is here: https://github.com/showkeyjar/mymodel/blob/main/train_model_ray.py @Yard1 no, I use pandas dataframe convert to ray dataset. |
I alleviated the problem using shell for loop script to call python train code, but I still don't know why python for loop cause disk increase. and I'm sure that the disk incease happened at |
Ray is using a mechanism called object spilling, where objects that cannot fit into the memory object store are instead put on disk. Can you run the Also, are you running this on a single machine, or multiple machines? |
======== Object references status: 2023-01-16 15:19:13.215008 ======== |
I'm so depressed this issues has not been solved yet, but I found some new infomations:
hope those helps. |
Based on your output ^, it looks like spilling actually doesn't really happen. I guess most of disk usage is from ray logs? |
Is it correct the disk usage is mostly from |
yes, it create a link |
I found one problem:
if I use xgboost_ray to train multiple models on linux, I found the "/tmp/ray/" dir size will continued growth.
and if train data is large, the system dist space run out quickly.
I try to fix it by "rm -rf /tmp/ray/", but the train process stucked in an endless loop, and wait for ray actor forever.
I guess "import xgboost_ray" may do some init for ray,
so I add "import importlib" and try to "importlib.reload('xgboost_ray')", but it not work.
please check this issue.
The text was updated successfully, but these errors were encountered: