memory leak from boosters with CUDA #207

Moelf · 2024-11-22T04:44:16Z

[ Info: Training machine(XGBoostClassifier(test = 1, …), …).
[ Info: XGBoost: starting training.
┌ Warning: [04:42:46] WARNING: [/workspace/srcdir/xgboost/src/common/error_msg.cc:27](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/common/error_msg.cc#line=26): The tree method `gpu_hist` is deprecated since 2.0.0. To use GPU training, set the `device` parameter to CUDA instead.
│ 
│     E.g. tree_method = "hist", device = "cuda"
└ @ XGBoost ~/.julia/packages/XGBoost/nqMqQ/src/XGBoost.jl:34
┌ Error: Problem fitting the machine machine(XGBoostClassifier(test = 1, …), …). 
└ @ MLJBase ~/.julia/packages/MLJBase/7nGJF/src/machines.jl:694
[ Info: Running type checks... 
[ Info: Type checks okay. 

XGBoostError: (caller: XGBoosterUpdateOneIter)
[04:42:46] [/workspace/srcdir/xgboost/src/tree/updater_gpu_hist.cu:781](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/tree/updater_gpu_hist.cu#line=780): Exception in gpu_hist: [04:42:46] [/workspace/srcdir/xgboost/src/c_api/../data/../common/device_helpers.cuh:431](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/common/device_helpers.cuh#line=430): Memory allocation error on worker 0: Caching allocator
- Free memory: 12582912
- Requested memory: 9232383

this happens with less and less Free memory: so something is leaking memory... I'm using XGBoost via MLJ, any clue?

The text was updated successfully, but these errors were encountered:

ExpandingMan · 2024-11-22T14:53:23Z

Could you try to reproduce with XGBoost.jl only? It's hard for me to imagine how the wrapper can cause this: it's really not doing anything during training other than calling the update function from the library; though it's been a while since I've worked on this package so my memory might be hazy.

Moelf · 2024-11-22T15:16:28Z

I guess my question is what does XGBoost.jl do fundamentally when training / inference on large data -- is there a way to manually tell it to partition more finely?

Moelf · 2024-11-22T16:36:20Z

106 function cross_train(df_all; model = XGBoostClassifier(; tree_method="hist",       eta=0.08, max_depth=7, num_round=90), Nfolds=5)                                107     y, eventNumber, X = unpack(df_all, ==(:proc),  ==(:eventNumber), ∈(BDT_in      put_names))
  108     eventNumber_modN = mod1.(eventNumber, Nfolds)                              109     machines = []
  110     @views for ith_fold in 1:Nfolds
  111     ▏   fold_mask = eventNumber_modN .!= ith_fold
  112     ▏   fold_y = y[fold_mask]
  113     ▏   fold_X = X[fold_mask, :]
  114     ▏   mach_bdt = machine(model, fold_X, fold_y)
  115     ▏   fit!(mach_bdt)
  116     ▏
  117     ▏   push!(machines, mach_bdt)
  118     end
  119     machines
  120 end

I think my problem is each MLJ machine would hold a reference to the data, and the data probably got moved to CUDA so it's now a memory leak

Moelf · 2024-11-22T16:48:35Z

3292MiB /  16384MiB

5252MiB /  16384MiB

7468MiB /  16384MiB

...

setting mach_bdt = machine(model, fold_X, fold_y; cache=false) doesn't help, @ablaom do you have any suggestion? how to stripe away reference to CUDA data after machine is trained?

ExpandingMan · 2024-11-22T19:15:33Z

I guess my question is what does XGBoost.jl do fundamentally when training / inference on large data -- is there a way to manually tell it to partition more finely?

The library itself doesn't exactly give us a huge amount of leeway. Basically you have to put your entire dataset into memory and it runs on it. There is a little bit of utility for larger datasets but last I looked it was pretty half-baked, at least within the C API. I wonder if there is memory leaking when creating lots of Booster objects? The wrapper is of course supposed to ensure this doesn't happen by letting the Julia GC know about them, but could be a bug, and that would actually be a bug in the wrapper that would be solvable, whereas whatever happens to the data and how it's trained when it gets into the C lib is totally mysterious.

ablaom · 2024-11-23T03:27:41Z

@ablaom do you have any suggestion? how to stripe away reference to CUDA data after machine is trained?

Uh. You could try:

mach = restore!(serializable(mach))

Methods exported by MLJ and MLJBase, I think.

Moelf · 2024-11-23T05:55:55Z

according to JuliaAI/MLJBase.jl#750 it doesn't fully remove reference to data somehow? But looks like it helps quite a lot at least

w/ the "fix":

2366MiB /  16384M
# GC.gc()
882MiB /  16384M

w/o the "fix":

2818MiB /  16384M
# GC.gc()
2624MiB /  16384M

ablaom · 2024-11-24T00:02:32Z

JuliaAI/MLJBase.jl#750 it doesn't fully remove reference to data somehow? But looks like it helps quite a lot at least

I'm not 100% sure, but I doubt the cited issue applies here.

trivialfis · 2024-11-24T04:52:58Z

There is a little bit of utility for larger datasets but last I looked

Could you please share what's half baked? I can help take a look.

Moelf · 2024-11-26T02:33:19Z

Anthony's solution worked for MLJ, but actually if I directly use XGBoost it now has cuda OOM:

106 function cross_train(df_all; model = (; tree_method="hist", eta=0.08, max_dep      th=7, num_round=90), Nfolds=5)
  107     y, eventNumber, X = unpack(df_all, ==(:proc),  ==(:eventNumber), ∈(BDT_input_names))
  108     eventNumber_modN = mod1.(eventNumber, Nfolds)
  109     machines = []
~ 110     @views for ith_fold in 1:Nfolds
  111     ▏   fold_mask = eventNumber_modN .!= ith_fold
  112     ▏   fold_y = y[fold_mask]
  113     ▏   fold_X = X[fold_mask, :]
~ 114     ▏   # mach_bdt = machine(model, fold_X, fold_y; cache=false)
~ 115     ▏   bst = xgboost((fold_X, fold_y); model...)
+ 116     ▏   # fit!(mach_bdt)
  117     ▏
~ 118     ▏   # remove reference to data
+ 119     ▏   # mach = restore!(serializable(mach_bdt))
+ 120     ▏   push!(machines, bst)
+ 121     ▏   GC.gc()
  122     end
  123     machines
  124 end

# calling with
model = (;
        tree_method="gpu_hist",
        nthread=10,
        eta=0.2,
        max_depth=10,
        scale_pos_weight = 1,
        max_delta_step = 1,
        num_round=2,
        )

Moelf · 2024-11-26T02:37:11Z

I suspect the thing returned by xgboost() has reference to the Array? somehow?

ablaom · 2024-11-26T02:44:46Z

Sorry, this is a little off topic, but if you use EvoTrees.jl, the pure Julia gradient tree boosting implementation, then I expect these sorts of issues either do not occur, or are readily diagnosed and mitigated. EvoTrees.jl may not have all the bells and whistles you're used to but it has had a considerable number of dev-hours put into it (compared to your average Julia package - shout out to @jeremiedb) and is still actively maintained.

Moelf · 2024-11-26T02:49:41Z

yeah I know EvoTrees and a big fan of the package as well as Jeremie (met at this year's JuliaCon). Although it looks like Jeremie is on a long holiday since the summer, going by the github activities

jeremiedb · 2024-11-26T04:24:41Z

No long holiday on my end unfortunately, just more busy with our firm :)
I hadn't much low-hanging fruit on the EvoTrees radar at the moment other than some loss function experiments; and an alternative logic for histogram struct which have stalled. If there are specific pain points, don't hesitate to reach out, it may motivate some sprints!

ExpandingMan · 2024-11-26T15:56:38Z

@trivialfis Sorry, please don't take that comment (about accommodation for large datasets being half-baked) too seriously, I should have qualified it better. It's been a good while since I've looked at it. I recall that the last time that happened I had a lot of trouble with it, but at this point there's a lot of reasons that could have been including me possibly doing something wrong.

Suffice it to say that if I return to it and find something that I consider a problem I will open an issue, but off the top of my head I don't even know that that's needed.

trivialfis · 2024-11-26T16:04:00Z

No worries, feel free to ping me. We have made some more progress recently for GPU memory usage and external memory etc. Was wondering what you have run into. I'm quite excited about the upcoming release and would like to fix any remaining issues you ran into.

Moelf · 2024-11-26T16:34:45Z

can you reproduce with my code above -- all you need is a big dataframe and then you partition it in the loop and then keep the returned object from xgboost() around in an array.

You should see that you end up with residual VRAM

trivialfis · 2024-11-26T17:21:18Z

I can look into it tomorrow, but before that, is there a Julia equivalent of https://github.com/dmlc/xgboost/blob/0e48cdcb361aee8459b4d1bbe6ff7bfdbbba4138/python-package/xgboost/training.py#L190 ? It's to clear out all training related cache in the booster object. Without a memory pool, GPU memory allocation is quite slow, so we have a lot of caches there.

trivialfis · 2024-11-27T06:38:44Z

@ExpandingMan Could you please help integrate the workaround into the jl package:

using CUDA
using XGBoost
X = cu(randn(100000, 3))
y = randn(100000)

vec = Any[]

for i in 1:10000
    booster = XGBoost.xgboost((X, y), num_round=10)  # no need to use `DMatrix`
    bin = XGBoost.serialize(booster)
    bst2 = Booster(DMatrix[])

    XGBoost.deserialize!(bst2, bin)

    push!(vec, bst2)
end

The trick is simply copying the model so that the new model doesn't have any cache in it.

ExpandingMan · 2024-11-27T18:35:37Z

Am I to understand that the booster objects need to be GC'd to avoid leaking memory? The wrapper as it currently sits doesn't really recognize this concept. @trivialfis would you be willing to open a more detailed issue about what exactly is going on here?

trivialfis · 2024-11-28T00:21:46Z

would you be willing to open a more detailed issue about what exactly is going on here?

Let me try to explain it here since it's the cause of this issue.

Internally, the booster has some GPU caches in it, like the buffer for storing gradients, and we need to release these buffers somehow. Right now, the only way to do it outside of libxgboost is to delete the booster from the environment entirely. The snippet does exactly that, it copies the model to a new booster object and discards the old booster object. The freshly created booster object doesn't have any GPU data in it, so there's no accumulated GPU memory usage.

ExpandingMan · 2024-12-02T15:13:35Z

Thanks for the explanation.

So it seems this wrapper is pretty broken with CUDA right now and this needs to be implemented. I should be able to get around to it this week, but if anyone else gets to it first I can merge it. I think the solution for the time being will be to implement @trivialfis code in xgboost only, thought this may introduce additional configuration parameters and I haven't yet given any thought to how that will be handled.

ExpandingMan · 2024-12-02T16:23:01Z

@trivialfis just so I understand, do you consider this a bug in libxgboost that needs to be mitigated or do you consider it a matter of XGBoost.jl implementing the library's memory model wrong? I ask this because this isn't a matter of "lazy GC", or whatever, the Booster object really is persisting here (right?) so it can't get GC'd, therefore this looks to me more like a bug or missing feature in xgboost. I would guess that the long term solution would be to add a libxgboost function to clean up whatever caches in the booster need to get cleaned up, and then the wrappers would be responsible for calling that at the appropriate time, does that seem reasonable or am I misunderstanding this?

trivialfis · 2024-12-02T16:51:39Z

You are correct. I think it's not too difficult to have a reset function in the booster object for eliminating most caches. I will keep track of this.

Moelf closed this as completed Nov 23, 2024

Moelf reopened this Nov 26, 2024

ExpandingMan changed the title ~~Out of memory on CUDA GPU~~ memory leak from boosters with CUDA Dec 2, 2024

trivialfis mentioned this issue Dec 2, 2024

Reset the booster. dmlc/xgboost#11042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak from boosters with CUDA #207

memory leak from boosters with CUDA #207

Moelf commented Nov 22, 2024

ExpandingMan commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024 •

edited

Loading

ExpandingMan commented Nov 22, 2024

ablaom commented Nov 23, 2024

Moelf commented Nov 23, 2024 •

edited

Loading

ablaom commented Nov 24, 2024

trivialfis commented Nov 24, 2024

Moelf commented Nov 26, 2024 •

edited

Loading

Moelf commented Nov 26, 2024

ablaom commented Nov 26, 2024

Moelf commented Nov 26, 2024

jeremiedb commented Nov 26, 2024

ExpandingMan commented Nov 26, 2024

trivialfis commented Nov 26, 2024

Moelf commented Nov 26, 2024

trivialfis commented Nov 26, 2024

trivialfis commented Nov 27, 2024 •

edited

Loading

ExpandingMan commented Nov 27, 2024

trivialfis commented Nov 28, 2024

ExpandingMan commented Dec 2, 2024

ExpandingMan commented Dec 2, 2024

trivialfis commented Dec 2, 2024

memory leak from boosters with CUDA #207

memory leak from boosters with CUDA #207

Comments

Moelf commented Nov 22, 2024

ExpandingMan commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024 • edited Loading

ExpandingMan commented Nov 22, 2024

ablaom commented Nov 23, 2024

Moelf commented Nov 23, 2024 • edited Loading

ablaom commented Nov 24, 2024

trivialfis commented Nov 24, 2024

Moelf commented Nov 26, 2024 • edited Loading

Moelf commented Nov 26, 2024

ablaom commented Nov 26, 2024

Moelf commented Nov 26, 2024

jeremiedb commented Nov 26, 2024

ExpandingMan commented Nov 26, 2024

trivialfis commented Nov 26, 2024

Moelf commented Nov 26, 2024

trivialfis commented Nov 26, 2024

trivialfis commented Nov 27, 2024 • edited Loading

ExpandingMan commented Nov 27, 2024

trivialfis commented Nov 28, 2024

ExpandingMan commented Dec 2, 2024

ExpandingMan commented Dec 2, 2024

trivialfis commented Dec 2, 2024

Moelf commented Nov 22, 2024 •

edited

Loading

Moelf commented Nov 23, 2024 •

edited

Loading

Moelf commented Nov 26, 2024 •

edited

Loading

trivialfis commented Nov 27, 2024 •

edited

Loading