[c++] Fix `dump_model()` information for root node #6569

neNasko1 · 2024-07-24T14:01:46Z

This PR corrects the output of dump_model() and other dump-related functions like trees_to_dataframe(). There are 2 fixes implemented:

The current Tree::Split implementation incorrectly saves the old leaf output value in the internal_value_ array when called on the root node. This in turn makes inspecting the whole training process from python incomplete.

Before:

(Pdb) booster_.trees_to_dataframe()
     tree_index  node_depth node_index left_child right_child parent_index  ... decision_type  missing_direction missing_type     value weight count
0             0           1       0-S0       0-S1        0-S2         None  ...            ==              right         None  0.000000      0   200
1             0           2       0-S1       0-S5        0-S4         0-S0  ...            <=               left         None  0.106573    113   113
2             0           3       0-S5       0-L0        0-L6         0-S1  ...            ==              right         None  0.082122     56    56
3             0           4       0-L0       None        None         0-S5  ...          None               None         None  0.064612     26    26
4             0           4       0-L6       None        None         0-S5  ...          None               None         None  0.097297     30    30

After:

(Pdb) booster_.trees_to_dataframe().head()
   tree_index  node_depth node_index left_child right_child parent_index  ... decision_type  missing_direction missing_type     value weight count
0           0           1       0-S0       0-S1        0-S2         None  ...            ==              right         None  0.081757    200   200
1           0           2       0-S1       0-S5        0-S4         0-S0  ...            <=               left         None  0.106573    113   113
2           0           3       0-S5       0-L0        0-L6         0-S1  ...            ==              right         None  0.082122     56    56
3           0           4       0-L0       None        None         0-S5  ...          None               None         None  0.064612     26    26
4           0           4       0-L6       None        None         0-S5  ...          None               None         None  0.097297     30    30

Stump has no leaf_count inside dump_model() output #5962

neNasko1 · 2024-07-29T11:11:17Z

Currently the CI is not passing as #6574 is blocking.

neNasko1 · 2024-07-29T11:49:02Z

I am open to ideas of ways to test related functionalities.

Tests should now be sufficient for the change.

…ix-root-values

neNasko1 · 2024-07-30T16:47:43Z

@jameslamb
Could you take a look at the PR, now that the CI is passing?

jameslamb

@shiyu1994 or @guolinke could you help with a review of this?

I'm not sure if this will correctly handle these cases:

custom init_score provided (via Dataset)
boost_from_average=False passed

@neNasko1 could you also look at #5962 and let us know if you think this change would fix the issue @thatlittleboy reported there?

neNasko1 · 2024-08-03T19:02:06Z

Thank you for taking the time to look into the PR and linking a relevant issue.

I'm not sure if this will correctly handle these cases:

custom init_score provided (via Dataset)

boost_from_average=False passed

I think those cases are handled as the results are consistent with what leaf values report, I also remade the test to boost from average.

@neNasko1 could you also look at #5962 and let us know if you think this change would fix the issue @thatlittleboy reported there?

I took the liberty to merge @thatlittleboy's WIP code into mine, additionally fixing the issues that they reported. I will also change the description of the PR to reflect both of the fixes.

StrikerRUS

LGTM!

But I'll keep following the discussion about Dask Ranker test (#6569 (comment)).

neNasko1 · 2024-10-01T12:00:54Z

@jameslamb can you submit a final review on the change, so that we can merge it?

Sorry for the caused inconvenience!

jameslamb · 2024-10-02T02:04:51Z

I will look when I can. I have spent most of my limited open source time in the last few weeks investigating and fixing multiple difficult, time-sensitive CI issues in this project, and there is yet another one that is still not done and a primary focus for me right now (#6651).

If @StrikerRUS has time to re-review the commits and comments you've pushed since his approval, and if he approves, then my review can be dismissed and this can be merged without another review from me. Otherwise, you will have to be patient a bit longer.

…values

jameslamb · 2024-10-08T01:34:43Z

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/11226953247

Status: success ✔️.

jameslamb

Thanks. I've left some minor suggestions for your consideration, around making the tests stricter and easier to understand.

I've also triggered valgrind checks on this branch, to ensure no new memory-management issues have been introduced by this PR.

Unfortunately, this still needs a bit more investigation before I'm confident in it... I was finally able to investigate your comments in #6569 (comment), and found that that Dask test checking the trees_to_dataframe() output in the presence of init_score really was testing that the init_score wasn't ignored. I am going to try right now to figure out why that was, and if it has implications for this PR.

By the way, most of the commits you've pushed here are not tied to your GitHub account.

Doesn't really matter in this repo because if this is merged we'll squash everything into one commit, and that'll be correctly tied to your account. But just making you aware of it as it might cause problems for you in other GitHub-based projects. You can fix that for future commits like this:

git config --global user.email "${EMAIL}"

replacing ${EMAIL} with an email address tied to your GitHub account.

tests/python_package_test/test_dask.py

tests/python_package_test/test_engine.py

tests/python_package_test/test_dask.py

neNasko1 · 2024-10-08T17:18:49Z

I will look when I can. I have spent most of my limited open source time in the last few weeks investigating and fixing multiple difficult, time-sensitive CI issues in this project, and there is yet another one that is still not done and a primary focus for me right now (#6651).

Sorry for making it seem like there is some rush around this. I understand that this project is run mainly by volunteers and I do not want to harass any of the maintainers.

Thanks. I've left some minor suggestions for your consideration, around making the tests stricter and easier to understand.
I've also triggered valgrind checks on this branch, to ensure no new memory-management issues have been introduced by this PR.

Thanks for the suggestions, all of them are reasonable and are now merged. I grouped them into one commit as I wanted to test if everything is okay.

Again, thanks to everyone for the time spent!

neNasko1 · 2024-10-08T22:49:10Z

@jameslamb

With init_score not provided, the root node's value is a non-0 value, and the test reliably and consistently fail.

Is this related to boost_from_average?

When there is a init_score then boost_from_average does not occur.
After training the first trees(iteration=1) we then add the average to them.

I retract my comment of the test being a no-op previously, however after this fix there is a need for a change in this test.

jameslamb · 2024-10-09T01:25:13Z

Thanks, changes look great. For your other questions, let's please stay in the thread you're quoting, so the conversation can all be grouped together. I've responded there: #6569 (comment)

jameslamb

This is looking great! Just some small suggestions on the new Dask test, and then I think we can merge this. Thanks for all your hard work, I'm glad we'll be able to get this fix in.

tests/python_package_test/test_dask.py

Co-authored-by: James Lamb <[email protected]>

jameslamb

This is looking good, thanks very much for the help!

neNasko1 · 2024-10-13T22:26:13Z

This is looking good, thanks very much for the help!

Thanks for the all the time spent! Glad to see it merged 🚀

neNasko1 requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners July 24, 2024 14:01

Atanas Dimitrov and others added 3 commits July 24, 2024 17:46

Fix value calculation in root node

12102cc

Fix dask tests

c933399

Merge branch 'master' into fix-root-values

c240016

Create proper tests

2f1de57

Merge branch 'master' into fix-root-values

273a1df

jameslamb added awaiting review fix labels Jul 29, 2024

jameslamb changed the title ~~[c++] Root internal_value_ is not calculated properly~~ [c++] Fix calculation of internal_value_ for root node Jul 29, 2024

Atanas Dimitrov added 3 commits July 30, 2024 02:10

Test only on cpu

208df85

Merge branch 'fix-root-values' of github.com:neNasko1/LightGBM into f…

130879b

…ix-root-values

Disable new tests for CUDA

48e6b96

jameslamb requested changes Aug 2, 2024

View reviewed changes

Atanas Dimitrov added 3 commits August 3, 2024 19:10

Merge with microsoft#5964

26b9859

Finish merging with dump_model unification

88e3dec

Improve tests

e1274dc

neNasko1 changed the title ~~[c++] Fix calculation of internal_value_ for root node~~ [c++] Fix dump_model() information for root node Aug 3, 2024

Atanas Dimitrov and others added 4 commits August 4, 2024 20:44

Add linear test for stump

38ee92c

Fix CUDA compilation

3b423de

Merge branch 'master' into fix-root-values

c89e257

Merge branch 'master' into fix-root-values

3de14d9

StrikerRUS approved these changes Sep 5, 2024

View reviewed changes

Atanas Dimitrov added 3 commits September 17, 2024 16:42

Fix test failing because of accuracy reasons

634b0fc

Fix test_dask::test_init_scores

3fe4577

Decrease size of trees in test

9e3e8ed

Merge branch 'master' of github.com:microsoft/LightGBM into fix-root-…

a01e737

…values

jameslamb requested changes Oct 8, 2024

View reviewed changes

jameslamb and others added 2 commits October 7, 2024 23:16

add a test on predictions from a model of all stumps

e76d5bc

Comments after code review

0af4631

neNasko1 added 2 commits October 9, 2024 23:52

Small text QOL

04886c0

Add test_predict_stump on dask

15fc3bf

jameslamb self-requested a review October 11, 2024 03:42

jameslamb requested changes Oct 11, 2024

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

jameslamb mentioned this pull request Oct 11, 2024

WIP: [ci] [dask] test lightgbm.dask on macOS #6677

Draft

Merge branch 'master' into fix-root-values

938cb63

jameslamb mentioned this pull request Oct 11, 2024

Regression result appears to depend highly on the base score (setting init_score to some constant value) #6658

Closed

neNasko1 and others added 2 commits October 11, 2024 16:40

Update tests/python_package_test/test_dask.py

bed5ded

Co-authored-by: James Lamb <[email protected]>

Appease linter

ac01d79

neNasko1 requested a review from jameslamb October 12, 2024 21:57

jameslamb approved these changes Oct 13, 2024

View reviewed changes

jameslamb merged commit bbeecc0 into microsoft:master Oct 13, 2024
48 checks passed

This was referenced Oct 13, 2024

Stump has no leaf_count inside dump_model() output #5962

Closed

WIP: Unify dump_model output (fixes #5962) #5964

Closed

StrikerRUS removed the awaiting review label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c++] Fix `dump_model()` information for root node #6569

[c++] Fix `dump_model()` information for root node #6569

neNasko1 commented Jul 24, 2024 •

edited

Loading

neNasko1 commented Jul 29, 2024

neNasko1 commented Jul 29, 2024

neNasko1 commented Jul 30, 2024

jameslamb left a comment

neNasko1 commented Aug 3, 2024 •

edited

Loading

StrikerRUS left a comment

neNasko1 commented Oct 1, 2024

jameslamb commented Oct 2, 2024

jameslamb commented Oct 8, 2024 •

edited by guolinke

Loading

jameslamb left a comment •

edited

Loading

neNasko1 commented Oct 8, 2024

neNasko1 commented Oct 8, 2024

jameslamb commented Oct 9, 2024

jameslamb left a comment

jameslamb left a comment

neNasko1 commented Oct 13, 2024

[c++] Fix dump_model() information for root node #6569

[c++] Fix dump_model() information for root node #6569

Conversation

neNasko1 commented Jul 24, 2024 • edited Loading

neNasko1 commented Jul 29, 2024

neNasko1 commented Jul 29, 2024

neNasko1 commented Jul 30, 2024

jameslamb left a comment

Choose a reason for hiding this comment

neNasko1 commented Aug 3, 2024 • edited Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

neNasko1 commented Oct 1, 2024

jameslamb commented Oct 2, 2024

jameslamb commented Oct 8, 2024 • edited by guolinke Loading

jameslamb left a comment • edited Loading

Choose a reason for hiding this comment

neNasko1 commented Oct 8, 2024

neNasko1 commented Oct 8, 2024

jameslamb commented Oct 9, 2024

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

neNasko1 commented Oct 13, 2024

[c++] Fix `dump_model()` information for root node #6569

[c++] Fix `dump_model()` information for root node #6569

neNasko1 commented Jul 24, 2024 •

edited

Loading

neNasko1 commented Aug 3, 2024 •

edited

Loading

jameslamb commented Oct 8, 2024 •

edited by guolinke

Loading

jameslamb left a comment •

edited

Loading