Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More verbose errors? #357

Open
BramVanroy opened this issue Aug 31, 2023 · 3 comments
Open

More verbose errors? #357

BramVanroy opened this issue Aug 31, 2023 · 3 comments

Comments

@BramVanroy
Copy link

Hello

Thank you for your work. I use mip as part of a neural network training pipeline. I used it specifically in an evaluation metric, smatchpp in a multi-node, multi-thread environment. I just found that my training sometimes, non-deterministically, seems to crash but I can't figure out where the problem lies (in my own code or in the smatchpp lib) because the error trace is so obfuscated. This is what I see:

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/mip/libraries/cbc-c-linux-x86-64.so(_Z15CbcCrashHandleri+0x119) [0x7f5f955c3459]
/lib64/libc.so.6(+0x54df0) [0x7f6697654df0]
/lib64/libc.so.6(+0xa154c) [0x7f66976a154c]
/lib64/libc.so.6(raise+0x16) [0x7f6697654d46]
/lib64/libc.so.6(abort+0xd3) [0x7f66976287f3]
/lib64/libstdc++.so.6(+0xa1a01) [0x7f66938a1a01]
/lib64/libstdc++.so.6(+0xad37c) [0x7f66938ad37c]
/lib64/libstdc++.so.6(+0xad3e7) [0x7f66938ad3e7]
/lib64/libstdc++.so.6(+0xad36f) [0x7f66938ad36f]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardENS_17ErrorHandlingModeE+0x278) [0x7f64d9cbd4d8]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x19f) [0x7f64d9cc102f]
/lib64/libstdc++.so.6(+0xdb9d4) [0x7f66938db9d4]
/lib64/libc.so.6(+0x9f802) [0x7f669769f802]
/lib64/libc.so.6(+0x3f450) [0x7f669763f450]

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.

I have no idea how to read this (I am used to Python stack traces). I see references to both mip (at the top) and torch near the end. So who was causing the error, mip or torch? And how can I pinpoint where the issue lies? Is it possible to get or implement more verbose error traces for mip?

@rschwarz
Copy link
Contributor

I guess that the issue here lies within the Cbc solver (in its shared library), not Python code.

@BramVanroy
Copy link
Author

@rschwarz Thank you for the reply. Does that mean I should report it elsewhere? What would be the right place?

@ckchow
Copy link

ckchow commented Oct 23, 2023

https://github.com/coin-or/Cbc/issues (I believe I'm having a similar issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants