More verbose errors? #357

BramVanroy · 2023-08-31T07:10:18Z

Hello

Thank you for your work. I use mip as part of a neural network training pipeline. I used it specifically in an evaluation metric, smatchpp in a multi-node, multi-thread environment. I just found that my training sometimes, non-deterministically, seems to crash but I can't figure out where the problem lies (in my own code or in the smatchpp lib) because the error trace is so obfuscated. This is what I see:

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/mip/libraries/cbc-c-linux-x86-64.so(_Z15CbcCrashHandleri+0x119) [0x7f5f955c3459]
/lib64/libc.so.6(+0x54df0) [0x7f6697654df0]
/lib64/libc.so.6(+0xa154c) [0x7f66976a154c]
/lib64/libc.so.6(raise+0x16) [0x7f6697654d46]
/lib64/libc.so.6(abort+0xd3) [0x7f66976287f3]
/lib64/libstdc++.so.6(+0xa1a01) [0x7f66938a1a01]
/lib64/libstdc++.so.6(+0xad37c) [0x7f66938ad37c]
/lib64/libstdc++.so.6(+0xad3e7) [0x7f66938ad3e7]
/lib64/libstdc++.so.6(+0xad36f) [0x7f66938ad36f]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardENS_17ErrorHandlingModeE+0x278) [0x7f64d9cbd4d8]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x19f) [0x7f64d9cc102f]
/lib64/libstdc++.so.6(+0xdb9d4) [0x7f66938db9d4]
/lib64/libc.so.6(+0x9f802) [0x7f669769f802]
/lib64/libc.so.6(+0x3f450) [0x7f669763f450]

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.

I have no idea how to read this (I am used to Python stack traces). I see references to both mip (at the top) and torch near the end. So who was causing the error, mip or torch? And how can I pinpoint where the issue lies? Is it possible to get or implement more verbose error traces for mip?

The text was updated successfully, but these errors were encountered:

rschwarz · 2023-08-31T11:32:22Z

I guess that the issue here lies within the Cbc solver (in its shared library), not Python code.

BramVanroy · 2023-09-09T07:55:29Z

@rschwarz Thank you for the reply. Does that mean I should report it elsewhere? What would be the right place?

ckchow · 2023-10-23T23:35:55Z

https://github.com/coin-or/Cbc/issues (I believe I'm having a similar issue)

BramVanroy mentioned this issue Aug 31, 2023

Instability of mip? flipz357/smatchpp#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More verbose errors? #357

More verbose errors? #357

BramVanroy commented Aug 31, 2023

rschwarz commented Aug 31, 2023

BramVanroy commented Sep 9, 2023

ckchow commented Oct 23, 2023

More verbose errors? #357

More verbose errors? #357

Comments

BramVanroy commented Aug 31, 2023

rschwarz commented Aug 31, 2023

BramVanroy commented Sep 9, 2023

ckchow commented Oct 23, 2023