-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a triaging guide for suspected rocMLIR failures. #3227
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3227 +/- ##
========================================
Coverage 92.20% 92.20%
========================================
Files 493 493
Lines 19700 19700
========================================
Hits 18164 18164
Misses 1536 1536 ☔ View full report in Codecov by Sentry. |
|
||
There are broadly 3 categories of bugs that can be due to rocMLIR. | ||
|
||
1. [B1]rocMLIR compilation bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another debugging check could be reverting rocMLIR SHA.
If there was a known working SHA before try building MIGraphX against that and see if that works without any problems.
and then try newer rocMLIR SHA and see if it fails or passes with MIGraphX. That could indicate if problem lies in rocMLIR or MIGraphX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
But also, conversely, holding the rocMLIR SHA constant at the newest one while debugging can be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing rocMLIR SHA feels like something rocMLIR team should be doing...
I m not sure whether it is advisable to try older SHAs where bugs are fixed in the later ones.
Therefore, I think disabling rocMLIR might be enough with the latest SHA in most cases to see whether its rocMLIR vs MIGraphX bug. (Yes, there can be complex cases where rocMLIR offloads creating a different graph structure exposing a MIGraphX bug but with current infrastructure there is no good way to handle that -- we can have a separate discussion how to improve debuggability)
I d go further and say unless there is a warning from the following, we should not be changing rocMLIR SHA in debugging.
AMDMIGraphX/src/targets/gpu/mlir.cpp
Lines 42 to 43 in 3e65032
#if !defined(MLIR_MIGRAPHX_DIALECT_API_VERSION) || MLIR_MIGRAPHX_DIALECT_API_VERSION != 4 | |
#warning "Incompatible version of rocMLIR library used, disabling" |
This build is not recommended to merge 🔴 |
🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output |
|
||
There are broadly 3 categories of bugs that can be due to rocMLIR. | ||
|
||
1. [B1]rocMLIR compilation bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
But also, conversely, holding the rocMLIR SHA constant at the newest one while debugging can be useful
|
||
Then individually create MIGraphX program that only has the MIGRAPHX | ||
module Then indiviually ``driver verify`` them to see which is the | ||
failing module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think MIGraphX is able to do this easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but someone has to do it, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/ROCm/AMDMIGraphX/pull/3182/files#
With this PR it would be possible to dump MIGraphX programs for each module as MXR and then migraphx-driver can driver verify them.
Co-authored-by: Umang Yadav <[email protected]>
Co-authored-by: Umang Yadav <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments only
Co-authored-by: Charlie Lin <[email protected]>
Co-authored-by: Charlie Lin <[email protected]>
Thanks all for reviewing and more importantly suggesting changes in GH interface :) |
This PR adds a doc to help migraphx developers
triage a suspected rocMLIR failure.