Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a triaging guide for suspected rocMLIR failures. #3227

Merged
merged 8 commits into from
Jul 4, 2024

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Jun 28, 2024

This PR adds a doc to help migraphx developers
triage a suspected rocMLIR failure.

@manupak manupak requested a review from a team as a code owner June 28, 2024 11:27
@manupak manupak requested a review from umangyadav June 28, 2024 11:27
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.20%. Comparing base (d9367cb) to head (8ab1604).
Report is 148 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3227   +/-   ##
========================================
  Coverage    92.20%   92.20%           
========================================
  Files          493      493           
  Lines        19700    19700           
========================================
  Hits         18164    18164           
  Misses        1536     1536           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


There are broadly 3 categories of bugs that can be due to rocMLIR.

1. [B1]rocMLIR compilation bug
Copy link
Member

@umangyadav umangyadav Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another debugging check could be reverting rocMLIR SHA.
If there was a known working SHA before try building MIGraphX against that and see if that works without any problems.
and then try newer rocMLIR SHA and see if it fails or passes with MIGraphX. That could indicate if problem lies in rocMLIR or MIGraphX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

But also, conversely, holding the rocMLIR SHA constant at the newest one while debugging can be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing rocMLIR SHA feels like something rocMLIR team should be doing...
I m not sure whether it is advisable to try older SHAs where bugs are fixed in the later ones.

Therefore, I think disabling rocMLIR might be enough with the latest SHA in most cases to see whether its rocMLIR vs MIGraphX bug. (Yes, there can be complex cases where rocMLIR offloads creating a different graph structure exposing a MIGraphX bug but with current infrastructure there is no good way to handle that -- we can have a separate discussion how to improve debuggability)

I d go further and say unless there is a warning from the following, we should not be changing rocMLIR SHA in debugging.

#if !defined(MLIR_MIGRAPHX_DIALECT_API_VERSION) || MLIR_MIGRAPHX_DIALECT_API_VERSION != 4
#warning "Incompatible version of rocMLIR library used, disabling"

@migraphx-bot
Copy link
Collaborator

migraphx-bot commented Jun 28, 2024

Test Batch Rate new
cab23d
Rate old
497c27
Diff Compare
torchvision-resnet50 64 1,749.69 1,748.90 0.05%
torchvision-resnet50_fp16 64 4,078.30 4,080.75 -0.06%
torchvision-densenet121 32 1,471.95 1,467.80 0.28%
torchvision-densenet121_fp16 32 2,520.20 2,527.69 -0.30%
torchvision-inceptionv3 32 889.21 888.98 0.03%
torchvision-inceptionv3_fp16 32 1,483.71 1,484.24 -0.04%
cadene-inceptionv4 16 412.28 412.13 0.04%
cadene-resnext64x4 16 419.25 419.33 -0.02%
slim-mobilenet 64 4,004.68 4,004.86 -0.00%
slim-nasnetalarge 64 100.95 100.96 -0.02%
slim-resnet50v2 64 1,678.44 1,678.94 -0.03%
bert-mrpc-onnx 8 615.65 616.38 -0.12%
bert-mrpc-tf 1 277.65 279.52 -0.67%
pytorch-examples-wlang-gru 1 326.51 326.38 0.04%
pytorch-examples-wlang-lstm 1 294.10 294.48 -0.13%
torchvision-resnet50_1 1 471.91 472.05 -0.03%
cadene-dpn92_1 1 246.78 246.45 0.13%
cadene-resnext101_1 1 204.36 203.78 0.28%
onnx-taau-downsample 1 205.75 206.10 -0.17%
dlrm-criteoterabyte 1 22.89 22.90 -0.05%
dlrm-criteoterabyte_fp16 1 42.72 42.66 0.14%
agentmodel 1 6,117.63 7,432.29 -17.69% 🔴
unet_fp16 2 34.28 34.31 -0.07%
resnet50v1_fp16 1 595.91 578.62 2.99%
resnet50v1_int8 1 572.92 577.30 -0.76%
bert_base_cased_fp16 64 645.75 645.72 0.00%
bert_large_uncased_fp16 32 198.84 198.85 -0.01%
bert_large_fp16 1 117.45 117.34 0.10%
distilgpt2_fp16 16 1,211.95 1,210.51 0.12%
yolov5s 1 301.25 293.94 2.49%
tinyllama 1 23.32 23.30 0.09%
vicuna-fastchat 1 132.99 133.83 -0.62%
whisper-tiny-encoder 1 244.20 243.92 0.11%
whisper-tiny-decoder 1 255.72 256.31 -0.23%

This build is not recommended to merge 🔴

@migraphx-bot
Copy link
Collaborator


     ✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

     ✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

     ✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

     ✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

     ✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

     ✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

     ✅ agentmodel: PASSED: MIGraphX meets tolerance

     ✅ unet: PASSED: MIGraphX meets tolerance

     ✅ resnet50v1: PASSED: MIGraphX meets tolerance

     ✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output


     ✅ bert_large: PASSED: MIGraphX meets tolerance

     ✅ yolov5s: PASSED: MIGraphX meets tolerance

     ✅ tinyllama: PASSED: MIGraphX meets tolerance

     ✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

     ✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

     ✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

@jerryyin jerryyin requested review from krzysz00 and pfultz2 June 28, 2024 13:43
@jerryyin jerryyin requested a review from causten June 28, 2024 14:24

There are broadly 3 categories of bugs that can be due to rocMLIR.

1. [B1]rocMLIR compilation bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

But also, conversely, holding the rocMLIR SHA constant at the newest one while debugging can be useful


Then individually create MIGraphX program that only has the MIGRAPHX
module Then indiviually ``driver verify`` them to see which is the
failing module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think MIGraphX is able to do this easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but someone has to do it, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/ROCm/AMDMIGraphX/pull/3182/files#

With this PR it would be possible to dump MIGraphX programs for each module as MXR and then migraphx-driver can driver verify them.

manupak and others added 2 commits July 1, 2024 13:41
@kahmed10 kahmed10 requested a review from CharlieL7 July 3, 2024 15:06
Copy link
Collaborator

@CharlieL7 CharlieL7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments only

docs/dev/triage-rocmlir.rst Outdated Show resolved Hide resolved
docs/dev/triage-rocmlir.rst Outdated Show resolved Hide resolved
manupak and others added 2 commits July 3, 2024 16:43
@manupak
Copy link
Contributor Author

manupak commented Jul 3, 2024

Thanks all for reviewing and more importantly suggesting changes in GH interface :)

@umangyadav umangyadav merged commit cbaa5b4 into develop Jul 4, 2024
40 of 44 checks passed
@umangyadav umangyadav deleted the mlir-triage-doc branch July 4, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants