Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(topic/tracker) faceswap pipeline performance #112

Open
monorimet opened this issue Jan 13, 2025 · 3 comments
Open

(topic/tracker) faceswap pipeline performance #112

monorimet opened this issue Jan 13, 2025 · 3 comments

Comments

@monorimet
Copy link
Collaborator

monorimet commented Jan 13, 2025

This refers to the work in the alibaba_fp16 branch of this repository.

from the fp16-model directory, with an IREE environment,

to run the controlled ip-adapted unet module:

iree-compile  --iree-hal-target-backends=rocm  --iree-hip-target=gfx942  --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics)'  --iree-hal-force-indirect-command-buffers=1  --iree-stream-resource-memory-model=discrete  --iree-hip-legacy-sync=0 --iree-hal-memoization=1  --iree-opt-strip-assertions  --iree-opt-outer-dim-concat=1  --iree-hip-waves-per-eu=2  --iree-llvmgpu-enable-prefetch=1  --iree-codegen-gpu-native-math-precision=1  --iree-dispatch-creation-enable-aggressive-fusion=0 --iree-codegen-transform-dialect-library=specs/attention_and_matmul_spec_control.mlir  base_ir/stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16.mlir -o stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16_amdgpu_gfx942.vmfb
iree-benchmark-module --module=stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16_amdgpu_gfx942.vmfb --device=hip://0 --input=@sample_inputs/controlled_unet_npys/cunet_in_0.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_1.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_2.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_3.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_4.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_5.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_6.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_7.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_8.npy --device_allocator=caching --parameters=model=splat/controlled_unet.irpa --function=run_forward --benchmark_repetitions=3

note the compiler flags. We will want to turn this flag on once a distributed context bug is fixed: --iree-dispatch-creation-enable-aggressive-fusion=1 ; tracked in iree-org/iree#19688

The attention spec is different only because one attention shape needs to be commented out of the tunings.

@MaheshRavishankar noted that this command was also missing a flag for matmul generalization.

Real weights for the controlled unet module are publicly available here: https://sharkpublic.blob.core.windows.net/sharkpublic/sdxl/weights/stable_diffusion_xl_base_1_0_controlled_unet_dataset_fp16.irpa

@MaheshRavishankar
Copy link
Contributor

Actually maybe we dont want to the generalization flag for this model. Lets leave it aside for now. This looks fine.

@monorimet
Copy link
Collaborator Author

If we focus on controlnet I have prepared standalone IR and inputs in the same branch.

iree-compile  --iree-hal-target-backends=rocm  --iree-hip-target=gfx942  --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics)'  --iree-hal-force-indirect-command-buffers=1  --iree-stream-resource-memory-model=discrete  --iree-hip-legacy-sync=0 --iree-hal-memoization=1  --iree-opt-strip-assertions  --iree-opt-outer-dim-concat=1  --iree-hip-waves-per-eu=2  --iree-llvmgpu-enable-prefetch=1  --iree-codegen-gpu-native-math-precision=1  --iree-dispatch-creation-enable-aggressive-fusion=0  --iree-codegen-transform-dialect-library=attention_and_matmul_spec.mlir  stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16.mlir -o stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16_amdgpu-gfx942.vmfb
iree-benchmark-module --module=stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16_amdgpu-gfx942.vmfb --device=hip://0 --input=@controlnet_0_input_0.npy --input=@controlnet_0_input_1.npy --input=@controlnet_0_input_2.npy --input=@controlnet_0_input_3.npy --input=@controlnet_0_input_4.npy --input=@controlnet_0_input_5.npy --input=@controlnet_0_input_6.npy --input=@controlnet_0_input_7.npy --input=@controlnet_0_input_8.npy --device_allocator=caching --parameters=model=/home/eagarvey/shark-ai/sharktank/sharktank/torch_exports/sdxl/stable_diffusion_xl_base_1_0_controlnet_fp16.irpa --function=run_forward

tracy profile:
https://sharkpublic.blob.core.windows.net/sharkpublic/ean/control.tracy

image

@monorimet
Copy link
Collaborator Author

monorimet commented Jan 14, 2025

I recall the txt2img unet latency reaching 160ms on MI308x. Without controlnet, this "ip_adapted" unet module has a latency of ~175ms. Would help to have someone reproduce the above results on a machine that achieves 160ms for the txt2img unet, to verify whether the IP-adapter regresses performance of unet. Or point me to a machine and I can go spin up there, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants