(topic/tracker) faceswap pipeline performance #112

monorimet · 2025-01-13T22:59:39Z

This refers to the work in the alibaba_fp16 branch of this repository.

from the fp16-model directory, with an IREE environment,

to run the controlled ip-adapted unet module:

iree-compile  --iree-hal-target-backends=rocm  --iree-hip-target=gfx942  --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics)'  --iree-hal-force-indirect-command-buffers=1  --iree-stream-resource-memory-model=discrete  --iree-hip-legacy-sync=0 --iree-hal-memoization=1  --iree-opt-strip-assertions  --iree-opt-outer-dim-concat=1  --iree-hip-waves-per-eu=2  --iree-llvmgpu-enable-prefetch=1  --iree-codegen-gpu-native-math-precision=1  --iree-dispatch-creation-enable-aggressive-fusion=0 --iree-codegen-transform-dialect-library=specs/attention_and_matmul_spec_control.mlir  base_ir/stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16.mlir -o stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16_amdgpu_gfx942.vmfb

iree-benchmark-module --module=stable_diffusion_xl_base_1_0_controlled_unet_bs1_64_1024x960_fp16_amdgpu_gfx942.vmfb --device=hip://0 --input=@sample_inputs/controlled_unet_npys/cunet_in_0.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_1.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_2.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_3.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_4.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_5.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_6.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_7.npy --input=@sample_inputs/controlled_unet_npys/cunet_in_8.npy --device_allocator=caching --parameters=model=splat/controlled_unet.irpa --function=run_forward --benchmark_repetitions=3

note the compiler flags. We will want to turn this flag on once a distributed context bug is fixed: --iree-dispatch-creation-enable-aggressive-fusion=1 ; tracked in iree-org/iree#19688

The attention spec is different only because one attention shape needs to be commented out of the tunings.

@MaheshRavishankar noted that this command was also missing a flag for matmul generalization.

Real weights for the controlled unet module are publicly available here: https://sharkpublic.blob.core.windows.net/sharkpublic/sdxl/weights/stable_diffusion_xl_base_1_0_controlled_unet_dataset_fp16.irpa

The text was updated successfully, but these errors were encountered:

MaheshRavishankar · 2025-01-14T00:49:49Z

Actually maybe we dont want to the generalization flag for this model. Lets leave it aside for now. This looks fine.

monorimet · 2025-01-14T20:17:56Z

If we focus on controlnet I have prepared standalone IR and inputs in the same branch.

iree-compile  --iree-hal-target-backends=rocm  --iree-hip-target=gfx942  --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics)'  --iree-hal-force-indirect-command-buffers=1  --iree-stream-resource-memory-model=discrete  --iree-hip-legacy-sync=0 --iree-hal-memoization=1  --iree-opt-strip-assertions  --iree-opt-outer-dim-concat=1  --iree-hip-waves-per-eu=2  --iree-llvmgpu-enable-prefetch=1  --iree-codegen-gpu-native-math-precision=1  --iree-dispatch-creation-enable-aggressive-fusion=0  --iree-codegen-transform-dialect-library=attention_and_matmul_spec.mlir  stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16.mlir -o stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16_amdgpu-gfx942.vmfb

iree-benchmark-module --module=stable_diffusion_xl_base_1_0_controlnet_bs1_64_960x1024_fp16_amdgpu-gfx942.vmfb --device=hip://0 --input=@controlnet_0_input_0.npy --input=@controlnet_0_input_1.npy --input=@controlnet_0_input_2.npy --input=@controlnet_0_input_3.npy --input=@controlnet_0_input_4.npy --input=@controlnet_0_input_5.npy --input=@controlnet_0_input_6.npy --input=@controlnet_0_input_7.npy --input=@controlnet_0_input_8.npy --device_allocator=caching --parameters=model=/home/eagarvey/shark-ai/sharktank/sharktank/torch_exports/sdxl/stable_diffusion_xl_base_1_0_controlnet_fp16.irpa --function=run_forward

tracy profile:
https://sharkpublic.blob.core.windows.net/sharkpublic/ean/control.tracy

monorimet · 2025-01-14T20:20:06Z

I recall the txt2img unet latency reaching 160ms on MI308x. Without controlnet, this "ip_adapted" unet module has a latency of ~175ms. Would help to have someone reproduce the above results on a machine that achieves 160ms for the txt2img unet, to verify whether the IP-adapter regresses performance of unet. Or point me to a machine and I can go spin up there, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(topic/tracker) faceswap pipeline performance #112

(topic/tracker) faceswap pipeline performance #112

monorimet commented Jan 13, 2025 •

edited

Loading

MaheshRavishankar commented Jan 14, 2025

monorimet commented Jan 14, 2025

monorimet commented Jan 14, 2025 •

edited

Loading

(topic/tracker) faceswap pipeline performance #112

(topic/tracker) faceswap pipeline performance #112

Comments

monorimet commented Jan 13, 2025 • edited Loading

MaheshRavishankar commented Jan 14, 2025

monorimet commented Jan 14, 2025

monorimet commented Jan 14, 2025 • edited Loading

monorimet commented Jan 13, 2025 •

edited

Loading

monorimet commented Jan 14, 2025 •

edited

Loading