Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this usable with a serverless architecture? #131

Open
quantumcode-martin opened this issue Jan 18, 2025 · 6 comments
Open

Is this usable with a serverless architecture? #131

quantumcode-martin opened this issue Jan 18, 2025 · 6 comments

Comments

@quantumcode-martin
Copy link

I want to speed up SAM on A100 as much as possible, but on a serverless GPU architecture.
I can load the model in a cold start and run it during the job.

I made this change:

# from segment_anything import SamPredictor, sam_model_registry, SamAutomaticMaskGenerator
from segment_anything_fast import (
    SamPredictor,
    sam_model_fast_registry,
    SamAutomaticMaskGenerator,
)

sam = sam_model_fast_registry["vit_h"](checkpoint="models/sam_vit_h_4b8939.pth")

device = "cuda" if torch.cuda.is_available() else "cpu"
sam.to(device)
predictor = SamPredictor(sam)
mask_generator = SamAutomaticMaskGenerator(sam)

(I have predictor and mask_generator because I use SAM in different ways (everything and with a point)).

I am wondering if I am missing something as my job is taking extremely long. I feel like the autotune is running, I am not sure I understand what it is doing, but can't I use pre-optimized functions for A100?
Here are the logs:

2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 11.5340 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_773 0.2693 ms 70.0%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   addmm 0.2693 ms 70.0%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_775 0.2673 ms 70.5%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_776 0.2401 ms 78.5%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_772 0.2365 ms 79.7%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_774 0.2263 ms 83.3%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_781 0.2150 ms 87.6%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_779 0.1925 ms 97.9%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   triton_mm_780 0.1905 ms 98.9%\n
2025-01-18  23:13:41.285 | info | kelcmu34r52tw4 |   bias_addmm 0.1884 ms 100.0%\n
2025-01-18  23:13:38.032 | info | kelcmu34r52tw4 | AUTOTUNE addmm(4096x3840, 4096x1280, 1280x3840)\n
2025-01-18  23:13:38.032 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.2984 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:38.032 | info | kelcmu34r52tw4 |   triton_mm_72 0.1137 ms 69.4%\n
2025-01-18  23:13:38.032 | info | kelcmu34r52tw4 |   triton_mm_68 0.1106 ms 71.3%\n
2025-01-18  23:13:38.032 | info | kelcmu34r52tw4 |   triton_mm_70 0.1085 ms 72.6%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_67 0.0993 ms 79.4%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_71 0.0983 ms 80.2%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_69 0.0932 ms 84.6%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_76 0.0881 ms 89.5%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_74 0.0829 ms 95.1%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   triton_mm_75 0.0809 ms 97.5%\n
2025-01-18  23:13:38.031 | info | kelcmu34r52tw4 |   mm 0.0788 ms 100.0%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 | AUTOTUNE mm(4900x1280, 1280x1280)\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 5.2299 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_49 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_47 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_46 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_45 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_44 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_42 0.0225 ms 95.5%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_54 0.0215 ms 100.0%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_51 0.0215 ms 100.0%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_50 0.0215 ms 100.0%\n
2025-01-18  23:13:26.480 | info | kelcmu34r52tw4 |   triton_bmm_48 0.0215 ms 100.0%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 | AUTOTUNE bmm(14x5600x80, 14x80x14)\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 5.4020 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_37 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_36 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_34 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_33 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_32 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_30 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_29 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_26 0.0215 ms 95.2%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_35 0.0205 ms 100.0%\n
2025-01-18  23:13:16.181 | info | kelcmu34r52tw4 |   triton_bmm_31 0.0205 ms 100.0%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 | AUTOTUNE bmm(14x5600x80, 14x80x14)\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.3003 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_21 0.3236 ms 69.6%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_17 0.3092 ms 72.8%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_19 0.3062 ms 73.6%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_20 0.2857 ms 78.9%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_16 0.2775 ms 81.2%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_18 0.2652 ms 84.9%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_25 0.2540 ms 88.7%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   mm 0.2386 ms 94.4%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_24 0.2263 ms 99.5%\n
2025-01-18  23:13:10.951 | info | kelcmu34r52tw4 |   triton_mm_23 0.2253 ms 100.0%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 | AUTOTUNE mm(4900x1280, 1280x3840)\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 3.3487 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_2 3.2420 ms 14.1%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_0 1.1105 ms 41.2%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_4 0.8735 ms 52.4%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_5 0.8172 ms 56.0%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   convolution 0.7444 ms 61.5%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_3 0.5960 ms 76.8%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_1 0.5591 ms 81.9%\n
2025-01-18  23:13:05.549 | info | kelcmu34r52tw4 |   triton_convolution_6 0.4577 ms 100.0%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 | AUTOTUNE convolution(1x3x1024x1024, 1280x3x16x16)\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.0648 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3470 0.3543 ms 67.1%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3474 0.3512 ms 67.6%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3473 0.3072 ms 77.3%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3469 0.3052 ms 77.9%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3472 0.3041 ms 78.1%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3471 0.2939 ms 80.8%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   mm 0.2760 ms 86.1%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3476 0.2550 ms 93.2%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3478 0.2499 ms 95.1%\n
2025-01-18  23:12:55.248 | info | kelcmu34r52tw4 |   triton_mm_3477 0.2376 ms 100.0%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 | AUTOTUNE mm(4096x5120, 5120x1280)\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.3774 seconds and 0.0000 seconds precompiling\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_91 0.3564 ms 68.4%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_87 0.3471 ms 70.2%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_89 0.3451 ms 70.6%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_90 0.3164 ms 77.0%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_86 0.3144 ms 77.5%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_88 0.3000 ms 81.2%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_95 0.2785 ms 87.5%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_93 0.2550 ms 95.6%\n
2025-01-18  23:12:48.030 | info | kelcmu34r52tw4 |   triton_mm_94 0.2499 ms 97.5%\n
2025-01-18  23:12:48.029 | info | kelcmu34r52tw4 |   mm 0.2437 ms 100.0%\n
2025-01-18  23:12:35.464 | info | kelcmu34r52tw4 | AUTOTUNE mm(4096x1280, 1280x5120)\n
2025-01-18  23:11:37.594 | info | kelcmu34r52tw4 | Started.
2025-01-18  23:11:37.335 | info | kelcmu34r52tw4 | --- Starting Serverless Worker |  Version 1.6.2 ---\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 6.8672 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_790 0.0256 ms 84.0%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_788 0.0246 ms 87.5%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_787 0.0246 ms 87.5%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_795 0.0236 ms 91.3%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_792 0.0236 ms 91.3%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_789 0.0236 ms 91.3%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_798 0.0225 ms 95.5%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_796 0.0225 ms 95.5%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_793 0.0225 ms 95.5%\n
2025-01-18  22:59:33.946 | info | kelcmu34r52tw4 |   triton_bmm_791 0.0215 ms 100.0%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 | AUTOTUNE bmm(64x1024x80, 64x80x64)\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 11.1037 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_775 0.2632 ms 71.2%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_773 0.2611 ms 71.8%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   addmm 0.2611 ms 71.8%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_776 0.2376 ms 78.9%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_772 0.2335 ms 80.3%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_774 0.2212 ms 84.7%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_781 0.2099 ms 89.3%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_779 0.1905 ms 98.4%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   triton_mm_780 0.1894 ms 98.9%\n
2025-01-18  22:59:32.082 | info | kelcmu34r52tw4 |   bias_addmm 0.1874 ms 100.0%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 | AUTOTUNE addmm(4096x3840, 4096x1280, 1280x3840)\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.0688 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_72 0.1126 ms 70.0%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_68 0.1096 ms 72.0%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_70 0.1085 ms 72.6%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_67 0.0973 ms 81.1%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_71 0.0963 ms 81.9%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_69 0.0901 ms 87.5%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_76 0.0881 ms 89.5%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_74 0.0819 ms 96.2%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   triton_mm_75 0.0799 ms 98.7%\n
2025-01-18  22:59:25.214 | info | kelcmu34r52tw4 |   mm 0.0788 ms 100.0%\n
2025-01-18  22:59:14.096 | info | kelcmu34r52tw4 | AUTOTUNE mm(4900x1280, 1280x1280)\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 5.2848 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_45 0.0225 ms 90.9%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_55 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_54 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_53 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_48 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_47 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_44 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_42 0.0215 ms 95.2%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_51 0.0205 ms 100.0%\n
2025-01-18  22:59:14.095 | info | kelcmu34r52tw4 |   triton_bmm_50 0.0205 ms 100.0%\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 | AUTOTUNE bmm(14x5600x80, 14x80x14)\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 5.3798 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 |   triton_bmm_26 0.0215 ms 95.2%\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 |   triton_bmm_40 0.0205 ms 100.0%\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 |   triton_bmm_39 0.0205 ms 100.0%\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 |   triton_bmm_38 0.0205 ms 100.0%\n
2025-01-18  22:59:04.027 | info | kelcmu34r52tw4 |   triton_bmm_35 0.0205 ms 100.0%\n
2025-01-18  22:59:04.026 | info | kelcmu34r52tw4 |   triton_bmm_34 0.0205 ms 100.0%\n
2025-01-18  22:59:04.026 | info | kelcmu34r52tw4 |   triton_bmm_33 0.0205 ms 100.0%\n
2025-01-18  22:59:04.026 | info | kelcmu34r52tw4 |   triton_bmm_32 0.0205 ms 100.0%\n
2025-01-18  22:59:04.026 | info | kelcmu34r52tw4 |   triton_bmm_31 0.0205 ms 100.0%\n
2025-01-18  22:59:04.026 | info | kelcmu34r52tw4 |   triton_bmm_29 0.0205 ms 100.0%\n
2025-01-18  22:58:58.742 | info | kelcmu34r52tw4 | AUTOTUNE bmm(14x5600x80, 14x80x14)\n
2025-01-18  22:58:58.742 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 9.8710 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:58:58.742 | info | kelcmu34r52tw4 |   triton_mm_21 0.3185 ms 69.8%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_17 0.3041 ms 73.1%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_19 0.3000 ms 74.1%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_20 0.2816 ms 78.9%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_16 0.2744 ms 81.0%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_18 0.2652 ms 83.8%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_25 0.2478 ms 89.7%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   mm 0.2355 ms 94.3%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_23 0.2232 ms 99.5%\n
2025-01-18  22:58:58.741 | info | kelcmu34r52tw4 |   triton_mm_24 0.2222 ms 100.0%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 | AUTOTUNE mm(4900x1280, 1280x3840)\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 3.2299 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_2 3.2338 ms 14.2%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_0 1.1110 ms 41.4%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_4 0.8755 ms 52.5%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_5 0.8192 ms 56.1%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   convolution 0.7485 ms 61.4%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_3 0.5990 ms 76.8%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_1 0.5591 ms 82.2%\n
2025-01-18  22:58:53.361 | info | kelcmu34r52tw4 |   triton_convolution_6 0.4598 ms 100.0%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 | AUTOTUNE convolution(1x3x1024x1024, 1280x3x16x16)\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 10.0157 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3474 0.3482 ms 67.1%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3470 0.3482 ms 67.1%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3469 0.3031 ms 77.0%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3473 0.3011 ms 77.6%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3472 0.2970 ms 78.6%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3471 0.2908 ms 80.3%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   mm 0.2703 ms 86.4%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3476 0.2478 ms 94.2%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3478 0.2468 ms 94.6%\n
2025-01-18  22:58:43.490 | info | kelcmu34r52tw4 |   triton_mm_3477 0.2335 ms 100.0%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 | AUTOTUNE mm(4096x5120, 5120x1280)\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 | SingleProcess AUTOTUNE benchmarking takes 9.9050 seconds and 0.0000 seconds precompiling\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_91 0.3564 ms 67.8%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_87 0.3410 ms 70.9%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_89 0.3400 ms 71.1%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_90 0.3123 ms 77.4%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_86 0.3082 ms 78.4%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_88 0.2908 ms 83.1%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_95 0.2724 ms 88.7%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_93 0.2499 ms 96.7%\n
2025-01-18  22:58:36.778 | info | kelcmu34r52tw4 |   triton_mm_94 0.2447 ms 98.7%\n
2025-01-18  22:58:36.777 | info | kelcmu34r52tw4 |   mm 0.2417 ms 100.0%\n
2025-01-18  22:58:24.228 | info | kelcmu34r52tw4 | AUTOTUNE mm(4096x1280, 1280x5120)\n
2025-01-18  22:57:26.656 | info | kelcmu34r52tw4 | Started.
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | --- Starting Serverless Worker |  Version 1.6.2 ---\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | \n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | \n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | By pulling and using the container, you accept the terms and conditions of this license:\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | This container image and its contents are governed by the NVIDIA Deep Learning Container License.\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | \n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | \n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | CUDA Version 11.8.0\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | \n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | ==========\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | == CUDA ==\n
2025-01-18  22:57:26.357 | info | kelcmu34r52tw4 | ==========\n
2025-01-18  22:57:26.356 | info | kelcmu34r52tw4 | \n
@cpuhrsch
Copy link
Contributor

Yes, so a lot of the optimizations here use compile and autotuning to get good runtime performance. You could try to use torch.export (using the latest nightlies) similar to this code here: https://github.com/pytorch/ao/blob/5d1444bdef6df15eb89c4c5716ede1c5f8677798/examples/sam2_amg_server/compile_export_utils.py#L147

@quantumcode-martin
Copy link
Author

So the optimizations are running on the first run only?
I'm a bit confused about this:

The package acts like a drop-in replacement for segment-anything.

I thought it would just make it faster but this compile and autotuning is making it much more slower.
I'll try to reproduce the torch.export, is there not a more simple example/function that does only this?

Thanks for the help 🙏

@cpuhrsch
Copy link
Contributor

@quantumcode-martin - ah, so torch.export let's you save out the code that comes from running with compile and max-autotune. So you run export on a GPU, store the resulting binary somewhere and then load it back up. Like here: https://github.com/pytorch/ao/blob/d0e434c8d825f7ac69e26585cb2ceb002a287f24/examples/sam2_amg_server/cli_on_modal.py#L156-L165 or more generically here: https://github.com/pytorch/ao/blob/d0e434c8d825f7ac69e26585cb2ceb002a287f24/examples/sam2_amg_server/compile_export_utils.py#L285-L308

@cpuhrsch
Copy link
Contributor

@quantumcode-martin and yes, it is a drop in replacement, but torch.compile does have compile overhead on the first run. It's just a part of that compiler, but with export you can do the compiler work and then store the result so you don't need to do it on every start. export is new and so it's not part of SAM-fast, but it is part of SAM2-fast :)

@quantumcode-martin
Copy link
Author

Hey @cpuhrsch thanks a loot for the help!

Sorry if I'm bothering you with noob questions but I got a bit outside of my confort zone trying to run this model faster. 😅
I tried to follow what you were saying on an A100 instance, I got to experiment that indeed the first call to set_image takes forever but the following ones are blazing fast. ⚡️

Here is my attempt to compile the model to get it to run fast right after a cold start:

from segment_anything_fast import (
    SamPredictor,
    sam_model_fast_registry,
    SamAutomaticMaskGenerator,
)
import torch
from torch.export import export

sam = sam_model_fast_registry["vit_h"]()

device = "cuda"
sam.to(device)
predictor = SamPredictor(sam)

# compiling
predictor.set_image(image=image)

export(
    predictor,
"/workspace/exports/",
)

But I get:

ValueError: Expected `mod` to be an instance of `torch.nn.Module`, got <class 'segment_anything_fast.predictor.SamPredictor'>.

I understand the error but I'm not sure about where I can find a torch.nn.Module if it's not predictor.
Again thanks for the precious help 🙏

@cpuhrsch
Copy link
Contributor

Ah, so you might want to try exporting the predictor.model.image_encoder. See in the code linked above that you need to export and then load back up a subset of the full predictor indeed because of the issue you link above :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants