Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add GGML_UNARY_OP_ARGMAX Metal kernel #1019

Merged
merged 11 commits into from
Dec 2, 2024

Conversation

PABannier
Copy link
Contributor

This PR implements the Metal kernel used for the GGML_UNARY_OP_ARGMAX operation.

It is necessary for Encodec.cpp to run on the Metal backend.

@ggerganov
Copy link
Owner

Thanks, will merge these soon. Just need to sync the ggerganov/llama.cpp#10238 from llama.cpp to avoid resolving conflicts manually.

src/ggml-metal/ggml-metal.m Outdated Show resolved Hide resolved
@PABannier
Copy link
Contributor Author

@ggerganov Thanks for the review. I added your changes:

  1. Taking into account the strides to avoid asserting that src0 is contiguous,
  2. Parallelizing over threadgroups instead of threads.

I have a follow-up question regarding the tests/CI: is the argmax Metal kernel tested? Looking at the logs I only found one place where argmax is mentioned: 19: ARGMAX(type=f32,ne=[10,100,1,1]): not supported [BLAS]. Am I missing something?

@ggerganov
Copy link
Owner

I have a follow-up question regarding the tests/CI: is the argmax Metal kernel tested? Looking at the logs I only found one place where argmax is mentioned: 19: ARGMAX(type=f32,ne=[10,100,1,1]): not supported [BLAS]. Am I missing something?

You have to enable the ARGMAX op in the Metal backend here:

https://github.com/ggerganov/llama.cpp/blob/3ee6382d48b07b31e64983969c16019490e19740/ggml/src/ggml-metal/ggml-metal.m#L949

Then you can test like this:

./bin/test-backend-ops -o ARGMAX -b Metal

src/ggml-metal/ggml-metal.metal Show resolved Hide resolved
@PABannier
Copy link
Contributor Author

Just tested the kernel with ./bin/test-backend-ops -o ARGMAX -b Metal and got:

  Device description: Apple M1 Pro
  Device memory: 10922 MB (10916 MB free)

  ARGMAX(type=f32,ne=[10,100,1,1]): OK
  1914/1914 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

@PABannier
Copy link
Contributor Author

@ggerganov I pushed a SIMD implementation of the kernel. All tests are passing :)

I'd like to benchmark both the non-SIMD and SIMD implementations of the kernel. Is there an existing snippet of code that benchmarks the latency and throughput of the kernel as @slaren did in ggerganov/llama.cpp#10441 ?

@slaren
Copy link
Collaborator

slaren commented Nov 29, 2024

You can obtain the performance measurements with test-backend-ops -o ARGMAX perf. The same tests that I used should already be there, but you can add your own to make_test_cases_perf.

@PABannier
Copy link
Contributor Author

PABannier commented Nov 29, 2024

Thanks @slaren ! I run test-backend-ops perf -o ARGMAX -b Metal after adding the 3 test cases in make_test_cases_perf.

Here are the results:

With SIMD vectorization:

ARGMAX(type=f32,ne=[100,10,1,1]):        286720 runs -     3.53 us/run -        3 kB/run -    1.07 GB/s
ARGMAX(type=f32,ne=[1024,12,1,1]):         90112 runs -    11.38 us/run -       48 kB/run -    4.03 GB/s
ARGMAX(type=f32,ne=[5438,3,1,1]):         65536 runs -    16.11 us/run -       63 kB/run -    3.77 GB/s

Without SIMD vectorization:

ARGMAX(type=f32,ne=[100,10,1,1]):         73728 runs -    14.75 us/run -        3 kB/run -    0.26 GB/s
ARGMAX(type=f32,ne=[1024,12,1,1]):       16384 runs -   120.66 us/run -       48 kB/run -    0.38 GB/s
ARGMAX(type=f32,ne=[5438,3,1,1]):          8192 runs -   626.23 us/run -       63 kB/run -    0.10 GB/s

This is indeed much faster! Thanks for the suggestion @slaren !

It's ok on my side if you want to merge :)

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved
Co-authored-by: Diego Devesa <slarengh@gmail.com>
@slaren slaren merged commit 589fed1 into ggerganov:master Dec 2, 2024
4 checks passed
ypapadop-amd pushed a commit to ypapadop-amd/ggml that referenced this pull request Dec 2, 2024
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
@PABannier PABannier deleted the argmax_metal_kernel branch December 3, 2024 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants