Performance for very small arrays #13

mdmaas · 2023-09-17T06:20:38Z

mdmaas
Sep 17, 2023

Hi!

I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C.

https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8

The test first allocates the inverse squares of integers from 1 to N, and then performs the sum.

This is how it looks like for Bumper.jl:

@inline function sumArray_bumper(N)
    @no_escape begin
        smallarray = alloc(Float64, N) 
        @turbo for i ∈ 1:N
            smallarray[i] = 1.0 / i^2
        end
        sum = 0.0
        @turbo for i ∈ 1:N
            sum += smallarray[i]
        end
    end
    return sum
end

I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory.

My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper.

Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays?

Best,

Answered by MasonProtter

Sep 18, 2023

Hi @mdmaas, sorry for the slow reply. The most important thing you can do to speed it up but would be to explicitly pass in the buffer you're using. Here's an example:

julia> @inline function sumArray_bumper_2(N, buf=default_buffer())
           @no_escape buf begin
               smallarray = alloc(Float64, buf, N) 
               @turbo for i ∈ 1:N
                   smallarray[i] = 1.0 / i^2
               end
               sum = 0.0
               @turbo for i ∈ 1:N
                   sum += smallarray[i]
               end
           end
           return sum
       end;

julia> let N = Ref(5)
           @btime sumArray_bumper($N[]) # Original version
           @btime sumArray_bump…

View full answer

MasonProtter · 2023-09-18T19:33:56Z

MasonProtter
Sep 18, 2023
Maintainer

Hi @mdmaas, sorry for the slow reply. The most important thing you can do to speed it up but would be to explicitly pass in the buffer you're using. Here's an example:

julia> @inline function sumArray_bumper_2(N, buf=default_buffer())
           @no_escape buf begin
               smallarray = alloc(Float64, buf, N) 
               @turbo for i ∈ 1:N
                   smallarray[i] = 1.0 / i^2
               end
               sum = 0.0
               @turbo for i ∈ 1:N
                   sum += smallarray[i]
               end
           end
           return sum
       end;

julia> let N = Ref(5)
           @btime sumArray_bumper($N[]) # Original version
           @btime sumArray_bumper_2($N[]) # Version where the buffer is aquired at runtime but then re-used
           @btime sumArray_bumper_2($N[], $(default_buffer())) # Version where the buffer is passed in ahead of time
       end;
  26.089 ns (0 allocations: 0 bytes)
  16.936 ns (0 allocations: 0 bytes)
  11.593 ns (0 allocations: 0 bytes)

julia> let N = Ref(20)
           @btime sumArray_bumper($N[]) # Original version
           @btime sumArray_bumper_2($N[]) # Version where the buffer is aquired at runtime but then re-used
           @btime sumArray_bumper_2($N[], $(default_buffer())) # Version where the buffer is passed in ahead of time
       end;
  31.478 ns (0 allocations: 0 bytes)
  23.941 ns (0 allocations: 0 bytes)
  16.604 ns (0 allocations: 0 bytes)

The reason for this is that in order to safely acquire the buffer, we use a task local storage which has some overhead.

3 replies

mdmaas Sep 25, 2023
Author

Oh that's much better! The performance is now totally on par with MallocArrays, but the usage pattern of pre-acquiring a buffer and then using it at will is much better.

Both options are still slower than manually pre-allocating a regular array and reusing it, suggesting that there should be room for improvement, maybe throwing some C at some point?

MasonProtter Sep 29, 2023
Maintainer

@mdmaas Hm, interesting, what I see on my machine if I take your benchmark file and comment out the calls that don't actually work, and then change the Bumper benchmarks to accept a buf argument, is that the benchmark for Bumper is identical to the benchmark for pre-allocated arrays (which is should be because they're doing literally the same thing under the hood -- LoopVectorization.jl is turning those pre-allocated Arrays in PtrArrays)

Here's my change to the file: https://gist.github.com/MasonProtter/7047a907aedf9c0adb97120f308ad631/revisions

and here's the result I observe:

the Green and Blue lines are directly on top of eachother here.

mdmaas Oct 3, 2023
Author

Oh that's perfect! I'm getting different results, I guess due to system specs, or maybe it's some library version issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance for very small arrays #13

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Performance for very small arrays #13

mdmaas Sep 17, 2023

Replies: 1 comment · 3 replies

MasonProtter Sep 18, 2023 Maintainer

mdmaas Sep 25, 2023 Author

MasonProtter Sep 29, 2023 Maintainer

mdmaas Oct 3, 2023 Author

mdmaas
Sep 17, 2023

Replies: 1 comment 3 replies

MasonProtter
Sep 18, 2023
Maintainer

mdmaas Sep 25, 2023
Author

MasonProtter Sep 29, 2023
Maintainer

mdmaas Oct 3, 2023
Author