Performance for very small arrays #13
-
Hi! I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C. https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8 The test first allocates the inverse squares of integers from 1 to N, and then performs the sum. This is how it looks like for Bumper.jl:
I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory. My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper. Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays? Best, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @mdmaas, sorry for the slow reply. The most important thing you can do to speed it up but would be to explicitly pass in the buffer you're using. Here's an example: julia> @inline function sumArray_bumper_2(N, buf=default_buffer())
@no_escape buf begin
smallarray = alloc(Float64, buf, N)
@turbo for i ∈ 1:N
smallarray[i] = 1.0 / i^2
end
sum = 0.0
@turbo for i ∈ 1:N
sum += smallarray[i]
end
end
return sum
end;
julia> let N = Ref(5)
@btime sumArray_bumper($N[]) # Original version
@btime sumArray_bumper_2($N[]) # Version where the buffer is aquired at runtime but then re-used
@btime sumArray_bumper_2($N[], $(default_buffer())) # Version where the buffer is passed in ahead of time
end;
26.089 ns (0 allocations: 0 bytes)
16.936 ns (0 allocations: 0 bytes)
11.593 ns (0 allocations: 0 bytes)
julia> let N = Ref(20)
@btime sumArray_bumper($N[]) # Original version
@btime sumArray_bumper_2($N[]) # Version where the buffer is aquired at runtime but then re-used
@btime sumArray_bumper_2($N[], $(default_buffer())) # Version where the buffer is passed in ahead of time
end;
31.478 ns (0 allocations: 0 bytes)
23.941 ns (0 allocations: 0 bytes)
16.604 ns (0 allocations: 0 bytes) The reason for this is that in order to safely acquire the buffer, we use a task local storage which has some overhead. |
Beta Was this translation helpful? Give feedback.
Hi @mdmaas, sorry for the slow reply. The most important thing you can do to speed it up but would be to explicitly pass in the buffer you're using. Here's an example: