-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorization problems (and how to fix them) #10
Comments
Hey @haampie , I've unfortunately noticed the same issue. I've been able to get very simple loops like There are two things I would like to fix here. First, there are loops like It looks like the intent of the changes you're proposing here to allow programmers to work around these issues by writing explicitly vectorized code using (Indeed, |
Just as an example: using MultiFloats, VectorizationBase
using VectorizationBase: extractelement, pick_vector_width
using MultiFloats: renormalize
@generated function MultiFloatOfVec(fs::NTuple{M,MultiFloat{T,N}}) where {T,M,N}
exprs = [:(Vec($([:(fs[$j]._limbs[$i]) for j=1:M]...))) for i=1:N]
return quote
MultiFloat(tuple($(exprs...)))
end
end
@generated function TupleOfMultiFloat(fs::MultiFloat{Vec{M,T},N}) where {T,M,N}
exprs = [:(MultiFloat(tuple($([:(extractelement(fs._limbs[$j], $i)) for j=1:N]...)))) for i=0:M-1]
return quote
tuple($(exprs...))
end
end
function trivial_sum(xs)
t = zero(eltype(xs))
@inbounds for x in xs
t += x
end
return t
end
function vectorized_sum(xs::Vector{MultiFloat{T,N}}) where {T,N}
M = pick_vector_width(T)
t = zero(MultiFloat{Vec{M,T},N})
@inbounds for i = 1:M:length(xs)-M+1
t += MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
end
return +(TupleOfMultiFloat(t)...)
end
function trivial_dot(xs, ys)
t = zero(eltype(xs))
@inbounds for i = 1:length(xs)
t += xs[i] * ys[i]
end
return t
end
function vectorized_dot(xs::Vector{MultiFloat{T,N}}, ys::Vector{MultiFloat{T,N}}) where {T,N}
M = pick_vector_width(T)
t = zero(MultiFloat{Vec{M,T},N})
@inbounds for i = 1:M:length(xs)-M+1
x = MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
y = MultiFloatOfVec(ntuple(k -> ys[i + k - 1], M))
t += x * y
end
return +(TupleOfMultiFloat(t)...)
end
using BenchmarkTools
random_vec(::Type{MultiFloat{T,N}}, k) where {T,N} =
[renormalize(MultiFloat(ntuple(i -> rand(T) * eps(T)^(i-1), N))) for _ = 1:k]
function benchmark_dot(::Type{T}) where {T<:MultiFloat}
xs = random_vec(T, 4096)
ys = random_vec(T, 4096)
@show vectorized_dot(xs, ys) - trivial_dot(xs, ys)
vectorized = @benchmark vectorized_dot($xs, $ys)
trivial = @benchmark trivial_dot($xs, $ys)
return vectorized, trivial
end
function benchmark_sum(::Type{T}) where {T<:MultiFloat}
xs = random_vec(T, 4096)
@show vectorized_sum(xs) - trivial_sum(xs)
vectorized = @benchmark vectorized_sum($xs)
trivial = @benchmark trivial_sum($xs)
return vectorized, trivial
end benchmarks timings: julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = 0.0
(Trial(59.062 μs), Trial(240.739 μs))
julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = 3.1115076389305709e-61
(Trial(21.876 μs), Trial(88.099 μs))
julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -1.3849467926678604e-127
(Trial(328.814 μs), Trial(854.370 μs))
julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.7787690973264271e-62
(Trial(37.992 μs), Trial(128.553 μs)) To replicate, install a patched VectorizationBase:
I think this is the best way to do it in Julia |
Seems like |
Some more data points for AVX-512 on an Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz with vectors of size 2^13: julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = -5.9091063153828709e-126
(Trial(124.202 μs), Trial(499.626 μs))
julia> benchmark_sum(Float64x7)
vectorized_sum(xs) - trivial_sum(xs) = -4.5240823300086601e-109
(Trial(100.521 μs), Trial(422.145 μs))
julia> benchmark_sum(Float64x6)
vectorized_sum(xs) - trivial_sum(xs) = 6.7116512220867355e-93
(Trial(78.467 μs), Trial(339.790 μs))
julia> benchmark_sum(Float64x5)
vectorized_sum(xs) - trivial_sum(xs) = -4.7498927053019445e-77
(Trial(47.538 μs), Trial(257.020 μs))
julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = -2.6058876476043531e-60
(Trial(33.474 μs), Trial(187.179 μs))
julia> benchmark_sum(Float64x3)
vectorized_sum(xs) - trivial_sum(xs) = -3.7835058536770061e-44
(Trial(17.097 μs), Trial(121.135 μs))
julia> benchmark_sum(Float64x2)
vectorized_sum(xs) - trivial_sum(xs) = -4.543838814073028e-28
(Trial(9.002 μs), Trial(63.851 μs)) and julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.7855212332921517e-126
(Trial(555.383 μs), Trial(1.735 ms))
julia> benchmark_dot(Float64x7)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.4846950312643274e-110
(Trial(275.906 μs), Trial(1.280 ms))
julia> benchmark_dot(Float64x6)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 2.7865337663127964e-93
(Trial(200.465 μs), Trial(855.838 μs))
julia> benchmark_dot(Float64x5)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -6.0453179885661112e-77
(Trial(131.219 μs), Trial(558.778 μs))
julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 1.2834969010588605e-60
(Trial(89.520 μs), Trial(300.694 μs))
julia> benchmark_dot(Float64x3)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -3.4331812375958018e-44
(Trial(36.687 μs), Trial(122.643 μs))
julia> benchmark_dot(Float64x2)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.0292258760486853e-28
(Trial(15.735 μs), Trial(63.286 μs)) Summary:
So, not enirely the 8x speedup I hoped for, I think the issue here is inlining doesn't work properly. Maybe we can inline code by hand? there was still some issue with bounds checks. after fixing it I'm getting:
|
Ok, seems like there's no real inlining problems after all. The issue is the code gen for the transform from NTuple{N,MultiFloat{Float64,M}} -> MultiFloat{Vec{N,Float64},M} It's pretty much a matrix-transpose in the registers, I've written that kernel before: https://github.com/haampie/FastTranspose.jl/blob/master/src/FastTranspose.jl#L34-L41 |
I've figured out the problem with With an improved transpose kernel I'm getting a 5.40x speedup for dot and a 5.66x speedup for sum for Float64x8 on AVX512 :) The resulting assembly is quite pretty for dot: https://gist.github.com/haampie/e2e52718208ee7997fc4a34a4649af97 |
I've improved the register-transpose kernel for general input sizes (for instance Float64x7 requires a 7x8 transpose on AVX-512), and I'm getting the following results:
Not sure this can be improved really. For reference the struct-of-arrays version which doesn't require shuffling has the following speedups vs scalar:
|
Another result on AVX-512:
this requires still some hand-vectorization of |
Hey @haampie , this is fantastic work! I am especially impressed with your fast Here is what I am going to do. Starting tonight, with the imminent release of MultiFloats.jl v1.0, I am simply going to drop the Moving forward, I would also like MultiFloats.jl to provide explicitly vectorized implementations for By the way, I had a look at your patch to VectorizationBase, and I believe it should be safe to use MultiFloats.jl with |
Just a few comments: it would be interesting to look into what's necessary to make LLVM autovectorize reductions, because that would be much more generic than implementing a few vectorized routines. This is a good starting point: https://docs.julialang.org/en/v1/devdocs/llvm/#Passing-options-to-LLVM Another idea I had is to add a simple "struct of arrays" array type to this package. Basically a permuted array where the A minimal working example: struct MFArray{T,M,N,TA} <: AbstractArray{MultiFloat{T,M},N}
A::TA
end
import Base: size, getindex, setindex, view, IndexStyle
using Base: OneTo
using Base.Cartesian: @ntuple, @nexprs
export MFArray
Base.size(A::MFArray) = reverse(Base.tail(reverse(size(A.A))))
Base.IndexStyle(x::MFArray) = IndexStyle(x.A)
@generated function Base.getindex(A::MFArray{T,M,N}, i::Int) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
dist = length(A)
MultiFloat{T,M}(@ntuple $M j -> A.A[i + (j - 1) * dist])
end
end
@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, i::Int) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
dist = length(A)
@nexprs $M j -> A.A[i + (j-1) * dist] = v._limbs[j]
return v
end
end
@generated function Base.getindex(A::MFArray{T,M,N}, I::Vararg{Int,N}) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
MultiFloat{T,M}(@ntuple $M j -> A.A[I..., j])
end
end
@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, I::Vararg{Int,N}) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
@nexprs $M j -> A.A[I..., j] = v._limbs[j]
return v
end
end
function MFArray(A::AbstractArray{MultiFloat{T,M},N}) where {T,M,N}
A′ = permutedims(reshape(reinterpret(T, A), M, size(A)...), circshift(OneTo(N+1), -1))
return MFArray{T,M,N,typeof(A′)}(A′)
end
@inline function view(A::MFArray{T,M,N}, I::Vararg{Any,N}) where {T,M,N}
B = view(A.A, I..., :)
return MFArray{T,M,ndims(B)-1,typeof(B)}(B)
end results in the following: julia> function simple_muladd(C, A, B)
@inbounds @simd ivdep for i ∈ eachindex(A)
C[i] += A[i] * B[i]
end
return nothing
end
simple_muladd (generic function with 1 method)
julia> @benchmark simple_muladd(C, A, B) setup=(A=MFArray(fill(Float64x4(1.0), 100)); B=MFArray(fill(Float64x4(1.0), 100)); C=MFArray(fill(Float64x4(1.0), 100)))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 654.309 ns (0.00% GC)
median time: 673.235 ns (0.00% GC)
mean time: 673.953 ns (0.00% GC)
maximum time: 811.642 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 162
julia> @benchmark simple_muladd(C, A, B) setup=(A=fill(Float64x4(1.0), 100); B=fill(Float64x4(1.0), 100); C=fill(Float64x4(1.0), 100))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 859.617 ns (0.00% GC)
median time: 878.817 ns (0.00% GC)
mean time: 880.429 ns (0.00% GC)
maximum time: 1.439 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 60 The problem however is that it requires |
Using an array like I suggested above does have the downside that it runs into cache thrashing sometimes. For instance, I have some code to do reduction to band (first step in a parallel schur decomp), and it's slow at 512x512 compared to 513x513: julia> A = MFArray(rand(Float64x4, 512, 512)); @time to_band!(A, p);
3.015606 seconds (2 allocations: 352 bytes)
julia> A = MFArray(rand(Float64x4, 513, 513)); @time to_band!(A, p);
1.789323 seconds (2 allocations: 352 bytes) because loading julia> A = rand(Float64x4, 512, 512); @time to_band!(A, p);
2.249070 seconds (1 allocation: 336 bytes)
julia> A = rand(Float64x4, 513, 513); @time to_band!(A, p);
2.052841 seconds (1 allocation: 336 bytes) There's some slowdown here too (I guess when iterating over columns) but it's not as dramatic as above. |
As a matter of fact, I actually have a similar struct-of-arrays construct that I started testing privately a long time ago. However, I hesitate to release an array-like construct as part of the public MultiFloats.jl API, since I have no idea how the Julia |
In principle Only remaining function to implement is Btw, I'm trying to write a microkernel for tall-and-skinny QR; what I found is for Q * A it's useful to store Q as an Array{Float64xN} and A with dims permuted (matrix transpose + limbs to 2nd dimension), so I guess MFArray like above isn't the answer to everything. |
Closing this old thread since explicit vectorization is now supported in MultiFloats.jl v2.0. I have, for the time being, decided against introducing my own |
Current codegen is unfortunately not great, for instance a trivial loop like this:
generates the following inner loop:
By relaxing the type restrictions in MultiFloats.jl a bit, and patching VectorizationBase.jl to add another Vec type that doesn't allow for certain fast-math ops, it should be possible to get a 4x speedup on AVX2 and potentially 8x on AVX512 (well, that's the upper bound :p).
A minimal set of changes to allow to use VectorizationBase is here: haampie@0e83c05
The idea is to work with a type MultiFloat{Vec{M,T},N} and run the basic *, /, +, and - operations on 2, 4, or 8 numbers simultaneously.
The text was updated successfully, but these errors were encountered: