-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove broken register passing optimization. #524
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #524 +/- ##
==========================================
- Coverage 80.31% 78.54% -1.77%
==========================================
Files 39 39
Lines 9604 9542 -62
==========================================
- Hits 7713 7495 -218
- Misses 1891 2047 +156 ☔ View full report in Codecov by Sentry. |
Forgot to include the revert of a4a160f. Updated the branch, so needs re-approval. |
Turns out this isn't sufficient. On this branch, MCPhylo.jl still asserts, suggesting that the bad IR doesn't (only) come from the
As I've already spent too much time on this, I'm going to vote on #523 instead. I'd suggest that people who are actually familiar with this code take a look instead after we've fixed the immediate problem. |
It is still annoying that we'd have to remove this optimization. PR: julia> @time using LoopVectorization
0.272088 seconds (427.03 k allocations: 22.256 MiB, 5.16% gc time, 4.05% compilation time)
julia> function matmul!(C, A, B)
@turbo for n = indices((C,B), 2), m = indices((C,A),1)
Cmn = zero(eltype(C))
for k = indices((A,B), (2,1))
Cmn += A[m,k]*B[k,n]
end
C[m,n] = Cmn
end
end
matmul! (generic function with 1 method)
julia> A = rand(3,3); B = rand(3,3); C = similar(A);
julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min … max): 20.742 ns … 23.508 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.831 ns ┊ GC (median): 0.00%
Time (mean ± σ): 20.852 ns ± 0.155 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▇██▇▅▁ ▂
▆████████▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▇▇▇▇▆▆ █
20.7 ns Histogram: log(frequency) by time 21.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0. Release julia> A = rand(3,3); B = rand(3,3); C = similar(A);
julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min … max): 19.945 ns … 23.903 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.230 ns ┊ GC (median): 0.00%
Time (mean ± σ): 20.276 ns ± 0.226 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▃▆█▃
▂▂▃▄▅▅▆▇████████▅▄▅█▇▄▃▂▂▂▂▂▂▂▂▂▂▂▃▃▅▄▃▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
19.9 ns Histogram: frequency by time 21.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0. It's a small, but consistent benefit.
"japi1_matmul!_3476": # @"japi1_matmul!_3476"
# %bb.0: # %L86.preheader
push rbp
mov rbp, rsp
push r15
push r14
push rbx
and rsp, -32
sub rsp, 128
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp], ymm0
mov qword ptr [rsp + 32], 0
mov qword ptr [rsp + 112], rsi
#APP
mov rax, qword ptr fs:[0]
#NO_APP
mov r14, qword ptr [rax - 8]
mov qword ptr [rsp], 12
mov rax, qword ptr [r14]
mov qword ptr [rsp + 8], rax
mov rax, rsp
mov qword ptr [r14], rax
mov rax, qword ptr [rsi]
mov rcx, qword ptr [rsi + 8]
mov rdx, qword ptr [rsi + 16]
mov r8, qword ptr [rax + 32]
mov r9, qword ptr [rax]
mov rsi, qword ptr [rax + 24]
mov r11, qword ptr [rcx]
mov rbx, qword ptr [rcx + 32]
lea r10, [8*rsi]
mov rdi, qword ptr [rdx]
lea r15, [8*rbx]
mov qword ptr [rsp + 72], r11
mov qword ptr [rsp + 80], rdi
mov qword ptr [rsp + 88], r9
mov qword ptr [rsp + 96], r10
mov qword ptr [rsp + 104], r15
mov qword ptr [rsp + 48], r8
mov qword ptr [rsp + 56], rsi
mov qword ptr [rsp + 64], rbx
mov qword ptr [rsp + 32], rcx
mov qword ptr [rsp + 24], rdx
mov qword ptr [rsp + 16], rax
movabs rax, offset "j__turbo_!_3483"
lea rdi, [rsp + 48]
lea rsi, [rsp + 72]
vzeroupper
call rax
mov rax, qword ptr [rsp + 8]
mov qword ptr [r14], rax
movabs rax, 94218903085064
lea rsp, [rbp - 24]
pop rbx
pop r14
pop r15
pop rbp
ret
.Lfunc_end0:
.size "japi1_matmul!_3476", .Lfunc_end0-"japi1_matmul!_3476" Release: "japi1_matmul!_3809": # @"japi1_matmul!_3809"
# %bb.0: # %L86.preheader
push rbp
mov rbp, rsp
push r15
push r14
push rbx
and rsp, -32
sub rsp, 64
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp], ymm0
mov qword ptr [rsp + 32], 0
mov qword ptr [rsp + 48], rsi
#APP
mov rax, qword ptr fs:[0]
#NO_APP
mov r14, qword ptr [rax - 8]
mov qword ptr [rsp], 12
mov rax, qword ptr [r14]
mov qword ptr [rsp + 8], rax
mov rax, rsp
mov qword ptr [r14], rax
mov rax, qword ptr [rsi]
mov rbx, qword ptr [rsi + 8]
mov r10, qword ptr [rsi + 16]
mov rdi, qword ptr [rax + 32]
mov r9, qword ptr [rax]
mov rsi, qword ptr [rax + 24]
mov rcx, qword ptr [rbx]
mov rdx, qword ptr [rbx + 32]
lea r11, [8*rsi]
mov r8, qword ptr [r10]
lea r15, [8*rdx]
mov qword ptr [rsp + 32], rbx
mov qword ptr [rsp + 24], r10
mov qword ptr [rsp + 16], rax
movabs rax, offset "j__turbo_!_3816"
push r15
push r11
vzeroupper
call rax
add rsp, 16
mov rax, qword ptr [rsp + 8]
mov qword ptr [r14], rax
movabs rax, 94379488305160
lea rsp, [rbp - 24]
pop rbx
pop r14
pop r15
pop rbp
ret
.Lfunc_end0:
.size "japi1_matmul!_3809", .Lfunc_end0-"japi1_matmul!_3809" Note all the extra stores to the stack in the asm from this PR. Mirroring this, the first block within the .text
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
mov rax, r13
mov r8, qword ptr [rdi]
mov rcx, qword ptr [rdi + 8]
mov r11, qword ptr [rdi + 16]
mov r9, qword ptr [rsi]
mov r14, qword ptr [rsi + 8]
mov r15, qword ptr [rsi + 16]
mov r13, qword ptr [rsi + 24]
mov rsi, qword ptr [rsi + 32]
mov rax, qword ptr [rax + 16]
mov rax, qword ptr [rax + 16]
mov rax, qword ptr [rax]
dec r11
mov qword ptr [rbp - 72], rcx
neg cl
and cl, 7
mov r12b, -1
shr r12b, cl
lea rcx, [4*r13]
add rcx, r13
lea rdi, [rsi + 2*rsi]
lea rax, [8*rsi]
sub rax, rsi
lea rdx, [8*r13]
sub rdx, r13
mov qword ptr [rbp - 88], rdx
add r8, -9
imul r8, rsi
lea r10, [rsi + 4*rsi]
lea rbx, [rsi + 8*rsi]
lea rdx, [2*r13]
add rdx, r13
mov qword ptr [rbp - 56], rdx
lea rdx, [r14 + r8]
mov qword ptr [rbp - 136], rbx
add rbx, rdx
mov qword ptr [rbp - 112], rbx
test r8, r8
mov qword ptr [rbp - 64], r11
mov qword ptr [rbp - 80], r9
mov dword ptr [rbp - 44], r12d
js L2378
lea rbx, [8*r13]
add rbx, r13
mov qword ptr [rbp - 120], rbx
kmovd k1, r12d
mov r12, r14
mov rbx, qword ptr [rbp - 72]
mov qword ptr [rbp - 128], rdx
jmp L255
nop dword ptr [rax + rax]
L224: Compared to the release: .text
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
mov r10, r9
mov r9, rcx
mov rax, qword ptr [rbp + 24]
mov r11, qword ptr [rbp + 16]
mov rcx, qword ptr [r13 + 16]
mov r13, r8
mov r8, rdx
mov rcx, qword ptr [rcx + 16]
mov rcx, qword ptr [rcx]
dec r8
mov qword ptr [rbp - 64], rsi
mov ecx, esi
neg cl
and cl, 7
mov r14b, -1
shr r14b, cl
lea rbx, [r11 + 4*r11]
lea rcx, [rax + 2*rax]
lea r12, [8*rax]
sub r12, rax
lea rdx, [8*r11]
sub rdx, r11
mov qword ptr [rbp - 72], rdx
add rdi, -9
imul rdi, rax
lea r15, [rax + 4*rax]
lea rsi, [rax + 8*rax]
lea rdx, [r11 + 2*r11]
mov qword ptr [rbp - 56], rdx
lea rdx, [r13 + rdi]
mov qword ptr [rbp - 96], rsi
add rsi, rdx
mov qword ptr [rbp - 112], rsi
test rdi, rdi
mov dword ptr [rbp - 44], r14d
js L2367
lea rsi, [r11 + 8*r11]
mov qword ptr [rbp - 120], rsi
kmovd k1, r14d
mov r14, r13
mov rdi, qword ptr [rbp - 64]
mov qword ptr [rbp - 104], r9
mov qword ptr [rbp - 128], rdx
jmp L220
nop word ptr cs:[rax + rax]
L192: It's not the biggest difference, but you can see all the extra stores and loads from the stack that this PR introduces. The perfectionist in me doesn't like to see regressions like this, a silent slow decay. But it seems to be broken anyway. |
Yes, that is unfortunate. It's probably possible to specialize the optimization to not only happen for types where it's legal (i.e., not Array), but I'm not familiar enough with the codebase to work on that efficiently (again, my prime motivation right now is to get 1.11 in a somewhat better shape, as we haven't been doing well lately). |
It shouldn't be passing arrays to julia> function matmul!(C, A, B)
@turbo for n = indices((C,B), 2), m = indices((C,A),1)
Cmn = zero(eltype(C))
for k = indices((A,B), (2,1))
Cmn += A[m,k]*B[k,n]
end
C[m,n] = Cmn
end
end
matmul! (generic function with 1 method) because My primary point here is that in the typical case, we should only need this decomposition and reconstruction for the For multithreaded code, it will try and store objects in either stack space, or a per-thread scratch space (use of this per-thread space is locked, so we shouldn't get multiple tasks trying to use the same one). I suspect this code may be problematic. Anyway, thanks for the time you have spent on this. My original concern was that we don't use this in any of our products, while you, Jameson, and others are sporadically paying maintenance costs for this ecosystem. It seems to be doing more harm than good. |
Yet another alternative to a4a160f and #523, now reverting the problematic optimization from #261. IMO this is the least controversial option, as it keeps the functionality of LoopVectorization.jl for 1.11 users.
cc @staticfloat
@chriselrod Please approve this PR (or #523) so that at least CI can run. I would like to get either merged ASAP to get PkgEval in a better state.
Fixes #520, re-fixes #518