-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve static compilation, reduce uses of lubuffer
#39
Conversation
100x100, old: julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 5.40e+09 33.3% # 4.0 cycles per ns
┌ instructions 1.33e+10 33.4% # 2.5 insns per cycle
│ branch-instructions 3.25e+08 33.4% # 2.4% of insns
└ branch-misses 4.59e+06 33.4% # 1.4% of branch insns
┌ task-clock 1.34e+09 100.0% # 1.3 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 8.68e+08 16.7% # 24.0% of dcache loads
│ L1-dcache-loads 3.62e+09 16.7%
└ L1-icache-load-misses 2.00e+05 16.7%
┌ dTLB-load-misses 1.56e+02 16.7% # 0.0% of dTLB loads
└ dTLB-loads 3.63e+09 16.7%
┌ iTLB-load-misses 6.27e+03 33.3% # 319.4% of iTLB loads
└ iTLB-loads 1.96e+03 33.3%
┌ cache-misses 7.42e+03 33.3% # 5.5% of cache refs
└ cache-references 1.34e+05 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ New: julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 6.40e+09 33.3% # 3.9 cycles per ns
┌ instructions 1.26e+10 33.4% # 2.0 insns per cycle
│ branch-instructions 2.57e+08 33.4% # 2.0% of insns
└ branch-misses 4.68e+06 33.4% # 1.8% of branch insns
┌ task-clock 1.64e+09 100.0% # 1.6 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 1.19e+09 16.7% # 28.8% of dcache loads
│ L1-dcache-loads 4.13e+09 16.7%
└ L1-icache-load-misses 2.38e+05 16.7%
┌ dTLB-load-misses 1.06e+03 16.7% # 0.0% of dTLB loads
└ dTLB-loads 4.14e+09 16.7%
┌ iTLB-load-misses 1.31e+04 33.3% # 230.8% of iTLB loads
└ iTLB-loads 5.67e+03 33.3%
┌ cache-misses 1.47e+04 33.3% # 15.7% of cache refs
└ cache-references 9.37e+04 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The old version has much better IPC (2.5 vs 2). The amount of instructions was comparable (slightly higher in the old version), but as a result runtime was >20% worse. |
Oops, some definite codegen quality bugs here. |
This really only exists because of RecursiveFactorization.jl, so without any demonstrations of performance regressions there, this is fine to merge IMO. This is still like 2x faster than MKL at small sizes. |
Static compilation and
lubuffer
aren't currently compatible.Unfortunately, this currently results in a runtime regression. On an M1 mac, PR, for
Float64
matrices:Main branch:
rdiv!
is uneffected, because it is not currently using a buffer.I'll have to look into different approaches.