Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve static compilation, reduce uses of lubuffer #39

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

chriselrod
Copy link
Member

Static compilation and lubuffer aren't currently compatible.

Unfortunately, this currently results in a runtime regression. On an M1 mac, PR, for Float64 matrices:

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 100 x 100
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  34.249 μs   40.208 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     34.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.748 μs ± 774.675 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▆▃▄▅▃                ▁▁▄▅▅▃▂▁                     ▂▃▂  ▁▂ ▂
  ████████▁▁▃▁▁▁▁▁▃▄▃▁▁▃██████████▇▅▄▅▅▆▄▄▄▄▄▅▄▄▃▅▃▄▃▅█████▇██ █
  34.2 μs       Histogram: log(frequency) by time      37.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 200 x 200
BenchmarkTools.Trial: 3593 samples with 1 evaluation.
 Range (min  max):  274.204 μs  299.746 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     277.162 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   277.524 μs ±   2.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▆▇█▇▅▂▁                                                  
  ▂▃▄▆█████████▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▂▂▂▁▂▁▂▂▁▁▂▁▁▁▁▁▂▁▁▁▂▂▂ ▃
  274 μs           Histogram: frequency by time          295 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Main branch:

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 100 x 100
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  23.750 μs   30.541 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     23.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.948 μs ± 316.293 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▁▆ █ ▇▆▆ ▄ ▁                               ▁ ▁         ▁ ▁ ▂
  ▅███▆█▁███▄█▁██▇▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▆▁▇▆█▁█▁▇▆▅▁▅▁▇▆█▁█ █
  23.8 μs       Histogram: log(frequency) by time        25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 200 x 200
BenchmarkTools.Trial: 5408 samples with 1 evaluation.
 Range (min  max):  182.289 μs  207.247 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     182.539 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   183.907 μs ±   3.616 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅  ▄▅                   ▂                          ▃▁        ▁
  ██▄▃██▆▇▇▅▅▃▃▅▆▅▆█▇▆▆▃▄▃▃██▄▄▄▆▄▅▅▅▄▁▃▁▁▁▄▃▁▁▁▃▁▃▄▁▁██▆▄▅▃▁▄▇ █
  182 μs        Histogram: log(frequency) by time        198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

rdiv! is uneffected, because it is not currently using a buffer.

I'll have to look into different approaches.

@chriselrod
Copy link
Member Author

old_over_new_trisolve_times
Old times/new times for different square problem sizes. >1 means an improvement, <1 means worse.
This is on an Intel 10980xe, which has AVX512.

Arithmetic mean was around 0.92, so this PR seems to be an overall performance regression.

@chriselrod
Copy link
Member Author

chriselrod commented Apr 12, 2024

100x100, old:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               5.40e+09   33.3%  #  4.0 cycles per ns
┌ instructions             1.33e+10   33.4%  #  2.5 insns per cycle
│ branch-instructions      3.25e+08   33.4%  #  2.4% of insns
└ branch-misses            4.59e+06   33.4%  #  1.4% of branch insns
┌ task-clock               1.34e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    8.68e+08   16.7%  # 24.0% of dcache loads
│ L1-dcache-loads          3.62e+09   16.7%
└ L1-icache-load-misses    2.00e+05   16.7%
┌ dTLB-load-misses         1.56e+02   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               3.63e+09   16.7%
┌ iTLB-load-misses         6.27e+03   33.3%  # 319.4% of iTLB loads
└ iTLB-loads               1.96e+03   33.3%
┌ cache-misses             7.42e+03   33.3%  #  5.5% of cache refs
└ cache-references         1.34e+05   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

New:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               6.40e+09   33.3%  #  3.9 cycles per ns
┌ instructions             1.26e+10   33.4%  #  2.0 insns per cycle
│ branch-instructions      2.57e+08   33.4%  #  2.0% of insns
└ branch-misses            4.68e+06   33.4%  #  1.8% of branch insns
┌ task-clock               1.64e+09  100.0%  #  1.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.19e+09   16.7%  # 28.8% of dcache loads
│ L1-dcache-loads          4.13e+09   16.7%
└ L1-icache-load-misses    2.38e+05   16.7%
┌ dTLB-load-misses         1.06e+03   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               4.14e+09   16.7%
┌ iTLB-load-misses         1.31e+04   33.3%  # 230.8% of iTLB loads
└ iTLB-loads               5.67e+03   33.3%
┌ cache-misses             1.47e+04   33.3%  # 15.7% of cache refs
└ cache-references         9.37e+04   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The old version has much better IPC (2.5 vs 2). The amount of instructions was comparable (slightly higher in the old version), but as a result runtime was >20% worse.

@chriselrod
Copy link
Member Author

Oops, some definite codegen quality bugs here.

@chriselrod
Copy link
Member Author

chriselrod commented Apr 15, 2024

old_over_new_lu_times
RecursiveFactorization.lu! old times / new. Less of a performance difference here.

I think I can optimize the small case more, because the bad code was actually specifically in handling of the diagonal blocks (which is more important in the small cases).

@chriselrod
Copy link
Member Author

This really only exists because of RecursiveFactorization.jl, so without any demonstrations of performance regressions there, this is fine to merge IMO.

This is still like 2x faster than MKL at small sizes.

@chriselrod chriselrod merged commit 64f1b07 into main Apr 15, 2024
16 checks passed
@chriselrod chriselrod deleted the staticallycompileableldiv branch April 15, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants