Improve static compilation, reduce uses of `lubuffer` #39

chriselrod · 2024-04-12T19:15:31Z

Static compilation and lubuffer aren't currently compatible.

Unfortunately, this currently results in a runtime regression. On an M1 mac, PR, for Float64 matrices:

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 100 x 100
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  34.249 μs …  40.208 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.748 μs ± 774.675 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆██▆▃▄▅▃                ▁▁▄▅▅▃▂▁                     ▂▃▂  ▁▂ ▂
  ████████▁▁▃▁▁▁▁▁▃▄▃▁▁▃██████████▇▅▄▅▅▆▄▄▄▄▄▅▄▄▃▅▃▄▃▅█████▇██ █
  34.2 μs       Histogram: log(frequency) by time      37.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 200 x 200
BenchmarkTools.Trial: 3593 samples with 1 evaluation.
 Range (min … max):  274.204 μs … 299.746 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     277.162 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   277.524 μs ±   2.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▆▇█▇▅▂▁                                                  
  ▂▃▄▆█████████▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▂▂▂▁▂▁▂▂▁▁▂▁▁▁▁▁▂▁▁▁▂▂▂ ▃
  274 μs           Histogram: frequency by time          295 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Main branch:

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 100 x 100
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.750 μs …  30.541 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.948 μs ± 316.293 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▁▆ █ ▇▆▆ ▄ ▁                               ▁ ▁         ▁ ▁ ▂
  ▅███▆█▁███▄█▁██▇▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▆▁▇▆█▁█▁▇▆▅▁▅▁▇▆█▁█ █
  23.8 μs       Histogram: log(frequency) by time        25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # 200 x 200
BenchmarkTools.Trial: 5408 samples with 1 evaluation.
 Range (min … max):  182.289 μs … 207.247 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     182.539 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   183.907 μs ±   3.616 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅  ▄▅                   ▂                          ▃▁        ▁
  ██▄▃██▆▇▇▅▅▃▃▅▆▅▆█▇▆▆▃▄▃▃██▄▄▄▆▄▅▅▅▄▁▃▁▁▁▄▃▁▁▁▃▁▃▄▁▁██▆▄▅▃▁▄▇ █
  182 μs        Histogram: log(frequency) by time        198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

rdiv! is uneffected, because it is not currently using a buffer.

I'll have to look into different approaches.

chriselrod · 2024-04-12T22:56:54Z

Old times/new times for different square problem sizes. >1 means an improvement, <1 means worse.
This is on an Intel 10980xe, which has AVX512.

Arithmetic mean was around 0.92, so this PR seems to be an overall performance regression.

chriselrod · 2024-04-12T23:28:57Z

100x100, old:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               5.40e+09   33.3%  #  4.0 cycles per ns
┌ instructions             1.33e+10   33.4%  #  2.5 insns per cycle
│ branch-instructions      3.25e+08   33.4%  #  2.4% of insns
└ branch-misses            4.59e+06   33.4%  #  1.4% of branch insns
┌ task-clock               1.34e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    8.68e+08   16.7%  # 24.0% of dcache loads
│ L1-dcache-loads          3.62e+09   16.7%
└ L1-icache-load-misses    2.00e+05   16.7%
┌ dTLB-load-misses         1.56e+02   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               3.63e+09   16.7%
┌ iTLB-load-misses         6.27e+03   33.3%  # 319.4% of iTLB loads
└ iTLB-loads               1.96e+03   33.3%
┌ cache-misses             7.42e+03   33.3%  #  5.5% of cache refs
└ cache-references         1.34e+05   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

New:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(TriangularSolve.ldiv!, 100_000, C, LowerTriangular(B), A, Val(false))
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               6.40e+09   33.3%  #  3.9 cycles per ns
┌ instructions             1.26e+10   33.4%  #  2.0 insns per cycle
│ branch-instructions      2.57e+08   33.4%  #  2.0% of insns
└ branch-misses            4.68e+06   33.4%  #  1.8% of branch insns
┌ task-clock               1.64e+09  100.0%  #  1.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.19e+09   16.7%  # 28.8% of dcache loads
│ L1-dcache-loads          4.13e+09   16.7%
└ L1-icache-load-misses    2.38e+05   16.7%
┌ dTLB-load-misses         1.06e+03   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               4.14e+09   16.7%
┌ iTLB-load-misses         1.31e+04   33.3%  # 230.8% of iTLB loads
└ iTLB-loads               5.67e+03   33.3%
┌ cache-misses             1.47e+04   33.3%  # 15.7% of cache refs
└ cache-references         9.37e+04   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The old version has much better IPC (2.5 vs 2). The amount of instructions was comparable (slightly higher in the old version), but as a result runtime was >20% worse.

chriselrod · 2024-04-13T00:23:45Z

Oops, some definite codegen quality bugs here.

chriselrod · 2024-04-15T05:14:44Z

RecursiveFactorization.lu! old times / new. Less of a performance difference here.

I think I can optimize the small case more, because the bad code was actually specifically in handling of the diagonal blocks (which is more important in the small cases).

chriselrod · 2024-04-15T15:30:10Z

This really only exists because of RecursiveFactorization.jl, so without any demonstrations of performance regressions there, this is fine to merge IMO.

This is still like 2x faster than MKL at small sizes.

chriselrod added 4 commits April 11, 2024 19:47

work on a statically compileable ldiv implementation

2c81058

Tests pass locally

77a9b0b

Add benchmarks

b7f5935

fix mask in ldiv remainder

0135554

YingboMa approved these changes Apr 12, 2024

View reviewed changes

chriselrod added 2 commits April 15, 2024 03:40

faster BdivU_small_kern

52532ec

Simplify

b7e9640

chriselrod merged commit 64f1b07 into main Apr 15, 2024
16 checks passed

chriselrod deleted the staticallycompileableldiv branch April 15, 2024 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve static compilation, reduce uses of `lubuffer` #39

Improve static compilation, reduce uses of `lubuffer` #39

chriselrod commented Apr 12, 2024

chriselrod commented Apr 12, 2024

chriselrod commented Apr 12, 2024 •

edited

Loading

chriselrod commented Apr 13, 2024

chriselrod commented Apr 15, 2024 •

edited

Loading

chriselrod commented Apr 15, 2024

Improve static compilation, reduce uses of lubuffer #39

Improve static compilation, reduce uses of lubuffer #39

Conversation

chriselrod commented Apr 12, 2024

chriselrod commented Apr 12, 2024

chriselrod commented Apr 12, 2024 • edited Loading

chriselrod commented Apr 13, 2024

chriselrod commented Apr 15, 2024 • edited Loading

chriselrod commented Apr 15, 2024

Improve static compilation, reduce uses of `lubuffer` #39

Improve static compilation, reduce uses of `lubuffer` #39

chriselrod commented Apr 12, 2024 •

edited

Loading

chriselrod commented Apr 15, 2024 •

edited

Loading