Redesign Multi-threading #159

cortner · 2022-12-14T09:54:45Z

cc @tjjarvinen

cortner · 2022-12-14T18:05:41Z

I had a request from Noam Bernstein to allow setting the number of threads in each call to forces, etc

tjjarvinen · 2022-12-14T20:17:05Z

Adding comments on the implementation

1

From Julia documentation

The value of threadid() may change even within a single iteration.

So, using it to get address to temporary data is a possible source of bugs!

The correct way to implement temporal data, is to use Channels. Create a Channel that holds temporary data. At the beginning of each iteration pull a data instance from the Channel. Once the task is done, put the instance back into the Channel.

2

Setting number of threads for each call to forces needs a special implementation. FoldsThreads.jl has TaskPoolEx executor that can do it.

This means of using either Folds.jl or FLoops.jl. The advantage here is that there are several different executors available that have different advantages. This has several potential use cases. Like, we could use DistributedEx to distribute the force calculation.

cortner · 2022-12-14T21:08:21Z

On 2 - this sounds great, I'm looking forward to learning more.

On 1 - this is a bit cryptic for me, let me see if I understand it correctly: the problem is that a thread could be paused in the middle of executing the body of a for-loop and then the same thread could start a new iteration of the for loop body. So now I'm trying to write into the same temporary arrays at the same time.

Correct? This would indeed be entirely unexpected for me and yes that could cause plenty of bugs. But I don't like the consequence of this which seems to be that I need far more temporary arrays than threads. And we will end up allocating a lot, no?

Are the Channels you are referring to the same as what I call ObjectPools? Or is this something else entirely?

cortner · 2022-12-14T21:09:47Z

I now also think, if I understood you correctly in 1, then this means my temporary arrays in ObjectPools.jl have the same bug.

tjjarvinen · 2022-12-15T04:07:02Z

threadid() - was not an issue with earlier Julia versions. It became an issue when default scheduling was changed from static to dynamic.

Basically what can happen is that:

thread id=1 starts a task
at the beginning the task gets threadid()==1
the task execution is given to other thread (kernel scheduling, some IO wait time etc.)
thread id=2 continues the task
thread id=1 is given an other task
the other task gets threadid()==1 and now there are two threads that access the same data

Other problem here is that the behaviour can depend on Julia versions, so there is no trusting on what can be done with it.

Channels

Channels are Julia construct for communicating between task. The documentation is on asynchronous computing section, but Channels are completely thread safe and can be used with threads. There is also RemoteChannel for distributed computing.

Here is an example of how to use Channels:

julia> c = Channel(4)
Channel{Any}(4) (empty)

julia> for i in 1:4
         put!(c,i)
       end

julia> Threads.@threads for i in 1:8
              t = take!(c)
              println(i, " ", Threads.threadid(), ",", t)
              put!(c, t)
              sleep(rand())
       end
1 3,3
3 2,2
7 4,4
5 1,1
8 4,3
2 3,2
6 3,4
4 2,1

So, allocations are done only once and only same number as (worker) threads are allocated.

In fact, now that I think about it, we should probably replace ObjectPools.jl with Channels too. It would be more efficient.

cortner · 2022-12-15T06:24:39Z

Thank you for the explanation.

cortner · 2022-12-15T07:15:05Z

Since when has this dynamic scheduling started ? And is there a way to avoid it? I'll need to send around a warning to people who are using multi-threading with ACE / JuLIP.

EDIT: I see that I simply need to tag new versions with the :static keyword inserted. Do you agree this will fix things temporarily?

tjjarvinen · 2022-12-15T20:09:56Z

Yes, it will work as a temporary fix.

tjjarvinen · 2022-12-15T20:12:55Z

Since when has this dynamic scheduling started ?

I think it was either v1.6 or v1.7 that introduced Dynamical load balancing.

tjjarvinen · 2022-12-16T21:17:35Z

I tested the Channel overhead

julia> using BenchmarkTools

julia> N = 100
100

julia> c = Channel(N)
Channel{Any}(100) (empty)

julia> for i in 1:N; put!(c, rand(3)) end

julia> b = @benchmarkable take!(c) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(b)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  41.000 ns …   1.070 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     83.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.798 μs ± 107.003 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  41 ns         Histogram: log(frequency) by time       3.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> p = @benchmarkable put!(c, rand(3)) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(p)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   83.000 ns … 14.083 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      84.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   275.830 ns ±  1.441 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  83 ns         Histogram: log(frequency) by time      3.71 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

For me this sounds as an acceptable overhead.

tjjarvinen · 2022-12-16T21:31:29Z

Adding a threaded version to above

julia> function f(c,n)
               Threads.@threads for i in 1:n
                       t = take!(c)
                       put!(c,t)
               end
       end
f (generic function with 1 method)

julia> @benchmark f(c,N)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.875 μs …  1.495 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   56.133 μs ± 45.093 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▅▇▅▄▅▇█▅▄▃▂                                            
  ▁▃▆▆▇████████████▇▇▅▅▄▄▄▅▆▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  15.9 μs         Histogram: frequency by time         158 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 25.

4 threads
100 iterations
So divide the time with 25
So around 1-2 μs latency
But this is unrealistic test, as no work is done and thus threads are only waiting to access the Channel
Dividing by 100 would be more realistic -> ~400 ns latency

cortner · 2022-12-16T21:48:17Z

thanks for this, I'll think about how a realistic test might look like.

cortner · 2022-12-28T23:57:29Z

coming back to this - there are three scenarios:

expensive, possibly batched MLIPs : that kind of overhead would cost us nothing ...
cheap MLIPs targeting 1-10 us latency, especially for bio : unclear to me?
ultra-cheap potentials in the 30-100ns / atom range. For those this overhead sounds potentially problematic

To test this why not try it out by multi-threading the evaluation of an EAM or SW potential? That's already in JuLIP. This is the worst-case scenario and if that scales well then I think we are fine.

In the meantime I'll finish a prototype of the new ACE evaluator, and you can then try it out with that as well?

cortner assigned cortner and unassigned cortner Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign Multi-threading #159

Redesign Multi-threading #159

cortner commented Dec 14, 2022

cortner commented Dec 14, 2022

tjjarvinen commented Dec 14, 2022 •

edited

Loading

cortner commented Dec 14, 2022

cortner commented Dec 14, 2022

tjjarvinen commented Dec 15, 2022 •

edited

Loading

cortner commented Dec 15, 2022

cortner commented Dec 15, 2022 •

edited

Loading

tjjarvinen commented Dec 15, 2022

tjjarvinen commented Dec 15, 2022

tjjarvinen commented Dec 16, 2022

tjjarvinen commented Dec 16, 2022 •

edited

Loading

cortner commented Dec 16, 2022

cortner commented Dec 28, 2022

Redesign Multi-threading #159

Redesign Multi-threading #159

Comments

cortner commented Dec 14, 2022

cortner commented Dec 14, 2022

tjjarvinen commented Dec 14, 2022 • edited Loading

1

2

cortner commented Dec 14, 2022

cortner commented Dec 14, 2022

tjjarvinen commented Dec 15, 2022 • edited Loading

Channels

cortner commented Dec 15, 2022

cortner commented Dec 15, 2022 • edited Loading

tjjarvinen commented Dec 15, 2022

tjjarvinen commented Dec 15, 2022

tjjarvinen commented Dec 16, 2022

tjjarvinen commented Dec 16, 2022 • edited Loading

cortner commented Dec 16, 2022

cortner commented Dec 28, 2022

tjjarvinen commented Dec 14, 2022 •

edited

Loading

tjjarvinen commented Dec 15, 2022 •

edited

Loading

cortner commented Dec 15, 2022 •

edited

Loading

tjjarvinen commented Dec 16, 2022 •

edited

Loading