Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign Multi-threading #159

Open
cortner opened this issue Dec 14, 2022 · 13 comments
Open

Redesign Multi-threading #159

cortner opened this issue Dec 14, 2022 · 13 comments

Comments

@cortner
Copy link
Member

cortner commented Dec 14, 2022

cc @tjjarvinen

@cortner cortner assigned cortner and unassigned cortner Dec 14, 2022
@cortner
Copy link
Member Author

cortner commented Dec 14, 2022

I had a request from Noam Bernstein to allow setting the number of threads in each call to forces, etc

@tjjarvinen
Copy link
Contributor

tjjarvinen commented Dec 14, 2022

Adding comments on the implementation

1

From Julia documentation

The value of threadid() may change even within a single iteration.

So, using it to get address to temporary data is a possible source of bugs!

The correct way to implement temporal data, is to use Channels. Create a Channel that holds temporary data. At the beginning of each iteration pull a data instance from the Channel. Once the task is done, put the instance back into the Channel.

2

Setting number of threads for each call to forces needs a special implementation. FoldsThreads.jl has TaskPoolEx executor that can do it.

This means of using either Folds.jl or FLoops.jl. The advantage here is that there are several different executors available that have different advantages. This has several potential use cases. Like, we could use DistributedEx to distribute the force calculation.

@cortner
Copy link
Member Author

cortner commented Dec 14, 2022

On 2 - this sounds great, I'm looking forward to learning more.

On 1 - this is a bit cryptic for me, let me see if I understand it correctly: the problem is that a thread could be paused in the middle of executing the body of a for-loop and then the same thread could start a new iteration of the for loop body. So now I'm trying to write into the same temporary arrays at the same time.

Correct? This would indeed be entirely unexpected for me and yes that could cause plenty of bugs. But I don't like the consequence of this which seems to be that I need far more temporary arrays than threads. And we will end up allocating a lot, no?

Are the Channels you are referring to the same as what I call ObjectPools? Or is this something else entirely?

@cortner
Copy link
Member Author

cortner commented Dec 14, 2022

I now also think, if I understood you correctly in 1, then this means my temporary arrays in ObjectPools.jl have the same bug.

@tjjarvinen
Copy link
Contributor

tjjarvinen commented Dec 15, 2022

threadid() - was not an issue with earlier Julia versions. It became an issue when default scheduling was changed from static to dynamic.

Basically what can happen is that:

  • thread id=1 starts a task
  • at the beginning the task gets threadid()==1
  • the task execution is given to other thread (kernel scheduling, some IO wait time etc.)
  • thread id=2 continues the task
  • thread id=1 is given an other task
  • the other task gets threadid()==1 and now there are two threads that access the same data

Other problem here is that the behaviour can depend on Julia versions, so there is no trusting on what can be done with it.

Channels

Channels are Julia construct for communicating between task. The documentation is on asynchronous computing section, but Channels are completely thread safe and can be used with threads. There is also RemoteChannel for distributed computing.

Here is an example of how to use Channels:

julia> c = Channel(4)
Channel{Any}(4) (empty)

julia> for i in 1:4
         put!(c,i)
       end

julia> Threads.@threads for i in 1:8
              t = take!(c)
              println(i, " ", Threads.threadid(), ",", t)
              put!(c, t)
              sleep(rand())
       end
1 3,3
3 2,2
7 4,4
5 1,1
8 4,3
2 3,2
6 3,4
4 2,1

So, allocations are done only once and only same number as (worker) threads are allocated.

In fact, now that I think about it, we should probably replace ObjectPools.jl with Channels too. It would be more efficient.

@cortner
Copy link
Member Author

cortner commented Dec 15, 2022

Thank you for the explanation.

@cortner
Copy link
Member Author

cortner commented Dec 15, 2022

Since when has this dynamic scheduling started ? And is there a way to avoid it? I'll need to send around a warning to people who are using multi-threading with ACE / JuLIP.

EDIT: I see that I simply need to tag new versions with the :static keyword inserted. Do you agree this will fix things temporarily?

@tjjarvinen
Copy link
Contributor

Yes, it will work as a temporary fix.

@tjjarvinen
Copy link
Contributor

Since when has this dynamic scheduling started ?

I think it was either v1.6 or v1.7 that introduced Dynamical load balancing.

@tjjarvinen
Copy link
Contributor

I tested the Channel overhead

julia> using BenchmarkTools

julia> N = 100
100

julia> c = Channel(N)
Channel{Any}(100) (empty)

julia> for i in 1:N; put!(c, rand(3)) end

julia> b = @benchmarkable take!(c) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(b)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):  41.000 ns    1.070 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     83.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.798 μs ± 107.003 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  41 ns         Histogram: log(frequency) by time       3.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> p = @benchmarkable put!(c, rand(3)) samples=N
Benchmark(evals=1, seconds=5.0, samples=100)

julia> run(p)
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):   83.000 ns  14.083 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      84.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   275.830 ns ±  1.441 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
  83 ns         Histogram: log(frequency) by time      3.71 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

For me this sounds as an acceptable overhead.

@tjjarvinen
Copy link
Contributor

tjjarvinen commented Dec 16, 2022

Adding a threaded version to above

julia> function f(c,n)
               Threads.@threads for i in 1:n
                       t = take!(c)
                       put!(c,t)
               end
       end
f (generic function with 1 method)

julia> @benchmark f(c,N)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  15.875 μs   1.495 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     47.417 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   56.133 μs ± 45.093 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▅▇▅▄▅▇█▅▄▃▂                                            
  ▁▃▆▆▇████████████▇▇▅▅▄▄▄▅▆▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  15.9 μs         Histogram: frequency by time         158 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 25.
  • 4 threads
  • 100 iterations
  • So divide the time with 25
  • So around 1-2 μs latency
  • But this is unrealistic test, as no work is done and thus threads are only waiting to access the Channel
  • Dividing by 100 would be more realistic -> ~400 ns latency

@cortner
Copy link
Member Author

cortner commented Dec 16, 2022

thanks for this, I'll think about how a realistic test might look like.

@cortner
Copy link
Member Author

cortner commented Dec 28, 2022

coming back to this - there are three scenarios:

  • expensive, possibly batched MLIPs : that kind of overhead would cost us nothing ...
  • cheap MLIPs targeting 1-10 us latency, especially for bio : unclear to me?
  • ultra-cheap potentials in the 30-100ns / atom range. For those this overhead sounds potentially problematic

To test this why not try it out by multi-threading the evaluation of an EAM or SW potential? That's already in JuLIP. This is the worst-case scenario and if that scales well then I think we are fine.

In the meantime I'll finish a prototype of the new ACE evaluator, and you can then try it out with that as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants