You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The effect application in our Brunel Hakim benchmark with heterogeneous delays is very inefficient when using no partitioning of the connectivity matrix (blocks = 1). But spike propagation is very efficient in that case (talking about large network sizes with many spikes per dt).
Here a figure from the benchmark. Yellow: Effect application, Red: Spike propagation, Blue: Neurons. Brian2CUDA top bar shows 1 block setting.
.
So it would be great if effect application would be better without connectivity matrix partition, then we could just use 1 block and be done with it.
The reason effect application is inefficient is that we use only 1 CUDA block per partition to apply the synaptic effects in the current spike queue. For the block setting, that means we only use 1 CUDA block. But this effect application could easily be further parallelized.
We should choose the number of CUDA blocks per connectivity matrix partition based on the total number of connectivity matrix partitions, such that the total number of CUDA blocks is as high as possible but below the maximal number of active CUDA blocks per SM. Since the number of synapses/bundles in the current spike queue is probably variable, it might make sense to read the spike queue sizes (same as we read the number of spiking neuron sizes) and choose the kernel dimensions accordingly.
This would likely be beneficial also for smaller networks with variable bundle sizes, since we could ignore idle threads per bundle (for smaller bundles) if everything is executed in parallel anyways.
The text was updated successfully, but these errors were encountered:
The effect application in our Brunel Hakim benchmark with heterogeneous delays is very inefficient when using no partitioning of the connectivity matrix (
blocks = 1
). But spike propagation is very efficient in that case (talking about large network sizes with many spikes per dt).Here a figure from the benchmark. Yellow: Effect application, Red: Spike propagation, Blue: Neurons. Brian2CUDA top bar shows 1 block setting.
.
So it would be great if effect application would be better without connectivity matrix partition, then we could just use 1 block and be done with it.
The reason effect application is inefficient is that we use only 1 CUDA block per partition to apply the synaptic effects in the current spike queue. For the block setting, that means we only use 1 CUDA block. But this effect application could easily be further parallelized.
We should choose the number of CUDA blocks per connectivity matrix partition based on the total number of connectivity matrix partitions, such that the total number of CUDA blocks is as high as possible but below the maximal number of active CUDA blocks per SM. Since the number of synapses/bundles in the current spike queue is probably variable, it might make sense to read the spike queue sizes (same as we read the number of spiking neuron sizes) and choose the kernel dimensions accordingly.
This would likely be beneficial also for smaller networks with variable bundle sizes, since we could ignore idle threads per bundle (for smaller bundles) if everything is executed in parallel anyways.
The text was updated successfully, but these errors were encountered: