Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

denisalevi · 2022-02-15T09:40:02Z

The effect application in our Brunel Hakim benchmark with heterogeneous delays is very inefficient when using no partitioning of the connectivity matrix (blocks = 1). But spike propagation is very efficient in that case (talking about large network sizes with many spikes per dt).

Here a figure from the benchmark. Yellow: Effect application, Red: Spike propagation, Blue: Neurons. Brian2CUDA top bar shows 1 block setting.

.
So it would be great if effect application would be better without connectivity matrix partition, then we could just use 1 block and be done with it.

The reason effect application is inefficient is that we use only 1 CUDA block per partition to apply the synaptic effects in the current spike queue. For the block setting, that means we only use 1 CUDA block. But this effect application could easily be further parallelized.

We should choose the number of CUDA blocks per connectivity matrix partition based on the total number of connectivity matrix partitions, such that the total number of CUDA blocks is as high as possible but below the maximal number of active CUDA blocks per SM. Since the number of synapses/bundles in the current spike queue is probably variable, it might make sense to read the spike queue sizes (same as we read the number of spiking neuron sizes) and choose the kernel dimensions accordingly.

This would likely be beneficial also for smaller networks with variable bundle sizes, since we could ignore idle threads per bundle (for smaller bundles) if everything is executed in parallel anyways.

The text was updated successfully, but these errors were encountered:

denisalevi · 2022-02-15T16:36:28Z

I had a quick look, this shouldn't be too hard. No time for it right now, instruction are here:

brian2cuda/brian2cuda/brianlib/cudaVector.h

Lines 54 to 65 in 8ea335d

    
           __device__ set_size_address(volatile size_type* size) 
        
           { 
        
               // TODO: 
        
               //  1. Mofigy m_size to be a pointer 
        
               //  2. Change it's address in here (not sure if method call needs 'volatile'? 
        
               //  3. In spikequeue.h, after initializing the queues, declare a 
        
               //     volatile array of size of queues and change each queues m_size 
        
               //     parameter with this function here. 
        
               //  4. In synapses.cu, get the queue size by indexing this new array 
        
               //     with current_offset and Memcpy it to host to set the correct kernel 
        
               //     dimensions. 
        
           };

Started working on this in the heterog-delays-parallel-effects branch.

denisalevi added optimisation high priority easy labels Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

denisalevi commented Feb 15, 2022

denisalevi commented Feb 15, 2022

Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

Parallelize effect application in heterogeneous delay mode over additional CUDA blocks #269

Comments

denisalevi commented Feb 15, 2022

denisalevi commented Feb 15, 2022