Skip to content

Bramich

dharvpat edited this page Apr 20, 2023 · 19 revisions

As of 2023-04-17:

  1. I (Dharv) started a new branch to the codebase that focuses on accelerating Bramich using CUDA, it currently only has CUDA Acceleration for the Convolution step but eventually I will add a CUDA Acceleration for the Matrix Dot product as well; this should bring runtimes well within ~10 seconds per image.

  2. For the nasty details; we are using (32x32) blocks and then calculating grid sizes based on this requirement. As for now, we need to use 32x32 blocks because nvidia caps out the maximum number of threads per block at 1024. Furthermore, if you would like to understand what the code is doing; basically we make blocks of pixels and pass them to the GPU and the GPU does convolutions on the tiny blocks instead of handling the giant image altogether. I have run several exhaustive tests and the difference between the two implementation are on the order 10**(-7) which can be chalked up to floating point arithmetic mistakes.

  3. Another Benefit of this method is that we can scale the kernel size and not have it affect the runtime at all. This will allow us to use as big a kernel as we'd like. Currently OIS uses 11x11 because thats the largest kernel you can support before facing drastic slowdowns. But now due to this convolution and dot product acceleration, we can choose as large a kernel as we want (under 32x32). A larger kernel basically helps us pinpoint PSFs more accurately and will lead to a subtraction that has less artifacts.

  4. The Tensordot product is by far what ends up taking the most time; once we accelerate that the pipeline should take less than ~10 seconds per image for Bramich.

This Branch is still under development and should not be pulled into main, I will do so when appropriate. In the meantime if someone would want to play with it; you can download the branch and set it up just like you would any other branch. Just a heads up; this doengrades the numpy version from 1.24.1 to 1.23.5 this doesn't affect the pipeline at all but just felt like I should let everyone know.

As of 2023-04-19:

  1. optimized a function in BramichStrategy class that was using for loops to do matrix multiplication, replaced it to just use np.matmult(), in the same for loop they were iterating over a 3D vector and taking the dot product of a 2D slice with another 2D slice. Just replaced this with a single dot product argument that does the same job.

  2. Runtime on the original 11x11 kernel is down to ~15 seconds per image. To put it in perspective we started at ~50 seconds without using the GPU or the optimizations. I ran a case with a 27x27 kernel and that uses ~75 seconds per image. Basically a 27x27 kernel allows us to see more detail in subtraction and reduces artifacts. We can sell it as a 'higher def subtraction' that can be used on any detected transients.

  3. This is still not complete, will keep working on making it faster; at this point most of the time is being sunk in executing numpy commands so making it any faster will need C++ extensions.