Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA pipeline for computing APR #185

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open

CUDA pipeline for computing APR #185

wants to merge 65 commits into from

Conversation

krzysg
Copy link
Member

@krzysg krzysg commented Aug 23, 2024

No description provided.

krzysg added 30 commits August 1, 2022 14:39
…adding), still float number differences between CPU and GPU
krzysg added 23 commits January 9, 2024 11:27
@cheesema
Copy link
Member

👋🏻 Hey! we will check this out. We should catch up. @joeljonsson @krzysg

@krzysg
Copy link
Member Author

krzysg commented Aug 23, 2024

Hi, currently this is something that works and gives exactly same results as CPU implementation and in the end generates LinearAccess structure on GPU (so it does not support old random or sparse data structures).
As you may expect the most problematic was to achieve this "exactly same" level which means that all floating point computation were supposed to work exactly same on CPU and GPU (so sometimes I needed change a little bit CPU impl.).

Anyway I have added a lot of unit tests comparing CPU and GPU to make sure that all going to be same.

What is not implemented:

  • auto_parameters
  • get_apr_cuda is currently doing same steps what get_apr_cpu in APRConverter, I have not touched other places like APRConverterBatch (not sure if this is working (?)) or other parts of APRConverter like get_lrf method (not even sure what it is doing...).

Currently I will fix one test and maybe cleanup little bit CUDA stuff mostly only by moving things around to better places just to not have mess there.

@cheesema
Copy link
Member

Cool thanks! I think develop is stale, and maybe we should point this at main? Then the PR diff will be a bit more rational?

@cheesema
Copy link
Member

Can't wait to give this a go amazing!

LinearAccess structure on GPU (so it does not support old random or sparse data structures).

Good decision those are all that are needed / suited for the GPU anyhow.

@krzysg krzysg changed the base branch from develop to master October 30, 2024 09:42
@krzysg
Copy link
Member Author

krzysg commented Oct 30, 2024

Cool thanks! I think develop is stale, and maybe we should point this at main?

Sure - as we talked lately I have change target branch to 'master'.

@krzysg
Copy link
Member Author

krzysg commented Nov 5, 2024

Hi, there are two parts of LIS that probably require some explanation:

  1. 'bool boundaryReflect' additional parameter to all calc_sat_mean_*

At some point CPU impl. was change to padd/unpadd pixels before running LIS (according reflect_bc_lis parameter). At first I was trying to avoid it for GPU since I was trying to avoid additional mem allocation for padded pixels. And I managed to do that for 1D. Unfortunately it does not work for 2D or 3D cases (which is obviuos since in CPU impl. cals_sat_mean_* when run also change padded pixels so when you run that in Y-dir it has some influance on running later X-dir and so on).
Anyway I decided to leave it as a option for super fast processing of 1D data (without allocation of additional memory, double mem copy for padd and unpadd). So I decided to upgrade also CPU impl. with that 'feature'.

  1. And now we got second part - as you know I really pay attention (no matter if this is important or not) to correct math and having same results for same data even if some differences would be caused by some minor float operations differences only (like because of different order of execution or so). So in case if I have some date put in Y-dir and when I run calc_sat_mean_y I want to have same results as when I put same data transposed (same date in X-dir) and run calc_sat_mean_x. Unfortunately those functions in CPU impl. where giving little bit different results (+/- some float precision on 6th or 7th digit after decimal point - enough to make me angry ;-) ).
    Since I first managed to have proper behavior on GPU code so I have 'equalized' code on CPU side to GPU side. So the good things because of that are:
    (a) code is still doing same thing that old code (+/- mention float precision) but behaves in a same way in all directions (it is already removed from code but during development I wrote tests comparing outputs of both old/new LIS to check if all is correct).
    (b) naming of variables is same of GPU/CPU side so in case of any changes it is easy to fix/update both sides since they look similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants