-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized transformations by transposing and then multiplying in place #424
base: develop
Are you sure you want to change the base?
Optimized transformations by transposing and then multiplying in place #424
Conversation
@pomerlef I'm trying to reproduce the issue locally and on internal CI but I haven't been able to. On which platform are the jenkins tests failing? |
all |
It's seems to be related to 2D transformations. There are all binary test:
|
I quickly looked into the issue, and the problem appeared because the results are slightly different with the new method for transforming features. I'm not sure whether this is because:
This issue only pops up on my PC, with a fork of libpointmatcher, when I set the epsilon for error checking to 1e-13. On libpointmatcher from this repo (upstream), it happens with 1e-8. I'm looking into this. |
@YoshuaNava could you have a look to resolve conflicts? |
@YoshuaNava could you verify the conflict? |
After merging #419 I put some extra time in finding out if we could further optimize the transformation of descriptors.
Based on that I implemented some snippets:
To benchmark the features transformation: https://godbolt.org/z/ehWfKG
To benchmark the descriptors rotation: https://godbolt.org/z/7xrnjW
I found out that for matrices like the ones we are processing (MxN, with M sort of small [1,10], but N quite large), it's better to transpose first, and then apply on the left. The Eigen parser seems to faster code when we do this.
Compared to what we had before:
The proportion of time spent between the operations is still similar 30-70% current, but the absolute time taken by the functions to process the data is shorter.
My thought on why this happens is that when we transpose we might be loading the matrix in L2/L3 cache (even though we don't enforce Eigen) in-time evaluation, and when the compiler sees applyOnTheRight, it optimizes for an in-place operation on a matrix that is dominantly column-based.
Something curious I found is that with compiler explorer you can try out different compilers, and the code runs just a bit faster with icc.