-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFactor: bulkerify GEMV (both MC and GPU) #1214
base: master
Are you sure you want to change the base?
Conversation
6558780
to
bf17fcc
Compare
37ea75b
to
2845339
Compare
cscs-ci run just to check if there is any major problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please name new functions in snake case, and rename existing internal functions if possible.
matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step}, | ||
{nrefls_step, nrefls_step}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Ts are allocated at scheduling. Better reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few questions, nothing blocking.
05dc48d
to
f83f317
Compare
0397c9e
to
d370894
Compare
cscs-ci run |
Converted to draft in order to prevent merging, since I'm still doing some checks. |
2f90f23
to
d199b88
Compare
cscs-ci run |
Apart the fixes committed between the two last CI runs, which fix for sure some problems, one major difference between the CI runs is that the oldest one ran on Eiger while the latest ran on Daint (because of the rebase on master). The oldest one on Eiger had some failures that are concerning, specifically all the I will try to investigate why that's the case and why it haven't happened on Daint. |
bc153ce
to
152f8e7
Compare
152f8e7
to
0d2bd4f
Compare
template <Backend backend, Device device, class T> | ||
void computeTFactor(matrix::Panel<Coord::Col, T, device>& hh_panel, | ||
matrix::ReadOnlyTileSender<T, Device::CPU> taus, | ||
matrix::ReadWriteTileSender<T, device> t, | ||
std::vector<matrix::ReadWriteTileSender<T, device>> workspaces, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very naive question: could you briefly explain why a vector of tile senders is more appropriate here instead of e.g. a round robin thing of tiles or a panel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get it correctly, #1214 (comment) is on topic. The main idea is that in this way the caller can manage the workspace by re-using it over multiple calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm not getting the point: can the caller not manage workspaces if the tiles are in a panel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering my own question after a call with @albestro: It is using a round-robin panel for the tile senders, the senders are just accessed by the caller of computeTFactor
. From the call we concluded that it may be worth passing the panel itself to computeTFactor
and let it access the senders directly. This may be useful to avoid some splitTile
s and unnecessary sender accesses that are currently happening in the caller(s) of computeTFactor
. It remains to be seen if it's actually an improvement (in organization and/or performance). Thanks @albestro for the explanation and for looking into this!
6d6fb9f
to
6f05890
Compare
6f05890
to
3749abc
Compare
3749abc
to
ffba9a4
Compare
@@ -1063,7 +1075,14 @@ Matrix<T, Device::CPU> ReductionToBand<B, D, T>::call(Matrix<T, D>& mat_a, const | |||
// TODO probably the first one in any panel is ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought: just saw this comment and wondering if we should have a look and see if we can "merge" this workspace into the other one? @rasolca
In this PR we try to increase parallelisation of GEMV step of TFactor, which is by construction serial (all results goes into the same block tile), by using workspaces for storing partial results and then reducing them before the final TRMV step.
Algorithmically, the main change is that
stepGEMV
loop has been replaced with a singlestepGEMVAll
, and the concept has been applied in a similar way for both backends, MC and GPU:pika::bulk
splits input tiles over multiple task, each one stores the partial results in their workspace and just one at the end does the reduction.In order to implement this solution, workspaces for intermediate results have been added (there is another option which does not require any additional workspace nor reduction, and that we might explore for MC in another PR).
TODO:
Close #798.
EDIT: since this is conceptually similar to #798 and it is going to be closed as soon as this gets merged, I migrated here the doc fixes happened there.