You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are applications that require scalable global collective communication (e.g., allreduce for matrix-vector multiplication, like in CG). Currently, these reductions won't be efficient in TTG with the reduction terminals because they are a star and we have no notion of collectiveness in them. TTG should expand the set of collective operations and could even integrate MPI collectives for scalability. It could look like this:
ttg::Edge<void, double> rin, rout;
auto reduce_tt = ttg::coll::reduce(MPI_COMM_WORLD, rin, rout, 1, MPI_SUM, root); // sum over 1 element of type double
auto producer_tt = ttg::make_tt(..., ttg::edges(), ttg::edges(rin));
auto consumer_tt = ttg::make_tt(..., ttg::edges(rout), ...); // may distribute the value further
The input and output edges must have key type void because there can be only one concurrent instance per collective TT. When creating the TT we duplicate the communicator so there can be multiple collective TTs at the same time. The backend will need a way to suspend the task and check for the operation to complete so as to not block the thread in MPI.
Straightforward operations to consider:
Reduce and allreduce
Broadcast (we have ttg::bcast but it's not using the underlying collective)
There should probably be an overload for std::vector for count > 1.
Would need some more thought on how to describe the difference between input and output count (and a use-case):
Gather and scatter
Alltoall
The text was updated successfully, but these errors were encountered:
There are applications that require scalable global collective communication (e.g., allreduce for matrix-vector multiplication, like in CG). Currently, these reductions won't be efficient in TTG with the reduction terminals because they are a star and we have no notion of collectiveness in them. TTG should expand the set of collective operations and could even integrate MPI collectives for scalability. It could look like this:
The input and output edges must have key type
void
because there can be only one concurrent instance per collective TT. When creating the TT we duplicate the communicator so there can be multiple collective TTs at the same time. The backend will need a way to suspend the task and check for the operation to complete so as to not block the thread in MPI.Straightforward operations to consider:
ttg::bcast
but it's not using the underlying collective)There should probably be an overload for
std::vector
forcount > 1
.Would need some more thought on how to describe the difference between input and output count (and a use-case):
The text was updated successfully, but these errors were encountered: