Provide collective operation TTs #276

devreal · 2024-03-11T19:11:05Z

There are applications that require scalable global collective communication (e.g., allreduce for matrix-vector multiplication, like in CG). Currently, these reductions won't be efficient in TTG with the reduction terminals because they are a star and we have no notion of collectiveness in them. TTG should expand the set of collective operations and could even integrate MPI collectives for scalability. It could look like this:

ttg::Edge<void, double> rin, rout;
auto reduce_tt = ttg::coll::reduce(MPI_COMM_WORLD, rin, rout, 1, MPI_SUM, root); // sum over 1 element of type double
auto producer_tt = ttg::make_tt(..., ttg::edges(), ttg::edges(rin));
auto consumer_tt = ttg::make_tt(..., ttg::edges(rout), ...); // may distribute the value further

The input and output edges must have key type void because there can be only one concurrent instance per collective TT. When creating the TT we duplicate the communicator so there can be multiple collective TTs at the same time. The backend will need a way to suspend the task and check for the operation to complete so as to not block the thread in MPI.

Straightforward operations to consider:

Reduce and allreduce
Broadcast (we have ttg::bcast but it's not using the underlying collective)

There should probably be an overload for std::vector for count > 1.

Would need some more thought on how to describe the difference between input and output count (and a use-case):

Gather and scatter
Alltoall

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide collective operation TTs #276

Provide collective operation TTs #276

devreal commented Mar 11, 2024 •

edited

Loading

Provide collective operation TTs #276

Provide collective operation TTs #276

Comments

devreal commented Mar 11, 2024 • edited Loading

devreal commented Mar 11, 2024 •

edited

Loading