large-count v-collectives (Trac #430) #2

jeffhammond · 2015-10-22T18:22:21Z

Overview

The general strategy for implementing large-count operations is to use datatypes. In some cases, this is straightforward, but it appears to be a very poor solution in the case of v-collectives. In order to use the datatype solution for v-collectives, one has to map (counts[],type) to (newcounts[],newtypes[]), which then requires the w-collective, since only it takes a vector of types.

In fact, we are in the large-count case even if all of the counts are less than INT_MAX because of the limitations of the offset vector. If the sum of counts[i] up to any i<comm_size exceeds INT_MAX, then displs[i] will overflow. This means that one cannot use any of the v-collectives for relatively small data sets, e.g. 3B floats, which is only 12 GB per process. This is likely to be limiting when implementing 3D FFT, matrix transpose and IO aggregation, all of which are likely use v-collectives.

The displacement issue is exacerbated in the large-count case because all the displacements are interpreted in bytes rather than the extent of the datatype, so there is no way to index beyond 2GB of data, irrespective of the datatype and the counts.

Below is an example of the displacement problem. Clearly, ''in this specific case'', we could use MPI_SCATTER instead, but homogeneous counts were chosen only to make the example simple and readable. A number of trivial modifications would cause this example to require MPI_SCATTERV.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <math.h>
#include <assert.h>

#include <mpi.h>

const unsigned int bignum = 3*1073741824; /* 3*2^30 < 2^32 */

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size<2) {
        printf("Use more than 1 process for this test\n");
        MPI_Finalize();
        return 1;
    }

    int * counts = malloc(size*sizeof(int));
    int * displs = malloc(size*sizeof(int));

    for (int i=0; i<size; i++) {
        counts[i] = (int)(bignum/size);
        displs[i] = (i-1)*(int)(bignum/size); /* OVERFLOW */
        assert(displs[i]>0);
    }

    char * sendbuf = malloc(bignum);
    char * recvbuf = malloc(bignum/size);

    memset(sendbuf, rank==0 ? 1 : 0, bignum);
    memset(recvbuf, 0, bignum/size);

    MPI_Scatterv(sendbuf, counts, displs, MPI_CHAR,
                 recvbuf, (int)(bignum/size), MPI_CHAR,
                 0, MPI_COMM_WORLD);

    free(counts);
    free(displs);

    free(sendbuf);
    free(recvbuf);

    MPI_Finalize();

    return 0;
}

Note that the aforementioned example will fail even on a 32-bit system (this requires the number of processes to be large enough such that (1+1./size)*bignum bytes can be allocated).

Using the w-collective for large-count v-collectives has these issues:

Calling the w-collectives requires the allocation and assignment of O(Nproc) vectors, which is tedious but certainly not a memory issue if one is in the large-count regime.
One cannot deallocate the argument vectors until the operation completes, which means that one cannot implement the nonblocking case, since there is no opportunity to deallocate the temporary vectors in the wait call (any solution involving generalized requests is almost certainly untenable for most users).
Because MPI_ALLTOALLW takes displacements of type int and interprets these irrespective of the extent of the datatype (see page 173 of MPI-3), it is hard to index more than 2GB of data ''using any datatype''. There is a solution using datatypes encoded with the offset internally (e.g. via MPI_Type_create_struct), but it is far from user-friendly.

In the absence of proper support in the MPI standard, the most reasonable implementation of large-count v-collectives uses point-to-point, which means that users must make relatively nontrivial changes to their code to support large counts, or they have to use something like BigMPI, which already implements these functions (vcollectives_x.c)). An RMA-based implementation is also possible, but users are unlikely to accept this suggestion.

One can map also the v-collectives to MPI_Neighborhood_alltoallw, but in a far-from-efficient manner, and this is not particularly useful for the nonblocking case because MPI_Dist_graph_create_adjacent is blocking.

This ticket proposes two possible solutions for solving the large-count v-collective problem.

Solution: New Function Prototypes

Adding _x versions of the v-collectives and w-collectives that have the count of type MPI_Count and displacement vectors of type MPI_Aint[] is the most direct solution and prevents users from having to allocate and set O(Nproc) vectors in the course of mapping to the most general collective available (e.g. MPI_NEIGHBORHOOD_ALLTOALLW).

The C bindings for the proposed new functions are given below. These changes have been made in https://github.com/mpiwg-large-count/mpi-standard/tree/large-count-vector-collectives.

int MPI_Gatherv_x(const void* sendbuf, MPI_Count sendcount, MPI_Datatype sendtype,
                  void* recvbuf, const MPI_Count recvcounts[],
                  const MPI_Aint displs[], MPI_Datatype recvtype,
                  int root, MPI_Comm comm);

int MPI_Scatterv_x(const void* sendbuf, const MPI_Count sendcounts[],
                   const MPI_Aint displs[], MPI_Datatype sendtype, 
                   void* recvbuf, MPI_Count recvcount, MPI_Datatype recvtype,
                   int root, MPI_Comm comm);

int MPI_Allgatherv_x(const void* sendbuf, MPI_Count sendcount, MPI_Datatype sendtype,
                     void* recvbuf, const MPI_Count recvcounts[],
                     const MPI_Aint displs[], MPI_Datatype recvtype,
                     MPI_Comm comm);

int MPI_Alltoallv_x(const void* sendbuf, const MPI_Count sendcounts[],
                    const MPI_Aint sdispls[], const MPI_Datatype sendtype, 
                    void* recvbuf, const MPI_Count recvcounts[],
                    const MPI_Aint rdispls[], const MPI_Datatype recvtype,
                    MPI_Comm comm);

int MPI_Alltoallw_x(const void* sendbuf, const MPI_Count sendcounts[],
                    const MPI_Aint sdispls[], const MPI_Datatype sendtypes[], 
                    void* recvbuf, const MPI_Count recvcounts[],
                    const MPI_Aint rdispls[], const MPI_Datatype recvtypes[],
                    MPI_Comm comm);

int MPI_Igatherv_x(const void* sendbuf, MPI_Count sendcount, MPI_Datatype sendtype,
                   void* recvbuf, const MPI_Count recvcounts[],
                   const MPI_Aint displs[], MPI_Datatype recvtype,
                   int root, MPI_Comm comm, MPI_Request * request);

int MPI_Iscatterv_x(const void* sendbuf, const MPI_Count sendcounts[],
                    const MPI_Aint displs[], MPI_Datatype sendtype, 
                    void* recvbuf, MPI_Count recvcount, MPI_Datatype recvtype,
                    int root, MPI_Comm comm, MPI_Request * request);

int MPI_Iallgatherv_x(const void* sendbuf, MPI_Count sendcount, MPI_Datatype sendtype,
                      void* recvbuf, const MPI_Count recvcounts[],
                      const MPI_Aint displs[], MPI_Datatype recvtype,
                      MPI_Comm comm, MPI_Request * request);

int MPI_Ialltoallv_x(const void* sendbuf, const MPI_Count sendcounts[],
                     const MPI_Aint sdispls[], const MPI_Datatype sendtype, 
                     void* recvbuf, const MPI_Count recvcounts[],
                     const MPI_Aint rdispls[], const MPI_Datatype recvtype,
                     MPI_Comm comm, MPI_Request * request);

int MPI_Ialltoallw_x(const void* sendbuf, const MPI_Count sendcounts[],
                     const MPI_Aint sdispls[], const MPI_Datatype sendtypes[], 
                     void* recvbuf, const MPI_Count recvcounts[],
                     const MPI_Aint rdispls[], const MPI_Datatype recvtypes[],
                     MPI_Comm comm, MPI_Request * request);

Implementation

I do not think that the implementation of these functions within MPICH will be particularly difficult, but I have not yet started working on it. BigMPI implements many of them already.

The text was updated successfully, but these errors were encountered:

jeffhammond changed the title ~~large-count v-collectives (was Trac ticket 430)~~ large-count v-collectives (Trac #430) Dec 3, 2015

jeffhammond mentioned this issue Dec 3, 2015

large-count reductions #4

Open

mpiforumbot mentioned this issue Jul 24, 2016

large-count v-collectives mpi-forum/mpi-forum-historic#430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large-count v-collectives (Trac #430) #2

large-count v-collectives (Trac #430) #2

jeffhammond commented Oct 22, 2015

large-count v-collectives (Trac #430) #2

large-count v-collectives (Trac #430) #2

Comments

jeffhammond commented Oct 22, 2015

Overview

Solution: New Function Prototypes

Implementation