evaluation.tex

\section{Evaluation}
\label{sec:evaluation}

The performance of LAGraph can only be considered in context of an
implementation of the underlying GraphBLAS library.  This is discussed in
Section~\ref{sec:extensions}, followed by performance results of the new
LAGraph API on the 6 algorithms in the GAP Benchmark
\cite{DBLP:conf/sc/BeamerAP12}.

\subsection{SuiteSparse Extensions}
\label{sec:extensions}

In a prior paper (\cite{DBLP:conf/iiswc/AzadABBCDDDDFGG20}), an early draft of
SS:GrB, (SuiteSparse:GraphBLAS v4.0.0, Aug 2020), was compared with the GAP benchmark
\cite{DBLP:conf/sc/BeamerAP12} and four other graph libraries.  This prior
version of SS:GrB included two primary data structures for its sparse matrices:
compressed sparse vector, and a hypersparse variant
\cite{DBLP:conf/ipps/BulucG08}, both held by row or by column.  It included a
draft implementation of a bitmap data structure that could only be used in a
prototype breadth-first search.  Since then, SuiteSparse:GraphBLAS v4.0.3 has
been released, with full support for bitmap and full matrices for all its
operations.  In an $m$-by-$n$ bitmap matrix, the values are held in a full
array of size $mn$, and another \verb'int8_t' array of size $mn$ holds the
sparsity pattern of the matrix.  A full matrix is a simple dense array of size
$mn$.

The bitmap format is particularly important for the ``pull'' phase of an
algorithm, as used in direction-optimizing breadth-first-search
\cite{DBLP:conf/sc/BeamerAP12,DBLP:conf/icpp/YangBO18}.  The GAP benchmark suite uses this method by
holding its frontier as a bitmap in the pull step and as a list in the push
step. The GAP BFS was typically the fastest BFS amongst the 6 graph
libraries compared in \cite{DBLP:conf/iiswc/AzadABBCDDDDFGG20} (for 4 of the 5
benchmark graphs).  With the addition of the bitmap format to SS:GrB,
LAGraph+SS:GrB is able to come within a factor of 2 or so of the performance of
the highly-tuned BFS GAP benchmark (see the results in the next section), for
those 4 graphs.  At the same time, however, the BFS is very easily expressed in
LAGraph as easy-to-read and easy-to-write code.  This enables non-experts to
obtain a reasonably high level of performance with modest programming effort
when writing graph algorithms.

Additional optimizations added to SS:GrB in the past year include a {\em lazy
sort}.  Normally, SS:GrB keeps its vectors sorted (row vectors in a CSR matrix,
or column vectors if the matrix is held by column), with entries sorted in
ascending order of column or row index, respectively.  This simplifies \
algorithms that operate on a \verb'GrB_Matrix'.  However, some algorithms
naturally produce a jumbled result (matrix multiply in particular), while others
are tolerant of jumbled input matrices.  We thus allow the sort to
be left pending.  The lazy sort joins two other kinds of pending work in
SS:GrB: {\em pending tuples} and {\em zombies}~\cite{DBLP:journals/toms/Davis19}.
A pending tuple is an entry
that is held inside a matrix in an unsorted list, awaiting insertion into the
CSR/CSC format of a \verb'GrB_Matrix'.  A zombie is the opposite: it is an
entry in the CSR/CSC format that has been marked for deletion, but has not yet
been deleted from the matrix.  With  lazy sort, the sort is postponed until
another algorithm requires sorted input matrices.  If the sort is lazy enough,
it might never occur, which is the case for the LAGraph BFS and BC.

Positional binary operators have also been added,
such as the $\grbanysecondi$ semiring, %ANY-SECONDI
which makes the BFS much faster.

\subsection{Performance Results}

We ran our benchmarks on an NVIDIA DGX Station (donated to Texas A\&M by
NVIDIA).  It includes a 20-core Intel(R) Xeon(R)
CPU E5-2698 v4 @ 2.20GHz, with 40 threads.  All codes were compiled with gcc
5.4.0 (-O3).  All default settings were used, which means  hyperthreading
was enabled.  The system has 256GB of RAM in a single socket.
%  There are NUMA effects in a single socket system with 20 cores .... so I deleted this parenthetical comment (no NUMA effects).
LAGraph (Feb 15, 2021) and SuiteSparse:GraphBLAS 4.0.4-draft (Feb 15, 2021) were
used.  The NVIDIA DGX Station includes four P100 GPUs, but no GPUs were used by
this experiment (a GPU-accelerated SS:GrB is in progress).
Table~\ref{table:results} lists the run time (in seconds) for the GAP benchmark
and LAGraph+SS:GrB for the 6 algorithms on the 5 benchmark matrices.
The benchmark matrices are listed in Table~\ref{table:matrices}.

\begin{table}
\begin{center}
\begin{tabular}{|l|rrrrr|}
\hline
Algorithm :    &   \multicolumn{5}{c|}{graph, with run time in seconds}  \\
 package       &   Kron    &   Urand   &   Twitter  &  Web    &    Road  \\
\hline
BC   : GAP     &  31.52    &  46.36    &  10.82     &  3.01   &    1.50  \\
% BC : Jan26   &  26.85    &  31.78    &  10.04     &  9.25   &   51.91  \\
% BC : Feb6    &  24.75    &  30.66    &   9.28     &  8.97   &   43.66  \\
% BC : Feb7    &  24.52    &  30.69    &   9.11     &  8.43   &   34.06  \\
BC   : SS      &  23.61    &  32.69    &   9.25     &  8.20   &   34.40  \\     % Feb14
\hline
BFS  : GAP     &    .31    &    .58    &    .22     &   .34   &     .25  \\
%BFS : Jan26   &    .52    &   1.31    &    .33     &   .67   &    3.33  \\
%BFS : Feb6    &    .52    &   1.20    &    .33     &   .66   &    3.36  \\
BFS  : SS      &    .52    &   1.22    &    .33     &   .66   &    3.32  \\     % Feb7
\hline
PR   : GAP     &  19.81    &  25.29    &  15.16     &  5.13   &    1.01  \\
%PR  : Jan26   &  21.96    &  27.75    &  17.22     &  9.30   &    1.34  \\
%PR  : Feb6    &  21.67    &  27.60    &  17.14     &  9.34   &    1.32  \\
PR   : SS      &  22.17    &  27.71    &  17.21     &  9.30   &    1.34  \\     % Feb7
\hline
CC   : GAP     &    .53    &   1.66    &    .23     &   .22   &     .05  \\
%CC  : Jan26   &   3.42    &   4.59    &   1.48     &  1.97   &    1.00  \\
%CC  : Feb6    &   3.35    &   4.56    &   1.47     &  1.96   &     .97  \\
CC   : SS      &   3.36    &   4.47    &   1.47     &  1.97   &     .98  \\     % Feb7
\hline
SSSP : GAP     &   4.91    &   7.23    &   2.02     &   .81   &     .21  \\
%SSSP: Jan26   &  17.62    &  25.62    &   8.44     &  9.67   &   48.49  \\
%SSSP: Feb6    &  17.58    &  25.60    &   8.18     &  9.60   &   48.24  \\
SSSP : SS      &  17.37    &  25.54    &   8.54     &  9.61   &   46.79  \\     % Feb7
\hline
TC   : GAP     & 374.08    &  21.83    &  79.58     & 22.18   &     .03  \\
%TC  : Jan26   & 943.47    &  34.10    & 242.36     & 35.15   &     .29  \\
%TC  : Feb6    & 922.35    &  33.97    & 238.71     & 34.67   &     .23  \\
TC   : SS      & 917.99    &  34.01    & 239.58     & 34.65   &     .23  \\     % Feb7
\hline
\end{tabular}
\caption{Run time of GAP and LAGraph+SS:GrB
\vspace{-2.5ex}
\label{table:results}}
\end{center}
\end{table}

\begin{table}
\begin{center}
\begin{tabular}{|l|rrr|}
\hline
graph   & nodes        & entries in $A$ & graph kind \\
\hline
Kron    & 134,217,726 &  4,223,264,644 &  undirected   \\
Urand   & 134,217,728 &  4,294,966,740 &  undirected   \\
Twitter &  61,578,415 &  1,468,364,884 &  directed     \\
Web     &  50,636,151 &  1,930,292,948 &  directed     \\
Road    &  23,947,347 &     57,708,624 &  directed     \\
\hline
\end{tabular}
\caption{Benchmark matrices\label{table:matrices}
(\url{https://sparse.tamu.edu/GAP})}
\vspace{-2.5ex}
\end{center}
\end{table}

% Notes: TC in SS is SandiaDot method only

With the addition of the bitmap (needed for the pull step), the
push/pull optimization in BC resulted in a nearly 2x performance gain in the
GraphBLAS method for the largest matrices, as compared to the SS:GrB version
used for the results presented in \cite{DBLP:conf/iiswc/AzadABBCDDDDFGG20}.

With this change, the BC method in LAGraph+SS:GrB is not only expressible in a
simple, elegant code, but it is also faster than the highly-tuned GAP benchmark
method, \verb'bc.cc', for the three largest matrices (1.3x for Kron, 1.5x for
Urand, and 1.2x for Twitter).

% Feb7 update:  the sort is completely lazy in BC.

% Feb6 results: the sort partially lazy in BC.

% For the Jan26 results:
% We expect the BC in LAGraph+SS:GrB to become faster still in the next
% release because we have not yet fully exploited the lazy sort.  The frontier
% matrix $\grbm{F}$ is left jumbled by the lazy sort, but it is sorted right away by a
% subsequent assignment.  Most uses of \verb'GrB_assign' are intolerant of
% jumbled input matrices, so it sorts them on input.  However, \verb'GrB_assign'
% includes about 40 internal variations, a few of which do not actually require
% sorted input matrices.  The particular method used in \verb'GrB_assign' in the
% LAGraph BC method is one of those methods, so this would be simple to exploit.
% With this change, the sort would be so lazy that it would {\em never} occur.
% The frontier would be computed, left jumbled, used in subsequent computations,
% and then recomputed (and thus discarded), just as we currently do in the
% LAGraph BFS.

The bitmap format (which makes push/pull optimization
simple to express, and fast) and the $\grbanysecondi$ %ANY-SECONDI
semiring, the BFS of a
directed or undirected graph is easily expressed in GraphBLAS, and has a
performance that is only about 1.5x to 2x slower than the GAP benchmark.  We expect
the remaining performance gap arises from two issues:

\begin{enumerate}
\item
GAP assumes that the graph has fewer than $2^{32}$ nodes and edges, and
thus uses 32-bit integers throughout.  GraphBLAS is written for larger
problems, and thus relies solely on 64-bit integers.  This cannot be easily
changed in GraphBLAS, but rather than ``fixing'' GraphBLAS to use smaller
integers, the GAP benchmark suite should be updated for larger
graphs.  In the current GAP benchmark graphs, two graphs are
chosen with almost exactly 4 billion edges.  Graphs of current interest in
large data science can easily exceed $2^{32}$ nodes and edges \cite{9286235}.

\item In GraphBLAS, the BFS must be expressed as two calls.  The first computes
$\grbv{q} \grbmask{\grbneg \grbv{p}} = \grbv{q}\grbt \grbm{A}$, and the second updates the parent vector,
$\grbv{p} \grbmask{\grbstr{\grbv{q}}} = \grbv{q}$:

{\footnotesize
\begin{verbatim}
  GrB_vxm (q, p, NULL, semiring, q, A, GrB_DESC_RSC) ;
  GrB_assign (p, q, NULL, q, GrB_ALL, n, GrB_DESC_S) ; \end{verbatim}}

In GAP's \verb'bfs.cc', these two steps are fused, and the
matrix-vector multiplication can write its result directly into the parent vector
\verb'p'.  This could be implemented in a future GraphBLAS library, since the
GraphBLAS API allows for a non-blocking mode where work is queued and done
later, thus enabling a fusion of these two steps.  SS:GrB exploits the
non-blocking mode (for its lazy sort, pending tuples, and zombies) but does not
{\em yet} exploit the fusion of \verb'GrB_vxm' and \verb'GrB_assign'.  We
intend to exploit this in the future.
\end{enumerate}

Note that for the Road graph,
LAGraph+SS:GrB is quite slow for all but PageRank (PR).
The primary reason for this is the high diameter of the Road graph
(about 6980).  This requires 6980 iterations of GraphBLAS in the BFS, each with
a tiny amount of work.  Each call to GraphBLAS does several \verb'malloc' and
\verb'free's, and in some cases the workspace must be initialized.  A future
version of SS:GrB is planned that will eliminate this work entirely, by
implementing an internal memory pool.  There may be other overheads, but we
hope that a memory pool, fusion to fully exploit non-blocking mode, and other
optimizations will address this large performance gap for the Road graph for
these algorithms.

LAGraph+SS:GrB is also up to 3x slower than the GAP for the triangle counting
problem (for all but the Road graph, where it is even slower).  This
performance gap can be eliminated entirely in the future, if the \verb'GrB_mxm'
and \verb'GrB_reduce' are combined in a single fused step, by a full
exploitation of the GraphBLAS non-blocking mode.  The current method computes
$\grbm{C} \grbmask{\grbstr{\grbm{L}}} = \grbm{L}\grbm{U}\grbt$, followed by the reduction of $\grbm{C}$ to a single
scalar.  The matrix $\grbm{C}$ is then discarded.  All that GraphBLAS needs is a fused
kernel that does not explicitly instantiate the temporary matrix $\grbm{C}$.
This is permitted by the GraphBLAS C API Specification, but not yet implemented
in SS:GrB.