Skip to content

Commit

Permalink
Expand and update examples, improve spmm performance (#89)
Browse files Browse the repository at this point in the history
Changes to examples
* The total least squares examples now run the classical algorithm in
addition to the randomized algorithm. We compare runtime and solution
quality for the two methods.
* I've added three examples for low-rank approximation of sparse
matrices.
* QB-based SVD of randomly generated synthetic test matrices (rank one
plus noise).
   * QB-based SVD of any sparse matrix in Matrix Market format.
   * Low-rank QRCP of a sparse matrix in Matrix Market format.

Changes to sparse matrix functionality 
* I've added two kernel implementations for SPMM of row-major data. This
was needed to address cache inefficiencies revealed in the existing SPMM
kernels during low-rank QRCP benchmarking.
* I've removed the requirement for ``A.reserve((int64_t) 10)`` being
able to execute for a matrix ``A`` that's compatible with the
``SpMatrix`` concept. Apparently this didn't work with sparse matrices
marked as ``const``.

I've also added two files of developer notes: ``RandBLAS/DevNotes.md``
and ``RandBLAS/sparse_data/DevNotes.md``.
  • Loading branch information
rileyjmurray authored May 30, 2024
1 parent 21700fd commit 9d0a03f
Show file tree
Hide file tree
Showing 18 changed files with 1,695 additions and 370 deletions.
39 changes: 39 additions & 0 deletions RandBLAS/DevNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Developer Notes for RandBLAS

This file has discussions of RandBLAS' implementation that aren't (currently) suitable
for the RandBLAS User Guide.

## What's where?

* Our basic random number generation is handled by [Random123](https://github.com/DEShawResearch/random123).
We have small wrappers around Random123 code in ``RandBLAS/base.hh`` and ``RandBLAS/random_gen.hh``.

* ``RandBLAS/dense_skops.hh`` has code for representing and sampling dense sketching operators.
The sampling code is complicated because it supports multi-threaded (yet threading invariant!)
random (sub)matrix generation.

* ``RandBLAS/sparse_skops.hh`` has code for representing and sampling sparse sketching operators.
The sampling code has a customized method for repeatedly sampling from an index set without
replacement, which is needed to quickly generate the structures used in statistically reliable
sparse sketching operators.

* [BLAS++ (aka blaspp)](https://github.com/icl-utk-edu/blaspp) is our portability layer for BLAS.
We actually use very few functions in BLAS at time of writing (GEMM, GEMV, SCAL, COPY, and
AXPY) but we use its enumerations _everywhere_. Fast GEMM is important for sketching dense
data with dense operators.

* The ``sketch_general`` functions in ``RandBLAS/skge.hh`` are the main entry point for sketching dense data.
These functions are small wrappers around functions with more BLAS-like names:
* ``lskge3`` and ``rskge3`` in ``RandBLAS/skge3_to_gemm.hh``.
* ``lskges`` and ``rskges`` in ``RandBLAS/skges_to_spmm.hh``.
The former pair of functions are just fancy wrappers around GEMM.
The latter pair of functions trigger a far more opaque call sequence, since they rely on sparse
matrix operations.

* There is no widely accepted standard for sparse BLAS operations. This is a bummer because
sparse matrices are super important in data science and scientific computing. In view of this,
RandBLAS provides its own abstractions for sparse matrices (CSC, CSR, and COO formats).
The abstractions can either own their associated data or just wrap existing data (say, data
attached to a sparse matrix in Eigen). RandBLAS has reasonably flexible and high-performance code
for multiplying a sparse matrix and a dense matrix. All code related to sparse matrices is in
``RandBLAS/sparse_data``. See that folder's ``DevNotes.md`` file for details.
76 changes: 76 additions & 0 deletions RandBLAS/sparse_data/DevNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Developer Notes for RandBLAS' sparse matrix functionality

RandBLAS provides abstractions for CSC, CSR, and COO-format sparse matrices.
The following functions use these abstractions:

* ``left_spmm``, which computes a product of a sparse matrix and a dense matrix when the sparse matrix
is the left operand. This function is GEMM-like, in that it allows offsets and transposition flags
for either argument.
* ``right_spmm``, which is analogous to ``left_spmm`` when the sparse matrix is the right operand.
* ``sketch_general``, when called with a SparseSkOp object.
* ``sketch_sparse``, when called with a DenseSkOp object.

Each of those functions is merely a _dispatcher_ of other (lower level) functions. See below for details on
how the dispatching works.

## Left_spmm and right_spmm

These functions are implemented in ``RandBLAS/sparse_data/spmm_dispatch.hh``.

``right_spmm`` is implemented by falling back on ``left_spmm`` with transformed
values for ``opA, opB`` and ``layout``.
Here's what happens if ``left_spmm`` is called with a sparse matrix ``A``, a dense input matrix ``B``, and a dense output matrix ``C``.

1. If needed, transposition of ``A`` is resolved by creating a lightweight object for the transpose
called ``At``. This object is just a tool for us to change how we intrepret the buffers that underlie ``A``.
* If ``A`` is COO, then ``At`` will also be COO.
* If ``A`` is CSR, then ``At`` will be CSC.
* If ``A`` is CSC, then ``At`` will be CSR.

We make a recursive call to ``left_spmm`` once we have our hands on ``At``, so
the rest of ``left_spmm``'s logic only needs to handle un-transposed ``A``.

2. A memory layout is determined for how we'll read ``B`` in the low-level
sparse matrix multiplication kernels.
* If ``B`` is un-transposed then we'll use the same layout as ``C``.
* If ``B`` is transposed then we'll swap its declared dimensions
(i.e., we'll swap its reported numbers of rows and columns) and
and we'll tell the kernel to read it in the opposite layout as ``C``.

3. We dispatch a kernel from ``coo_spmm_impl.hh``, or ``csc_spmm_impl.hh``,
or ``csr_spmm_impl.h``. The precise kernel depends on the type of ``A``, and the inferred layout for ``B``, and the declared layout for ``C``.

## Sketching dense data with sparse operators.

Sketching dense data with a sparse operator is typically handled with ``sketch_general``,
which is defined in ``skge.hh``.

If we call this function with a SparseSkOp object, ``S``, we'd immediately get routed to
a function in ``skges_to_spmm.hh``: either ``lskges`` or ``rskges``. Here's what would happen
after we entered one of those functions:

1. If necessary, we'd sample the defining data of ``S`` with ``RandBLAS::fill_sparse(S)``.

2. We'd obtain a lightweight view of ``S`` as a COOMatrix, and we'd pass that matrix to ``left_spmm``
(if inside ``lskges``) or ``right_spmm`` (if inside ``rskges``).


## Sketching sparse data with dense operators

If we call ``sketch_sparse`` with a DenseSkOp, ``S``, and a sparse matrix, ``A``, then we'll get routed to either
``lsksp3`` or ``rsksp3`` in ``sparse_data/sksp3_to_spmm.hh``.

From there, we'll do the following.

1. If necessary, we sample the defining data of ``S``. The way that we do this is a
little more complicated than using ``RandBLAS::fill_dense(S)``, but it's similar
in spirit.

2. We get our hands on the simple buffer representation of ``S``. From there ...
* We call ``right_spmm`` if we're inside ``lsksp3``.
* We call ``left_spmm`` if we're inside ``rsksp3``.

Note that the ``l`` and ``r`` in the ``[l/r]sksp3`` function names
get matched to opposite sides for ``[left/right]_spmm``! This is because all the fancy abstractions in ``S`` have been stripped away by this point in the call sequence, so the "side" that we emphasize in function names changes
from emphasizing ``S`` to emphasizing ``A``.

12 changes: 10 additions & 2 deletions RandBLAS/sparse_data/base.hh
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ static inline void sorted_nonzero_locations_to_pointer_array(
return;
}

// Idea: change all "const" attributes to for SpMatrix to return values from inlined functions.
// Looks like there'd be no collision with function/property names for sparse matrix
// types in Eigen, SuiteSparse, OneMKL, etc.. These inlined functions could return
// nomincally public members like A._n_rows and A._n_cols, which the user will only change
// at their own peril.

// =============================================================================
/// @verbatim embed:rst:leading-slashes
///
Expand Down Expand Up @@ -145,9 +151,11 @@ concept SparseMatrix = requires(SpMat A) {
{ A.n_cols } -> std::convertible_to<const int64_t>;
{ A.nnz } -> std::convertible_to<int64_t>;
{ *(A.vals) } -> std::convertible_to<typename SpMat::scalar_t>;
{ SpMat(A.n_rows, A.n_cols) };
// ^ Is there better way to require a two-argument constructor?
{ A.own_memory } -> std::convertible_to<const bool>;
{ SpMat(A.n_rows, A.n_cols) }; // Is there better way to require a two-argument constructor?
{ A.reserve((int64_t) 10) }; // This will always compile, even though it might error at runtime.
// { A.reserve((int64_t) 10) };
// ^ Problem: const SpMat objects fail that check.
};

} // end namespace RandBLAS::sparse_data
Expand Down
60 changes: 60 additions & 0 deletions RandBLAS/sparse_data/csc_spmm_impl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -121,5 +121,65 @@ static void apply_csc_left_jki_p11(
return;
}

template <typename T, RandBLAS::SignedInteger sint_t>
static void apply_csc_left_kib_rowmajor_1p1(
T alpha,
int64_t d,
int64_t n,
int64_t m,
CSCMatrix<T, sint_t> &A,
const T *B,
int64_t ldb,
T *C,
int64_t ldc
) {
randblas_require(A.index_base == IndexBase::Zero);

randblas_require(d == A.n_rows);
randblas_require(m == A.n_cols);


int num_threads = 1;
#if defined(RandBLAS_HAS_OpenMP)
#pragma omp parallel
{
num_threads = omp_get_num_threads();
}
#endif

int* block_bounds = new int[num_threads + 1]{};
int block_size = d / num_threads;
if (block_size == 0) { block_size = 1;}
for (int t = 0; t < num_threads; ++t)
block_bounds[t+1] = block_bounds[t] + block_size;
block_bounds[num_threads] += d % num_threads;

#pragma omp parallel default(shared)
{
#if defined(RandBLAS_HAS_OpenMP)
int t = omp_get_thread_num();
#else
int t = 0;
#endif
int i_lower = block_bounds[t];
int i_upper = block_bounds[t+1];
for (int64_t k = 0; k < m; ++k) {
// Rank-1 update: C[:,:] += A[:,k] @ B[k,:]
const T* row_B = &B[k*ldb];
for (int64_t ell = A.colptr[k]; ell < A.colptr[k+1]; ++ell) {
int64_t i = A.rowidxs[ell];
if (i_lower <= i && i < i_upper) {
T* row_C = &C[i*ldc];
T scale = alpha * A.vals[ell];
blas::axpy(n, scale, row_B, 1, row_C, 1);
}
}
}
}

delete [] block_bounds;
return;
}

}
#endif
41 changes: 39 additions & 2 deletions RandBLAS/sparse_data/csr_spmm_impl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,13 @@ static void apply_csr_to_vector_from_left_ik(
int64_t incAv // stride between elements of Av
) {
for (int64_t i = 0; i < len_Av; ++i) {
T Av_i = Av[i*incAv];
for (int64_t ell = rowptr[i]; ell < rowptr[i+1]; ++ell) {
int j = colidxs[ell];
T Aij = vals[ell];
Av[i*incAv] += Aij * v[j*incv];
// ^ if v were a matrix, this could be an axpy with the j-th row of v, accumulated into i-th row of Av.
Av_i += Aij * v[j*incv];
}
Av[i*incAv] = Av_i;
}
}

Expand Down Expand Up @@ -86,6 +87,42 @@ static void apply_csr_left_jik_p11(
return;
}

template <typename T, RandBLAS::SignedInteger sint_t>
static void apply_csr_left_ikb_rowmajor(
T alpha,
int64_t d,
int64_t n,
int64_t m,
CSRMatrix<T, sint_t> &A,
const T *B,
int64_t ldb,
T *C,
int64_t ldc
) {
randblas_require(A.index_base == IndexBase::Zero);

randblas_require(d == A.n_rows);
randblas_require(m == A.n_cols);

#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic)
for (int64_t i = 0; i < d; ++i) {
// C[i, 0:n] += alpha * A[i, :] @ B[:, 0:n]
T* row_C = &C[i*ldc];
for (int64_t ell = A.rowptr[i]; ell < A.rowptr[i+1]; ++ell) {
// we're working with A[i,k] for k = A.colidxs[ell]
// compute C[i, 0:n] += alpha * A[i,k] * B[k, 0:n]
T scale = alpha * A.vals[ell];
int64_t k = A.colidxs[ell];
const T* row_B = &B[k*ldb];
blas::axpy(n, scale, row_B, 1, row_C, 1);
}
}
}
return;
}

} // end namespace RandBLAS::sparse_data::csr

#endif
19 changes: 15 additions & 4 deletions RandBLAS/sparse_data/spmm_dispatch.hh
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,22 @@ void left_spmm(
using RandBLAS::sparse_data::coo::apply_coo_left_jki_p11;
apply_coo_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, ro_a, co_a, B, ldb, C, ldc);
} else if constexpr (is_csc) {
using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
if (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
using RandBLAS::sparse_data::csc::apply_csc_left_kib_rowmajor_1p1;
apply_csc_left_kib_rowmajor_1p1(alpha, d, n, m, A, B, ldb, C, ldc);
} else {
using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
}
} else {
using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
if (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
using RandBLAS::sparse_data::csr::apply_csr_left_ikb_rowmajor;
apply_csr_left_ikb_rowmajor(alpha, d, n, m, A, B, ldb, C, ldc);
} else {
using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
}

}
return;
}
Expand Down
3 changes: 3 additions & 0 deletions examples/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sparse-data-matrices/*
sparse-low-rank-approx/data-matrices/*
sparse-low-rank-approx/fast-matrix-market/*
Loading

0 comments on commit 9d0a03f

Please sign in to comment.