Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand and update examples, improve spmm performance #89

Merged
merged 44 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
8a5e294
update examples
rileyjmurray Apr 17, 2024
8744efa
have data set up for low-rank approx of sparse matrix
rileyjmurray Apr 19, 2024
37087b8
svd example compiles
rileyjmurray Apr 23, 2024
b481b5b
accuracy check and runtime
rileyjmurray Apr 26, 2024
a7b4ab0
directory
rileyjmurray Apr 28, 2024
8e6c035
ignore directory contents
rileyjmurray Apr 28, 2024
74122e1
fast matrix market folder
rileyjmurray Apr 28, 2024
c8b8378
include examples that use fast-matrix-market
rileyjmurray Apr 28, 2024
cb8f8d2
interesting demo using test data from fast-matrix-market
rileyjmurray Apr 28, 2024
8d5ac2d
tweak
rileyjmurray Apr 28, 2024
e19a266
clean up svd example (using a new macro for timing)
rileyjmurray Apr 28, 2024
299cbc3
ternary assignment
rileyjmurray Apr 29, 2024
f66d209
proper QRCP demo
rileyjmurray May 1, 2024
698d272
standalone Eigen example (currently very messy)
rileyjmurray May 1, 2024
2dff390
update eigen example
rileyjmurray May 2, 2024
5676def
remove Eigen example (its sparse QR doesnt do pivoting to reveal nume…
rileyjmurray May 2, 2024
669acbd
restructure examples
rileyjmurray May 7, 2024
4af1221
fix bug in examples (those using lu_stabilize). Create special implem…
rileyjmurray May 8, 2024
1eb7b80
properly check for OpenMP before calling omp_get_thread_num()
rileyjmurray May 9, 2024
db4adcb
fix thread number
rileyjmurray May 9, 2024
6e70336
graphs examples
rileyjmurray May 9, 2024
2408585
check in
rileyjmurray May 10, 2024
18f91eb
check in
rileyjmurray May 10, 2024
cb929a2
fix logging bug
rileyjmurray May 10, 2024
09244d3
debugging
rileyjmurray May 24, 2024
8966842
fixed sparse QRCP example
rileyjmurray May 24, 2024
9a13962
Update lu_row_stabilize so that it errors when necessary. Modulo read…
rileyjmurray May 24, 2024
a86d2c5
clean up notes in SpMatrix concept and remove requirement that preclu…
rileyjmurray May 29, 2024
862121e
remove messy graphs example. Rename fmm_svd.cc
rileyjmurray May 29, 2024
daac3fd
rename qrcp.cc example, update CMakeLists.txt
rileyjmurray May 29, 2024
bb16d37
remove added whitespace
rileyjmurray May 29, 2024
fc25e2d
clean up svd_matrixmarket example
rileyjmurray May 29, 2024
4274ad3
Polish example for low-rank QRCP of sparse matrices
rileyjmurray May 29, 2024
90d6678
blank line at end of file
rileyjmurray May 29, 2024
703ccef
specify μs (microseconds) instead of ms (milliseconds)
rileyjmurray May 29, 2024
a226f21
revise TLS examples
rileyjmurray May 29, 2024
96588fa
change fetcher to a README
rileyjmurray May 29, 2024
a7ab7bb
README and CMakeLists.txt
rileyjmurray May 29, 2024
a91397c
make sure to deallocate before returning
rileyjmurray May 29, 2024
d9eacf9
dev notes
rileyjmurray May 29, 2024
1f8ae46
better devnotes
rileyjmurray May 29, 2024
d0125c1
devnotes
rileyjmurray May 29, 2024
5754ce1
better devnotes
rileyjmurray May 30, 2024
fd5a987
tweak
rileyjmurray May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions RandBLAS/DevNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Developer Notes for RandBLAS

This file has discussions of RandBLAS' implementation that aren't (currently) suitable
for the RandBLAS User Guide.

## What's where?

* Our basic random number generation is handled by [Random123](https://github.com/DEShawResearch/random123).
We have small wrappers around Random123 code in ``RandBLAS/base.hh`` and ``RandBLAS/random_gen.hh``.

* ``RandBLAS/dense_skops.hh`` has code for representing and sampling dense sketching operators.
The sampling code is complicated because it supports multi-threaded (yet threading invariant!)
random (sub)matrix generation.

* ``RandBLAS/sparse_skops.hh`` has code for representing and sampling sparse sketching operators.
The sampling code has a customized method for repeatedly sampling from an index set without
replacement, which is needed to quickly generate the structures used in statistically reliable
sparse sketching operators.

* [BLAS++ (aka blaspp)](https://github.com/icl-utk-edu/blaspp) is our portability layer for BLAS.
We actually use very few functions in BLAS at time of writing (GEMM, GEMV, SCAL, COPY, and
AXPY) but we use its enumerations _everywhere_. Fast GEMM is important for sketching dense
data with dense operators.

* The ``sketch_general`` functions in ``RandBLAS/skge.hh`` are the main entry point for sketching dense data.
These functions are small wrappers around functions with more BLAS-like names:
* ``lskge3`` and ``rskge3`` in ``RandBLAS/skge3_to_gemm.hh``.
* ``lskges`` and ``rskges`` in ``RandBLAS/skges_to_spmm.hh``.
The former pair of functions are just fancy wrappers around GEMM.
The latter pair of functions trigger a far more opaque call sequence, since they rely on sparse
matrix operations.

* There is no widely accepted standard for sparse BLAS operations. This is a bummer because
sparse matrices are super important in data science and scientific computing. In view of this,
RandBLAS provides its own abstractions for sparse matrices (CSC, CSR, and COO formats).
The abstractions can either own their associated data or just wrap existing data (say, data
attached to a sparse matrix in Eigen). RandBLAS has reasonably flexible and high-performance code
for multiplying a sparse matrix and a dense matrix. All code related to sparse matrices is in
``RandBLAS/sparse_data``. See that folder's ``DevNotes.md`` file for details.
76 changes: 76 additions & 0 deletions RandBLAS/sparse_data/DevNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Developer Notes for RandBLAS' sparse matrix functionality

RandBLAS provides abstractions for CSC, CSR, and COO-format sparse matrices.
The following functions use these abstractions:

* ``left_spmm``, which computes a product of a sparse matrix and a dense matrix when the sparse matrix
is the left operand. This function is GEMM-like, in that it allows offsets and transposition flags
for either argument.
* ``right_spmm``, which is analogous to ``left_spmm`` when the sparse matrix is the right operand.
* ``sketch_general``, when called with a SparseSkOp object.
* ``sketch_sparse``, when called with a DenseSkOp object.

Each of those functions is merely a _dispatcher_ of other (lower level) functions. See below for details on
how the dispatching works.

## Left_spmm and right_spmm

These functions are implemented in ``RandBLAS/sparse_data/spmm_dispatch.hh``.

``right_spmm`` is implemented by falling back on ``left_spmm`` with transformed
values for ``opA, opB`` and ``layout``.
Here's what happens if ``left_spmm`` is called with a sparse matrix ``A``, a dense input matrix ``B``, and a dense output matrix ``C``.

1. If needed, transposition of ``A`` is resolved by creating a lightweight object for the transpose
called ``At``. This object is just a tool for us to change how we intrepret the buffers that underlie ``A``.
* If ``A`` is COO, then ``At`` will also be COO.
* If ``A`` is CSR, then ``At`` will be CSC.
* If ``A`` is CSC, then ``At`` will be CSR.

We make a recursive call to ``left_spmm`` once we have our hands on ``At``, so
the rest of ``left_spmm``'s logic only needs to handle un-transposed ``A``.

2. A memory layout is determined for how we'll read ``B`` in the low-level
sparse matrix multiplication kernels.
* If ``B`` is un-transposed then we'll use the same layout as ``C``.
* If ``B`` is transposed then we'll swap its declared dimensions
(i.e., we'll swap its reported numbers of rows and columns) and
and we'll tell the kernel to read it in the opposite layout as ``C``.

3. We dispatch a kernel from ``coo_spmm_impl.hh``, or ``csc_spmm_impl.hh``,
or ``csr_spmm_impl.h``. The precise kernel depends on the type of ``A``, and the inferred layout for ``B``, and the declared layout for ``C``.

## Sketching dense data with sparse operators.

Sketching dense data with a sparse operator is typically handled with ``sketch_general``,
which is defined in ``skge.hh``.

If we call this function with a SparseSkOp object, ``S``, we'd immediately get routed to
a function in ``skges_to_spmm.hh``: either ``lskges`` or ``rskges``. Here's what would happen
after we entered one of those functions:

1. If necessary, we'd sample the defining data of ``S`` with ``RandBLAS::fill_sparse(S)``.

2. We'd obtain a lightweight view of ``S`` as a COOMatrix, and we'd pass that matrix to ``left_spmm``
(if inside ``lskges``) or ``right_spmm`` (if inside ``rskges``).


## Sketching sparse data with dense operators

If we call ``sketch_sparse`` with a DenseSkOp, ``S``, and a sparse matrix, ``A``, then we'll get routed to either
``lsksp3`` or ``rsksp3`` in ``sparse_data/sksp3_to_spmm.hh``.

From there, we'll do the following.

1. If necessary, we sample the defining data of ``S``. The way that we do this is a
little more complicated than using ``RandBLAS::fill_dense(S)``, but it's similar
in spirit.

2. We get our hands on the simple buffer representation of ``S``. From there ...
* We call ``right_spmm`` if we're inside ``lsksp3``.
* We call ``left_spmm`` if we're inside ``rsksp3``.

Note that the ``l`` and ``r`` in the ``[l/r]sksp3`` function names
get matched to opposite sides for ``[left/right]_spmm``! This is because all the fancy abstractions in ``S`` have been stripped away by this point in the call sequence, so the "side" that we emphasize in function names changes
from emphasizing ``S`` to emphasizing ``A``.

12 changes: 10 additions & 2 deletions RandBLAS/sparse_data/base.hh
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ static inline void sorted_nonzero_locations_to_pointer_array(
return;
}

// Idea: change all "const" attributes to for SpMatrix to return values from inlined functions.
// Looks like there'd be no collision with function/property names for sparse matrix
// types in Eigen, SuiteSparse, OneMKL, etc.. These inlined functions could return
// nomincally public members like A._n_rows and A._n_cols, which the user will only change
// at their own peril.

// =============================================================================
/// @verbatim embed:rst:leading-slashes
///
Expand Down Expand Up @@ -145,9 +151,11 @@ concept SparseMatrix = requires(SpMat A) {
{ A.n_cols } -> std::convertible_to<const int64_t>;
{ A.nnz } -> std::convertible_to<int64_t>;
{ *(A.vals) } -> std::convertible_to<typename SpMat::scalar_t>;
{ SpMat(A.n_rows, A.n_cols) };
// ^ Is there better way to require a two-argument constructor?
{ A.own_memory } -> std::convertible_to<const bool>;
{ SpMat(A.n_rows, A.n_cols) }; // Is there better way to require a two-argument constructor?
{ A.reserve((int64_t) 10) }; // This will always compile, even though it might error at runtime.
// { A.reserve((int64_t) 10) };
// ^ Problem: const SpMat objects fail that check.
};

} // end namespace RandBLAS::sparse_data
Expand Down
60 changes: 60 additions & 0 deletions RandBLAS/sparse_data/csc_spmm_impl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -121,5 +121,65 @@ static void apply_csc_left_jki_p11(
return;
}

template <typename T, RandBLAS::SignedInteger sint_t>
static void apply_csc_left_kib_rowmajor_1p1(
T alpha,
int64_t d,
int64_t n,
int64_t m,
CSCMatrix<T, sint_t> &A,
const T *B,
int64_t ldb,
T *C,
int64_t ldc
) {
randblas_require(A.index_base == IndexBase::Zero);

randblas_require(d == A.n_rows);
randblas_require(m == A.n_cols);


int num_threads = 1;
#if defined(RandBLAS_HAS_OpenMP)
#pragma omp parallel
{
num_threads = omp_get_num_threads();
}
#endif

int* block_bounds = new int[num_threads + 1]{};
int block_size = d / num_threads;
if (block_size == 0) { block_size = 1;}
for (int t = 0; t < num_threads; ++t)
block_bounds[t+1] = block_bounds[t] + block_size;
block_bounds[num_threads] += d % num_threads;

#pragma omp parallel default(shared)
{
#if defined(RandBLAS_HAS_OpenMP)
int t = omp_get_thread_num();
#else
int t = 0;
#endif
int i_lower = block_bounds[t];
int i_upper = block_bounds[t+1];
for (int64_t k = 0; k < m; ++k) {
// Rank-1 update: C[:,:] += A[:,k] @ B[k,:]
const T* row_B = &B[k*ldb];
for (int64_t ell = A.colptr[k]; ell < A.colptr[k+1]; ++ell) {
int64_t i = A.rowidxs[ell];
if (i_lower <= i && i < i_upper) {
T* row_C = &C[i*ldc];
T scale = alpha * A.vals[ell];
blas::axpy(n, scale, row_B, 1, row_C, 1);
}
}
}
}

delete [] block_bounds;
return;
}

}
#endif
41 changes: 39 additions & 2 deletions RandBLAS/sparse_data/csr_spmm_impl.hh
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,13 @@ static void apply_csr_to_vector_from_left_ik(
int64_t incAv // stride between elements of Av
) {
for (int64_t i = 0; i < len_Av; ++i) {
T Av_i = Av[i*incAv];
for (int64_t ell = rowptr[i]; ell < rowptr[i+1]; ++ell) {
int j = colidxs[ell];
T Aij = vals[ell];
Av[i*incAv] += Aij * v[j*incv];
// ^ if v were a matrix, this could be an axpy with the j-th row of v, accumulated into i-th row of Av.
Av_i += Aij * v[j*incv];
}
Av[i*incAv] = Av_i;
}
}

Expand Down Expand Up @@ -86,6 +87,42 @@ static void apply_csr_left_jik_p11(
return;
}

template <typename T, RandBLAS::SignedInteger sint_t>
static void apply_csr_left_ikb_rowmajor(
T alpha,
int64_t d,
int64_t n,
int64_t m,
CSRMatrix<T, sint_t> &A,
const T *B,
int64_t ldb,
T *C,
int64_t ldc
) {
randblas_require(A.index_base == IndexBase::Zero);

randblas_require(d == A.n_rows);
randblas_require(m == A.n_cols);

#pragma omp parallel default(shared)
{
#pragma omp for schedule(dynamic)
for (int64_t i = 0; i < d; ++i) {
// C[i, 0:n] += alpha * A[i, :] @ B[:, 0:n]
T* row_C = &C[i*ldc];
for (int64_t ell = A.rowptr[i]; ell < A.rowptr[i+1]; ++ell) {
// we're working with A[i,k] for k = A.colidxs[ell]
// compute C[i, 0:n] += alpha * A[i,k] * B[k, 0:n]
T scale = alpha * A.vals[ell];
int64_t k = A.colidxs[ell];
const T* row_B = &B[k*ldb];
blas::axpy(n, scale, row_B, 1, row_C, 1);
}
}
}
return;
}

} // end namespace RandBLAS::sparse_data::csr

#endif
19 changes: 15 additions & 4 deletions RandBLAS/sparse_data/spmm_dispatch.hh
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,22 @@ void left_spmm(
using RandBLAS::sparse_data::coo::apply_coo_left_jki_p11;
apply_coo_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, ro_a, co_a, B, ldb, C, ldc);
} else if constexpr (is_csc) {
using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
if (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
using RandBLAS::sparse_data::csc::apply_csc_left_kib_rowmajor_1p1;
apply_csc_left_kib_rowmajor_1p1(alpha, d, n, m, A, B, ldb, C, ldc);
} else {
using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
}
} else {
using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
if (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
using RandBLAS::sparse_data::csr::apply_csr_left_ikb_rowmajor;
apply_csr_left_ikb_rowmajor(alpha, d, n, m, A, B, ldb, C, ldc);
} else {
using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
}

}
return;
}
Expand Down
3 changes: 3 additions & 0 deletions examples/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sparse-data-matrices/*
sparse-low-rank-approx/data-matrices/*
sparse-low-rank-approx/fast-matrix-market/*
Loading
Loading