BallisticLA · rileyjmurray · May 30, 2024 · Apr 17, 2024 · Apr 19, 2024 · Apr 23, 2024
diff --git a/RandBLAS/DevNotes.md b/RandBLAS/DevNotes.md
@@ -0,0 +1,39 @@
+# Developer Notes for RandBLAS
+
+This file has discussions of RandBLAS' implementation that aren't (currently) suitable 
+for the RandBLAS User Guide.
+
+## What's where?
+
+ * Our basic random number generation is handled by [Random123](https://github.com/DEShawResearch/random123).
+   We have small wrappers around Random123 code in ``RandBLAS/base.hh`` and ``RandBLAS/random_gen.hh``.
+
+ * ``RandBLAS/dense_skops.hh`` has code for representing and sampling dense sketching operators.
+   The sampling code is complicated because it supports multi-threaded (yet threading invariant!)
+   random (sub)matrix generation.
+
+ * ``RandBLAS/sparse_skops.hh`` has code for representing and sampling sparse sketching operators.
+   The sampling code has a customized method for repeatedly sampling from an index set without
+   replacement, which is needed to quickly generate the structures used in statistically reliable
+   sparse sketching operators.
+
+ * [BLAS++ (aka blaspp)](https://github.com/icl-utk-edu/blaspp) is our portability layer for BLAS.
+   We actually use very few functions in BLAS at time of writing (GEMM, GEMV, SCAL, COPY, and
+   AXPY) but we use its enumerations _everywhere_. Fast GEMM is important for sketching dense
+   data with dense operators.
+
+ * The ``sketch_general`` functions in ``RandBLAS/skge.hh`` are the main entry point for sketching dense data.
+   These functions are small wrappers around functions with more BLAS-like names:
+      * ``lskge3`` and ``rskge3`` in ``RandBLAS/skge3_to_gemm.hh``.
+      * ``lskges`` and ``rskges`` in ``RandBLAS/skges_to_spmm.hh``.
+   The former pair of functions are just fancy wrappers around GEMM.
+   The latter pair of functions trigger a far more opaque call sequence, since they rely on sparse
+   matrix operations.
+
+ * There is no widely accepted standard for sparse BLAS operations. This is a bummer because
+   sparse matrices are super important in data science and scientific computing. In view of this,
+   RandBLAS provides its own abstractions for sparse matrices (CSC, CSR, and COO formats).
+   The abstractions can either own their associated data or just wrap existing data (say, data
+   attached to a sparse matrix in Eigen). RandBLAS has reasonably flexible and high-performance code
+   for multiplying a sparse matrix and a dense matrix. All code related to sparse matrices is in
+   ``RandBLAS/sparse_data``. See that folder's ``DevNotes.md`` file for details.
diff --git a/RandBLAS/sparse_data/DevNotes.md b/RandBLAS/sparse_data/DevNotes.md
@@ -0,0 +1,76 @@
+# Developer Notes for RandBLAS' sparse matrix functionality
+
+RandBLAS provides abstractions for CSC, CSR, and COO-format sparse matrices.
+The following functions use these abstractions:
+
+ * ``left_spmm``, which computes a product of a sparse matrix and a dense matrix when the sparse matrix
+    is the left operand. This function is GEMM-like, in that it allows offsets and transposition flags
+    for either argument.
+ * ``right_spmm``, which is analogous to ``left_spmm`` when the sparse matrix is the right operand.
+ * ``sketch_general``, when called with a SparseSkOp object.
+ * ``sketch_sparse``, when called with a DenseSkOp object.
+
+Each of those functions is merely a _dispatcher_ of other (lower level) functions. See below for details on
+how the dispatching works.
+
+## Left_spmm and right_spmm
+
+These functions are implemented in ``RandBLAS/sparse_data/spmm_dispatch.hh``.
+
+``right_spmm`` is implemented by falling back on ``left_spmm`` with transformed
+values for ``opA, opB`` and ``layout``.
+Here's what happens if ``left_spmm`` is called with a sparse matrix ``A``, a dense input matrix ``B``, and a dense output matrix ``C``.
+
+ 1. If needed, transposition of ``A`` is resolved by creating a lightweight object for the transpose
+    called ``At``. This object is just a tool for us to change how we intrepret the buffers that underlie ``A``.
+      * If ``A`` is COO, then ``At`` will also be COO.
+      * If ``A`` is CSR, then ``At`` will be CSC.
+      * If ``A`` is CSC, then ``At`` will be CSR.
+
+    We make a recursive call to ``left_spmm`` once we have our hands on ``At``, so
+    the rest of ``left_spmm``'s logic only needs to handle un-transposed ``A``.
+
+ 2. A memory layout is determined for how we'll read ``B`` in the low-level 
+    sparse matrix multiplication kernels.
+      * If ``B`` is un-transposed then we'll use the same layout as ``C``.
+      * If ``B`` is transposed then we'll swap its declared dimensions
+        (i.e., we'll swap its reported numbers of rows and columns) and 
+        and we'll tell the kernel to read it in the opposite layout as ``C``.
+
+ 3. We dispatch a kernel from ``coo_spmm_impl.hh``, or ``csc_spmm_impl.hh``,
+    or ``csr_spmm_impl.h``. The precise kernel depends on the type of ``A``, and the inferred layout for ``B``, and the declared layout for ``C``.
+
+## Sketching dense data with sparse operators.
+
+Sketching dense data with a sparse operator is typically handled with ``sketch_general``,
+which is defined in ``skge.hh``.
+
+If we call this function with a SparseSkOp object, ``S``, we'd immediately get routed to
+a function in ``skges_to_spmm.hh``: either ``lskges`` or ``rskges``. Here's what would happen
+after we entered one of those functions:
+
+ 1. If necessary, we'd sample the defining data of ``S`` with ``RandBLAS::fill_sparse(S)``.
+
+ 2. We'd obtain a lightweight view of ``S`` as a COOMatrix, and we'd pass that matrix to ``left_spmm``
+    (if inside ``lskges``) or ``right_spmm`` (if inside ``rskges``).
+
+
+## Sketching sparse data with dense operators
+
+If we call ``sketch_sparse`` with a DenseSkOp, ``S``, and a sparse matrix, ``A``, then  we'll get routed to either
+``lsksp3`` or ``rsksp3`` in ``sparse_data/sksp3_to_spmm.hh``.
+
+From there, we'll do the following.
+
+ 1. If necessary, we sample the defining data of ``S``. The way that we do this is a
+    little more complicated than using ``RandBLAS::fill_dense(S)``, but it's similar
+    in spirit.
+
+ 2. We get our hands on the simple buffer representation of ``S``.  From there ...
+     * We call ``right_spmm`` if we're inside ``lsksp3``.
+     * We call ``left_spmm`` if we're inside ``rsksp3``.
+
+    Note that the ``l`` and ``r`` in the ``[l/r]sksp3`` function names
+    get matched to opposite sides for ``[left/right]_spmm``! This is because all the fancy abstractions in ``S`` have been stripped away by this point in the call sequence, so the "side" that we emphasize in function names changes
+    from emphasizing ``S`` to emphasizing ``A``.
+
diff --git a/RandBLAS/sparse_data/base.hh b/RandBLAS/sparse_data/base.hh
@@ -63,6 +63,12 @@ static inline void sorted_nonzero_locations_to_pointer_array(
     return;
 }
 
+// Idea: change all "const" attributes to for SpMatrix to return values from inlined functions. 
+// Looks like there'd be no collision with function/property names for sparse matrix
+// types in Eigen, SuiteSparse, OneMKL, etc.. These inlined functions could return
+// nomincally public members like A._n_rows and A._n_cols, which the user will only change
+// at their own peril.
+
 // =============================================================================
 /// @verbatim embed:rst:leading-slashes
 ///
@@ -145,9 +151,11 @@ concept SparseMatrix = requires(SpMat A) {
     { A.n_cols } -> std::convertible_to<const int64_t>;
     { A.nnz } -> std::convertible_to<int64_t>;
     { *(A.vals) } -> std::convertible_to<typename SpMat::scalar_t>;
+    { SpMat(A.n_rows, A.n_cols) };
+    // ^ Is there better way to require a two-argument constructor?
     { A.own_memory } ->  std::convertible_to<const bool>;
-    { SpMat(A.n_rows, A.n_cols) }; // Is there better way to require a two-argument constructor?
-    { A.reserve((int64_t) 10) };  // This will always compile, even though it might error at runtime.
+    // { A.reserve((int64_t) 10) };
+    // ^ Problem: const SpMat objects fail that check.
 };
 
 } // end namespace RandBLAS::sparse_data

diff --git a/RandBLAS/sparse_data/csc_spmm_impl.hh b/RandBLAS/sparse_data/csc_spmm_impl.hh
@@ -121,5 +121,65 @@ static void apply_csc_left_jki_p11(
     return;
 }
 
+template <typename T, RandBLAS::SignedInteger sint_t>
+static void apply_csc_left_kib_rowmajor_1p1(
+    T alpha,
+    int64_t d,
+    int64_t n,
+    int64_t m,
+    CSCMatrix<T, sint_t> &A,
+    const T *B,
+    int64_t ldb,
+    T *C,
+    int64_t ldc
+) {
+    randblas_require(A.index_base == IndexBase::Zero);
+
+    randblas_require(d == A.n_rows);
+    randblas_require(m == A.n_cols);
+
+
+    int num_threads = 1;
+    #if defined(RandBLAS_HAS_OpenMP)
+    #pragma omp parallel 
+    {
+        num_threads = omp_get_num_threads();
+    }
+    #endif
+
+    int* block_bounds = new int[num_threads + 1]{};
+    int block_size = d / num_threads;
+    if (block_size == 0) { block_size = 1;}
+    for (int t = 0; t < num_threads; ++t)
+        block_bounds[t+1] = block_bounds[t] + block_size;
+    block_bounds[num_threads] += d % num_threads;
+
+    #pragma omp parallel default(shared)
+    {
+        #if defined(RandBLAS_HAS_OpenMP)
+        int t = omp_get_thread_num();
+        #else
+        int t = 0;
+        #endif
+        int i_lower = block_bounds[t];
+        int i_upper = block_bounds[t+1];
+        for (int64_t k = 0; k < m; ++k) {
+            // Rank-1 update: C[:,:] += A[:,k] @ B[k,:]
+            const T* row_B = &B[k*ldb];
+            for (int64_t ell = A.colptr[k]; ell < A.colptr[k+1]; ++ell) {
+                int64_t i = A.rowidxs[ell];
+                if (i_lower <= i && i < i_upper) {
+                    T* row_C = &C[i*ldc];
+                    T scale = alpha * A.vals[ell];
+                    blas::axpy(n, scale, row_B, 1, row_C, 1);
+                }
+            }
+        }
+    }
+
+    delete [] block_bounds;
+    return;
+}
+
 }
 #endif
diff --git a/RandBLAS/sparse_data/csr_spmm_impl.hh b/RandBLAS/sparse_data/csr_spmm_impl.hh
@@ -24,12 +24,13 @@ static void apply_csr_to_vector_from_left_ik(
     int64_t incAv   // stride between elements of Av
 ) {
     for (int64_t i = 0; i < len_Av; ++i) {
+        T Av_i = Av[i*incAv];
         for (int64_t ell = rowptr[i]; ell < rowptr[i+1]; ++ell) {
             int j = colidxs[ell];
             T Aij = vals[ell];
-            Av[i*incAv] += Aij * v[j*incv];
-            // ^ if v were a matrix, this could be an axpy with the j-th row of v, accumulated into i-th row of Av.
+            Av_i += Aij * v[j*incv];
         }
+        Av[i*incAv] = Av_i;
     }
 }
 
@@ -86,6 +87,42 @@ static void apply_csr_left_jik_p11(
     return;
 }
 
+template <typename T, RandBLAS::SignedInteger sint_t>
+static void apply_csr_left_ikb_rowmajor(
+    T alpha,
+    int64_t d,
+    int64_t n,
+    int64_t m,
+    CSRMatrix<T, sint_t> &A,
+    const T *B,
+    int64_t ldb,
+    T *C,
+    int64_t ldc
+) {
+    randblas_require(A.index_base == IndexBase::Zero);
+
+    randblas_require(d == A.n_rows);
+    randblas_require(m == A.n_cols);
+
+    #pragma omp parallel default(shared)
+    {
+        #pragma omp for schedule(dynamic)
+        for (int64_t i = 0; i < d; ++i) {
+            // C[i, 0:n] += alpha * A[i, :] @ B[:, 0:n]
+            T* row_C = &C[i*ldc];
+            for (int64_t ell = A.rowptr[i]; ell < A.rowptr[i+1]; ++ell) {
+                // we're working with A[i,k] for k = A.colidxs[ell]
+                // compute C[i, 0:n] += alpha * A[i,k] * B[k, 0:n]
+                T scale = alpha * A.vals[ell];
+                int64_t k = A.colidxs[ell];
+                const T* row_B = &B[k*ldb];
+                blas::axpy(n, scale, row_B, 1, row_C, 1);
+            }
+        }
+    }
+    return;
+}
+
 } // end namespace RandBLAS::sparse_data::csr
 
 #endif
diff --git a/RandBLAS/sparse_data/spmm_dispatch.hh b/RandBLAS/sparse_data/spmm_dispatch.hh
@@ -110,11 +110,22 @@ void left_spmm(
         using RandBLAS::sparse_data::coo::apply_coo_left_jki_p11;
         apply_coo_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, ro_a, co_a, B, ldb, C, ldc);
     } else if constexpr (is_csc) {
-        using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
-        apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
+        if (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
+            using RandBLAS::sparse_data::csc::apply_csc_left_kib_rowmajor_1p1;
+            apply_csc_left_kib_rowmajor_1p1(alpha, d, n, m, A, B, ldb, C, ldc);
+        } else {
+            using RandBLAS::sparse_data::csc::apply_csc_left_jki_p11;
+            apply_csc_left_jki_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
+        }
     } else {
-        using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
-        apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
+        if  (layout_opB == Layout::RowMajor && layout_C == Layout::RowMajor) {
+             using RandBLAS::sparse_data::csr::apply_csr_left_ikb_rowmajor;
+             apply_csr_left_ikb_rowmajor(alpha, d, n, m, A, B, ldb, C, ldc);
+        } else {
+            using RandBLAS::sparse_data::csr::apply_csr_left_jik_p11;
+            apply_csr_left_jik_p11(alpha, layout_opB, layout_C, d, n, m, A, B, ldb, C, ldc);
+        }
+
     }
     return;
 }

diff --git a/examples/.gitignore b/examples/.gitignore
@@ -0,0 +1,3 @@
+sparse-data-matrices/*
+sparse-low-rank-approx/data-matrices/*
+sparse-low-rank-approx/fast-matrix-market/*