Algorithms for Sparse Matrix-Vector Multiplication

Compressed Sporse Row

This is an optimized way to store rows for sparse matrices: Cache Optimization-20250331112244798

Sparse MVM using CSR

void smvm(int m, const double* values, const int* col_idx,
          const int* row_start, double* x, double* y) {
    int i, j;
    double d;

    /* Loop over m rows */
    for (i = 0; i < m; i++) {
        d = y[i]; /* Scalar replacement since reused */

        /* Loop over non-zero elements in row i */
        for (j = row_start[i]; j < row_start[i + 1]; j++) {
            d += values[j] * x[col_idx[j]];
        }

        y[i] = d;
    }
}

Let’s analyze this code:

  • Spatial locality: with respect to row_start, col_idx and values we have spatial locality.
  • Temporal locality: with respect to y we have temporal locality. (Poor temporal with respect to $x$)
  • Good storage efficiency for the sparse matrix.
  • But it is 2x slower than the dense matrix multiplication when the matrix is dense.

Block CSR

Sparse Matrix Vector Multiplication-20250331112834561

But we cannot do block optimizations for the cache with this storage method.

Blocked Sparse MVM

void smvm_2x2(int bm, const int *b_row_start, const int *b_col_idx,
              const double *b_values, double *x, double *y) {
    int i, j;
    double d0, d1, c0, c1;

    /* Loop over bm block rows */
    for (i = 0; i < bm; i++) {
        d0 = y[2 * i];     /* Scalar replacement since reused */
        d1 = y[2 * i + 1];

        /* Dense micro MVM */
        for (j = b_row_start[i]; j < b_row_start[i + 1]; j++, b_values += 2 * 2) {
            c0 = x[2 * b_col_idx[j] + 0]; /* Scalar replacement since reused */
            c1 = x[2 * b_col_idx[j] + 1];

            d0 += b_values[0] * c0;
            d1 += b_values[2] * c0;
            d0 += b_values[1] * c1;
            d1 += b_values[3] * c1;
        }

        y[2 * i] = d0;
        y[2 * i + 1] = d1;
    }
}
  • Temporal locality both with respect to y and x
  • Lower storage for indexes, but higher for the data.
  • Computational overhead due to the zeros.

They discovered that the performance was quite machine dependent, they couldn’t se clearly a pattern in the beginning. Then if you wanted to use autotuning using matrix multiplication, then you would have just too many choices (r, c, variables are too much from 1 to 12). 🟥. The conversion from CSR to blocked CSR is not trivial, the authors estimated it to cost about 10 Single Matrix Vector Multiplications, so the total cost of the search would be 1440 single matrix vector multiplications, which is just too much. See Fast Linear Algebra for ATLAS model.

Simple model

They created a simple model to compare the performance of the blocked sparse MVM with the standard sparse MVM. They got the perfromances for a dense matrix: that was speed up, divided by the computational overhead (slowdown, due to the zeros). This model has been proven to work quite well.

Compression methods for MVM

It is memory bound, so if you find nice compression ways, you can get higher performance!

Value Compression

Sparse Matrix Vector Multiplication-20250331114937698 #### Index Compression Sparse Matrix Vector Multiplication-20250331115326387

Pattern-based compression 🟥

Sparse Matrix Vector Multiplication-20250331115128267

Multiple inputs

TODO