Algorithms for Sparse Matrix-Vector Multiplication
Compressed Sporse Row
This is an optimized way to store rows for sparse matrices:
Sparse MVM using CSR
void smvm(int m, const double* values, const int* col_idx,
const int* row_start, double* x, double* y) {
int i, j;
double d;
/* Loop over m rows */
for (i = 0; i < m; i++) {
d = y[i]; /* Scalar replacement since reused */
/* Loop over non-zero elements in row i */
for (j = row_start[i]; j < row_start[i + 1]; j++) {
d += values[j] * x[col_idx[j]];
}
y[i] = d;
}
}
Let’s analyze this code:
- Spatial locality: with respect to
row_start
,col_idx
andvalues
we have spatial locality. - Temporal locality: with respect to
y
we have temporal locality. (Poor temporal with respect to $x$) - Good storage efficiency for the sparse matrix.
- But it is 2x slower than the dense matrix multiplication when the matrix is dense.
Block CSR

But we cannot do block optimizations for the cache with this storage method.
Blocked Sparse MVM
void smvm_2x2(int bm, const int *b_row_start, const int *b_col_idx,
const double *b_values, double *x, double *y) {
int i, j;
double d0, d1, c0, c1;
/* Loop over bm block rows */
for (i = 0; i < bm; i++) {
d0 = y[2 * i]; /* Scalar replacement since reused */
d1 = y[2 * i + 1];
/* Dense micro MVM */
for (j = b_row_start[i]; j < b_row_start[i + 1]; j++, b_values += 2 * 2) {
c0 = x[2 * b_col_idx[j] + 0]; /* Scalar replacement since reused */
c1 = x[2 * b_col_idx[j] + 1];
d0 += b_values[0] * c0;
d1 += b_values[2] * c0;
d0 += b_values[1] * c1;
d1 += b_values[3] * c1;
}
y[2 * i] = d0;
y[2 * i + 1] = d1;
}
}
- Temporal locality both with respect to
y
andx
- Lower storage for indexes, but higher for the data.
- Computational overhead due to the zeros.
They discovered that the performance was quite machine dependent, they couldn’t se clearly a pattern in the beginning. Then if you wanted to use autotuning using matrix multiplication, then you would have just too many choices (r, c, variables are too much from 1 to 12). 🟥. The conversion from CSR to blocked CSR is not trivial, the authors estimated it to cost about 10 Single Matrix Vector Multiplications, so the total cost of the search would be 1440 single matrix vector multiplications, which is just too much. See Fast Linear Algebra for ATLAS model.
Simple model
They created a simple model to compare the performance of the blocked sparse MVM with the standard sparse MVM. They got the perfromances for a dense matrix: that was speed up, divided by the computational overhead (slowdown, due to the zeros). This model has been proven to work quite well.
Compression methods for MVM
It is memory bound, so if you find nice compression ways, you can get higher performance!
Value Compression


Pattern-based compression 🟥

Multiple inputs
TODO