It is wasteful to store zeros elements in a sparse matrix
, especially for incrementally data. When constructing tf-idf
and bag-of-words
features or saving graph ajacent matrix
, non-efficient sparse matrix storage might lead to the memory error
. To circumvent this problems, efficient sparse matrix storage is a choice.
Sparse matrix
A martix is sparse when its sparsity is greater than 0.5, where the sparsity of a matrix is the # of zero-valued elements divided by the total # of elements (which is equal to 1 minus the density of the matrix).[2]
Storage
Each entry in the array represents the element of the matrix and is accessed by the two indices $i$ (row index) and $j$ (colomn index).
Sparse matrix can reduce sustantial memory requirements by storing only the non-zero entries.
Formats can be divided into:
those support efficient modification, such as DOK, LIL or COO, which are typically used to construct the matrices.
those that support efficient access and matrix operations, such as CSR or CSC.
Dictionary of keys (DOK)
- DOK is storing a dictionary that mays
(row, column)
-pairs to the valu of the elements. Elements that are missing from the dictionary are zeros. - DOK based sparse matrix is efficient for constructing sparse matrices incrementally.
- Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.[3]
1 | class scipy.sparse.dok_matrix( |
List of lists (LIL)
- LIL stores one list per row, with each entry containing the column index and the value, where entries are sorted by column index for faster indexing.[4]
- Also efficient for constructing sparse matrices incrementally
1 | class scipy.sparse.lil_matrix( |
Advantages of the LIL format
- supports
flexible slicing
- changes to the matrix sparsity structure are efficient
Disadvantages of the LIL format
- arithmetic operations LIL + LIL are slow (consider CSR or CSC)
- slow column slicing (consider CSC)
- slow matrix vector products (consider CSR or CSC)
Intended Usage
LIL is a convenient format for constructing sparse matrices
once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
consider using theCOO format
when constructing large matrices
Coordinate list (COO)
- COO stores a list of
(row, column, value)
tuples, where entries are sorted first by row index and then by column index. ALso known as theijv
ortriplet
format.[5] - Also efficient for constructing sparse matrices incrementally
A[i[k], j[k]] = data[k]
1 | class scipy.sparse.coo_matrix( |
Advantages of the COO format
- facilitates fast conversion among sparse formats
- permits duplicate entries (see example)
- very fast conversion to and from CSR/CSC formats
Disadvantages of the COO format
- does not directly support
arithmetic operations
andslicing
Intended Usage
- COO is a fast format for constructing sparse matrices
- Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
- By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)
Compressed sparse row (CSR, CRS or Yale format)
The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format stores the matrix in three 1-dim arrays (data, col_indices, row_indptr)
, respectively containing non-zero values.
Let NNZ
denote the # of nonzero entries.[6]
- The arrays
data
(non-zero values) andcol_indices
(column indices) are of lengthNNZ
. - The array
row_indptr
has one element per row in the matrix and encodes the index indata
where the given row starts. - The first value of
row_indptr
is always 0 and the last is always set toNNZ
.
Insantiate with:
csr_matrix(D)
, with a dense matrix or rank-2 ndarray Dcsr_matrix(S)
, with another sparse matrix S (equivalent to S.tocsr())csr_matrix((M, N), [dtype])
, to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype=’d’.csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
, where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].csr_matrix((data, indices, indptr), [shape=(M, N)])
.row = row_indptr[i: i+1]
col = indices[row]
A[row,col] = data[row]
Advantages of the CSC format
- efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
- efficient column slicing
- fast matrix vector products
Disadvantages of the CSC format
- slow column slicing operations (consider CSC)
- changes to the sparsity structure are expensive (consider LIL or DOK)
Compressed sparse column (CSC or CCS)
CSC is similar to CSR except that values are read first by column, storing row indices for each value.
1 | 0, 2, 3, 6]) # col_indptr indptr = np.array([ |
Advantages of the CSC format
- efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
- efficient column slicing
- fast matrix vector products (CSR, BSR may be faster)
Disadvantages of the CSC format
- slow row slicing operations (consider CSR)
- changes to the sparsity structure are expensive (consider LIL or DOK)
Shuffle sparse matrix
When generating batches of training/test data, sklearn.utils.shuffle
can help.
1 | from sklearn.utils import shuffle |
Feed to TensorFlow
Convert to dense matrix
1 | x_dense_batch = x_csr_batch.todense() |
Feed to tf.sparse.placeholder
1 | import tensorflow as tf |
After feeding into the tensorflow computational graph, use tf.sparse
since SparseTensor cannot be directly feed into a Tensor op!
Use
tf.sparse
1
2
3w = tf.Variable(tf.random_normal([dim1, dim2], mean=.0, stddev=.01))
b = tf.Variable(tf.random_normal([dim2], mean=.0, stddev=.01))
sp_x = tf.sparse.sparse_dense_matmul(sparse_input, w) + bConvert the
SparseTensor
to denseTensor
1
2
3
4
5
6
7
8
9
10
11
12
13tf.sparse.to_dense(
sp_input, default_value=None, validate_indices=True, name=None
)
"""
Args:
-------------------
sp_input: SparseTensor
default_value: Scalar value to set for indices not specified in sp_input. Defaults to zero.
validate_indices: boolean (default True). If True, indices are checked to assure sorted lexicographic order and no repeats.
name (optional): a name prefix for returned tensors.
"""
dense_x = tf.sparse.to_dense(sparse_input)