Using sparse linear solvers from Theseus with pytorch sparse matrices #609

tvercaut · 2023-10-04T22:23:15Z

tvercaut
Oct 4, 2023

PyTorch provides some out-of-the-box support for sparse matrices:
https://pytorch.org/docs/stable/sparse.html

However, as discused for example in pytorch/pytorch#69538, there is limited support for linear algebra operations with it.

I read in the Theseus README that Theseus provides sparse linear solvers (CHOLMOD, LU, BaSpaCho) with GPU support. However, their usage is not clear to me. I couldn't find a simple tutorial for it but maybe I missed it.

Suppose I have a sparse CSR PyTorch tensor A (which I know is SPD) and a dense PyTorch vector b both on the GPU. Is there a simple way of using say BaSpaCho to solve for Ax=b?

Looking at test_baspacho_sparse_backward.py, I tried things along the way of the below but it didn't work directly and it feels more complicated than it ought to be.

void_objective = th.Objective()
void_ordering = th.VariableOrdering(void_objective, default_order=False)
solver = th.BaspachoSparseSolver(
    void_objective, linearization_kwargs={"ordering": void_ordering}, damping=0.0
)
linearization = solver.linearization
linearization.num_rows = A.shape[0]
linearization.num_cols = A.shape[1]
linearization.A_col_ind = A.col_indices
linearization.A_row_ptr = A.crow_indices
linearization.A_val = A.values
linearization.b = b

# Not sure what to do for this
# also need: var dims and var_start_cols (because baspacho is blockwise)
#linearization.var_dims = split_into_param_sizes(
#    linearization.num_cols, *param_size_range, rng=rng
#)

# This may not be needed according to the coment in the test and it fails but otherwise, it's unclear how  solver.A_row_ptr would get allocated
solver.reset(device=torchdevice)

# This fails
x = BaspachoSolveFunction.apply(
    linearization.A_val,
    linearization.b,
    linearization.structure(),
    solver.A_row_ptr,
    solver.A_col_ind,
    solver.symbolic_decomposition
)

Are there any convenience wrappers that I missed and would allow to switch easily between sparse solvers?

luisenp · 2023-10-05T14:07:38Z

luisenp
Oct 5, 2023
Collaborator

Hi @tvercaut. We haven't written any convenience API for the sparse solvers so far; we mostly use then under the hood for our optimizers. I'll try to find time to write a short example in the next few days. You don't really need any of the linearization stuff, but you need to construct the right structures to pass to BaspachoSolveFunction.

Also, the comment about reset is there because our optimizers call reset() under the hood, not because it's not needed.

1 reply

tvercaut Oct 6, 2023
Author

I made a bit of progress and can now run LUCudaSolveFunction.apply successfully on a small scale problem although it requires some back and forth for the indices between CPU and GPU.

However, baspacho fails for me with ValueError: index pointer should start with 0.

Also, trying to work with LU on a larger scale problem doesn't work either with LUCudaSparseSolver.solver taking a very long time. I couldn't see it finish and killed it after 5 minutes. Note that scipy.spsolve takes about 100 seconds for this problem while iterative solvers on GPU take a few seconds:
https://github.com/cai4cai/torchsparsegradutils/blob/test-notebook/torchsparsegradutils/tests/spd_forward_solve_notebook.ipynb

A notebook with my Theseus attempts can be found here:
https://colab.research.google.com/drive/14waPOyMsmNTS-eBuGKnlx9_3rkcsYW37?usp=sharing

For convenience, here is the function I currently use:

def test_solve(A, b, use_baspacho = False):
  void_objective = th.Objective()
  void_ordering = th.VariableOrdering(void_objective, default_order=False)
  if use_baspacho:
    solver = th.BaspachoSparseSolver(
        void_objective, linearization_kwargs={"ordering": void_ordering}, damping=0.0
    )
  else:
    solver = th.LUCudaSparseSolver(
        void_objective, linearization_kwargs={"ordering": void_ordering}, damping=0.0
    )

  batch_size = 1
  void_objective._batch_size = batch_size

  linearization = solver.linearization
  linearization.num_rows = A.shape[0]
  linearization.num_cols = A.shape[1]

  # Seems like indices need to be on CPU here but then back on GPU in LUCudaSolveFunction.apply
  linearization.A_col_ind = A.col_indices().cpu()
  linearization.A_row_ptr = A.crow_indices().cpu()

  # Unsqueeze to mimick batched values with batch size 1
  linearization.A_val = A.values().unsqueeze(0)
  linearization.b = b.unsqueeze(0)

  #print(linearization.A_col_ind)

  if use_baspacho:
    # TODO - chek this
    # also need: var dims and var_start_cols (because baspacho is blockwise)
    linearization.var_dims = split_into_param_sizes(
        linearization.num_cols, 2, 6, rng=torch.Generator(device=A.device)
    )

  # Need this line here since the objective is a mock
  solver.reset(batch_size=batch_size,device=A.device)

  damping_alpha_beta = (torch.zeros((1,),device=A.device), torch.zeros((1,),device=A.device))

  if use_baspacho:
    x = BaspachoSolveFunction.apply(
        linearization.A_val,
        linearization.b,
        linearization.structure(),
        linearization.A_row_ptr,
        linearization.A_col_ind,
        solver.symbolic_decomposition,
        damping_alpha_beta,
    )
  else:
    x = LUCudaSolveFunction.apply(
        linearization.A_val,
        linearization.b,
        linearization.structure(),
        linearization.A_row_ptr.cuda(),
        linearization.A_col_ind.cuda(),
        solver._solver_contexts[solver._last_solver_context],
        damping_alpha_beta,
        False,  # it's the same matrix, so no overwrite problems
    )
  return x.squeeze()

luisenp · 2023-10-06T11:57:35Z

luisenp
Oct 6, 2023
Collaborator

One quick comment, just in case you haven't seen this yet. Our solvers work such that given A and b, they solve for A^T @ A = A^Tb, and in the case of the LUCudaSolveFunction, it will explicitly compute these products before passing the linear system to solver kernel. I mention this because I downloaded the matrix you are using in that notebook, and it has shape A=(123440, 123440); so it looks like this is not what you were trying to do.

The underlying function that solves the linear system can be imported with from theseus.extlib.cusolver_lu_solver import CusolverLUSolver. Example usage can be found here.

1 reply

tvercaut Oct 6, 2023
Author

Thanks! I indeed hadn't realised LUCudaSolveFunction was solving the normal equation. This definitely explains why it was taking forever.

luisenp · 2023-10-06T12:12:35Z

luisenp
Oct 6, 2023
Collaborator

I've also asked @maurimo, who developed all the sparse solvers and also the author of BaSpaCho, to offer some guidance. But, in general, If your goal is to solve a linear system w/o it being part of and optimization problem, you don't really need to go through the SparseLinearization structure, but rather call directly either the kernels or the solve functions (depending on whether you need differentiation or not). Below is a preliminary example (haven't verified the result) of how this would work for AAx = Ab using Baspacho.

Ideally, a lot of this would be wrapped up in a higher-level API, but we are a bandwidth limited at the moment. We actively welcome community contribution, so we would be more than happy to guide you if you want to take a stab a this.

import numpy as np
import torch
from scipy.sparse import csr_matrix
from theseus.extlib.baspacho_solver import SymbolicDecomposition
from theseus.optimizer.autograd import BaspachoSolveFunction
from theseus.optimizer.linear_system import SparseStructure
from theseus.utils import random_sparse_binary_matrix

device = "cuda"
rng = torch.Generator(device=device)

num_rows = 100
num_cols = 100
fill = 0.1
batch_size = 16


A_skel = random_sparse_binary_matrix(
    num_rows, num_cols, fill, min_entries_per_col=1, rng=rng
)

A_val = torch.rand(
    (batch_size, A_skel.nnz), dtype=torch.double, device=device, generator=rng
)
b = torch.randn(
    (batch_size, num_rows), dtype=torch.double, device=device, generator=rng
)

structure = SparseStructure(
    A_skel.indices,
    A_skel.indptr,
    num_rows,
    num_cols,
    dtype=np.float64,
)


# convert to tensors for accelerated Mt x M operation
A_row_ptr = torch.tensor(structure.row_ptr, dtype=torch.int64).to(device)
A_col_ind = torch.tensor(structure.col_ind, dtype=torch.int64).to(device)


var_dims = [10, 20, 10, 20, 10, 20, 10]
var_start_cols = np.cumsum([0, *var_dims[:-1]])

# compute block-structure of AtA.
At_mock = structure.mock_csc_transpose()
num_vars = len(var_start_cols)
to_blocks = csr_matrix(
    (
        np.ones(num_cols),
        np.arange(num_cols),
        [*var_start_cols, num_cols],
    ),
    (num_vars, num_cols),
)
block_At_mock = to_blocks @ At_mock
block_AtA_mock = (block_At_mock @ block_At_mock.T).tocsr()
block_AtA_mock.sort_indices()

param_size = torch.tensor(var_dims, dtype=torch.int64)
block_struct_ptrs = torch.tensor(block_AtA_mock.indptr, dtype=torch.int64)
block_struct_inds = torch.tensor(block_AtA_mock.indices, dtype=torch.int64)
symbolic_decomposition = SymbolicDecomposition(
    param_size, block_struct_ptrs, block_struct_inds, device
)


alpha = torch.rand(batch_size, device=device, dtype=torch.double, generator=rng)
beta = torch.rand(batch_size, device=device, dtype=torch.double, generator=rng)

x = BaspachoSolveFunction.apply(
    A_val,
    b,
    structure,
    A_row_ptr,
    A_col_ind,
    symbolic_decomposition,
    (alpha, beta),
)
print(x)

0 replies

luisenp · 2023-10-07T17:16:49Z

luisenp
Oct 7, 2023
Collaborator

@tvercaut Here is an example for (the non-differentiable) CusolverLUSolver that includes a comparison with scipy.spsolve when solving a batch of Ax=b systems. I used a smaller matrix from the same bank so I could fit a batch in GPU memory. In my test scipy is about 3X faster when solving a single system, but the CusolverLUSolver is about 2X faster for a batch of 8 (not sure if there is a better way do this with scipy than with a loop).

import numpy as np
import scipy
import torch
from theseus.extlib.cusolver_lu_solver import CusolverLUSolver
from theseus.utils import Timer

matrix = "raefsky4"  # "cfd2"

A_np_coo = scipy.io.mmread(f"sparse/{matrix}/{matrix}.mtx")
A_np_csr = scipy.sparse.csr_matrix(A_np_coo)
b_np = np.random.randn(A_np_coo.shape[1])

batch_size = 8

timer_cpu = Timer("cpu")
with timer_cpu:
    for _ in range(batch_size):
        scipy.sparse.linalg.spsolve(A_np_csr, b_np)
print(timer_cpu.elapsed_time)


A_row_ptr = torch.tensor(A_np_csr.indptr).cuda()
A_col_ind = torch.tensor(A_np_csr.indices).cuda()
A_val = torch.tensor(A_np_csr.data).cuda().repeat(batch_size, 1)
A_num_rows = A_row_ptr.size(0) - 1
A_num_cols = A_num_rows
b = torch.tensor(b_np).cuda().repeat(batch_size, 1)

timer_gpu = Timer("cuda")
x = b.clone()
with timer_gpu:
    slv = CusolverLUSolver(batch_size, A_num_cols, A_row_ptr, A_col_ind)
    slv.factor(A_val)
    slv.solve(x)
print(timer_gpu.elapsed_time)
b_sol = A_np_csr @ x[0].cpu().numpy()
print(np.linalg.norm(b_np - b_sol))

2 replies

tvercaut Oct 9, 2023
Author

Thanks @luisenp I tried this earlier with the bcspwr01 matrix and it failed with the following error:

RuntimeError: cusolver error: Unknown cusolver error number, when calling `cusolverRfBatchZeroPivot(cusolverRfH, singularityPositions.data())`

I now see that it does work with the matrix you chose. Is this a bug or am I missing something?

Also, it's still unclear to me how to do the same with baspacho.

Regarding batching and scipy (or any other sparse solver with no batch support), a classical trick to avoid a for loop in python is to create a sparse block diagonal matrix from the individual sparse matrices and vertically stack the individual right hand-side elements of the batch. We used this here to get a differentiable triangular solver:
https://github.com/cai4cai/torchsparsegradutils/blob/main/torchsparsegradutils/sparse_solve.py#L74

For the record, here is a small convenience wrapper function to use the LU solver on a pytorch csr matrix (no batching):

def test_solve_lu(A, b):
  batch_size = 1
  solver = CusolverLUSolver(batch_size, A.shape[1], A.crow_indices(), A.col_indices())
  singularities = solver.factor(A.values().unsqueeze(0))
  print("singularities:", singularities)
  x = b.clone().unsqueeze(0)
  solver.solve(x)
  return x.squeeze()

luisenp Oct 10, 2023
Collaborator

Hi @tvercaut. I was looking into adding a short example with baspacho, but I ran into some errors and I'm waiting for @maurimo's input, who will offer some additional guidance.

Regarding the error with CusolverLUSolver, I'm not sure. This solver is based on this Nvidia library, which is where this error message comes from. I wonder if it's because the matrix is binary; I added small normally-distributed perturbations to the data (std=0.001) and the error goes away.

Also, thanks for sharing the trick to avoid the loop and your triangular solver! I had indeed used this trick years ago when we were first drafting Theseus (an early prototype used spsolve), and completely forgot about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using sparse linear solvers from Theseus with pytorch sparse matrices #609

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Using sparse linear solvers from Theseus with pytorch sparse matrices #609

tvercaut Oct 4, 2023

Replies: 4 comments · 4 replies

luisenp Oct 5, 2023 Collaborator

tvercaut Oct 6, 2023 Author

luisenp Oct 6, 2023 Collaborator

tvercaut Oct 6, 2023 Author

luisenp Oct 6, 2023 Collaborator

luisenp Oct 7, 2023 Collaborator

tvercaut Oct 9, 2023 Author

luisenp Oct 10, 2023 Collaborator

tvercaut
Oct 4, 2023

Replies: 4 comments 4 replies

luisenp
Oct 5, 2023
Collaborator

tvercaut Oct 6, 2023
Author

luisenp
Oct 6, 2023
Collaborator

tvercaut Oct 6, 2023
Author

luisenp
Oct 6, 2023
Collaborator

luisenp
Oct 7, 2023
Collaborator

tvercaut Oct 9, 2023
Author

luisenp Oct 10, 2023
Collaborator