remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersStructured SketchingJohannes J. Brust and Michael A. Saunders
Structured Sketching for Linear Systems††thanks: Version of June 30, 2024. Submitted to the editors Summer 2024. \fundingThis work was partially funded by the startup fund at Arizona State University.
Abstract
For linear systems we develop iterative algorithms based on a sketch-and-project approach. By using judicious choices for the sketch, such as the history of residuals, we develop weighting strategies that enable short recursive formulas. The proposed algorithms have a low memory footprint and iteration complexity compared to regular sketch-and-project methods. In a set of numerical experiments the new methods compare well to GMRES, SYMMLQ and state-of-the-art randomized solvers.
keywords:
randomized sketching, CG, GMRES, SYMMLQ, sketch-and-project, Kaczmarz method, data science15A06, 15B52, 65F10, 68W20, 65Y20, 90C20
1 Introduction
For data science and scientific computing, consider the solution of large, possibly sparse, general linear systems
(1) |
where is a real matrix, the unknowns are , and is the right-hand side. We develop methods for general systems with a focus on square and overdetermined problems. Direct algorithms use a factorization of (Golub and Van Loan [10]). With appropriate pivoting strategies, these methods are very accurate and reliable. Sparse factorizations exist for almost all direct algorithms (Davis [7]). Nevertheless, sparse pivoting strategies are sometimes too costly. When is only available as a linear operator, or the system is too large or otherwise not suitable for a direct method, then iterative methods are most effective (Saad [17], Barrett et al. [2]). In particular, for modern data-driven applications, system Eq. 1 may constitute a random subset of a larger dataset, or it may be contaminated by noise, so that highly accurate solutions are not needed or even desired. In this context, sketching methods have become popular (Woodruff et al. [20]), especially in the artificial intelligence and machine learning community (Liberty [13]). For some sketching matrix with , a solution of
(2) |
approximates in the original system (1). If is not too large, Blendenpik [1] and LSRN [14] use (2) to construct a preconditioner for LSQR to solve (1). The value of needs to be set in advance. Higher values of usually result in higher computational cost, but also improved numerical performance.
The methods of Gower and Richtárik [11] and Richtárik and Takác [16] solve a sequence of sketched problems
with a different random each time, a symmetric positive definite , and . This process is referred to as a Sketch-and-Project approach.
In Gower and Richtárik’s analysis [11], the convergence rate for solving (1) () depends on the smallest eigenvalue of a certain matrix. Because the convergence depends on a rate that can be arbitrarily close to one, the observed numerical performance may be slow.
1.1 Notation
The integer represents the iteration index. Vector denotes the column of the identity matrix , with dimension depending on the context. The row of is , while the column is . For the solution estimate , the residual vector is , with associated vector . We abbreviate “symmetric positive definite” to spd and “symmetric indefinite” to sid. Lower-case Greek letters represent scalars. For an spd matrix , the scaled 2-norm of an -vector is . The standard 2-norm is .
1.2 Sketching methods
Based on the idea that the next iterate solves a sketched system, a class of methods can be derived in the form
(3) | ||||
(4) |
for where is with , and is the update. Note that in Eq. 3 has index and depends on the iteration. To compute , we can solve the sketched system Eq. 3 in the equivalent form
(5) |
When the sketch is (one random column of the identity), the minimum-norm solution of Eq. 3 and hence (5) is
This update and the corresponding iterate is the popular randomized Kaczmarz method (Strohmer and Vershynin [19]). Note that for a symmetric full-rank square matrix and sketch , the update
(6) |
is also a valid solution of Eq. 5. Nevertheless, to the best of our knowledge, an update like Eq. 6 remains largely unexplored. Importantly, there exist infinitely many updates like Eq. 6 because there exist arbitrarily many ’s.
1.3 Contributions
In the context of sketch-and-project methods, this article develops judicious weights , which we view as implicit preconditioning strategies. Instead of using a uniform weight across all problems (e.g., for all problems) we consider weights for different types of linear system. For instance, if is symmetric we consider weighting by the matrix itself (i.e., ). Each specific choice of results in a different method. When is symmetric indefinite, we develop a new method that performs well compared to state-of-the-art random or deterministic methods. For general , we develop a that results in a nested algorithm. Numerical experiments on relatively large sparse systems demonstrate that the proposed method is effective compared to GMRES and alternative weightings.
2 PLSS
The Projected Linear Systems Solver (Brust and Saunders [6]) is a family of methods that allows the use of deterministic or random sketches. The main assumptions about the sketch are its size and rank at iteration :
(7) |
Arbitrary sketches (7) with updates computed from (17) (see below) lead to an iteration that enjoys a finite termination property: convergence to a solution of in at most iterations in exact arithmetic. This result differs from convergence for the expected difference with a rate in [11], because it guarantees a solution in finitely many steps in exact arithmetic. We emphasize that the finite termination property also applies when the sketch is random. Note too that conditions (7) for the sketch are general. In particular, the sketch can be generated each iteration from scratch as in conventional methods [11, 16], where a random normal matrix is recomputed at each iteration. In these conventional implementations the random normal sketches do not expand as but use a constant subspace size . We believe this is the reason why the methods in [11, 16] don’t have the finite termination property.
Instead of recomputing the sketch at every iteration, we have another practical possibility of expanding the sketch recursively:
Of course, a direct implementation of expanding sketches results in growing memory and computational complexity. However, for judicious sketch choices we can develop a very efficient recursion.
To use the sketch (7) in a practical method, note that the solution of a consistent linear system (square, overdetermined or underdetermined) can be computed via the iteration
(8) |
where
(9) |
Here, is an arbitrary nonsingular symmetric matrix and is an auxiliary vector that does not have to be explicitly computed. System Eq. 9 has a unique solution when has full rank. It corresponds to the first-order optimality conditions for the optimization problem (17) below. (Linear equality constrained optimization is discussed in [5, 4].) We emphasize that iteration Eq. 8–Eq. 9 applies to general linear systems . The iteration is parametrized by a symmetric parameter matrix and by the choice of sketching matrix .
2.1 Update formulas
Initially assume that has full rank, so that its inverse exists. Solving Eq. 9 gives the explicit formula
(10) |
This is the basis for straightforward randomized solvers in which is chosen as a random sketch. For instance, the approaches of Gower and Richtárik [11] use random normal sketches with rank parameter . Because these methods need to solve with the matrix , typically is a relatively small integer. However, for small the method can converge only slowly.
2.2 Finite termination
In contrast to using a fixed-size sketch, with sketches Eq. 7 we can prove finite termination for iteration Eq. 8–Eq. 9. We emphasize that the sketch can be a random matrix, like a random normal Gaussian, or it could come from a deterministic process. In the following we describe some of the intuition for proving finite termination. To keep matters simple we analyse the situation of a nonsingular square system. (Detailed results for square, rectangular and singular problems are in [6].) Suppose iteration Eq. 8–Eq. 9 has not yet converged. This means that at iteration , the sketch is a full-rank square matrix . Therefore, the update formula is
By the assumption that both and are square, the update formula implies
and hence
Since is the solution of (1) (for square ) we conclude that the iteration converges in at most iterations, independent of whether the sketch is random or deterministic. The main assumption is that the sketch satisfies the rank condition Eq. 7. Finite termination is a useful and desirable property because it leads to efficient methods.
2.3 Sketch variations
To keep computational cost low, the original versions of PLSS consider mainly simple weights for and its inverse , such as or . On the other hand, the sketch in PLSS is developed meticulously, typically by recursively expanding the previous sketch. Different choices for the columns in (7) are
(11) |
Each of these variations results in a different method. However, the effects of the weighting matrix remain largely unexplored.
2.4 Orthogonal updates
We note that when the product is a multiple of , say , for some scalar , then the updates in (10) are orthogonal with respect to the weighting matrix :
(12) |
In order to see this orthogonality, suppose is the matrix of identity columns, so that . From (10) this means that and therefore
In other words, the updates are orthogonal with respect to the inner product defined by , so that (12) is valid.
2.5 PLSS residual
When the sketch consists of previous residuals, i.e., for and is spd, remarkably the update formula (10) simplifies to a short one-step recurrence (we refer to this method as PLSS residual). Further, note that all residuals (with this sketch) are orthogonal. In particular, the residuals satisfy
and therefore the second block row in (9) results in
(13) |
As contains all previous residuals, (13) implies orthogonality of all residuals. This is an important property of PLSS with residual sketches. For instance, the product simplifies to . Using this property and further simplifications, we can reduce the explicit formula in (10) to a short recursion ([6, Theorem 1]):
(14) |
where
Method (14) is extremely efficient compared to the full update formula (10), even though they are mathematically equivalent (modulo that is based on past residuals).
2.6 PLSS Kaczmarz
A generalization of the randomized Kaczmarz method can be developed from PLSS when the sketching matrix is augmented with one random identity column at each iteration: for . The sketch is then and the second block row from (9) implies that . Using the definition , we define a diagonal matrix
The generalized randomized Kaczmarz method is then given (after simplification of (10)) by the updates [6]
(15) |
where
3 New methods
Here we develop new weighting strategies depending on the properties of . Since is symmetric, system Eq. 9 corresponds to the first-order optimality conditions of a certain optimization problem. The left plot of Fig. 1 shows the updates generated from a minimization process with an spd , while the right plot corresponds to a symmetric indefinite . Our discussion starts with spd and least-squares problems before describing symmetric indefinite systems and then general (nonsymmetric) square systems. From now on we use the notation
(16) | ||||
3.1 Symmetric positive definite
When is spd, the block system Eq. 9 corresponds to the optimality conditions of the optimization problem
(17) | |||
gives the minimum-norm objective . For the solution to be a global minimizer of (17), must be positive semidefinite. Further, when and have full rank, is the unique minimizer. When is any spd matrix, the objective in Eq. 17 can equivalently be written as (thus can be interpreted as a weighting). Recall that with a general sketch, the solution of Eq. 9 is given by Eq. 10. However, when previous residuals make up the sketch columns, the update reduces to a short recurrence Eq. 14. If is spd, we would like to exploit this property. Possible choices for are or , which are both spd.
3.1.1 Symmetric positive definite: Method 1
Since in recurrence Eq. 14 is expressed in terms of , an immediate choice of weighting could be , so that . The only difficulty for implementing the recurrence in this case is computing the scalar
as it depends on (which is not available). Nevertheless, we can exploit the short recurrence Eq. 14 to deduce a relation for . Premultiplying
by and using the notation Eq. 16, we see that satisfies
With this relation a short recursive algorithm can be developed that does not require any computations with . This method is based on residuals for the sketch; hence it derives from the relations in Eq. 14. Second, we specify the arbitrary weight to ne and ensure that all scalars in Eq. 16 can be computed.
Algorithm 1: PLSS (spd ) | ||
Given: | ||
for | ||
end |
A few notes about Algorithm 1. First, it is designed for symmetric and therefore any occurrences of can be replaced by . Second, it has low memory usage: only 6 vectors of size are being updated. Third, it is computationally efficient, especially compared to randomized algorithms that compute the update via Eq. 10 (see [11, 16]). Compared to other iterative methods it is moderately efficient in using three matvec operations per iteration. Also, it possesses the finite termination property because the sketch consists of the history of residuals, by virtue of its derivation from (14).
3.1.2 Symmetric positive definite: Method 2
We can develop a second algorithm by choosing so that . At first this choice seems to make recursion Eq. 14 difficult to compute because the vector
is needed in the update. However, when is symmetric we have
In other words, the computations with the inverse cancel out. The scalars can also be computed directly:
where . Because satisfies
can be found to follow the relation
The algorithm using this weighting follows.
Algorithm 2: PLSS (spd ) | ||
Given: | ||
for | ||
end |
As before, can be replaced by . The memory footprint is low (storing and updating only five vectors), and importantly, the algorithm is very efficient: each iteration needs only one matvec operation, . Again the method enjoys finite termination. Finally, the residuals generated by this algorithm are equivalent to those from CG [12].
3.1.3 Comparison: Symmetric positive definite
Detailed comparisons of various methods are reported in Section 4 (Numerical Experiments). Here we provide some intuition on the efficacy of the previous algorithms. We include a regular randomized method, where the sketch is computed as a standard normal random matrix and the update is obtained by explicitly evaluating Eq. 10 with . The algorithms are applied on a small spd matrix with . The size for the random sketch is , and all methods are initialized with the same zero vector . The results are in Table 1. We make a few observations about the table. First note the finite termination property. PLSS with residual sketches has finite termination, because all residuals are orthogonal by construction. The methods converge to the solution in at most iterations. One can see this taking effect for , where the residual norms drop rapidly to convergence tolerances. In contrast, the iteration with random normal sketches converges based on a rate and typically converges much slower in practice. Further, the computational costs of the PLSS algorithms are very low because of the short recurrences. On the other hand, the standard Randn algorithm needs to perform updates via the costly formula Eq. 10. Finally, PLSS () tends to converge much more rapidly than PLSS (). Since PLSS () uses only one matrix-vector multiply per iteration as opposed to three for PLSS (), we view the former as the better of the two.
Randn. | PLSS () | PLSS () | |
---|---|---|---|
0 | 4.1653e+04 | 4.1653e+04 | 4.1653e+04 |
1 | 1.5470e+04 | 1.3794e+04 | 6.6381e+03 |
2 | 1.0068e+04 | 1.5185e+04 | 1.1624e+03 |
3 | 6.9911e+03 | 3.9838e+03 | 3.2431e+02 |
4 | 2.9346e+03 | 2.6166e+03 | 7.5577e+01 |
5 | 1.8590e+03 | 1.4479e+03 | 1.6274e+01 |
6 | 1.3115e+03 | 9.2966e+02 | 1.7461e+00 |
7 | 1.0382e+03 | 3.0200e+02 | 2.3732e-01 |
8 | 1.0255e+03 | 9.5402e+01 | 2.1046e-02 |
9 | 7.0308e+02 | 2.2631e+01 | 1.5050e-03 |
10 | 6.9981e+02 | 2.2689e-09 | 3.1511e-16 |
3.2 Least squares
When problem Eq. 1 is overdetermined with , the system is typically inconsistent. To find the least-squares solution we have to solve . A simple and often effective approach is to apply Algorithm 2, i.e., PLSS () with and . Recall that Algorithm 2 uses only one matvec with per iteration, which can be implemented as two products with . In particular, the product (for some ) is implemented as .
3.3 Symmetric indefinite
When is symmetric but indefinite, the block system Eq. 9 still characterizes the critical points of an optimization problem. However, the objective is not a norm anymore because it can assume negative values. Therefore, the solution to Eq. 9 is typically not the minimizer. (This is good because the minimizer is typically unbounded.) Broadly, in this context is related to a saddle point in a quadratic programming problem. For an example, one can compare the updates generated with an indefinite weighting in Fig. 1. Independent of everything else, as long as the sketch remains full-rank, finite termination is maintained (even with indefinite ). Because can’t be interpreted as a minimizer in the indefinite case, we develop the update using an analogy to spd . We view as an implicit preconditioner with the main purpose of simplifying computations. Recall the short recurrence Eq. 14 for the update with residual sketches (independent of the definiteness of ). Since
a significant simplification occurs when so that . This means we choose , even if is indefinite. The result is an algorithm that is computationally equivalent to Algorithm 2. We summarize the method as follows.
Algorithm 3: PLSS (sid ) |
Given: |
Apply Algorithm 2 |
Note that Algorithm 2 (and hence Algorithm 3) does not require any square roots, and thus there is no concern about square roots of negative quantities in the indefinite case causing breakdown. Since appears in the denominator, we potentially have to guard for this value becoming too small. However, implies and therefore this quantity is small only when the method has converged. Further, the denominator in must be guarded from becoming zero. With in from Algorithm 2 (and the notation Eq. 16) we see that
The denominator is zero if .
A possible remedy is to restart the method from should this condition arise. (In the numerical experiments we don’t observe breakdowns from this condition.) Algorithm 3 is efficient because it uses only one matvec per iteration. It is equivalent to CG but applicable to symmetric indefinite systems, whereas CG is typically designed for spd matrices. In Section 4.1 (Experiment I: Indefinite Systems) we observe that PLSS () (Algorithm 3) is robust and fast (see Tables 2 and 3).
3.4 General square systems
When is square, we can’t directly apply the strategies for from Sections 3.1 and 3.3. Specifically, it is not possible to choose or as before because is not necessarily symmetric. It is fine to set , which results in the original PLSS algorithm. Another approach is to consider the symmetric weighting . Substituting in the update Eq. 8 (with residual sketches), we obtain
The main difficulty is evaluating . In fact, if could be computed exactly the method based on this choice for would converge in one iteration:
However, is typically not computed exactly. A possibility is to apply PLSS () as a subproblem solver inside an outer loop in order to evaluate approximately. This approach has two nested loops and we don’t want the inner loop to take many iterations. Therefore, we include a second stop** tolerance to terminate the inner loop early. The resulting Algorithm 4 uses PLSS () represented as the function in Fig. 2 with its own stop** tolerance .
function plss() | ||
while | ||
end |
Algorithm 4: PLSS (general ) | ||
Given: | ||
Solve using plss() | ||
for | ||
Solve using plss() in Fig. 2 | ||
end |
A few remarks about Algorithm 4. First, inside the loop, plss is called with an initial guess of . This is not essential, and the zero starting point is a valid option. However, when the residuals don’t change rapidly for every iteration it can be advantageous to use information from the previous solution. The tolerances are only one implementation that force increasingly accurate solutions. Other sequences are possible. Further, if PLSS () could solve the initial system to full accuracy, Algorithm 4 would only need one iteration. However, as this is unrealistic, the algorithm will typically use multiple iterations and its convergence depends on how well the subproblem solver approximates . When matvec products with can be computed cheaply, the overall approach in Algorithm 4 can be useful. Numerical experiments for general square are in Section 4.2.
4 Numerical experiments
Our algorithms are implemented in MATLAB and PYTHON 3.9. The numerical experiments are carried out in MATLAB 2023a on a Linux machine with Intel 13th Gen Intel Core i9-13900KS (24 cores) processor and 128 GB RAM and a laptop with Apple M2 Max chip. For comparisons, we use randomized algorithms of [11], the Algorithms from [6, github.com/johannesbrust/PLSS], SYMMLQ [15] and GMRES [18]. All codes are available in the public domain [3]. The stop** criterion is either . Unless otherwise specified, the iteration limit is . We label the PLSS Algorithm with and as PLSS and PLSS respectively.
4.1 Experiment I
The problems in this experiment are large square consistent symmetric systems. However, the matrices may be indefinite and/or rank-deficient. For example, problem bcsstm36 is neither positive definite nor full-rank. The convergence criterion is and the iterations limit is set to . Table 2 gives a detailed comparison of the solver outcomes. We note that PLSS with weighting and SYMMLQ solve all but one problem to the specified tolerance. In terms of computational time, PLSS with the new weighting is the fastest overall. Fig. 3 summarizes the computational times of the four methods using performance profiles [9]. These profiles allow for a direct comparison of different solver based on computational times or iterations. Specifically, the performance metric on test problems is given by
where is the “output” (i.e., iterations or time) of “solver” on problem , and denotes the total number of solvers for a given comparison. This metric measures the proportion of how close a given solver is to the best result.
4.2 Experiment II
The problems in this experiment medium size are general square systems. The convergence tolerance is and the iteration limit is . Table 3 gives a detailed comparison of the solver outcomes. PLSS with weighting solves the most problems to the specified tolerance. For reference we include GMRES [18] with restarts every 500 iterations. Figure 4 summarizes the computational times of the four methods.
4.3 Experiment III
This experiment tests Algorithm 4 on a difficult unsymmetric matrix (West0479) [8] for iterative solvers, when no preconditioner is used. The matrix comes from a chemical engineering process via A. Westerberg. Even though the system is small (), the condition number is large: .
For comparison, we include three random normal solvers with . We run Algorithm 4 with inner iterations limit . The convergence tolerance for all methods is . As Algorithm 4 has cheap inner and costly outer iterations, a comparison based on iterations only may not be informative. Instead we compare computation times. Figure 5 shows the outcomes. We see that Rand and Algorithm 4 (PLSS with ) perform well. Importantly, Algorithm 4 scales to larger problems because the subproblem solver has low inner iterations complexity. Every iteration of the randomized algorithm needs operations.
5 Conclusion
Structured sketch-and-project methods for linear systems have been developed here. The methods are characterized by a finite termination property for both random and deterministic sketches. When the history of past residuals forms the sketch, we exploit a short recurrence to develop effective weighting schemes. The techniques enable us to incorporate information from the linear system to obtain an implicit preconditioning. In numerical experiments on large sparse problems, the proposed methods compare well to state-of-the-art deterministic and random solvers.
Acknowledgments
We are grateful for fruitful discussions after the presentation in Session 3B: Randomized Algorithms at the 18th Copper Mountain Conference on Iterative Methods, April 14–19, 2024, Frisco, CO.
References
- [1] H. Avron, P. Maymounkov, and S. Toledo, Blendenpik: Supercharging LAPACK’s least-squares solver, SIAM J. Sci. Comput., 32 (2010), pp. 1217–1236.
- [2] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the solution of linear systems: building blocks for iterative methods, SIAM, 1994.
- [3] J. J. Brust, Code for Algorithm PLSS and test programs. https://github.com/johannesbrust/PLSS, 2022.
- [4] J. J. Brust, R. F. Marcia, and C. G. Petra, Large-scale quasi-newton trust-region methods with low-dimensional linear equality constraints, Comput. Optim. Appl., 74 (2019), pp. 669–701.
- [5] J. J. Brust, R. F. Marcia, C. G. Petra, and M. A. Saunders, Large-scale optimization with linear equality constraints using reduced compact representation, SIAM J. Sci. Comput., 44 (2022), pp. A103–A127.
- [6] J. J. Brust and M. A. Saunders, PLSS: A projected linear systems solver, SIAM J. Sci. Comput., 45 (2023), pp. A1012–A1037.
- [7] T. A. Davis, Direct Methods for Sparse Linear Systems, SIAM, Philadelphia, 2006.
- [8] T. A. Davis, Y. Hu, and S. Kolodziej, SuiteSparse matrix collection. https://sparse.tamu.edu/, 2015–present.
- [9] E. Dolan and J. Moré, Benchmarking optimization software with performance profiles, Math. Program., 91 (2002), pp. 201–213.
- [10] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, Maryland, third ed., 1996.
- [11] R. M. Gower and P. Richtárik, Randomized iterative methods for linear systems, SIAM J. Matrix Anal. Appl., 36 (2015), pp. 1660–1690, https://doi.org/10.1137/15M1025487.
- [12] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, 49 (1952), pp. 409–436.
- [13] E. Liberty, Simple and deterministic matrix sketching, in Proceedings of the 19th ACM SIGKDD international conference on Knowledge Discovery and Data Dining, 2013, pp. 581–588.
- [14] X. Meng, M. A. Saunders, and M. W. Mahoney, LSRN: A parallel iterative solver for strongly over- or underdetermined systems, SIAM J. Sci. Comput., 36 (2014), pp. C95–C118, https://doi.org/10.1137/120866580.
- [15] C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Numer. Anal., 12 (1975), pp. 617–629, https://doi.org/10.1137/0712047.
- [16] P. Richtárik and M. Takáč, Stochastic reformulations of linear systems: Algorithms and convergence theory, SIAM J. Matrix Anal. Appl., 41 (2020), pp. 487–524, https://doi.org/10.1137/18M1179249.
- [17] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2003.
- [18] Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. and Statist. Comput., 7 (1986), pp. 856–869, https://doi.org/10.1137/0907058.
- [19] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, Journal of Fourier Analysis and Applications, 15 (2009), pp. 262–278.
- [20] D. P. Woodruff et al., Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science, 10 (2014), pp. 1–157.