Quasi Cauchy Optimizer

This article contains some notes about the quasi Cauchy optimizer implementation. The discussed method is a member of the quasi Newton family. It minimizes a function \(f(x)\) which maps an n-vector \(x\) to a scalar, where \(f(x)\) could be a loss function used in machine learning algorithms. The Hessian is approximated by a diagonal matrix satisfying the weak secant equation. Because a diagonal matrix has the same memory footprint as a vector, the method is well suited for high-dimensional problems (e.g. neural networks).

Weak Secant Equation

The weak secant equation is \(s^{T} \cdot y= s^{T} \cdot B_{+} \cdot s\) with the update step \(s=x_{+}-x\), the gradient difference \(y=\nabla f(x_{+})-\nabla f(x)\) and \(s^{T} \cdot y \gt 0\). The \(n \times n\) matrix \(B_{+}\) is an approximation of the Hessian of \(f(x)\) which satisfies the weak secant equation. Like in Newton's method, the update step is \(s=-B^{-1} \cdot \nabla f(x)\) where the steepest-descent direction \(-\nabla f(x)\) is scaled and rotated by \(B^{-1}\). For more details see the paper of Zhu et al.

Scaled Identity Update

We can use a scaled identity matrix \(D=d \cdot I\) for the Hessian approximation \(B_{+}\). Inserting into the weak secant equation, this gives \(s^{T} \cdot y= s^{T} \cdot d \cdot I \cdot s\). Solving for the scalar \(d\) we get \(d=\frac{s^{T} \cdot y}{s^{T} \cdot s}\). This Hessian approximation uniformly scales all steepest-descent dimensions by \(\frac{1}{d}\) in the update step.

Diagonal Update

Instead of the scaled identity matrix, we can also use a diagonal matrix \(D\) for the Hessian approximation. There is not a unique solution for a diagonal matrix, so we narrow down the solution \(D_{+}\) to be the one most similar to the previous matrix \(D\). The similarity is measured as: \[\| D_{+}-D \| = \| U \| = \sqrt{\sum_{i}{u_{ii}^2}} \] By taking the smallest possible update \(U\), this keeps as much information as possible from previous iterations. This gives the optimization problem \(min \| U \| \) with the weak secant equation \(s^T \cdot (D+U) \cdot s = s^T \cdot y\) as constraint. The Lagrange function is: \[L(U, \lambda) = \frac{1}{2} \cdot \| U \|^2 + \lambda \cdot (s^T \cdot (D+U) \cdot s - s^T \cdot y) \] Computing the partial derivatives with respect to the diagonal elements \(u_{ii}\) and \(\lambda\) and setting them to zero gives the update: \[u_{ii} = \frac{s^T \cdot y - s^T \cdot D \cdot s}{\sum_{j}{s_j^4}} \cdot s_i^2\] This Hessian approximation scales each steepest-descent dimension individually by \(\frac{1}{u_{ii}}\) in the update step.

Experiments

The results can be reproduced with the quasi Cauchy optimizer implementation. The implementation uses line-search. Further, it clips the Hessian approximation to positive values as a crude way to ensure moving along a descent direction. Here only a small subset of the results is shown. Comparing the diagonal update and the scaled identity update on multiple test-functions shows the following:

Fig. 1: Path towards the minimum of the Beale test-function. Left: diagonal update. Right: scaled identity update.

Update Error Iteration Count
Diagonal 0.000 172
Scaled Identity 0.000 33
Table 1: Beale test function.

Update Error Iteration Count
Diagonal 0.383 185
Scaled Identity 0.649 501 (max. reached)
Table 2: 50-dimensional polynomial.

Conclusion

The diagonal update derived from the weak secant equation scales each steepest descent dimension separately. This update performs better for high-dimensional functions with scale varying across dimensions. Otherwise, uniform scaling performs better.

Harald Scheidl, 2020