Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published December 2022 | Submitted + Accepted Version
Journal Article Open

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Abstract

Most modern learning problems are highly overparameterized, i.e., have many more model parameters than the number of training data points. As a result, the training loss may have infinitely many global minima (parameter vectors that perfectly "interpolate" the training data). It is therefore imperative to understand which interpolating solutions we converge to, how they depend on the initialization and learning algorithm, and whether they yield different test errors. In this article, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which stochastic gradient descent (SGD) is a special case. Recently, it has been shown that for overparameterized linear models, SMD converges to the closest global minimum to the initialization point, where closeness is in terms of the Bregman divergence corresponding to the potential function of the mirror descent. With appropriate initialization, this yields convergence to the minimum-potential interpolating solution, a phenomenon referred to as implicit regularization. On the theory side, we show that for sufficiently- overparameterized nonlinear models, SMD with a (small enough) fixed step size converges to a global minimum that is "very close" (in Bregman divergence) to the minimum-potential interpolating solution, thus attaining approximate implicit regularization. On the empirical side, our experiments on the MNIST and CIFAR-10 datasets consistently confirm that the above phenomenon occurs in practical scenarios. They further indicate a clear difference in the generalization performances of different SMD algorithms: experiments on the CIFAR-10 dataset with different regularizers, ℓ₁ to encourage sparsity, ℓ₂ (SGD) to encourage small Euclidean norm, and ℓ∞ to discourage large components, surprisingly show that the ℓ∞ norm consistently yields better generalization performance than SGD, which in turn generalizes better than the ℓ₁ norm.

Additional Information

© 2021 IEEE. Manuscript received June 8, 2020; revised December 7, 2020 and May 10, 2021; accepted May 24, 2021. This work was supported in part by the National Science Foundation under Grant ECCS-1509977, in part by Qualcomm Inc., in part by the NASA's Jet Propulsion Laboratory through the President and Director's Fund, in part by Amazon Web Services Inc., and in part by PIMCO, LLC. This article was presented in part at the 2019 International Conference on Machine Learning (ICML) Generalization Workshop, Long Beach, CA, USA.

Attached Files

Accepted Version - Stochastic_Mirror_Descent_on_Overparameterized_Nonlinear_Models.pdf

Submitted - 1906.03830.pdf

Files

1906.03830.pdf
Files (3.3 MB)
Name Size Download all
md5:14881abbd2f1d5ce9ed44fff329d3738
1.7 MB Preview Download
md5:d8e07a7cf9391b184054827e99f8bf0a
1.6 MB Preview Download

Additional details

Created:
August 20, 2023
Modified:
October 20, 2023