# Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method

NIPS 2020, 2020.

Keywords:

lipschitz continuousprojected subgradient methodminimization oracle Efficient Subgradientfo callsupport vector machinesMore(24+)

Weibo:

Abstract:

We consider the classical setting of optimizing a nonsmooth Lipschitz continuous convex function over a convex constraint set, when having access to a (stochastic) first-order oracle (FO) for the function and a projection oracle (PO) for the constraint set. It is well known that to achieve $\epsilon$-suboptimality in high-dimensions, $\Th...More

Code:

Data:

Introduction

- When queried at a point x, FO returns a subgradient of f at x and PO returns the projection of x onto X.
- Finding an ε-suboptimal solution for this problem requires Ω(ε−2) FO calls in the worst case, when the dimension d is large [64].
- This lower bound is tightly matched by the projected subgradient method (PGD).
- PGD uses one PO call after every FO call, resulting in a PO calls

Highlights

- In this paper, we consider the nonsmooth convex optimization (NSCO) problem with the First-order Oracle (FO) and the Projection Oracle (PO) defined as: NSCO : min f (x), s.t. x ∈ X, x

first-order oracle (FO)(x) ∈ ∂f (x), and projection oracle (PO)(x) = PX (x) = argmin y∈X y−x (1)

where f : Rd → R is a convex Lipschitz-continuous function, and X ⊆ Rd is a convex constraint - PO is often higher than the cost of an FO call. This begs the natural question, which surprisingly is largely unexplored in the general nonsmooth optimization setting: Can we design an algorithm whose PO calls complexity is significantly better than the optimal FO calls complexity O(ε−2)?
- We introduce MOreau Projection Efficient Subgradient (MOPES) and show that it is guaranteed to find an ε-suboptimal solution for any constrained nonsmooth convex optimization problem using O(ε−1) PO calls and optimal O(ε−2) Stochastic First-order Oracle (SFO) calls
- Our MOPES method guarantees significantly better PO-CC than projected subgradient method (PGD) that is still independent of dimension
- We assume that the function is accessed with a first-order oracle (FO) and the set is accessed with either a projection oracle (PO) or a linear minimization oracle (LMO)
- We introduce MOPES, and show that it finds an ε-suboptimal solution with O(ε−2) FO calls and O(ε−1) PO calls

Results

- The authors present the main results . The authors first present the main ideas in Section 3.1 and the results for PO and LMO settings in Sections 3.2 and 3.3 respectively. 3.1 Main Ideas

The authors are interested in the NSCO problem (1). - In Figure 2 the authors plot the mean sub-optimality gap: f − f∗, of the iterates against the number of LMO and FO calls, respectively, used to obtain that iterate.
- In both these plots, while MOPES/MOLES and baselines have comparable FO-CC, MOPES/MOLES is significantly more efficient in the number of PO/LMO calls, matching the Theorems 1 and 2.

Conclusion

- The authors study a canonical problem in optimization: minimizing a nonsmooth Lipschitz continuous convex function over a convex constraint set.
- The authors assume that the function is accessed with a first-order oracle (FO) and the set is accessed with either a projection oracle (PO) or a linear minimization oracle (LMO).
- The authors introduce MOLES, and show that it finds an ε-suboptimal solution with O(ε−2) FO and LMO calls
- This is optimal in both the number of PO and the number of LMO calls.
- This resolves a question left open since [84] on designing the optimal Frank-Wolfe type algorithm for nonsmooth functions

Summary

## Introduction:

When queried at a point x, FO returns a subgradient of f at x and PO returns the projection of x onto X.- Finding an ε-suboptimal solution for this problem requires Ω(ε−2) FO calls in the worst case, when the dimension d is large [64].
- This lower bound is tightly matched by the projected subgradient method (PGD).
- PGD uses one PO call after every FO call, resulting in a PO calls
## Results:

The authors present the main results . The authors first present the main ideas in Section 3.1 and the results for PO and LMO settings in Sections 3.2 and 3.3 respectively. 3.1 Main Ideas

The authors are interested in the NSCO problem (1).- In Figure 2 the authors plot the mean sub-optimality gap: f − f∗, of the iterates against the number of LMO and FO calls, respectively, used to obtain that iterate.
- In both these plots, while MOPES/MOLES and baselines have comparable FO-CC, MOPES/MOLES is significantly more efficient in the number of PO/LMO calls, matching the Theorems 1 and 2.
## Conclusion:

The authors study a canonical problem in optimization: minimizing a nonsmooth Lipschitz continuous convex function over a convex constraint set.- The authors assume that the function is accessed with a first-order oracle (FO) and the set is accessed with either a projection oracle (PO) or a linear minimization oracle (LMO).
- The authors introduce MOLES, and show that it finds an ε-suboptimal solution with O(ε−2) FO and LMO calls
- This is optimal in both the number of PO and the number of LMO calls.
- This resolves a question left open since [84] on designing the optimal Frank-Wolfe type algorithm for nonsmooth functions

- Table1: Comparison of SFO (3), PO (1) & LMO (2) calls complexities of our methods and stateof-the-art algorithms, and corresponding lower-bounds for finding an approximate minimizer of a d-dimensional NSCO problem (1). We assume that f is convex and G-Lipschitz continuous, and is accessed through a stochastic subgradient oracle with a variance of σ2. requires using a minibatch of appropriate size, †approximates projections of PGD with FW method (FW-PGD, see Appendix B.2)
- Table2: Projection: Comparison of PO/MO and SFO calls complexities (PO-CC and SFO-CC)
- Table3: Linear minimization oracle: LMO and SFO calls complexity (LMO-CC and SFO-CC) of various methods for d-dimensional 1 norm constrained SVM with n training samples. SFO uses a batchsize of b = o(n). SP+VR-MP combines ideas from Semi-Proximal [<a class="ref-link" id="c41" href="#r41">41</a>] and Variance reduced [<a class="ref-link" id="c16" href="#r16">16</a>] Mirror-Prox methods. Our MOLES outperforms other nonsmooth methods in LMOCC while still maintaining O(1/ε2) SFO-CC. Complexities of method based on smooth minimax reformulation adversely scale with n or d

Related work

- Nonsmooth convex optimization: Nonsmooth convex optimization has been the focal point of several research works for past few decades. [64] provided information theoretic lower bound of FO calls O(ε−2) to obtain ε-suboptimal solution, for the general problem. This bound is matched by the PGD method introduced independently by [34] and [59], which also implies a PO-CC of O(ε−2). Recently, several faster PGD style methods [50, 78, 87, 48] have been proposed that exploit more structure in the given optimization function, e.g., when the function is a sum of a smooth and a nonsmooth function for which a proximal operator is available [8]. But, to the best of our knowledge, such works do not explicitly address PO-CC and are mainly concerned about optimizing FO-CC. Thus, for the worst case nonsmooth functions, these methods still suffer from O(ε−2) PO-CC.

Smoothed surrogates: Smoothing of the nonsmooth function is another common approach in solving them [62, 66]. In particular, randomized smoothing [27, 9] techniques have been successful in bringing down FO-CC w.r.t. ε but such improvements come at the cost of dimension factors. For example, [27, Corollary 2.4] provides a randomized smoothing method that has O(d1/4/ε) PO-CC and O(ε−2) FO-CC. Our MOPES method guarantees significantly better PO-CC than PGD that is still independent of dimension.

Reference

- J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4575–4583, 2016.
- B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 146–155. JMLR. org, 2017.
- F. Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization, 25(1):115–129, 2015.
- F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
- K. Balasubramanian and S. Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Advances in Neural Information Processing Systems, pages 3455–3464, 2018.
- N. Bansal and A. Gupta. Potential-function proofs for first-order methods. arXiv preprint arXiv:1712.04581, 2017.
- H. H. Bauschke, M. N. Dao, and S. B. Lindstrom. Regularizing with bregman–moreau envelopes. SIAM Journal on Optimization, 28(4):3208–3228, 2018.
- A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
- A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM Journal on Optimization, 22(2):557–580, 2012.
- A. Ben-Tal, S. Bhadra, C. Bhattacharyya, and A. Nemirovski. Efficient methods for robust classification under uncertainty in kernel matrices. Journal of Machine Learning Research, 13 (Oct):2923–2954, 2012.
- D. P. Bertsekas. Nonlinear Programming. Athena Scientific Belmont, 2 edition, 1999.
- C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
- P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998.
- G. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms. Journal of Machine Learning Research, 20(71):1–42, 2019.
- J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982, 2010.
- Y. Carmon, Y. Jin, A. Sidford, and K. Tian. Variance reduction for matrix games. In Advances in Neural Information Processing Systems, pages 11377–11388, 2019.
- J. Chen, T. Yang, Q. Lin, L. Zhang, and Y. Chang. Optimal stochastic strongly convex optimization with a logarithmic number of projections. arXiv preprint arXiv:1304.5504, 2013.
- L. Chen, C. Harshaw, H. Hassani, and A. Karbasi. Projection-free online optimization with stochastic gradient: From convexity to submodularity. In International Conference on Machine Learning, pages 814–823, 2018.
- Y. Chen, Y. Shi, and B. Zhang. Optimal control via neural networks: A convex approach. In International Conference on Learning Representations, 2018.
- Y. Chen, Y. Shi, and B. Zhang. Input convex neural networks for optimal voltage regulation. arXiv preprint arXiv:2002.08684, 2020.
- A. Clark and Contributors. Pillow: Python image-processing library, 2020. URL https://pillow.readthedocs.io/en/stable/. Documentation.
- K. L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):1–30, 2010.
- B. Cox, A. Juditsky, and A. Nemirovski. Decomposition techniques for bilinear saddle point problems and variational inequalities with affine monotone operators. Journal of Optimization Theory and Applications, 172(2):402–435, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
- O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.
- J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008.
- J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
- M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- R. M. Freund and P. Grigas. New analysis and results for the frank–wolfe method. Mathematical Programming, 155(1-2):199–230, 2016.
- D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666, 2013.
- D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015, 2015.
- G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. In Artificial Intelligence and Statistics, pages 362–371. PMLR, 2017.
- G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-wolfe splitting via augmented lagrangian method. In International Conference on Artificial Intelligence and Statistics, pages 1456–1465, 2018.
- A. A. Goldstein. Convex programming in hilbert space. Bulletin of the American Mathematical Society, 70(5):709–710, 1964.
- J. H. Hammond. Solving asymmetric variational inequality problems and systems of equations with generalized nonlinear programming algorithms. PhD thesis, Massachusetts Institute of Technology, 1984.
- Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015.
- H. Hassani, A. Karbasi, A. Mokhtari, and Z. Shen. Stochastic conditional gradient++: (non-)convex minimization and continuous submodular maximization. arXiv preprint arXiv:1902.06992, 2019.
- E. Hazan and S. Kale. Projection-free online learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1843–1850, 2012.
- E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
- E. Hazan and E. Minasyan. Faster projection-free online learning. arXiv preprint arXiv:2001.11568, 2020.
- N. He and Z. Harchaoui. Semi-proximal mirror-prox for nonsmooth composite minimization. In Advances in Neural Information Processing Systems, pages 3411–3419, 2015.
- N. He and Z. Harchaoui. Stochastic semi-proximal mirror-prox. Workshop on Optimization for Machine Learning, 2015. URL https://opt-ml.org/papers/OPT2015_paper_27.pdf.
- J. Howard. Imagenette, 2019. URL https://github.com/fastai/imagenette. Github repository with links to dataset.
- P. J. Huber. Robust statistical procedures, volume 68. SIAM, 1996.
- M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, pages 427–435, 2013.
- P. Jain, O. D. Thakkar, and A. Thakurta. Differentially private matrix completion revisited. In International Conference on Machine Learning, pages 2215–2224. PMLR, 2018.
- B. Kulis, M. A. Sustik, and I. S. Dhillon. Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research, 10(Feb):341–376, 2009.
- A. Kundu, F. Bach, and C. Bhattacharya. Convex optimization over intersection of simple sets: improved convergence rate guarantees via an exact penalty approach. In International Conference on Artificial Intelligence and Statistics, pages 958–967. PMLR, 2018.
- S. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
- S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
- S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe optimization for structural svms. In Proceedings of the 30th international conference on machine learning, pages 53–61, 2013.
- J. Lafond, H.-T. Wai, and E. Moulines. On the online frank-wolfe algorithms for convex and non-convex optimizations. arXiv preprint arXiv:1510.01171, 2015.
- G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
- G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550, 2013.
- G. Lan. Gradient sliding for composite optimization. Mathematical Programming, 159(1-2): 201–235, 2016.
- G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379–1409, 2016.
- G. Lan, Z. Lu, and R. D. Monteiro. Primal-dual first-order methods with O(1/ε) iterationcomplexity for cone programming. Mathematical Programming, 126(1):1–29, 2011.
- G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1965–1974, 2017.
- E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966.
- F. Locatello, A. Yurtsever, O. Fercoq, and V. Cevher. Stochastic frank-wolfe for composite convex minimization. In Advances in Neural Information Processing Systems, pages 14246– 14256, 2019.
- M. Mahdavi, T. Yang, R. Jin, S. Zhu, and J. Yi. Stochastic gradient descent with only one projection. In Advances in Neural Information Processing Systems, pages 494–502, 2012.
- J. J. Moreau. Functions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Ser. A Math., 255:2897–2899, 1962.
- J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de France, 93:273–299, 1965.
- A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1 edition, 1983.
- Y. Nesterov. Introductory lectures on convex programming volume I: Basic course. Lecture notes, 1998.
- Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103 (1):127–152, 2005.
- Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
- Y. Nesterov. Complexity bounds for primal-dual methods minimizing the model of objective function. Mathematical Programming, 171(1-2):311–330, 2018.
- Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
- Q. Nguyen. Efficient learning with soft label information and multiple annotators. PhD thesis, University of Pittsburgh, 2014.
- B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416–1424, 2016.
- F. Pierucci, Z. Harchaoui, and J. Malick. A smoothing approach for composite conditional gradient with nonsmooth loss. Technical report, [Research Report] RR-8662, INRIA Grenoble, 2014.
- S. N. Ravi, M. D. Collins, and V. Singh. A deterministic nonsmooth frank wolfe algorithm with coreset guarantees. Informs Journal on Optimization, 1(2):120–142, 2019.
- M. I. Razzak. Sparse support matrix machines for the classification of corrupted data. PhD thesis, Queensland University of Technology, 2019.
- S. J. Reddi, S. Sra, B. Póczos, and A. Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
- A. K. Sahu, M. Zaheer, and S. Kar. Towards gradient free and projection free stochastic optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3468–3477, 2019.
- M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pages 1458–1466, 2011.
- O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013.
- N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances in neural information processing systems, pages 1329–1336, 2005.
- K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pages 12659– 12670, 2019.
- P. Tseng. Accelerated proximal gradient methods for convex optimization. Technical report, University of Washington, Seattle, 2008. URL https://www.mit.edu/~dimitrib/PTseng/papers/apgm.pdf.
- R. Vinter and H. Zheng. Some finance problems solved with nonsmooth optimization techniques. Journal of optimization theory and applications, 119(1):1–18, 2003.
- Z. Wang, X. He, D. Gao, and X. Xue. An efficient kernel-based matrixized least squares support vector machine. Neural Computing and Applications, 22(1):143–150, 2013.
- D. White. Extension of the frank-wolfe algorithm to concave nondifferentiable objective functions. Journal of optimization theory and applications, 78(2):283–301, 1993.
- L. Wolf, H. Jhuang, and T. Hazan. Modeling appearances with low-rank svm. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–6. IEEE, 2007.
- J. Xie, Z. Shen, C. Zhang, B. Wang, and H. Qian. Efficient projection-free online methods with stochastic recursive gradient. In AAAI, pages 6446–6453, 2020.
- T. Yang and Q. Lin. RSG: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research, 19(1):236–268, 2018.
- T. Yang, Q. Lin, and L. Zhang. A richer theory of convex constrained optimization with reduced projections and improved rates. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3901–3910. JMLR. org, 2017.
- I. E.-H. Yen, X. Lin, J. Zhang, P. Ravikumar, and I. Dhillon. A convex atomic-norm approach to multiple sequence alignment and motif discovery. In International Conference on Machine Learning, pages 2272–2280, 2016.
- K. Yosida. Functional analysis. Springer Verlag, 1965.
- L. Zhang, T. Yang, R. Jin, and X. He. O(log t) projections for stochastic optimization of smooth and strongly convex functions. In International Conference on Machine Learning, pages 1121–1129, 2013.
- T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682–691, 2003.
- J. Zhu, S. Rosset, R. Tibshirani, and T. J. Hastie. 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56, 2004.
- [27] Nonsmooth methods (p = 1) Mirror descent (p = 1)
- [64] Randomized smoothing (p = 1)
- [27] Minimax methods: O(n) extra memory
- [16] Mirror-Prox methods. Here SP+VR-MP uses the variance reduced Mirror-prox method [16] in the 2- 2 setting to optimize (84) and then approximates the projection steps with Frank-Wolfe (FW) method. This is an L22-smooth minimax problem with
- [54] Nonsmooth methods (p = 1) Rand. Frank-Wolfe (p = 1)
- [54] Minimax methods: O(n) extra memory SP [41]+VR [16]-MP (p = 2)
- [16] Mirror-Prox methods. Our MOLES outperforms other nonsmooth methods in LMOCC while still maintaining O(1/ε2) SFO-CC. Complexities of method based on smooth minimax reformulation adversely scale with n or d.

Tags

Comments