Optimization for Deep Learning: An Overview

doi:10.1007/s40305-020-00309-6

摘要/Abstract

Abstract: Optimization is a critical component in deep learning. We think optimization for neural networks is an interesting topic for theoretical research due to various reasons. First, its tractability despite non-convexity is an intriguing question and may greatly expand our understanding of tractable problems. Second, classical optimization theory is far from enough to explain many phenomena. Therefore, we would like to understand the challenges and opportunities from a theoretical perspective and review the existing research in this field. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum and then discuss practical solutions including careful initialization, normalization methods and skip connections. Second, we review generic optimization methods used in training neural networks, such as stochastic gradient descent and adaptive gradient methods, and existing theoretical results. Third, we review existing research on the global issues of neural network training, including results on global landscape, mode connectivity, lottery ticket hypothesis and neural tangent kernel.

Key words: Deep learning, Non-convex optimization, Neural networks, Convergence, Landscape

中图分类号:

90C30
68Q32

. [J]. Journal of the Operations Research Society of China, 2020, 8(2): 249-294.

Ruo-Yu Sun. Optimization for Deep Learning: An Overview[J]. Journal of the Operations Research Society of China, 2020, 8(2): 249-294.

参考文献

[1] Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48(3), 334-334(1997)
[2] Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
[3] Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223-311(2018)
[4] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016)
[5] Jakubovitz, D., Giryes, R., Rodrigues, M.R.D.: Generalization error in deep learning. In: Boche, H., Caire, G., Calderbank, R., Kutyniok, G., Mathar, R. (eds.) Compressed Sensing and Its Applications, pp. 153-193. Springer, Berlin (2019)
[6] Shamir,O.:Exponentialconvergencetimeofgradientdescentforone-dimensionaldeeplinearneural networks (2018). arXiv:1809.08587
[7] Bottou, Léon: Reconnaissance de la parole par reseaux connexionnistes. In: Proceedings of neuro Nimes 88, pp. 197-218. Nimes, France (1988). http://leon.bottou.org/papers/bottou-88b
[8] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9-50. Springer, Berlin (1998)
[9] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504-507(2006)
[10] Erhan,D.,Bengio,Y.,Courville,A.,Manzagol,P.-A.,Vincent,P.,Bengio,S.:Whydoesunsupervised pre-training help deep learning? J. Mach. Learn. Res. 11(Feb), 625-660(2010)
[11] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256(2010)
[12] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315-323(2011)
[13] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026-1034(2015). https://openreview.net/forum?id=rkxQ-nA9FX
[14] Mishkin, D., Matas, J.: All you need is a good init (2015). arXiv:1511.06422
[15] Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2013). arXiv:1312.6120
[16] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. In: Advances in Neural Information Processing Systems, pp. 3360-3368(2016)
[17] Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571-8580(2018)
[18] Hanin,B.,Rolnick,D.:Howtostarttraining:theeffectofinitializationandarchitecture.In:Advances in Neural Information Processing Systems, pp. 569-579(2018)
[19] Orhan, A.E., Pitkow, X.: Skip connections eliminate singularities (2017). arXiv:1701.09175
[20] Pennington, J., Schoenholz, S., Ganguli, S.: Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In: Advances in Neural Information Processing Systems, pp. 4785-4795(2017)
[21] Pennington, J., Schoenholz, S.S., Ganguli, S.: The emergence of spectral universality in deep networks (2018). arXiv:1802.09979
[22] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S.S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks (2018). arXiv:1806.05393
[23] Li, P., Nguyen, P.-M.: On random deep weight-tied autoencoders: exact asymptotic analysis, phase transitions, and implications to training. In: 7th International Conference on Learning Representations, ICLR 2019(2019) https://openreview.net/forum?id=HJx54i05tX
[24] Gilboa, D., Chang, B., Chen, M., Yang, G., Schoenholz, S.S., Chi, E.H., Pennington, J.: Dynamical isometry and a mean field theory of LSTMs and GRUs (2019). arXiv:1901.08987
[25] Dauphin, Y.N., Schoenholz, S.: Metainit: initializing learning by learning to initialize. In: Advances in Neural Information Processing Systems, pp. 12624-12636(2019)
[26] Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
[27] Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: Advances in Neural Information Processing Systems, pp. 2483-2493(2018)
[28] Bjorck, N., Gomes, C.P., Selman, B., Weinberger, K.Q.: Understanding batch normalization. In: Advances in Neural Information Processing Systems, pp. 7694-7705(2018)
[29] Arora, S., Li, Z., Lyu, K.: Theoretical analysis of auto rate-tuning by batch normalization. In: International Conference on Learning Representations (2019c). https://openreview.net/forum?id=rkxQnA9FX
[30] Cai, Y., Li, Q., Shen, Z.: A quantitative analysis of the effect of batch normalization on gradient descent. In: International Conference on Machine Learning, pp. 882-890(2019)
[31] Kohler, J., Daneshmand, H., Lucchi, A., Hofmann, T., Zhou, M., Neymeyr, K.: Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 806-815(2019)
[32] Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density (2019). arXiv:1901.10159
[33] Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901-909(2016)
[34] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450
[35] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization (2016). arXiv:1607.08022
[36] Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision, pp. 3-19(2018)
[37] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks (2018). arXiv:1802.05957
[38] Luo, P., Zhang, R., Ren, J., Peng, Z., Li, J.: Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
[39] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097-1105(2012)
[40] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9(2015)
[41] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778(2016)
[42] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition arXiv:1409.1556(2014)
[43] Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks (2015). arXiv:1505.00387
[44] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700-4708(2017)
[45] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492-1500(2017)
[46] Zoph,B.,Le,Q.V.:Neuralarchitecturesearchwithreinforcementlearning(2016).arXiv:1611.01578
[47] Yu, J., Huang, T.: Network slimming by slimmable networks: towards one-shot architecture search for channel numbers (2019). arXiv:1903.11728
[48] Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019). arXiv:1905.11946
[49] Hanin, B.: Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems, pp. 580-589(2018)
[50] Tarnowski, W., Warchoł, P., Jastrzebski, S., Tabor, J.: Nowak, Maciej: Dynamical isometry is achieved in residual networks in a universal way for any activation function. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2221-2230(2019)
[51] Yang, G., Schoenholz, S.: Mean field residual networks: On the edge of chaos. In: Advances in Neural Information Processing Systems, pp. 7103-7114(2017)
[52] Balduzzi,D.,Frean,M.,Leary,L.,Lewis,J.P.,Ma,K.W.-D.,McWilliams,B.:Theshatteredgradients problem:Ifresnetsaretheanswer,thenwhatisthequestion?In:Proceedingsofthe34thInternational Conference on Machine Learning, vol. 70, pp. 342-350. JMLR. org (2017)
[53] Zhang, H., Dauphin, Y.N., Ma, T.: Fixup initialization: residual learning without normalization (2019a). arXiv:1901.09321
[54] Curtis, F.E., Scheinberg, K.: Optimization methods for supervised machine learning: from linear models to deep learning. In: Leading Developments from INFORMS Communities, pp. 89-114. INFORMS (2017)
[55] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour (2017). arXiv:1706.02677
[56] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998-6008(2017)
[57] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
[58] Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=r14EOsCqKX
[59] Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision, pp. 464-472. IEEE (2017)
[60] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2016). arXiv:1608.03983
[61] Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates (2017). arXiv:1708.07120
[62] Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241-254(1977)
[63] O'donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715-732(2015)
[64] Luo,Z.-Q.:Ontheconvergenceofthelmsalgorithmwithadaptivelearningrateforlinearfeedforward networks. Neural Comput. 3(2), 226-245(1991)
[65] Schmidt, M., Roux, L.N.: Fast convergence of stochastic gradient descent under a strong growth condition (2013). arXiv:1308.6370
[66] Vaswani,S.,Bach,F.,Schmidt,M.:Fastandfasterconvergenceofsgdforover-parameterizedmodels and an accelerated perceptron (2018). arXiv:1810.07288
[67] Liu, C., Belkin, M.: Mass: an accelerated stochastic method for over-parametrized learning (2018b). arXiv:1810.13395
[68] Bottou, L.: Online learning and stochastic approximations. On-line Learn.Neural Netw. 17(9), 142(1998)
[69] Ruder, Sebastian: An overview of gradient descent optimization algorithms (2016). arXiv:1609.04747
[70] Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1-2), 37-75(2014)
[71] Devolder, O., Glineur, F., Nesterov, Y., et al.: First-order methods with inexact oracle: the strongly convex case. No. 2013016. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2013
[72] Kidambi, R., Netrapalli, P., Jain, P., Kakade, S.: On the insufficiency of existing momentum schemes for stochastic optimization. In: 2018 Information Theory and Applications Workshop (ITA), pp. 1-9. IEEE (2018)
[73] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384-3392(2015)
[74] Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18(1), 8194-8244(2017)
[75] Defazio, A., Bottou, L.: On the ineffectiveness of variance reduced optimization for deep learning. In: Advances in Neural Information Processing Systems, pp. 1753-1763(2019)
[76] Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent (2017). arXiv:1704.08227
[77] Liu, C., Belkin, M.: Accelerating sgd with momentum for over-parameterized learning (2018) arXiv:1810.13395
[78] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751-1772(2018)
[79] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 654-663(2017)
[80] Xu, Y., Rong, Jing, Y., Tianbao: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Advances in Neural Information Processing Systems, pp. 5535-5545(2018)
[81] Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 687-697(2018)
[82] Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2680-2691(2018)
[83] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121-2159(2011)
[84] Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26-31(2012)
[85] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980
[86] Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701
[87] Dozat, T., Adam, I. N.: International Conference on Learning Representations. In Workshop (ICLRW) (pp. 1-6). In: Proceedings of Incorporating nesterov momentum into adam (2016)
[88] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781
[89] Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543(2014)
[90] Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148-4158(2017)
[91] Keskar, N.S., Socher, R.: Improving generalization performance by switching from adam to sgd (2017). arXiv:1712.07628
[92] Sivaprasad, P.T., Mai, F., Vogels, T., Jaggi, M., Fleuret, F.: On the tunability of optimizers in deep learning (2019). arXiv:1910.11758
[93] Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)
[94] Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization (2018). arXiv:1808.02941
[95] Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2018). arXiv:1808.05671
[96] Zou, F., Shen, L.: On the convergence of adagrad with momentum for training deep neural networks (2018). arXiv:1808.03408
[97] De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration (2018) arXiv:1807.06766
[98] Zou, F., Shen, L., Jie, Z., Zhang, We., Liu, W.: A sufficient condition for convergences of adam and rmsprop (2018b). arXiv:1811.09358
[99] Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization (2018). arXiv:1806.01811
[100] Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization (2019). arXiv:1911.07596
[101] Bertsekas, D.P., Tsitsiklis, J.N: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)
[102] Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don't decay the learning rate, increase the batch size. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=B1Yy1BxCZ
[103] Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch sgd: training resnet-50 on imagenet in 15 minutes (2017). arXiv:1711.04325
[104] Jia,X.,Song,S.,He,W.,Wang,Y.,Rong,H.,Zhou,F.,Xie,L.,Guo,Z.,Yang,Y.,Yu,L.,etal.:Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes (2018). arXiv:1807.11205
[105] Mikami, H., Suganuma, H., Tanaka, Y., Kageyama, Y., et al.: Massively distributed sgd: Imagenet/resnet-50 training in a flash (2018). arXiv:1811.05233
[106] Ying, C., Kumar, S., Chen, D., Wang, T., Cheng, Y.: Image classification at supercomputer scale (2018). arXiv:1811.06992
[107] Yamazaki, M., Kasagi, A., Tabuchi, A., Honda, T., Miwa, M., Fukumoto, N., Tabaru, T., Ike, A., Nakashima, K.: Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds (2019). arXiv:1903.12650
[108] You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM (2018)
[109] Yuan, Y.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785(2008)
[110] Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141-148(1988)
[111] Becker, S., Le Cun, Y., et al.: Improving the convergence of back-propagation learning with second order methods. In: Proceedings of the 1988 Connectionist Models Summer School, pp. 29-37(1988)
[112] Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737-1754(2009)
[113] LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9-48. Springer, Berlin (2012)
[114] Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343-351(2013)
[115] Tan, C., Ma, S., Dai, Y.-H., Qian, Y.: Barzilai-borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685-693(2016)
[116] Orabona, F., Tommasi, T.: Training deep networks without learning rates through coin betting. In: Advances in Neural Information Processing Systems, pp. 2160-2170(2017)
[117] Martens, J.: Deep learning via hessian-free optimization. ICML 27, 735-742(2010)
[118] Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Comput. 6(1), 147-160(1994)
[119] Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723-1738(2002)
[120] Berahas, A.S., Jahani, M., Takáč, M.: Quasi-newton methods for deep learning: forget the past, just sample (2019). arXiv:1901.09997
[121] Amari, S.-I., Park, H., Fukumizu, K.: Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput. 12(6), 1399-1409(2000)
[122] Martens, J.: New insights and perspectives on the natural gradient method (2014). arXiv:1412.1193
[123] Amari, S., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society, Providence (2007)
[124] Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408-2417(2015)
[125] Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., Matsuoka, S.: Second-order optimization method for large mini-batch: training resnet-50 on imagenet in 35 epochs (2018). arXiv:1811.12019
[126] Anil, R., Gupta, V., Koren, T., Regan, K., Singer, Y.: Second order optimization made practical (2020). arXiv:2002.09018
[127] Gupta, V., Koren, T., Singer, Y.: Shampoo: preconditioned stochastic tensor optimization (2018). arXiv:1802.09568
[128] Vidal, R., Bruna, J., Giryes, R., Soatto, S.: Mathematics of deep learning (2017). arXiv:1712.04741
[129] Lu, C., Deng, Z., Zhou, J., Guo, X.: A sensitive-eigenvector based global algorithm for quadratically constrained quadratic programming. J. Glob. Optim. 73, 1-18(2019)
[130] Ferreira, O.P., Németh, S.Z.: On the spherical convexity of quadratic functions. J. Glob. Optim. 73(3), 537-545(2019). https://doi.org/10.1007/s10898-018-0710-6
[131] Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67(20), 5239-5269(2019)
[132] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933-2941(2014)
[133] Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems (2014). arXiv:1412.6544
[134] Poggio, T., Liao, Q.: Theory Ⅱ: landscape of the empirical risk in deep learning. PhD thesis, Center for Brains, Minds and Machines (CBMM) (2017). arXiv:1703.09833
[135] Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, pp. 6391-6401(2018b)
[136] Baity-Jesi, M., Sagun, L., Geiger, M., Spigler, S., Arous, G.B., Cammarota, C., LeCun, Y., Wyart, M., Biroli, G.: Comparing dynamics: deep neural networks versus glassy systems (2018). arXiv:1803.06969
[137] Franz, S., Hwang, S., Urbani, P.: Jamming in multilayer supervised learning models (2018). arXiv:1809.09945
[138] Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., Wyart, M.: The jamming transition as a paradigm to understand the loss landscape of deep neural networks (2018). arXiv:1809.09349
[139] Draxler, F., Veschgini, K., Salmhofer, M., Hamprecht, F.A.: Essentially no barriers in neural network energy landscape (2018) arXiv:1803.00885
[140] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of DNNS. In: Advances in Neural Information Processing Systems, pp. 8789-8798(2018)
[141] Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization (2016). arXiv:1611.01540
[142] Nguyen, Q.: On connected sublevel sets in deep learning (2019b). arXiv:1901.07417
[143] Kuditipudi, R., Wang, X., Lee, H., Zhang, Y., Li, Z., Hu, W., Arora, S., Ge, R.: Explaining landscape connectivity of low-cost solutions for multilayer nets (2019). arXiv:1906.06247
[144] Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding (2015). arXiv:1510.00149
[145] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv:1810.05270
[146] Lee, N., Ajanthan, T., Torr, P.: SNIP: single-shot network pruning based on connection sensitivity. In: International Conference on Learning Representations (2019b). https://openreview.net/forum?id=B1VZqjAcYX
[147] Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: The lottery ticket hypothesis at scale (2019). arXiv:1903.01611
[148] Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks (2018). arXiv:1803.03635
[149] Zhou, H., Lan, J., Liu, R., Yosinski, J.: Deconstructing lottery tickets: zeros, signs, and the supermask (2019). arXiv:1905.01067
[150] Morcos, A.S., Yu, H., Paganini, M., Tian, Y.: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (2019). arXiv:1906.02773
[151] Tian, Y., Jiang, T., Gong, Q., Morcos, A.: Luck matters: Understanding training dynamics of deep relu networks (2019). arXiv:1905.13405
[152] Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1-42(1997)
[153] Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima (2016). arXiv:1609.04836
[154] Dinh, L., Pascanu, R., Bengio, S., Bengio, Y.: Sharp minima can generalize for deep nets. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1019-1028(2017)
[155] Neyshabur, B., Salakhutdinov, R.R., Srebro, N.: Path-sgd: Path-normalized optimization in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2422-2430(2015)
[156] Yi, M., Meng, Q., Chen, W., Ma, Z., Liu, T.-Y.: Positively scale-invariant flatness of relu neural networks (2019). arXiv:1903.02237
[157] He, H., Huang, G., Yuan, Y.: Asymmetric valleys: beyond sharp and flat local minima (2019). arXiv:1902.00744
[158] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: Biasing gradient descent into wide valleys (2016). arXiv:1611.01838
[159] Kawaguchi, K.: Deep learning without poor local minima. In: Advances in Neural Information Processing Systems, pp. 586-594(2016)
[160] Lu, H., Kawaguchi, K.: Depth creates no bad local minima (2017). arXiv:1702.08580
[161] Laurent, T., Brecht, J.: Deep linear networks with arbitrary loss: all local minima are global. In: International Conference on Machine Learning, pp. 2908-2913(2018)
[162] Nouiehed, M., Razaviyayn, M.: Learning deep models: critical points and local openness (2018). arXiv:1803.02968
[163] Zhang, L.: Depth creates no more spurious local minima (2019). arXiv:1901.09827
[164] Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks (2017). arXiv:1707.02444
[165] Zhou, Y., Liang, Y.: Critical points of linear neural networks: analytical forms and landscape properties (2018) arXiv: 1710.11205
[166] Livni, R., Shalev-Shwartz, S., Shamir, O.: On the computational efficiency of training neural networks. In: Advances in Neural Information Processing Systems, pp. 855-863(2014)
[167] Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N.: Exploring generalization in deep learning. In: Advances in Neural Information Processing Systems, pp. 5947-5956(2017)
[168] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv:1611.03530
[169] Nguyen, Q., Mukkamala, M.C., Hein, M.: On the loss landscape of a class of deep neural networks with no bad local valleys (2018). arXiv:1809.10749
[170] Li, Dawei, D., Tian, S., Ruoyu: Over-parameterized deep neural networks have no strict local minima for any continuous activations (2018a). arXiv:1812.11039
[171] Yu, X., Pasupathy, S.: Innovations-based MLSE for Rayleigh flat fading channels. IEEE Trans. Commun. 43, 1534-1544(1995)
[172] Ding, T., Li, D., Sun, R.: Sub-optimal local minima exist for almost all over-parameterized neural networks. Optimization Online (2019) arXiv: 1911.01413
[173] Bartlett,P.L.,Foster,D.J.,Telgarsky,M.J.:Spectrally-normalizedmarginboundsforneuralnetworks. In: Advances in Neural Information Processing Systems, pp. 6240-6249(2017)
[174] Wei, C., Lee, J.D., Liu, Q., Ma, T.: On the margin theory of feedforward neural networks (2018). arXiv:1810.05369
[175] Wu, L., Zhu, Z., et al.: Towards understanding generalization of deep learning: perspective of loss landscapes (2017). arXiv:1706.10239
[176] Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine learning and the bias-variance trade-off (2018). arXiv:1812.11118
[177] Mei, S., Montanari, A.: The generalization error of random features regression: precise asymptotics and double descent curve (2019). arXiv:1908.05355
[178] Liang, S., Sun, R., Lee, J.D., Srikant, R.: Adding one neuron can eliminate all bad local minima. In: Advances in Neural Information Processing Systems, pp. 4355-4365(2018a)
[179] Kawaguchi, K., Kaelbling, L.P.: Elimination of all bad local minima in deep learning (2019). arXiv:1901.00279
[180] Liang, S., Sun, R., Srikant, R.: Revisiting landscape analysis in deep neural networks: eliminating decreasing paths to infinity (2019). arXiv:1912.13472
[181] Shalev-Shwartz, S., Shamir, O., Shammah, S.: Failures of gradient-based deep learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3067-3075. JMLR. org (2017)
[182] Swirszcz, G., Czarnecki, W.M., Pascanu, R.: Local minima in training of deep networks (2016). arXiv:1611.06310
[183] Zhou, Y., Liang, Y.: Critical points of neural networks: analytical forms and landscape properties (2017). arXiv:1710.11205
[184] Safran, I., Shamir, O.: Spurious local minima are common in two-layer relu neural networks (2017). arXiv:1712.08968
[185] Venturi, L., Bandeira, A., Bruna, J.: Spurious valleys in two-layer neural network optimization landscapes (2018b). arXiv:1802.06384
[186] Liang, S., Sun, R., Li, Y., Srikant, R.: Understanding the loss surface of neural networks for binary classification (2018b). arXiv:1803.00909
[187] Yun, C., Sra, S., Jadbabaie, A.: Small nonlinearities in activation functions create bad local minima in neural networks (2018). arXiv:1802.03487
[188] Bartlett, P., Helmbold, D., Long, P.: Gradient descent with identity initialization efficiently learns positive definite linear transformations. In: International Conference on Machine Learning, pp. 520-529(2018)
[189] Arora, S., Cohen, N., Golowich, N., Hu, W.: A convergence analysis of gradient descent for deep linear neural networks (2018). arXiv:1810.02281
[190] Ji, Z., Telgarsky, M.: Gradient descent aligns the layers of deep linear networks (2018). arXiv:1810.02032
[191] Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks (2018). arXiv:1811.03804
[192] Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation (2019). arXiv:1902.04760
[193] Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia, D.A., Pennington, J., Sohl-dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: International Conference on Learning Representations (2019a). https://openreview.net/forum?id=B1g30j0qF7
[194] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization (2018). arXiv:1811.03962
[195] Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep relu networks (2018a). arXiv:1811.08888
[196] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8168-8177(2018)
[197] Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net (2019a). arXiv:1904.11955
[198] Zhang, H., Yu, D., Chen, W., Liu, T.-Y.: Training over-parameterized deep resnet is almost as easy as training a two-layer network (2019b). arXiv:1903.07120
[199] Ma, C., Wu, L., et al.: Analysis of the gradient descent algorithm for a deep neural network model with skip-connections (2019). arXiv:1904.05263
[200] Li, Z., Wang, R., Yu, D., Du, S.S., Hu, W., Salakhutdinov, R., Arora, S.: Enhanced convolutional neural tangent kernels (2019) arXiv:1806.05393
[201] Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks (2019b). arXiv:1910.01663
[202] Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A.A., Sohl-Dickstein, J., Schoenholz, S.S.: Neural tangents: Fast and easy infinite neural networks in python (2019b). arXiv:1912.02803
[203] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570-8581(2019a)
[204] Sirignano, J., Spiliopoulos, K.: Mean field analysis of deep neural networks (2019). arXiv:1903.04440
[205] Araujo, D., Oliveira, R.I., Yukimura, D.: A mean-field limit for certain deep neural networks (2019) arXiv:1906.00193
[206] Nguyen, P.-M.: Mean field limit of the learning dynamics of multilayer neural networks (2019a). arXiv:1902.02880
[207] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layers neural networks (2018). arXiv:1804.06561
[208] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks (2018). arXiv:1805.01053
[209] Rotskoff, G.M., Vanden-Eijnden, E.: Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error (2018). arXiv:1805.00915
[210] Chizat, L., Oyallon, E., Bach, F.: On the global convergence of gradient descent for overparameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3040-3050(2018)
[211] Williams, F., Trager, M., Silva, C., Panozzo, D., Zorin, D., Bruna, J.: Gradient dynamics of shallow univariate relu networks In: Advances in Neural Information Processing Systems, pp. 8376-8385(2019)
[212] Venturi, L., Bandeira, A., Bruna, J.: Neural networks with finite intrinsic dimension have no spurious valleys (2018a). arXiv:1802.06384. 15
[213] Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7331-7339(2017)
[214] Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427-444(2005)
[215] Ge, R., Lee, J.D., Ma, T.: Learning one-hidden-layer neural networks with landscape design (2017). arXiv:1711.00501
[216] Gao, W., Makkuva, A.V., Oh, S., Viswanath, P.: Learning one-hidden-layer neural networks under general input distributions (2018). arXiv:1810.04133
[217] Feizi, S., Javadi, H., Zhang, J., Tse, D.: Porcupine neural networks: (almost) all local optima are global (2017). arXiv:1710.02196
[218] Panigrahy, R., Rahimi, A., Sachdeva, S., Zhang, Q.: Convergence results for neural networks via electrodynamics (2017). arXiv:1702.00458
[219] Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742-769(2019)
[220] Soudry, D., Hoffer, E.: Exponentially vanishing sub-optimal local minima in multilayer neural networks (2017). arXiv:1702.05777
[221] Laurent, T., von Brecht, J.: The multilinear structure of relu networks (2017). arXiv:1712.10132
[222] Tian, Y.: An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3404-3413. JMLR. org (2017)
[223] Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with Gaussian inputs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 605-614(2017)
[224] Zhong, K., Song, Z., Jain, P., Bartlett, P.L., Dhillon, I.S.: Recovery guarantees for one-hidden-layer neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4140-4149(2017)
[225] Li,Y.,Yuan,Y.:Convergenceanalysisoftwo-layerneuralnetworkswithreluactivation.In:Advances in Neural Information Processing Systems, pp. 597-607(2017)
[226] Brutzkus, A., Globerson, A., Malach, E., Shalev-Shwartz, S.: Sgd learns over-parameterized networks that provably generalize on linearly separable data. International Conference on Learning Representations (2018)
[227] Wang, G., Giannakis, G.B., Chen, J.: Learning relu networks on linearly separable data: algorithm, optimality, and generalization (2018). arXiv:1808.04685
[228] Zhang, X., Yu, Y., Wang, L., Gu, Q.: Learning one-hidden-layer relu networks via gradient descent (2018). arXiv:1806.07808
[229] Du,S.S.,Lee,J.D.:Onthepowerofover-parametrizationinneuralnetworkswithquadraticactivation (2018). arXiv:1803.01206
[230] Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks (2019). arXiv:1902.04674
[231] Su, L., Yang, P.: On learning over-parameterized neural networks: a functional approximation prospective. In: Advances in Neural Information Processing Systems pp. 2637-2646(2019)
[232] Janzamin, M., Sedghi, H., Anandkumar, A.: Beating the perils of non-convexity: guaranteed training of neural networks using tensor methods (2015). arXiv:1506.08473
[233] Mondelli, M., Montanari, A.: On the connection between learning two-layers neural networks and tensor decomposition (2018). arXiv:1802.07301
[234] Boob, D., Lan, G.: Theoretical properties of the global optimizer of two layer neural network (2017). arXiv:1710.11241
[235] Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: Don't be afraid of spurious local minima (2017). arXiv:1712.00779
[236] Vempala, S., Wilmes, J.: Polynomial convergence of gradient descent for training one-hidden-layer neural networks (2018). arXiv:1805.02677
[237] Ge, R., Kuditipudi, R., Li, Z., Wang, X.: Learning two-layer neural networks with symmetric inputs (2018). arXiv:1810.06793
[238] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: Gradient descent takes the shortest path? (2018). arXiv:1812.10004
[239] Ju, S.: List of works on “provable nonconvex methods/algorithms”. https://sunju.org/research/nonconvex/
[240] Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641-654(2010)
[241] Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341-362(2012)
[242] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315-323(2013)
[243] Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646-1654(2014)
[244] Wright, S., Nocedal, J.: Numerical optimization. Science 35(67-68), 7(1999)