Journal of the Operations Research Society of China ›› 2020, Vol. 8 ›› Issue (2): 249-294.doi: 10.1007/s40305-020-00309-6
• Special Issue: Mathematical Optimization: Past, Present and Future • Previous Articles Next Articles
Ruo-Yu Sun1,2
Received:
2019-12-18
Revised:
2020-04-13
Online:
2020-06-30
Published:
2020-07-07
Contact:
Ruo-Yu Sun
E-mail:ruoyus@illinois.edu
CLC Number:
Ruo-Yu Sun. Optimization for Deep Learning: An Overview[J]. Journal of the Operations Research Society of China, 2020, 8(2): 249-294.
[1] Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48(3), 334-334(1997) [2] Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012) [3] Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223-311(2018) [4] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT press, Cambridge (2016) [5] Jakubovitz, D., Giryes, R., Rodrigues, M.R.D.: Generalization error in deep learning. In: Boche, H., Caire, G., Calderbank, R., Kutyniok, G., Mathar, R. (eds.) Compressed Sensing and Its Applications, pp. 153-193. Springer, Berlin (2019) [6] Shamir,O.:Exponentialconvergencetimeofgradientdescentforone-dimensionaldeeplinearneural networks (2018). arXiv:1809.08587 [7] Bottou, Léon: Reconnaissance de la parole par reseaux connexionnistes. In: Proceedings of neuro Nimes 88, pp. 197-218. Nimes, France (1988). http://leon.bottou.org/papers/bottou-88b [8] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9-50. Springer, Berlin (1998) [9] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504-507(2006) [10] Erhan,D.,Bengio,Y.,Courville,A.,Manzagol,P.-A.,Vincent,P.,Bengio,S.:Whydoesunsupervised pre-training help deep learning? J. Mach. Learn. Res. 11(Feb), 625-660(2010) [11] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256(2010) [12] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315-323(2011) [13] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026-1034(2015). https://openreview.net/forum?id=rkxQ-nA9FX [14] Mishkin, D., Matas, J.: All you need is a good init (2015). arXiv:1511.06422 [15] Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2013). arXiv:1312.6120 [16] Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., Ganguli, S.: Exponential expressivity in deep neural networks through transient chaos. In: Advances in Neural Information Processing Systems, pp. 3360-3368(2016) [17] Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571-8580(2018) [18] Hanin,B.,Rolnick,D.:Howtostarttraining:theeffectofinitializationandarchitecture.In:Advances in Neural Information Processing Systems, pp. 569-579(2018) [19] Orhan, A.E., Pitkow, X.: Skip connections eliminate singularities (2017). arXiv:1701.09175 [20] Pennington, J., Schoenholz, S., Ganguli, S.: Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In: Advances in Neural Information Processing Systems, pp. 4785-4795(2017) [21] Pennington, J., Schoenholz, S.S., Ganguli, S.: The emergence of spectral universality in deep networks (2018). arXiv:1802.09979 [22] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S.S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks (2018). arXiv:1806.05393 [23] Li, P., Nguyen, P.-M.: On random deep weight-tied autoencoders: exact asymptotic analysis, phase transitions, and implications to training. In: 7th International Conference on Learning Representations, ICLR 2019(2019) https://openreview.net/forum?id=HJx54i05tX [24] Gilboa, D., Chang, B., Chen, M., Yang, G., Schoenholz, S.S., Chi, E.H., Pennington, J.: Dynamical isometry and a mean field theory of LSTMs and GRUs (2019). arXiv:1901.08987 [25] Dauphin, Y.N., Schoenholz, S.: Metainit: initializing learning by learning to initialize. In: Advances in Neural Information Processing Systems, pp. 12624-12636(2019) [26] Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167 [27] Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? In: Advances in Neural Information Processing Systems, pp. 2483-2493(2018) [28] Bjorck, N., Gomes, C.P., Selman, B., Weinberger, K.Q.: Understanding batch normalization. In: Advances in Neural Information Processing Systems, pp. 7694-7705(2018) [29] Arora, S., Li, Z., Lyu, K.: Theoretical analysis of auto rate-tuning by batch normalization. In: International Conference on Learning Representations (2019c). https://openreview.net/forum?id=rkxQnA9FX [30] Cai, Y., Li, Q., Shen, Z.: A quantitative analysis of the effect of batch normalization on gradient descent. In: International Conference on Machine Learning, pp. 882-890(2019) [31] Kohler, J., Daneshmand, H., Lucchi, A., Hofmann, T., Zhou, M., Neymeyr, K.: Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 806-815(2019) [32] Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density (2019). arXiv:1901.10159 [33] Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems, pp. 901-909(2016) [34] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization (2016). arXiv:1607.06450 [35] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization (2016). arXiv:1607.08022 [36] Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision, pp. 3-19(2018) [37] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks (2018). arXiv:1802.05957 [38] Luo, P., Zhang, R., Ren, J., Peng, Z., Li, J.: Switchable normalization for learning-to-normalize deep representation. IEEE Trans. Pattern Anal. Mach. Intell. (2019) [39] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097-1105(2012) [40] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9(2015) [41] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778(2016) [42] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition arXiv:1409.1556(2014) [43] Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks (2015). arXiv:1505.00387 [44] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700-4708(2017) [45] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492-1500(2017) [46] Zoph,B.,Le,Q.V.:Neuralarchitecturesearchwithreinforcementlearning(2016).arXiv:1611.01578 [47] Yu, J., Huang, T.: Network slimming by slimmable networks: towards one-shot architecture search for channel numbers (2019). arXiv:1903.11728 [48] Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks (2019). arXiv:1905.11946 [49] Hanin, B.: Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in Neural Information Processing Systems, pp. 580-589(2018) [50] Tarnowski, W., Warchoł, P., Jastrzebski, S., Tabor, J.: Nowak, Maciej: Dynamical isometry is achieved in residual networks in a universal way for any activation function. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2221-2230(2019) [51] Yang, G., Schoenholz, S.: Mean field residual networks: On the edge of chaos. In: Advances in Neural Information Processing Systems, pp. 7103-7114(2017) [52] Balduzzi,D.,Frean,M.,Leary,L.,Lewis,J.P.,Ma,K.W.-D.,McWilliams,B.:Theshatteredgradients problem:Ifresnetsaretheanswer,thenwhatisthequestion?In:Proceedingsofthe34thInternational Conference on Machine Learning, vol. 70, pp. 342-350. JMLR. org (2017) [53] Zhang, H., Dauphin, Y.N., Ma, T.: Fixup initialization: residual learning without normalization (2019a). arXiv:1901.09321 [54] Curtis, F.E., Scheinberg, K.: Optimization methods for supervised machine learning: from linear models to deep learning. In: Leading Developments from INFORMS Communities, pp. 89-114. INFORMS (2017) [55] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour (2017). arXiv:1706.02677 [56] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998-6008(2017) [57] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805 [58] Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=r14EOsCqKX [59] Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision, pp. 464-472. IEEE (2017) [60] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2016). arXiv:1608.03983 [61] Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates (2017). arXiv:1708.07120 [62] Powell, M.J.D.: Restart procedures for the conjugate gradient method. Math. Program. 12(1), 241-254(1977) [63] O'donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715-732(2015) [64] Luo,Z.-Q.:Ontheconvergenceofthelmsalgorithmwithadaptivelearningrateforlinearfeedforward networks. Neural Comput. 3(2), 226-245(1991) [65] Schmidt, M., Roux, L.N.: Fast convergence of stochastic gradient descent under a strong growth condition (2013). arXiv:1308.6370 [66] Vaswani,S.,Bach,F.,Schmidt,M.:Fastandfasterconvergenceofsgdforover-parameterizedmodels and an accelerated perceptron (2018). arXiv:1810.07288 [67] Liu, C., Belkin, M.: Mass: an accelerated stochastic method for over-parametrized learning (2018b). arXiv:1810.13395 [68] Bottou, L.: Online learning and stochastic approximations. On-line Learn.Neural Netw. 17(9), 142(1998) [69] Ruder, Sebastian: An overview of gradient descent optimization algorithms (2016). arXiv:1609.04747 [70] Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1-2), 37-75(2014) [71] Devolder, O., Glineur, F., Nesterov, Y., et al.: First-order methods with inexact oracle: the strongly convex case. No. 2013016. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2013 [72] Kidambi, R., Netrapalli, P., Jain, P., Kakade, S.: On the insufficiency of existing momentum schemes for stochastic optimization. In: 2018 Information Theory and Applications Workshop (ITA), pp. 1-9. IEEE (2018) [73] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384-3392(2015) [74] Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18(1), 8194-8244(2017) [75] Defazio, A., Bottou, L.: On the ineffectiveness of variance reduced optimization for deep learning. In: Advances in Neural Information Processing Systems, pp. 1753-1763(2019) [76] Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent (2017). arXiv:1704.08227 [77] Liu, C., Belkin, M.: Accelerating sgd with momentum for over-parameterized learning (2018) arXiv:1810.13395 [78] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751-1772(2018) [79] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 654-663(2017) [80] Xu, Y., Rong, Jing, Y., Tianbao: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Advances in Neural Information Processing Systems, pp. 5535-5545(2018) [81] Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 687-697(2018) [82] Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2680-2691(2018) [83] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121-2159(2011) [84] Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26-31(2012) [85] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980 [86] Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701 [87] Dozat, T., Adam, I. N.: International Conference on Learning Representations. In Workshop (ICLRW) (pp. 1-6). In: Proceedings of Incorporating nesterov momentum into adam (2016) [88] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781 [89] Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543(2014) [90] Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148-4158(2017) [91] Keskar, N.S., Socher, R.: Improving generalization performance by switching from adam to sgd (2017). arXiv:1712.07628 [92] Sivaprasad, P.T., Mai, F., Vogels, T., Jaggi, M., Fleuret, F.: On the tunability of optimizers in deep learning (2019). arXiv:1910.11758 [93] Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018) [94] Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization (2018). arXiv:1808.02941 [95] Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2018). arXiv:1808.05671 [96] Zou, F., Shen, L.: On the convergence of adagrad with momentum for training deep neural networks (2018). arXiv:1808.03408 [97] De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration (2018) arXiv:1807.06766 [98] Zou, F., Shen, L., Jie, Z., Zhang, We., Liu, W.: A sufficient condition for convergences of adam and rmsprop (2018b). arXiv:1811.09358 [99] Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization (2018). arXiv:1806.01811 [100] Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization (2019). arXiv:1911.07596 [101] Bertsekas, D.P., Tsitsiklis, J.N: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989) [102] Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don't decay the learning rate, increase the batch size. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=B1Yy1BxCZ [103] Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch sgd: training resnet-50 on imagenet in 15 minutes (2017). arXiv:1711.04325 [104] Jia,X.,Song,S.,He,W.,Wang,Y.,Rong,H.,Zhou,F.,Xie,L.,Guo,Z.,Yang,Y.,Yu,L.,etal.:Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes (2018). arXiv:1807.11205 [105] Mikami, H., Suganuma, H., Tanaka, Y., Kageyama, Y., et al.: Massively distributed sgd: Imagenet/resnet-50 training in a flash (2018). arXiv:1811.05233 [106] Ying, C., Kumar, S., Chen, D., Wang, T., Cheng, Y.: Image classification at supercomputer scale (2018). arXiv:1811.06992 [107] Yamazaki, M., Kasagi, A., Tabuchi, A., Honda, T., Miwa, M., Fukumoto, N., Tabaru, T., Ike, A., Nakashima, K.: Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds (2019). arXiv:1903.12650 [108] You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM (2018) [109] Yuan, Y.: Step-sizes for the gradient method. AMS IP Stud. Adv. Math. 42(2), 785(2008) [110] Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141-148(1988) [111] Becker, S., Le Cun, Y., et al.: Improving the convergence of back-propagation learning with second order methods. In: Proceedings of the 1988 Connectionist Models Summer School, pp. 29-37(1988) [112] Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(Jul), 1737-1754(2009) [113] LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 9-48. Springer, Berlin (2012) [114] Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343-351(2013) [115] Tan, C., Ma, S., Dai, Y.-H., Qian, Y.: Barzilai-borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685-693(2016) [116] Orabona, F., Tommasi, T.: Training deep networks without learning rates through coin betting. In: Advances in Neural Information Processing Systems, pp. 2160-2170(2017) [117] Martens, J.: Deep learning via hessian-free optimization. ICML 27, 735-742(2010) [118] Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Comput. 6(1), 147-160(1994) [119] Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723-1738(2002) [120] Berahas, A.S., Jahani, M., Takáč, M.: Quasi-newton methods for deep learning: forget the past, just sample (2019). arXiv:1901.09997 [121] Amari, S.-I., Park, H., Fukumizu, K.: Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput. 12(6), 1399-1409(2000) [122] Martens, J.: New insights and perspectives on the natural gradient method (2014). arXiv:1412.1193 [123] Amari, S., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society, Providence (2007) [124] Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408-2417(2015) [125] Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Yokota, R., Matsuoka, S.: Second-order optimization method for large mini-batch: training resnet-50 on imagenet in 35 epochs (2018). arXiv:1811.12019 [126] Anil, R., Gupta, V., Koren, T., Regan, K., Singer, Y.: Second order optimization made practical (2020). arXiv:2002.09018 [127] Gupta, V., Koren, T., Singer, Y.: Shampoo: preconditioned stochastic tensor optimization (2018). arXiv:1802.09568 [128] Vidal, R., Bruna, J., Giryes, R., Soatto, S.: Mathematics of deep learning (2017). arXiv:1712.04741 [129] Lu, C., Deng, Z., Zhou, J., Guo, X.: A sensitive-eigenvector based global algorithm for quadratically constrained quadratic programming. J. Glob. Optim. 73, 1-18(2019) [130] Ferreira, O.P., Németh, S.Z.: On the spherical convexity of quadratic functions. J. Glob. Optim. 73(3), 537-545(2019). https://doi.org/10.1007/s10898-018-0710-6 [131] Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Signal Process. 67(20), 5239-5269(2019) [132] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933-2941(2014) [133] Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems (2014). arXiv:1412.6544 [134] Poggio, T., Liao, Q.: Theory Ⅱ: landscape of the empirical risk in deep learning. PhD thesis, Center for Brains, Minds and Machines (CBMM) (2017). arXiv:1703.09833 [135] Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, pp. 6391-6401(2018b) [136] Baity-Jesi, M., Sagun, L., Geiger, M., Spigler, S., Arous, G.B., Cammarota, C., LeCun, Y., Wyart, M., Biroli, G.: Comparing dynamics: deep neural networks versus glassy systems (2018). arXiv:1803.06969 [137] Franz, S., Hwang, S., Urbani, P.: Jamming in multilayer supervised learning models (2018). arXiv:1809.09945 [138] Geiger, M., Spigler, S., d'Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., Wyart, M.: The jamming transition as a paradigm to understand the loss landscape of deep neural networks (2018). arXiv:1809.09349 [139] Draxler, F., Veschgini, K., Salmhofer, M., Hamprecht, F.A.: Essentially no barriers in neural network energy landscape (2018) arXiv:1803.00885 [140] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D.P., Wilson, A.G.: Loss surfaces, mode connectivity, and fast ensembling of DNNS. In: Advances in Neural Information Processing Systems, pp. 8789-8798(2018) [141] Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization (2016). arXiv:1611.01540 [142] Nguyen, Q.: On connected sublevel sets in deep learning (2019b). arXiv:1901.07417 [143] Kuditipudi, R., Wang, X., Lee, H., Zhang, Y., Li, Z., Hu, W., Arora, S., Ge, R.: Explaining landscape connectivity of low-cost solutions for multilayer nets (2019). arXiv:1906.06247 [144] Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding (2015). arXiv:1510.00149 [145] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning (2018). arXiv:1810.05270 [146] Lee, N., Ajanthan, T., Torr, P.: SNIP: single-shot network pruning based on connection sensitivity. In: International Conference on Learning Representations (2019b). https://openreview.net/forum?id=B1VZqjAcYX [147] Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: The lottery ticket hypothesis at scale (2019). arXiv:1903.01611 [148] Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks (2018). arXiv:1803.03635 [149] Zhou, H., Lan, J., Liu, R., Yosinski, J.: Deconstructing lottery tickets: zeros, signs, and the supermask (2019). arXiv:1905.01067 [150] Morcos, A.S., Yu, H., Paganini, M., Tian, Y.: One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers (2019). arXiv:1906.02773 [151] Tian, Y., Jiang, T., Gong, Q., Morcos, A.: Luck matters: Understanding training dynamics of deep relu networks (2019). arXiv:1905.13405 [152] Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1-42(1997) [153] Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima (2016). arXiv:1609.04836 [154] Dinh, L., Pascanu, R., Bengio, S., Bengio, Y.: Sharp minima can generalize for deep nets. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1019-1028(2017) [155] Neyshabur, B., Salakhutdinov, R.R., Srebro, N.: Path-sgd: Path-normalized optimization in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2422-2430(2015) [156] Yi, M., Meng, Q., Chen, W., Ma, Z., Liu, T.-Y.: Positively scale-invariant flatness of relu neural networks (2019). arXiv:1903.02237 [157] He, H., Huang, G., Yuan, Y.: Asymmetric valleys: beyond sharp and flat local minima (2019). arXiv:1902.00744 [158] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: Biasing gradient descent into wide valleys (2016). arXiv:1611.01838 [159] Kawaguchi, K.: Deep learning without poor local minima. In: Advances in Neural Information Processing Systems, pp. 586-594(2016) [160] Lu, H., Kawaguchi, K.: Depth creates no bad local minima (2017). arXiv:1702.08580 [161] Laurent, T., Brecht, J.: Deep linear networks with arbitrary loss: all local minima are global. In: International Conference on Machine Learning, pp. 2908-2913(2018) [162] Nouiehed, M., Razaviyayn, M.: Learning deep models: critical points and local openness (2018). arXiv:1803.02968 [163] Zhang, L.: Depth creates no more spurious local minima (2019). arXiv:1901.09827 [164] Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks (2017). arXiv:1707.02444 [165] Zhou, Y., Liang, Y.: Critical points of linear neural networks: analytical forms and landscape properties (2018) arXiv: 1710.11205 [166] Livni, R., Shalev-Shwartz, S., Shamir, O.: On the computational efficiency of training neural networks. In: Advances in Neural Information Processing Systems, pp. 855-863(2014) [167] Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N.: Exploring generalization in deep learning. In: Advances in Neural Information Processing Systems, pp. 5947-5956(2017) [168] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization (2016). arXiv:1611.03530 [169] Nguyen, Q., Mukkamala, M.C., Hein, M.: On the loss landscape of a class of deep neural networks with no bad local valleys (2018). arXiv:1809.10749 [170] Li, Dawei, D., Tian, S., Ruoyu: Over-parameterized deep neural networks have no strict local minima for any continuous activations (2018a). arXiv:1812.11039 [171] Yu, X., Pasupathy, S.: Innovations-based MLSE for Rayleigh flat fading channels. IEEE Trans. Commun. 43, 1534-1544(1995) [172] Ding, T., Li, D., Sun, R.: Sub-optimal local minima exist for almost all over-parameterized neural networks. Optimization Online (2019) arXiv: 1911.01413 [173] Bartlett,P.L.,Foster,D.J.,Telgarsky,M.J.:Spectrally-normalizedmarginboundsforneuralnetworks. In: Advances in Neural Information Processing Systems, pp. 6240-6249(2017) [174] Wei, C., Lee, J.D., Liu, Q., Ma, T.: On the margin theory of feedforward neural networks (2018). arXiv:1810.05369 [175] Wu, L., Zhu, Z., et al.: Towards understanding generalization of deep learning: perspective of loss landscapes (2017). arXiv:1706.10239 [176] Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine learning and the bias-variance trade-off (2018). arXiv:1812.11118 [177] Mei, S., Montanari, A.: The generalization error of random features regression: precise asymptotics and double descent curve (2019). arXiv:1908.05355 [178] Liang, S., Sun, R., Lee, J.D., Srikant, R.: Adding one neuron can eliminate all bad local minima. In: Advances in Neural Information Processing Systems, pp. 4355-4365(2018a) [179] Kawaguchi, K., Kaelbling, L.P.: Elimination of all bad local minima in deep learning (2019). arXiv:1901.00279 [180] Liang, S., Sun, R., Srikant, R.: Revisiting landscape analysis in deep neural networks: eliminating decreasing paths to infinity (2019). arXiv:1912.13472 [181] Shalev-Shwartz, S., Shamir, O., Shammah, S.: Failures of gradient-based deep learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3067-3075. JMLR. org (2017) [182] Swirszcz, G., Czarnecki, W.M., Pascanu, R.: Local minima in training of deep networks (2016). arXiv:1611.06310 [183] Zhou, Y., Liang, Y.: Critical points of neural networks: analytical forms and landscape properties (2017). arXiv:1710.11205 [184] Safran, I., Shamir, O.: Spurious local minima are common in two-layer relu neural networks (2017). arXiv:1712.08968 [185] Venturi, L., Bandeira, A., Bruna, J.: Spurious valleys in two-layer neural network optimization landscapes (2018b). arXiv:1802.06384 [186] Liang, S., Sun, R., Li, Y., Srikant, R.: Understanding the loss surface of neural networks for binary classification (2018b). arXiv:1803.00909 [187] Yun, C., Sra, S., Jadbabaie, A.: Small nonlinearities in activation functions create bad local minima in neural networks (2018). arXiv:1802.03487 [188] Bartlett, P., Helmbold, D., Long, P.: Gradient descent with identity initialization efficiently learns positive definite linear transformations. In: International Conference on Machine Learning, pp. 520-529(2018) [189] Arora, S., Cohen, N., Golowich, N., Hu, W.: A convergence analysis of gradient descent for deep linear neural networks (2018). arXiv:1810.02281 [190] Ji, Z., Telgarsky, M.: Gradient descent aligns the layers of deep linear networks (2018). arXiv:1810.02032 [191] Du, S.S., Lee, J.D., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks (2018). arXiv:1811.03804 [192] Yang, G.: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation (2019). arXiv:1902.04760 [193] Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolafia, D.A., Pennington, J., Sohl-dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: International Conference on Learning Representations (2019a). https://openreview.net/forum?id=B1g30j0qF7 [194] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization (2018). arXiv:1811.03962 [195] Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep relu networks (2018a). arXiv:1811.08888 [196] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8168-8177(2018) [197] Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R., Wang, R.: On exact computation with an infinitely wide neural net (2019a). arXiv:1904.11955 [198] Zhang, H., Yu, D., Chen, W., Liu, T.-Y.: Training over-parameterized deep resnet is almost as easy as training a two-layer network (2019b). arXiv:1903.07120 [199] Ma, C., Wu, L., et al.: Analysis of the gradient descent algorithm for a deep neural network model with skip-connections (2019). arXiv:1904.05263 [200] Li, Z., Wang, R., Yu, D., Du, S.S., Hu, W., Salakhutdinov, R., Arora, S.: Enhanced convolutional neural tangent kernels (2019) arXiv:1806.05393 [201] Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks (2019b). arXiv:1910.01663 [202] Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A.A., Sohl-Dickstein, J., Schoenholz, S.S.: Neural tangents: Fast and easy infinite neural networks in python (2019b). arXiv:1912.02803 [203] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8570-8581(2019a) [204] Sirignano, J., Spiliopoulos, K.: Mean field analysis of deep neural networks (2019). arXiv:1903.04440 [205] Araujo, D., Oliveira, R.I., Yukimura, D.: A mean-field limit for certain deep neural networks (2019) arXiv:1906.00193 [206] Nguyen, P.-M.: Mean field limit of the learning dynamics of multilayer neural networks (2019a). arXiv:1902.02880 [207] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layers neural networks (2018). arXiv:1804.06561 [208] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks (2018). arXiv:1805.01053 [209] Rotskoff, G.M., Vanden-Eijnden, E.: Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error (2018). arXiv:1805.00915 [210] Chizat, L., Oyallon, E., Bach, F.: On the global convergence of gradient descent for overparameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3040-3050(2018) [211] Williams, F., Trager, M., Silva, C., Panozzo, D., Zorin, D., Bruna, J.: Gradient dynamics of shallow univariate relu networks In: Advances in Neural Information Processing Systems, pp. 8376-8385(2019) [212] Venturi, L., Bandeira, A., Bruna, J.: Neural networks with finite intrinsic dimension have no spurious valleys (2018a). arXiv:1802.06384. 15 [213] Haeffele, B.D., Vidal, R.: Global optimality in neural network training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7331-7339(2017) [214] Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427-444(2005) [215] Ge, R., Lee, J.D., Ma, T.: Learning one-hidden-layer neural networks with landscape design (2017). arXiv:1711.00501 [216] Gao, W., Makkuva, A.V., Oh, S., Viswanath, P.: Learning one-hidden-layer neural networks under general input distributions (2018). arXiv:1810.04133 [217] Feizi, S., Javadi, H., Zhang, J., Tse, D.: Porcupine neural networks: (almost) all local optima are global (2017). arXiv:1710.02196 [218] Panigrahy, R., Rahimi, A., Sachdeva, S., Zhang, Q.: Convergence results for neural networks via electrodynamics (2017). arXiv:1702.00458 [219] Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inf. Theory 65(2), 742-769(2019) [220] Soudry, D., Hoffer, E.: Exponentially vanishing sub-optimal local minima in multilayer neural networks (2017). arXiv:1702.05777 [221] Laurent, T., von Brecht, J.: The multilinear structure of relu networks (2017). arXiv:1712.10132 [222] Tian, Y.: An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3404-3413. JMLR. org (2017) [223] Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with Gaussian inputs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 605-614(2017) [224] Zhong, K., Song, Z., Jain, P., Bartlett, P.L., Dhillon, I.S.: Recovery guarantees for one-hidden-layer neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4140-4149(2017) [225] Li,Y.,Yuan,Y.:Convergenceanalysisoftwo-layerneuralnetworkswithreluactivation.In:Advances in Neural Information Processing Systems, pp. 597-607(2017) [226] Brutzkus, A., Globerson, A., Malach, E., Shalev-Shwartz, S.: Sgd learns over-parameterized networks that provably generalize on linearly separable data. International Conference on Learning Representations (2018) [227] Wang, G., Giannakis, G.B., Chen, J.: Learning relu networks on linearly separable data: algorithm, optimality, and generalization (2018). arXiv:1808.04685 [228] Zhang, X., Yu, Y., Wang, L., Gu, Q.: Learning one-hidden-layer relu networks via gradient descent (2018). arXiv:1806.07808 [229] Du,S.S.,Lee,J.D.:Onthepowerofover-parametrizationinneuralnetworkswithquadraticactivation (2018). arXiv:1803.01206 [230] Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks (2019). arXiv:1902.04674 [231] Su, L., Yang, P.: On learning over-parameterized neural networks: a functional approximation prospective. In: Advances in Neural Information Processing Systems pp. 2637-2646(2019) [232] Janzamin, M., Sedghi, H., Anandkumar, A.: Beating the perils of non-convexity: guaranteed training of neural networks using tensor methods (2015). arXiv:1506.08473 [233] Mondelli, M., Montanari, A.: On the connection between learning two-layers neural networks and tensor decomposition (2018). arXiv:1802.07301 [234] Boob, D., Lan, G.: Theoretical properties of the global optimizer of two layer neural network (2017). arXiv:1710.11241 [235] Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: Don't be afraid of spurious local minima (2017). arXiv:1712.00779 [236] Vempala, S., Wilmes, J.: Polynomial convergence of gradient descent for training one-hidden-layer neural networks (2018). arXiv:1805.02677 [237] Ge, R., Kuditipudi, R., Li, Z., Wang, X.: Learning two-layer neural networks with symmetric inputs (2018). arXiv:1810.06793 [238] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: Gradient descent takes the shortest path? (2018). arXiv:1812.10004 [239] Ju, S.: List of works on “provable nonconvex methods/algorithms”. https://sunju.org/research/nonconvex/ [240] Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641-654(2010) [241] Nesterov, Y.: Efficiency of coordiate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341-362(2012) [242] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315-323(2013) [243] Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646-1654(2014) [244] Wright, S., Nocedal, J.: Numerical optimization. Science 35(67-68), 7(1999) |
[1] | Jiang Hu, Xin Liu, Zai-Wen Wen, Ya-Xiang Yuan. A Brief Introduction to Manifold Optimization [J]. Journal of the Operations Research Society of China, 2020, 8(2): 199-248. |
[2] | Hai-Miao Zhang, Bin Dong. A Review on Deep Learning in Medical Image Reconstruction [J]. Journal of the Operations Research Society of China, 2020, 8(2): 311-340. |
[3] | Yin Zhang. Convergence of a Class of Stationary Iterative Methods for Saddle Point Problems [J]. Journal of the Operations Research Society of China, 2019, 7(2): 195-204. |
[4] | Zhong-Ming Wu, Min Li. An LQP-Based Symmetric Alternating Direction Method of Multipliers with Larger Step Sizes [J]. Journal of the Operations Research Society of China, 2019, 7(2): 365-383. |
[5] | Yun-Da Dong, Hai-Bin Zhang, Huan Gao. On Globally Q-Linear Convergence of a Splitting Method for Group Lasso [J]. Journal of the Operations Research Society of China, 2018, 6(3): 445-454. |
[6] | Nancy Baygorrea, Erik Alex Papa Quiroz, Nelson Maculan. On the Convergence Rate of an Inexact Proximal Point Algorithm for Quasiconvex Minimization on Hadamard Manifolds [J]. Journal of the Operations Research Society of China, 2017, 5(4): 457-467. |
[7] | Morteza Kimiaei, Susan Ghaderi. A New Restarting Adaptive Trust-Region Method for Unconstrained Optimization [J]. Journal of the Operations Research Society of China, 2017, 5(4): 487-507. |
[8] | Feng-Mei Tang · Ping-Liang Huang. On the Convergence Rate of a Proximal Point Algorithm for Vector Function on Hadamard Manifolds [J]. Journal of the Operations Research Society of China, 2017, 5(3): 405-417. |
[9] | Yi-Ju Wang· Guang-Lu Zhou. A Hybrid Second-Order Method for Homogenous Polynomial Optimization over Unit Sphere [J]. Journal of the Operations Research Society of China, 2017, 5(1): 99-. |
[10] | Yi-Yong Li· Qing-Zhi Yang · Xi He. A Method with Parameter for Solving the Spectral Radius of Nonnegative Tensor [J]. Journal of the Operations Research Society of China, 2017, 5(1): 3-. |
[11] | Suvra Kanti Chakraborty, Geetanjali Panda. Two-Phase-SQP Method with Higher-Order Convergence Property [J]. Journal of the Operations Research Society of China, 2016, 4(3): 385-. |
[12] | Ya-Feng Liu, Rui Diao,Feng Ye,Hong-Wei Liu. An Efficient Inexact Newton-CG Algorithm for the Smallest Enclosing Ball Problem of Large Dimensions [J]. Journal of the Operations Research Society of China, 2016, 4(2): 167-. |
[13] | Jun-Kai Feng · Hai-Bin Zhang ·Cao-Zong Cheng· Hui-Min Pei. Convergence Analysis of L-ADMM for Multi-block Linear-Constrained Separable Convex Minimization Problem [J]. Journal of the Operations Research Society of China, 2015, 3(4): 563-. |
[14] | Xue Zhang · Xiao-Qun Zhang. A Note on the Complexity of Proximal Iterative Hard Thresholding Algorithm [J]. Journal of the Operations Research Society of China, 2015, 3(4): 459-. |
[15] | Xiang-Feng Wang. On the Convergence Rate of a Class of Proximal-Based Decomposition Methods for Monotone Variational Inequalities [J]. Journal of the Operations Research Society of China, 2015, 3(3): 347-. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||