An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

doi:10.1007/s40305-023-00453-9

摘要/Abstract

Abstract: Numerous intriguing optimization problems arise as a result of the advancement of machine learning. The stochastic first-ordermethod is the predominant choicefor those problems due to its high efficiency. However, the negative effects of noisy gradient estimates and high nonlinearity of the loss function result in a slow convergence rate. Second-order algorithms have their typical advantages in dealing with highly nonlinear and ill-conditioning problems. This paper provides a review on recent developments in stochastic variants of quasi-Newton methods, which construct the Hessian approximations using only gradient information. We concentrate on BFGS-based methods in stochastic settings and highlight the algorithmic improvements that enable the algorithm to work in various scenarios. Future research on stochastic quasi-Newton methods should focus on enhancing its applicability, lowering the computational and storage costs, and improving the convergence rate.

Key words: Stochastic quasi-Newton methods, BFGS, Large-scale machine learning

中图分类号:

. [J]. Journal of the Operations Research Society of China, 2023, 11(2): 245-275.

Tian-De Guo, Yan Liu, Cong-Ying Han. An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning[J]. Journal of the Operations Research Society of China, 2023, 11(2): 245-275.

参考文献

[1] Robbins, H., Monro, S.:A stochastic approximation method. Ann. Math. Stat. 22(3), 400-407(1951)
[2] Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.:On optimization methods for deep learning. In:International Conference on Machine Learning. pp. 265-272. ACM, New York, USA (2011)
[3] Bottou, L.:Stochastic gradient learning in neural networks. In:Proceedings of Neuro-Nımes. 91(1991)
[4] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.:Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278-2324(1998)
[5] Bottou, L., Bousquet, O.:The tradeoffs of large scale learning. In:Neural Information Processing Systems. vol. 20. Curran Associates, Inc. (2007). https://proceedings.neurips.cc/paper/2007/file/0d3180d672e08b4c5312dcdafdf6ef36-Paper.pdf
[6] Zinkevich, M., Weimer, M., Li, L., Smola, A.:Parallelized stochastic gradient descent. In:Neural Information Processing Systems. vol. 23. Curran Associates, Inc. (2010). https://proceedings.neurips.cc/paper/2010/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf
[7] Qian, N.:On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145-151(1999)
[8] Loizou, N., Richtárik, P.:Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Comput. Optim. Appl. 77(3), 653-710(2020)
[9] Nesterov, Y.:A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady Akademia Nauk USSR 269, 543-547(1983)
[10] Duchi, J., Hazan, E., Singer, Y.:Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 257-269(2011)
[11] Zeiler, M.D.:Adadelta:an adaptive learning rate method. arXiv:1212.5701(2012)
[12] Kingma, D.P., Ba, J.:Adam:a method for stochastic optimization. arXiv:1412.6980(2014)
[13] Sutskever, I., Martens, J., Dahl, G., Hinton, G.:On the importance of initialization and momentum in deep learning. In:International Conference on Machine Learning. pp. 1139-1147. PMLR (2013)
[14] Yuan, K., Ying, B., Sayed, A.H.:On the influence of momentum acceleration on online learning. J. Mach. Learn. Res. 17(1), 6602-6667(2016)
[15] Lan, G.:An optimal method for stochastic composite optimization. Math. Progr. 133(1), 365-397(2012)
[16] Ghadimi, S., Lan, G.:Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i:A generic algorithmic framework. SIAM J. Optim. 22(4), 1469-1492(2012)
[17] Bottou, L., Curtis, F.E., Nocedal, J.:Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223-311(2018)
[18] Gower, R.M., Schmidt, M., Bach, F., Richtarik, P.:Variance-reduced methods for machine learning. Proc. IEEE 108(11), 1968-1983(2020)
[19] Schmidt, M., Roux, N.L., Bach, F.:Minimizing finite sums with the stochastic average gradient. Math. Progr. 162(5), 1-30(2013)
[20] Roux, N., Schmidt, M., Bach, F.:A stochastic gradient method with an exponential convergence rate for finite training sets. In:Neural Information Processing Systems. vol. 25. Curran Associates, Inc. (2012). https://proceedings.neurips.cc/paper/2012/file/905056c1ac1dad141560467e0a99e1cfPaper.pdf
[21] Defazio, A., Bach, F., Lacoste-Julien, S.:Saga:a fast incremental gradient method with support for non-strongly convex composite objectives. In:Neural Information Processing Systems. vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/ede7e2b6d13a41ddf9f4bdef84fdc737-Paper.pdf
[22] Johnson, R., Zhang, T.:Accelerating stochastic gradient descent using predictive variance reduction. In:Neural Information Processing Systems. vol. 26. Curran Associates, Inc. (2013). https://proceedings.neurips.cc/paper/2013/file/ac1dd209cbcc5e5d1c6e28598e8cbbe8-Paper.pdf
[23] Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.:SARAH:a novel method for machine learning problems using stochasticrecursive gradient.In:International Conference onMachine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 2613-2621. PMLR (2017). https://proceedings.mlr.press/v70/nguyen17b.html
[24] Xu, P., Roosta, F., Mahoney, M.W.:Second-order optimization for non-convex machine learning:an empirical study. In:SIAM International Conference on Data Mining. pp. 199-207. SIAM (2020)
[25] Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.:On the use of stochastic hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977-995(2011)
[26] Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.:Sample size selection in optimization methods for machine learning. Math. Progr. 134(1), 127-155(2012)
[27] Erdogdu, M.A., Montanari, A.:Convergence rates of sub-sampled newton methods. In:Neural Information Processing Systems. vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/404dcc91b2aeaa7caa47487d1483e48a-Paper.pdf
[28] Roosta-Khorasani, F., Mahoney, M.W.:Sub-sampled Newton methods. Math. Progr. 174(1), 293-326(2019)
[29] Xu, P., Yang, J., Roosta, F., Ré, C., Mahoney, M.W.:Sub-sampled newton methods with non-uniform sampling.In:NeuralInformation Processing Systems. vol. 29. CurranAssociates,Inc.(2016).https://proceedings.neurips.cc/paper/2016/file/55c567fd4395ecef6d936cf77b8d5b2b-Paper.pdf
[30] Bollapragada, R., Byrd, R.H., Nocedal, J.:Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545-578(2019)
[31] Wang, J., Zhang, T.:Improved optimization of finite sums with minibatch stochastic variance reduced proximal iterations. arXiv:1706.07001(2017)
[32] Pilanci, M., Wainwright, M.J.:Newton sketch:a near linear-time optimization algorithm with linearquadratic convergence. SIAM J. Optim. 27(1), 205-245(2017)
[33] Berahas, A.S., Bollapragada, R., Nocedal, J.:An investigation of Newton-sketch and subsampled Newton methods. Optim. Methods Softw. 35(4), 661-680(2020)
[34] Hestenes, M.R., Stiefel, E.:Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409-436(1952)
[35] Agarwal, N., Bullins, B., Hazan, E.:Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(116), 1-40(2017)
[36] Bordes, A., Bottou, L., Gallinari, P.:Sgd-qn:careful quasi-newton stochastic gradient descent. J. Mach. Learn. Res. 10(3), 1737-1754(2009)
[37] Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.:A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008-1031(2016)
[38] Moritz, P., Nishihara, R., Jordan, M.:A linearly-convergent stochastic L-BFGS algorithm. In:International Conference on Artificial Intelligence and Statistics. pp. 249-258. PMLR (2016)
[39] Gower, R., Goldfarb, D., Richtarik, P.:Stochastic block BFGS:squeezing more curvature out of data. In:International Conference on Machine Learning. pp. 1869-1878. PMLR (2016)
[40] Martens, J., et al.:Deep learning via hessian-free optimization. Int. Conf. Mach. Learn. 27, 735-742(2010)
[41] Martens, J., Sutskever, I.:Learning recurrent neural networks with hessian-free optimization. In:International Conference on Machine Learning. pp. 1033-1040. ACM, New York, USA (2011)
[42] Schraudolph, N.N.:Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723-1738(2002)
[43] Botev, A., Ritter, H., Barber, D.:Practical Gauss-Newton optimisation for deep learning. In:International Conference on Machine Learning. pp. 557-565. PMLR (2017)
[44] Martens, J., Grosse, R.:Optimizing neural networks with Kronecker-factored approximate curvature. In:International Conference on Machine Learning. pp. 2408-2417. PMLR (2015)
[45] Grosse, R., Martens, J.:A Kronecker-factored approximate fisher matrix for convolution layers. In:International Conference on Machine Learning. pp. 573-582. PMLR (2016)
[46] Amari, S.I.:Natural gradient works efficiently in learning. Neural Comput. 10(2), 251-276(1998)
[47] Roux, N., Manzagol, P.A., Bengio, Y.:Topmoumoute online natural gradient algorithm. In:Neural Information Processing Systems. vol. 20. Curran Associates, Inc. (2007). https://proceedings.neurips.cc/paper/2007/file/9f61408e3afb633e50cdf1b20de6f466-Paper.pdf
[48] Martens, J.:New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21(146), 1-76(2020)
[49] George, T., Laurent, C., Bouthillier, X., Ballas, N., Vincent, P.:Fast approximate natural gradient descent in a kronecker-factored eigenbasis. In:Neural Information Processing Systems. pp. 9573-9583. NIPS'18, Curran Associates Inc., Red Hook, NY, USA (2018)
[50] Gupta, V., Koren, T., Singer, Y.:Shampoo:preconditioned stochastic tensor optimization. In:International Conference on Machine Learning. pp. 1842-1850. PMLR (2018)
[51] Chen, M., Gao, K., Liu, X., Wang, Z., Ni, N., Zhang, Q., Chen, L., Ding, C., Huang, Z., Wang, M., Wang, S., Yu, F., Zhao, X., Xu, D.:THOR, trace-based hardware-driven layer-oriented natural gradient descent computation. AAAI Conf. Artif. Intell. 35(8), 7046-7054(2021)
[52] Schraudolph, N.N., Yu, J., Günter, S.:A stochastic quasi-Newton method for online convex optimization. In:Artificial intelligence and statistics. pp. 436-443. PMLR (2007)
[53] Gower, R., Kovalev, D., Lieder, F., Richtarik, P.:RSN:randomized subspace Newton. In:Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019)
[54] Hanzely, F., Doikov, N., Nesterov, Y., Richtarik, P.:Stochastic subspace cubic Newton method. In:International Conference on Machine Learning. pp. 4027-4038. PMLR (2020)
[55] Xu, P., Roosta, F., Mahoney, M.W.:Newton-type methods for non-convex optimization under inexact Hessian information. Math. Progr. 184(1), 35-70(2020)
[56] Mokhtari, A., Ribeiro, A.:RES:regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089-6104(2014)
[57] Wang, X., Ma, S., Goldfarb, D., Liu, W.:Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927-956(2017)
[58] Mokhtari, A., Eisen, M., Ribeiro, A.:IQN:an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670-1698(2018)
[59] Mokhtari, A., Ribeiro, A.:Stochastic quasi-Newtonmethods. Proc.IEEE 108(11), 1906-1922(2020)
[60] Jalilzadeh, A., Nedić, A., Shanbhag, U.V., Yousefian, F.:A variable sample-size stochastic quasinewton method for smooth and nonsmooth stochastic convex optimization. Math. Op. Res. 47(1), 690-719(2022)
[61] Shi, H.J.M., Xie, Y., Byrd, R., Nocedal, J.:A noise-tolerant quasi-Newton algorithm for unconstrained optimization. SIAM J. Optim. (2022). https://doi.org/10.1137/20M1373190
[62] Martens, J., Sutskever, I.:Training Deep and Recurrent Networks with Hessian-Free Optimization, pp. 479-535. Springer, Berlin (2012)
[63] Cho, M., Dhir, C., Lee, J.:Hessian-free optimization for learning deep multidimensional recurrent neural networks. In:Neural Information Processing Systems. vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/file/a86c450b76fb8c371afead6410d55534-Paper.pdf
[64] Haider, A., Zhang, C., Kreyssig, F.L., Woodland, P.C.:A distributed optimisation framework combining natural gradient with hessian-free for discriminative sequence training. Neural Netw. 143, 537-549(2021)
[65] Roux, N.L., Fitzgibbon, A.:A fast natural Newton method. In:International Conference on Machine Learning. pp. 623-630. ICML'10, Omnipress, Madison, WI, USA (2010)
[66] Ba, J., Grosse, R., Martens, J.:Distributed second-order optimization using Kronecker-factored approximations. In:International Conference on Learning Representations (2017)
[67] Khalfan, H.F., Byrd, R.H., Schnabel, R.B.:A theoretical and experimental study of the symmetric rank-one update. SIAM J. Optim. 3(1), 1-24(1993)
[68] Conn, A.R., Gould, N.I., Toint, P.L.:Convergence of quasi-newton matrices generated by the symmetric rank one update. Math. Progr. 50(1), 177-195(1991)
[69] Jorge, N., Stephen, J.W.:Numerical Optimization. Springer, Berlin (2006)
[70] Davidon, W.C.:Varible metric method for minimization. Tech. rep., Argonne National Lab., Lemont, Ill. (1959). https://doi.org/10.2172/4222000,https://www.osti.gov/biblio/4222000
[71] Fletcher, R., Powell, M.J.:A rapidly convergent descent method for minimization. Comput. J. 6(2), 163-168(1963)
[72] Greenstadt, J.:Variations on variable-metric methods. (with discussion). Math. Comput. 24(109), 1-22(1970)
[73] Broyden, C.G.:Quasi-newton methods and their application to function minimisation. Math. Comput. 21(99), 368-381(1967)
[74] Fletcher, R.:A new approach to variable metric algorithms. Comput. J. 13(3), 317-322(1970)
[75] Goldfarb, D.:A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23-26(1970)
[76] Shanno, D.F.:Conditioning of quasi-newton methods for function minimization. Math. Comput. 24(111), 647-656(1970)
[77] Sherman, J., Morrison, W.J.:Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann. Math. Stat. 21(1), 124-127(1950)
[78] Nocedal, J.:Updating quasi-newton matrices with limited storage. Math. Comput. 35(151), 773-782(1980)
[79] Liu, D.C., Nocedal, J.:On the limited memory BFGS method for large scale optimization. Math. Progr. 45(1-3), 503-528(1989)
[80] Wills, A.G., Schön, T.B.:Stochastic quasi-Newton with line-search regularisation. Automatica 127, 109503(2021)
[81] Mokhtari, A., Ribeiro, R.:Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(98), 3151-3181(2015)
[82] Goldfarb, D., Ren, Y., Bahamou, A.:Practical quasi-newton methods for training deep neural networks. In:Neural Information Processing Systems. vol. 33, pp. 2386-2396. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/192fc044e74dffea144f9ac5dc9f3395-Paper. pdf
[83] Lucchi, A., McWilliams, B., Hofmann, T.:A variance reduced stochastic newton method. arXiv:1503.08316(2015)
[84] Konečnỳ, J., Richtárik, P.:Semi-stochastic gradient descent methods. arXiv:1312.1666(2013)
[85] Zhao, R., Haskell, W.B., Tan, V.Y.F.:Stochastic L-BFGS:improved convergence rates and practical acceleration strategies. IEEE Trans. Signal Process. 66(5), 1155-1169(2018)
[86] Gao, W., Goldfarb, D.:Block BFGS methods. SIAM J. Optim. 28(2), 1205-1231(2018)
[87] Kovalev, D., Mishchenko, K., Richtárik, P.:Stochastic newton and cubic newton methods with simple local linear-quadratic rates. arXiv:1912.01597(2019)
[88] Chen, J., Yuan, R., Garrigos, G., Gower, R.M.:San:stochastic average newton algorithm for minimizing finite sums. arXiv:2106.10520(2021)
[89] Berahas, A.S., Nocedal, J., Takac, M.:A multi-batch L-BFGS method for machine learning. In:Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016). https://proceedings.neurips.cc/paper/2016/file/8ebda540cbcc4d7336496819a46a1b68-Paper.pdf
[90] Berahas, A.S., Takáč, M.:A robust multi-batch L-BFGS method for machine learning. Optim. Methods Softw. 35(1), 191-219(2020)
[91] Pearlmutter, B.A.:Fast exact multiplication by the hessian. Neural Comput. 6(1), 147-160(1994)
[92] Keskar, N.S., Berahas, A.S.:AdaQN:an adaptive quasi-Newton algorithm for training RNNs. In:Machine Learning and Knowledge Discovery in Databases. pp. 1-16. Springer International Publishing, Cham (2016)
[93] Berahas, A.S., Jahani, M., Richtárik, P., Takáč, M.:Quasi-Newton methods for machine learning:forget the past, just sample. Optim. Methods Softw. (2021). https://doi.org/10.1080/10556788.2021.1977806
[94] Chen, H., Wu, H.C., Chan, S.C., Lam, W.H.:A stochastic quasi-Newton method for large-scale nonconvex optimization with applications. IEEE Trans. Neural Netw. Learn. Syst. 31(11), 4776-4790(2020)
[95] Lotfi, S., de Ruisselet, T.B., Orban, D., Lodi, A.:Stochastic damped L-BFGS with controlled norm of the Hessian approximation. arXiv:2012.05783(2020)
[96] Wills, A., Schön, T.:Stochastic quasi-newton with adaptive step lengths for large-scale problems. arXiv:1802.04310(2018)
[97] Wills, A., Schön, T.B., Jidling, C.:A fast quasi-Newton-type method for large-scale stochastic optimisation. IFAC-PapersOnLine 53(2), 1249-1254(2020)
[98] Gower, R.M., Gondzio, J.:Action constrained quasi-newton methods. arXiv:1412.8045(2014)
[99] Hennig, P.:Probabilistic interpretation of linear solvers. SIAM J. Optim. 25(1), 234-260(2015)
[100] Gower, R.M., Richtárik, P.:Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380-1409(2017)
[101] Ma, X.:Apollo:an adaptive parameter-wised diagonal quasi-newton method for nonconvex stochastic optimization. In:International Conference on Learning Representations (2020)
[102] Dennis, J.E., Jr., Wolkowicz, H.:Sizing and least-change secant methods. SIAM J. Numer. Anal. 30(5), 1291-1314(1993)
[103] Wills, A.G., Schön, T.B.:On the construction of probabilistic newton-type algorithms.In:Conference on Decision and Control (CDC). pp. 6499-6504(2017). https://doi.org/10.1109/CDC.2017.8264638
[104] Zhou, C., Gao, W., Goldfarb, D.:Stochastic adaptive quasi-newton methods for minimizing expected values. In:International Conference on Machine Learning. pp. 4150-4159(2017)
[105] Nesterov, Y.:Introductory Lectures on Convex Optimization:A Basic Course, vol. 87. Springer, Berlin (2003)
[106] Mahsereci, M., Hennig, P.:Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 1-59(2017)
[107] Bollapragada, R., Byrd, R., Nocedal, J.:Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312-3343(2018)
[108] Balles, L., Romero, J., Hennig, P.:Coupling adaptive batch sizes with learning rates. In:Uncertainty in Artificial Intelligence. pp. 675-684. Curran Associates, Inc. (2017)
[109] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.:Accurate, large minibatch sgd:training imagenet in 1 hour. arXiv:1706.02677(2017)
[110] De, S., Yadav, A.K., Jacobs, D.W., Goldstein, T.:Automated inference with adaptive batches. In:International Conference on Artificial Intelligence and Statistics. vol. 54, pp. 1504-1513. PMLR (2017). http://proceedings.mlr.press/v54/de17a.html
[111] Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.J., Tang, P.T.P.:A progressive batching L-BFGS method for machine learning. In:International Conference on Machine Learning. pp. 620-629. PMLR (2018)
[112] Xie, Y., Byrd, R.H., Nocedal, J.:Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182-209(2020)
[113] Sonnenburg, S., Franc, V., Yom-Tov, E., Sebag, M.:Pascal large scale learning challenge (2008)
[114] Bordes, A., Bottou, L., Gallinari, P., Chang, J., Smith, S.A.:Erratum:SGDQN is less careful than expected. J. Mach. Learn. Res. 11(77), 2229-2240(2010)
[115] Yousefian, F., Nedić, A., Shanbhag, U.V.:Stochastic quasi-Newton methods for non-strongly convex problems:convergence and rate analysis. In:Conference on Decision and Control (CDC). pp. 4496-4503. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CDC.2016.7798953
[116] Yousefian, F., Nedic, A., Shanbhag, U.V.:A smoothing stochastic quasi-newton method for nonlipschitzian stochastic optimization problems. In:Winter Simulation Conference (WSC). pp. 2291-2302. IEEE, Las Vegas, NV (2017). https://doi.org/10.1109/WSC.2017.8247960
[117] Jalilzadeh, A., Nedić, A., Shanbhag, U.V., Yousefian, F.:A variable sample-size stochastic quasinewton method for smooth and nonsmooth stochastic convex optimization. In:Conference on Decision and Control (CDC). pp. 4097-4102. IEEE, Institute of Electrical and Electronics Engineers Inc. (2018). https://doi.org/10.1109/CDC.2018.8619209
[118] Yousefian, F., Nedić, A., Shanbhag, U.V.:On stochastic and deterministic quasi-Newton methods for nonstrongly convex optimization:asymptotic convergence and rate analysis. SIAM J. Optim. 30(2), 1144-1172(2020)
[119] Chen, W., Wang, Z., Zhou, J.:Large-scale l-bfgs using mapreduce. In:Neural Information Processing Systems. vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/e49b8b4053df9505e1f48c3a701c0682-Paper.pdf
[120] Najafabadi, M.M., Khoshgoftaar, T.M., Villanustre, F., Holt, J.:Large-scale distributed L-BFGS. J. Big Data 4(1), 22(2017)
[121] Jahani, M., Nazari, M., Rusakov, S., Berahas, A.S., Takáč, M.:Scaling up quasi-newton algorithms:communication efficient distributed SR1. In:Machine Learning, Optimization, and Data Science, pp. 41-54. Springer, Cham (2020)
[122] Liu, Y., Gao, Y., Yin, W.:An improved analysis of stochastic gradient descent with momentum. In:Neural Information Processing Systems. vol. 33, pp. 18261-18271. Curran Associates Inc. (2020)
[123] Ghadimi, S., Lan, G.:Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Progr. 156(1), 59-99(2016)
[124] Ninomiya, H.:Neural network training based on quasi-Newton method using Nesterov's accelerated gradient. In:IEEE Region 10 Conference (TENCON). pp. 51-54. IEEE, Singapore (2016). https://doi.org/10.1109/TENCON.2016.7847957
[125] Indrapriyadarsini, S., Mahboubi, S., Ninomiya, H., Asai, H.:A stochastic quasi-Newton method with Nesterov's accelerated gradient. In:Machine Learning and Knowledge Discovery in Databases. pp. 743-760. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-46150-8_43
[126] Chang, D., Sun, S., Zhang, C.:An accelerated linearly convergent stochastic L-BFGS algorithm. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3338-3346(2019)
[127] Yasuda, S., Mahboubi, S., Indrapriyadarsini, S., Ninomiya, H., Asai, H.:A stochastic variance reduced Nesterov's accelerated quasi-Newton method. In:IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 1874-1879. IEEE, Boca Raton, FL, USA (2019). https://doi.org/10.1109/ICMLA.2019.00301
[128] Gao, H., Huang, H.:Faster stochastic second order method for large-scale machine learning models. In:SIAM International Conference on Data Mining (SDM). pp. 405-413. Proceedings, Society for Industrial and Applied Mathematics (2021). https://doi.org/10.1137/1.9781611976700.46
[129] Indrapriyadarsini, S., Mahboubi, S., Ninomiya, H., Kamio, T., Asai, H.:Accelerating symmetric rank-1 quasi-newton method with nesterov's gradient for training neural networks. Algorithms 15(1), 6(2022)
[130] Yu, J., Vishwanathan, S., Günter, S., Schraudolph, N.N.:A quasi-newton approach to nonsmooth convex optimization problems in machine learning. The J. Mach. Learn. Res. 11, 1145-1200(2010)
[131] Beck, A.:First-Order Methods in Optimization. SIAM, New Delhi (2017)
[132] Wang, J., Zhang, T.:Utilizing second order information in minibatch stochastic variance reduced proximal iterations. The J. Mach. Learn. Res. 20(1), 1578-1633(2019)
[133] Wang, X., Wang, X., Yuan, Y.X.:Stochastic proximal quasi-Newton methods for non-convex composite optimization. Optim. Methods Softw. 34(5), 922-948(2019)
[134] Yang, M., Milzarek, A., Wen, Z., Zhang, T.:A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization. Math. Progr. (2021). https://doi.org/10.1007/s10107-021-01629-y
[135] Li, D.H., Fukushima, M.:On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM J. Optim. 11(4), 1054-1064(2001)
[136] Powell, M.J.:Algorithms for nonlinear constraints that use lagrangian functions. Math. Progr. 14(1), 224-248(1978)
[137] Curtis, F.:A self-correcting variable-metric algorithm for stochastic optimization. In:International Conference on Machine Learning. pp. 632-641. PMLR (2016)
[138] Yao, Z., Gholami, A., Shen, S., Mustafa, M., Keutzer, K., Mahoney, M.:Adahessian:an adaptive second order optimizer for machine learning. AAAI Conf. Artif. Intell. 35(12), 10665-10673(2021)
[139] Nair, V., Hinton, G.E.:Rectified linear units improve restricted boltzmann machines. In:International Conference on Machine Learning. p. 807-814. ICML'10, Omnipress, Madison, WI, USA (2010)
[140] Jahani, M., Nazari, M., Tappenden, R., Berahas, A., Takac, M.:SONIA:a symmetric blockwise truncated optimization algorithm. In:International Conference on Artificial Intelligence and Statistics. pp. 487-495. PMLR (2021)
[141] Yang, M., Xu, D., Chen, H., Wen, Z., Chen, M.:Enhance curvature information by structured stochastic quasi-Newton methods. In:Computer Vision and Pattern Recognition. pp. 10654-10663(2021)
[142] Ren, Y., Goldfarb, D.:Kronecker-factored quasi-newton methods for convolutional neural networks. arXiv:2102.06737(2021)
[143] Mokhtari, A., Daneshmand, H., Lucchi, A., Hofmann, T., Ribeiro, A.:Adaptive Newton method for empirical risk minimization to statistical accuracy. In:Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016)
[144] Jin, Q., Mokhtari, A.:Exploiting local convergence of quasi-Newton methods globally:adaptive sample size approach. In:Neural Information Processing Systems. p. 12(2021)
[145] Boggs, P.T., Byrd, R.H.:Adaptive, limited-memory BFGS algorithms for unconstrained optimization. SIAM J. Optim. 29(2), 1282-1299(2019)
[146] Berahas, A.S., Curtis, F.E., Zhou, B.:Limited-memory BFGS with displacement aggregation. Math. Progr. (2021). https://doi.org/10.1007/s10107-021-01621-6
[147] Andrychowicz, M., Denil, M., Gómez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., de Freitas, N.:Learning to learn by gradient descent by gradient descent. In:Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016). https://proceedings.neurips.cc/paper/2016/file/fb87582825f9d28a8d42c5e5e5e8b23d-Paper.pdf
[148] Hochreiter, S., Younger, A.S., Conwell, P.R.:Learning to learn using gradient descent. In:International Conference on Artificial Neural Networks. pp. 87-94. Springer (2001)
[149] Egidio, L.N., Hansson, A., Wahlberg, B.:Learning the step-size policy for the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm. In:International Joint Conference on Neural Networks (IJCNN). pp. 1-8(2021). https://doi.org/10.1109/IJCNN52387.2021.9534194