Abstract
Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. Here we propose a scalable K-FAC algorithm and investigate K-FAC’s applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling, with the goal of preserving convergence while minimizing training time. We evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9–25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.
- [1] , “ImageNet training in minutes,” in Proc. 47th InterNat. Conf. Parallel Process..
ACM ,2018 , pp. 1–10.Google Scholar - [2] , “Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train,” 2017, arXiv:1711.04291.Google Scholar
- [3] , “Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes,” 2017, arXiv:1711.04325.Google Scholar
- [4] , “Image classification at supercomputer scale,” 2018, arXiv:1811.06992.Google Scholar
- [5] , “Massively distributed SGD: ImageNet/ResNet-50 training in a flash,” 2018, arXiv:1811.05233.Google Scholar
- [6] , “Large-scale distributed second-order optimization using Kronecker-factored approximate curvature for deep convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2019 , pp. 12359–12367.Google Scholar - [7] , “DeepDriveMD: Deep-learning driven adaptive molecular simulations for protein folding,” in Proc. IEEE/ACM 3rd Workshop Deep Learn. Super Comput.,
2019 , pp. 12–19.Google Scholar - [8] , “Machine learning phases of matter,” Nat. Phys., vol. 13, pp. 431–434, 2017.Google ScholarCross Ref
- [9] , “Predicting disruptive instabilities in controlled fusion plasmas through deep learning,” Nature, vol. 568, no. 7753, pp. 526–531, 2019.Google ScholarCross Ref
- [10] , and the OpenAI Data Team, “An empirical model of large-batch training,” 2018, arXiv:1812.06162.Google Scholar
- [11] , “Optimization methods for large-scale machine learning,” SIAM Rev., vol. 60, no. 2, pp. 223–311, 2018.Google ScholarCross Ref
- [12] , “Large-batch training for LSTM and beyond,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal.,
2019 , pp. 1–16.Google Scholar - [13] , “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2016 , pp. 770–778.Google Scholar - [14] , “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.Google Scholar
- [15] , “Optimizing neural networks with Kronecker-factored approximate curvature,” in Proc. Int. Conf. Mach. Learn.,
2015 , pp. 2408–2417.Google Scholar - [16] , “Inefficiency of K-FAC for large batch size training,” in Proc. AAAI Conf. Artif. Intell.,
2019 , pp. 5053–5060.Google Scholar - [17] MLPerf, 2019. [Online]. Available: https://www.mlperf.org/Google Scholar
- [18] , “On the limited memory BFGS method for large scale optimization,” Math. Prog., vol. 45, no. 1/3, pp. 503–528, 1989.Google ScholarDigital Library
- [19] , “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Informat. Process. Syst.,
2019 , pp. 8024–8035.Google Scholar - [20] , “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.,
2017 , pp. 2961–2969.Google Scholar - [21] , “Convolutional neural network training with distributed K-FAC,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal.,
2020 , pp. 1–12.Google Scholar - [22] , “KAISA: An adaptive second-order optimizer framework for deep neural networks,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal.,
2021 , pp. 1–14.Google Scholar - [23] , “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Proc. 33rd Int. Conf. Neural Informat. Process. Syst.,
2019 , pp. 103–112.Google Scholar - [24] Gloo: Collective communications library with various primitives for multi-machine training, 2019. [Online]. Available: https://github.com/facebookincubator/glooGoogle Scholar
- [25] , “Optimization of collective communication operations in MPICH,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, 2005.Google ScholarDigital Library
- [26] Intel, “Intel machine learning scaling library,” 2019. [Online]. Available: https://github.com/intel/MLSLGoogle Scholar
- [27] , “Horovod: Fast and easy distributed deep learning in TensorFlow,” 2018, arXiv:1802.05799.Google Scholar
- [28] , “TensorFlow: A system for large-scale machine learning,” in Proc. 12th USENIX Symp. Oper. Syst. Des. Implementation,
2016 , pp. 265–283.Google Scholar - [29] NVIDIA Apex (a PyTorch extension), 2018. [Online]. Available: https://github.com/NVIDIA/apexGoogle Scholar
- [30] , “Large batch training of convolutional networks,” 2017, arXiv:1708.03888.Google Scholar
- [31] , “HOGWILD: A lock-free approach to parallelizing stochastic gradient descent,” in Proc. Adv. Neural Informat. Process. Syst.,
2011 , pp. 693–701.Google Scholar - [32] , “Deep learning with elastic averaging SGD,” in Proc. Adv. Neural Informat. Process. Syst.,
2015 , pp. 685–693.Google Scholar - [33] , “How to scale distributed deep learning?,” 2016, arXiv:1611.04581.Google Scholar
- [34] , “Asynchrony begets momentum, with an application to deep learning,” in Proc. 54th Annu. Allerton Conf. Commun., Control, Comput.,
2016 , pp. 997–1004.Google Scholar - [35] , “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX Symp. Oper. Syst. Des. Implementation,
2014 , pp. 583–598.Google Scholar - [36] , “The convergence of stochastic gradient descent in asynchronous shared memory,” in Proc. ACM Symp. Princ. Distrib. Comput.,
2018 , pp. 169–178.Google Scholar - [37] , “A Kronecker-factored approximate Fisher matrix for convolution layers,” 2016, arXiv:1602.01407.Google Scholar
- [38] , “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” 2015, arXiv:1512.01274.Google Scholar
- [39] , “On large-batch training for deep learning: Generalization gap and sharp minima,” in Proc. 5th Int. Conf. Learn. Representations,
2017 , pp. 1–16.Google Scholar - [40] , “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” 2017, arXiv:1706.02677.Google Scholar
- [41] , “Distributed second-order optimization using Kronecker-factored approximations,” in Proc. Int. Conf. Learn. Representations,
2017 , pp. 1–17.Google Scholar - [42] , “Kronecker-factored curvature approximations for recurrent neural networks,” in Proc. Int. Conf. Learn. Representations,
2018 , pp. 1–25.Google Scholar - [43] , “Rich information is affordable: A systematic performance analysis of second-order optimization using K-FAC,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,
2020 , pp. 2145–2153.Google Scholar - [44] , “Scalable and practical natural gradient for large-scale deep learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, pp. 404–415, Jan. 2022.Google ScholarDigital Library
- [45] , “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining,
2020 , pp. 3505–3506.Google Scholar - [46] , “Mixed precision training,” in Proc. 6th Int. Conf. Learn. Representations, 2018, pp. 1–12.Google Scholar
- [47] , “EigenDamage: Structured pruning in the Kronecker-factored eigenbasis,” in Proc. 36th Int. Conf. Mach. Learn.,
2019 , pp. 6566–6575. [Online]. Available: http://proceedings.mlr.press/v97/wang19g.htmlGoogle Scholar - [48] , “Fast approximate natural gradient descent in a Kronecker-factored eigenbasis,” in Proc. 32nd Int. Conf. Neural Informat. Process. Syst.,
2018 , pp. 9573–9583.Google Scholar - [49] , “Learning multiple layers of features from tiny images,” Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009, 2009.Google Scholar
- [50] , “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2009 , pp. 248–255.Google Scholar - [51] , “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis.,
2014 , pp. 740–755.Google Scholar - [52] NVIDIA deep learning examples, 2020. [Online]. Available: https://github.com/NVIDIA/DeepLearningExamplesGoogle Scholar
- [53] Wikipedia, Wikipedia Corpus, 2020. [Online]. Available: https://www.english-corpora.org/wiki/Google Scholar
- [54] , “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis.,
2015 , pp. 19–27.Google Scholar - [55] MLPerf, MLPerf Results v0.6, 2019. [Online]. Available: https://mlperf.org/training-results-0-6Google Scholar
Index Terms
(auto-classified)Deep Neural Network Training With Distributed K-FAC
Comments