research-article

Deep Neural Network Training With Distributed K-FAC

Published:01 December 2022Publication History
Skip Abstract Section

Abstract

Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. Here we propose a scalable K-FAC algorithm and investigate K-FAC’s applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling, with the goal of preserving convergence while minimizing training time. We evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9–25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.

References

  1. [1] You Y., Zhang Z., Hsieh C.-J., Demmel J., and Keutzer K., “ImageNet training in minutes,” in Proc. 47th InterNat. Conf. Parallel Process.. ACM, 2018, pp. 110.Google ScholarGoogle Scholar
  2. [2] Codreanu V., Podareanu D., and Saletore V., “Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train,” 2017, arXiv:1711.04291.Google ScholarGoogle Scholar
  3. [3] Akiba T., Suzuki S., and Fukuda K., “Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes,” 2017, arXiv:1711.04325.Google ScholarGoogle Scholar
  4. [4] Ying C., Kumar S., Chen D., Wang T., and Cheng Y., “Image classification at supercomputer scale,” 2018, arXiv:1811.06992.Google ScholarGoogle Scholar
  5. [5] Mikami H.et al., “Massively distributed SGD: ImageNet/ResNet-50 training in a flash,” 2018, arXiv:1811.05233.Google ScholarGoogle Scholar
  6. [6] Osawa K., Tsuji Y., Ueno Y., Naruse A., Yokota R., and Matsuoka S., “Large-scale distributed second-order optimization using Kronecker-factored approximate curvature for deep convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1235912367.Google ScholarGoogle Scholar
  7. [7] Lee H., Turilli M., Jha S., Bhowmik D., Ma H., and Ramanathan A., “DeepDriveMD: Deep-learning driven adaptive molecular simulations for protein folding,” in Proc. IEEE/ACM 3rd Workshop Deep Learn. Super Comput., 2019, pp. 1219.Google ScholarGoogle Scholar
  8. [8] Carrasquilla J. and Melko R. G., “Machine learning phases of matter,” Nat. Phys., vol. 13, pp. 431434, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Kates-Harbeck J., Svyatkovskiy A., and Tang W., “Predicting disruptive instabilities in controlled fusion plasmas through deep learning,” Nature, vol. 568, no. 7753, pp. 526531, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] McCandlish S., Kaplan J., Amodei D., and the OpenAI Data Team, “An empirical model of large-batch training,” 2018, arXiv:1812.06162.Google ScholarGoogle Scholar
  11. [11] Bottou L., Curtis F. E., and Nocedal J., “Optimization methods for large-scale machine learning,” SIAM Rev., vol. 60, no. 2, pp. 223311, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] You Y., Hseu J., Ying C., Demmel J., Keutzer K., and Hsieh C.-J., “Large-batch training for LSTM and beyond,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2019, pp. 116.Google ScholarGoogle Scholar
  13. [13] He K., Zhang X., Ren S., and Sun J., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770778.Google ScholarGoogle Scholar
  14. [14] Devlin J., Chang M.-W., Lee K., and Toutanova K., “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.Google ScholarGoogle Scholar
  15. [15] Martens J. and Grosse R., “Optimizing neural networks with Kronecker-factored approximate curvature,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 24082417.Google ScholarGoogle Scholar
  16. [16] Ma L.et al., “Inefficiency of K-FAC for large batch size training,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 50535060.Google ScholarGoogle Scholar
  17. [17] MLPerf, 2019. [Online]. Available: https://www.mlperf.org/Google ScholarGoogle Scholar
  18. [18] Liu D. C. and Nocedal J., “On the limited memory BFGS method for large scale optimization,” Math. Prog., vol. 45, no. 1/3, pp. 503528, 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Paszke A.et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Informat. Process. Syst., 2019, pp. 80248035.Google ScholarGoogle Scholar
  20. [20] He K., Gkioxari G., Dollár P., and Girshick R., “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 29612969.Google ScholarGoogle Scholar
  21. [21] Pauloski J. G., Zhang Z., Huang L., Xu W., and Foster I. T., “Convolutional neural network training with distributed K-FAC,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2020, pp. 112.Google ScholarGoogle Scholar
  22. [22] Pauloski J. G.et al., “KAISA: An adaptive second-order optimizer framework for deep neural networks,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2021, pp. 114.Google ScholarGoogle Scholar
  23. [23] Huang Y.et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Proc. 33rd Int. Conf. Neural Informat. Process. Syst., 2019, pp. 103112.Google ScholarGoogle Scholar
  24. [24] Gloo: Collective communications library with various primitives for multi-machine training, 2019. [Online]. Available: https://github.com/facebookincubator/glooGoogle ScholarGoogle Scholar
  25. [25] Thakur R., Rabenseifner R., and Gropp W., “Optimization of collective communication operations in MPICH,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 4966, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Intel, “Intel machine learning scaling library,” 2019. [Online]. Available: https://github.com/intel/MLSLGoogle ScholarGoogle Scholar
  27. [27] Sergeev A. and Balso M. D., “Horovod: Fast and easy distributed deep learning in TensorFlow,” 2018, arXiv:1802.05799.Google ScholarGoogle Scholar
  28. [28] Abadi M.et al., “TensorFlow: A system for large-scale machine learning,” in Proc. 12th USENIX Symp. Oper. Syst. Des. Implementation, 2016, pp. 265283.Google ScholarGoogle Scholar
  29. [29] NVIDIA Apex (a PyTorch extension), 2018. [Online]. Available: https://github.com/NVIDIA/apexGoogle ScholarGoogle Scholar
  30. [30] You Y., Gitman I., and Ginsburg B., “Large batch training of convolutional networks,” 2017, arXiv:1708.03888.Google ScholarGoogle Scholar
  31. [31] Recht B., Re C., Wright S., and Niu F., “HOGWILD: A lock-free approach to parallelizing stochastic gradient descent,” in Proc. Adv. Neural Informat. Process. Syst., 2011, pp. 693701.Google ScholarGoogle Scholar
  32. [32] Zhang S., Choromanska A. E., and LeCun Y., “Deep learning with elastic averaging SGD,” in Proc. Adv. Neural Informat. Process. Syst., 2015, pp. 685693.Google ScholarGoogle Scholar
  33. [33] Jin P. H., Yuan Q., Iandola F., and Keutzer K., “How to scale distributed deep learning?,” 2016, arXiv:1611.04581.Google ScholarGoogle Scholar
  34. [34] Mitliagkas I., Zhang C., Hadjis S., and Ré C., “Asynchrony begets momentum, with an application to deep learning,” in Proc. 54th Annu. Allerton Conf. Commun., Control, Comput., 2016, pp. 9971004.Google ScholarGoogle Scholar
  35. [35] Li M.et al., “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX Symp. Oper. Syst. Des. Implementation, 2014, pp. 583598.Google ScholarGoogle Scholar
  36. [36] Alistarh D., De Sa C., and Konstantinov N., “The convergence of stochastic gradient descent in asynchronous shared memory,” in Proc. ACM Symp. Princ. Distrib. Comput., 2018, pp. 169178.Google ScholarGoogle Scholar
  37. [37] Grosse R. and Martens J., “A Kronecker-factored approximate Fisher matrix for convolution layers,” 2016, arXiv:1602.01407.Google ScholarGoogle Scholar
  38. [38] Chen T.et al., “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” 2015, arXiv:1512.01274.Google ScholarGoogle Scholar
  39. [39] Keskar N., Nocedal J., Tang P., Mudigere D., and Smelyanskiy M., “On large-batch training for deep learning: Generalization gap and sharp minima,” in Proc. 5th Int. Conf. Learn. Representations, 2017, pp. 116.Google ScholarGoogle Scholar
  40. [40] Goyal P.et al., “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” 2017, arXiv:1706.02677.Google ScholarGoogle Scholar
  41. [41] Ba J., Grosse R. B., and Martens J., “Distributed second-order optimization using Kronecker-factored approximations,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 117.Google ScholarGoogle Scholar
  42. [42] Martens J., Ba J., and Johnson M., “Kronecker-factored curvature approximations for recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 125.Google ScholarGoogle Scholar
  43. [43] Ueno Y., Osawa K., Tsuji Y., Naruse A., and Yokota R., “Rich information is affordable: A systematic performance analysis of second-order optimization using K-FAC,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2020, pp. 21452153.Google ScholarGoogle Scholar
  44. [44] Osawa K., Tsuji Y., Ueno Y., Naruse A., Foo C.-S., and Yokota R., “Scalable and practical natural gradient for large-scale deep learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, pp. 404415, Jan. 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Rasley J., Rajbhandari S., Ruwase O., and He Y., “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2020, pp. 35053506.Google ScholarGoogle Scholar
  46. [46] Micikevicius P.et al., “Mixed precision training,” in Proc. 6th Int. Conf. Learn. Representations, 2018, pp. 112.Google ScholarGoogle Scholar
  47. [47] Wang C., Grosse R., Fidler S., and Zhang G., “EigenDamage: Structured pruning in the Kronecker-factored eigenbasis,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 65666575. [Online]. Available: http://proceedings.mlr.press/v97/wang19g.htmlGoogle ScholarGoogle Scholar
  48. [48] George T., Laurent C., Bouthillier X., Ballas N., and Vincent P., “Fast approximate natural gradient descent in a Kronecker-factored eigenbasis,” in Proc. 32nd Int. Conf. Neural Informat. Process. Syst., 2018, pp. 95739583.Google ScholarGoogle Scholar
  49. [49] Krizhevsky A.et al., “Learning multiple layers of features from tiny images,” Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009, 2009.Google ScholarGoogle Scholar
  50. [50] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L., “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248255.Google ScholarGoogle Scholar
  51. [51] Lin T.-Y.et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740755.Google ScholarGoogle Scholar
  52. [52] NVIDIA deep learning examples, 2020. [Online]. Available: https://github.com/NVIDIA/DeepLearningExamplesGoogle ScholarGoogle Scholar
  53. [53] Wikipedia, Wikipedia Corpus, 2020. [Online]. Available: https://www.english-corpora.org/wiki/Google ScholarGoogle Scholar
  54. [54] Zhu Y.et al., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1927.Google ScholarGoogle Scholar
  55. [55] MLPerf, MLPerf Results v0.6, 2019. [Online]. Available: https://mlperf.org/training-results-0-6Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Deep Neural Network Training With Distributed K-FAC

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image IEEE Transactions on Parallel and Distributed Systems
          IEEE Transactions on Parallel and Distributed Systems  Volume 33, Issue 12
          Dec. 2022
          1246 pages

          1045-9219 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

          Publisher

          IEEE Press

          Publication History

          • Published: 1 December 2022

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0

          Other Metrics

        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!