10.1145/3392717.3392774acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Characterization and identification of HPC applications at leadership computing facility

Published:29 June 2020Publication History

ABSTRACT

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation.

In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O.

Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

References

  1. William Allcock et al. 2017. Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.Google ScholarGoogle Scholar
  2. Gonzalo Pedro Rodrigo Alvarez et al. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 16th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Comp. (CCGrid). IEEE, 521--526.Google ScholarGoogle Scholar
  3. Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in Python. In 28th International Symposium on High-Performance Parallel and Distributed Computing. 25--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 378--378. Google ScholarGoogle ScholarCross RefCross Ref
  5. Philip Carns. 2014. Darshan. In High Performance Parallel I/O. Chapman and Hall/CRC, 351--358.Google ScholarGoogle Scholar
  6. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. arXiv preprint arXiv.1603.02754 (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 30, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291696Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). Association for Computing Machinery, New York, NY, USA, 40--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jack J Dongarra et al. 1992. Performance of various computers using standard linear equations software. ACM SIGARCH Computer Architecture News 20, 3 (1992), 22--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Argonne Leadership Computing Facility. [n.d.]. Job Scheduling Policy for Mira/Cetus/Vesta. https://www.alcf.anl.gov/support-center/miracetusvesta/job-scheduling-policy-miracetusvesta.Google ScholarGoogle Scholar
  11. Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Supercomputing, Vol. 99. 5--33.Google ScholarGoogle Scholar
  12. Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google ScholarGoogle Scholar
  13. A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert, and M. Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1013--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, Venkat Vishwanath, Zarija Lukic, David Daniel, Patricia Fasel, and Nicholas Frontiere. 2013. Blasting Through the 10 Petaflops Barrier: HACC on the BG/Q. https://press3.mcs.anl.gov//salman-habib/files/2013/05/hacc_pflops.pdf.Google ScholarGoogle Scholar
  15. Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, Venkatram Vishwanath, Zarija Lukić, Saba Sehrish, and Wei-keng Liao. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. J. Hack and M. E. Papka. 2014. New Frontiers in Leadership Computing. Computing in Science & Engineering 16, 6 (Nov 2014), 10--12. Google ScholarGoogle ScholarCross RefCross Ref
  17. J. J. Hack and M. E. Papka. 2015. Big Data: Next-Generation Machines for Big Science. Computing in Science & Engineering 17, 4 (July 2015), 63--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jim Collins Jared Sagoff. 2019 (accessed Dec 3, 2019). A game changer for computational materials science. https://www.alcf.anl.gov/news/argonne-s-mira-supercomputer-set-retire-after-years-enabling-groundbreaking-science.Google ScholarGoogle Scholar
  19. Jianwei Li, Wei-keng Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. 2003. Parallel netCDF: A High-Performance Scientific I/O Interface. In ACM/IEEE Conference on Supercomputing. 39--39. Google ScholarGoogle ScholarCross RefCross Ref
  20. Wayne Joubert et al. 2012. An Analysis of Computational Workloads for the ORNL Jaguar System. In 26th ACM Int. Conf. on SuperComp. ACM, 247--256.Google ScholarGoogle Scholar
  21. Qiao Kang, Ankit Agrawal, Alok N. Choudhary, Alex Sim, Kesheng Wu, Rajkumar Kettimuthu, Peter H. Beckman, Zhengchun Liu, and Wei-Keng Liao. 2019. Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems. In Big Data Predictive Maintenance using Artificial Intelligence workshop.Google ScholarGoogle Scholar
  22. Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Future Generation Computer Systems 88(2018), 191--198. Google ScholarGoogle ScholarCross RefCross Ref
  23. Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, et al. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (2018), 195901.Google ScholarGoogle ScholarCross RefCross Ref
  24. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gary Lakner, Brant Knudson, et al. 2013. IBM system Blue Gene solution: Blue Gene/Q system administration. IBM Redbooks.Google ScholarGoogle Scholar
  26. Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific User Behavior and Data-sharing Trends in a Petascale File System. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 46, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yuanlai Liu, Zhengchun Liu, Rajkumar Kettimuthu, Nageswara Rao, Zizhong Chen, and Ian Foster. 2019. Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 122--131. Google ScholarGoogle ScholarCross RefCross Ref
  28. Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (Washington, DC, USA). ACM, New York, NY, USA, 167--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhengchun Liu, Rajkumar Kettimuthu, Prasanna Balaprakash, and Ian Foster. 2018. Building a Wide-Area Data Transfer Performance Predictor: An Empirical Study. In 1st International Conference on Machine Learning for Networking (Paris, France). Springer, 20.Google ScholarGoogle Scholar
  30. Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2018. Toward a smart data transfer node. Future Generation Computer Systems 89 (2018), 10--18. Google ScholarGoogle ScholarCross RefCross Ref
  31. Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Yuanlai Liu. 2018. A comprehensive study of wide area data movement at a scientific computing facility. In 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria). IEEE, 8.Google ScholarGoogle ScholarCross RefCross Ref
  32. Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S. V. Rao. 2018. Cross-geography Scientific Data Transferring Trends and Behavior. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). ACM, New York, NY, USA, 267--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2018. A Year in the Life of a Parallel File System. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 74, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291755Google ScholarGoogle Scholar
  34. Lustre. 2019 (accessed Dec 3, 2019). Data on MDT Solution Architecture. http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.Google ScholarGoogle Scholar
  35. Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA). ACM, New York, NY, USA, 33--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.Google ScholarGoogle Scholar
  37. Robert Mawhinney. 2013. Lattice QCD from Mira or Probing Quarks at a Sustained Petaflops. https://www.alcf.anl.gov/files/Mawhinney_ESP_May_2013_0.pdf.Google ScholarGoogle Scholar
  38. Sally A McKee, Steven A Moyer, Wm A Wulf, and Charles Hitchcock. 1994. Increasing memory bandwidth for vector computations. In Programming Languages and System Architectures. Springer, 87--104.Google ScholarGoogle Scholar
  39. Wes McKinney et al. 2010. Data structures for statistical computing in Python. In 9th Python in Science Conference, Vol. 445. SciPy Austin, TX, 51--56.Google ScholarGoogle Scholar
  40. George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and et al. 2019. Bandwidth Steering in HPC Using Silicon Nanophotonics. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). Association for Computing Machinery, New York, NY, USA, Article Article 41, 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. National Academies of Sciences, Engineering, and Medicine. 2016. Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020. The National Academies Press, Washington, DC. Google ScholarGoogle ScholarCross RefCross Ref
  42. Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O Behavior in Large-scale Storage Systems: The Expected and the Unexpected. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 65, 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Stephan Schlagkamp et al. 2016. Analyzing Users in Parallel Compiting: A User-Oriented Study. In 2016 Int. Conf. on High Performance Comp. & Simulation. IEEE, 395--402.Google ScholarGoogle Scholar
  44. Stephan Schlagkamp et al. 2016. Consecutive Job Submission Behavior at Mira Supercomputer. In 25th ACM Int. Symposium on High-Performance Parallel and Dist. Comp. ACM, 93--96.Google ScholarGoogle Scholar
  45. S. Snyder, P. Carns, K. Harms, R. Ross, G. K. Lockwood, and N. J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In 5th Workshop on Extreme-Scale Programming Tools (ESPT). 9--17. Google ScholarGoogle ScholarCross RefCross Ref
  46. TOP500. 2019 (accessed May 3, 2019). TOP500 Supercomputer. https://www.top500.org.Google ScholarGoogle Scholar
  47. Bob Walkup. 2019 (accessed Dec 3, 2019). Application Performance Characterization and Analysis on Blue Gene/Q. https://www.alcf.anl.gov/files/miracon_AppPerform_BobWalkup_1.pdf.Google ScholarGoogle Scholar
  48. Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011), 633--652.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Justin M Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, and Matthew Wolf. 2019. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows. Future Generation Computer Systems 101 (2019), 576--589.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Characterization and identification of HPC applications at leadership computing facility

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!