ABSTRACT
High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation.
In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O.
Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.
- William Allcock et al. 2017. Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.Google Scholar
- Gonzalo Pedro Rodrigo Alvarez et al. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 16th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Comp. (CCGrid). IEEE, 521--526.Google Scholar
- Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in Python. In 28th International Symposium on High-Performance Parallel and Distributed Computing. 25--36.Google Scholar
Digital Library
- K. Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 378--378. Google Scholar
Cross Ref
- Philip Carns. 2014. Darshan. In High Performance Parallel I/O. Chapman and Hall/CRC, 351--358.Google Scholar
- Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. arXiv preprint arXiv.1603.02754 (2016).Google Scholar
Digital Library
- Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 30, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291696Google Scholar
Digital Library
- Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). Association for Computing Machinery, New York, NY, USA, 40--51. Google Scholar
Digital Library
- Jack J Dongarra et al. 1992. Performance of various computers using standard linear equations software. ACM SIGARCH Computer Architecture News 20, 3 (1992), 22--44.Google Scholar
Digital Library
- Argonne Leadership Computing Facility. [n.d.]. Job Scheduling Policy for Mira/Cetus/Vesta. https://www.alcf.anl.gov/support-center/miracetusvesta/job-scheduling-policy-miracetusvesta.Google Scholar
- Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Supercomputing, Vol. 99. 5--33.Google Scholar
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
- A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert, and M. Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1013--1022. Google Scholar
Digital Library
- Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, Venkat Vishwanath, Zarija Lukic, David Daniel, Patricia Fasel, and Nicholas Frontiere. 2013. Blasting Through the 10 Petaflops Barrier: HACC on the BG/Q. https://press3.mcs.anl.gov//salman-habib/files/2013/05/hacc_pflops.pdf.Google Scholar
- Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, Venkatram Vishwanath, Zarija Lukić, Saba Sehrish, and Wei-keng Liao. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.Google Scholar
Cross Ref
- J. J. Hack and M. E. Papka. 2014. New Frontiers in Leadership Computing. Computing in Science & Engineering 16, 6 (Nov 2014), 10--12. Google Scholar
Cross Ref
- J. J. Hack and M. E. Papka. 2015. Big Data: Next-Generation Machines for Big Science. Computing in Science & Engineering 17, 4 (July 2015), 63--65. Google Scholar
Digital Library
- Jim Collins Jared Sagoff. 2019 (accessed Dec 3, 2019). A game changer for computational materials science. https://www.alcf.anl.gov/news/argonne-s-mira-supercomputer-set-retire-after-years-enabling-groundbreaking-science.Google Scholar
- Jianwei Li, Wei-keng Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. 2003. Parallel netCDF: A High-Performance Scientific I/O Interface. In ACM/IEEE Conference on Supercomputing. 39--39. Google Scholar
Cross Ref
- Wayne Joubert et al. 2012. An Analysis of Computational Workloads for the ORNL Jaguar System. In 26th ACM Int. Conf. on SuperComp. ACM, 247--256.Google Scholar
- Qiao Kang, Ankit Agrawal, Alok N. Choudhary, Alex Sim, Kesheng Wu, Rajkumar Kettimuthu, Peter H. Beckman, Zhengchun Liu, and Wei-Keng Liao. 2019. Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems. In Big Data Predictive Maintenance using Artificial Intelligence workshop.Google Scholar
- Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Future Generation Computer Systems 88(2018), 191--198. Google Scholar
Cross Ref
- Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, et al. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (2018), 195901.Google Scholar
Cross Ref
- Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.Google Scholar
Cross Ref
- Gary Lakner, Brant Knudson, et al. 2013. IBM system Blue Gene solution: Blue Gene/Q system administration. IBM Redbooks.Google Scholar
- Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific User Behavior and Data-sharing Trends in a Petascale File System. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 46, 12 pages. Google Scholar
Digital Library
- Yuanlai Liu, Zhengchun Liu, Rajkumar Kettimuthu, Nageswara Rao, Zizhong Chen, and Ian Foster. 2019. Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 122--131. Google Scholar
Cross Ref
- Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (Washington, DC, USA). ACM, New York, NY, USA, 167--178. Google Scholar
Digital Library
- Zhengchun Liu, Rajkumar Kettimuthu, Prasanna Balaprakash, and Ian Foster. 2018. Building a Wide-Area Data Transfer Performance Predictor: An Empirical Study. In 1st International Conference on Machine Learning for Networking (Paris, France). Springer, 20.Google Scholar
- Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2018. Toward a smart data transfer node. Future Generation Computer Systems 89 (2018), 10--18. Google Scholar
Cross Ref
- Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Yuanlai Liu. 2018. A comprehensive study of wide area data movement at a scientific computing facility. In 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria). IEEE, 8.Google Scholar
Cross Ref
- Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S. V. Rao. 2018. Cross-geography Scientific Data Transferring Trends and Behavior. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). ACM, New York, NY, USA, 267--278. Google Scholar
Digital Library
- Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2018. A Year in the Life of a Parallel File System. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 74, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291755Google Scholar
- Lustre. 2019 (accessed Dec 3, 2019). Data on MDT Solution Architecture. http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.Google Scholar
- Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA). ACM, New York, NY, USA, 33--44. Google Scholar
Digital Library
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.Google Scholar
- Robert Mawhinney. 2013. Lattice QCD from Mira or Probing Quarks at a Sustained Petaflops. https://www.alcf.anl.gov/files/Mawhinney_ESP_May_2013_0.pdf.Google Scholar
- Sally A McKee, Steven A Moyer, Wm A Wulf, and Charles Hitchcock. 1994. Increasing memory bandwidth for vector computations. In Programming Languages and System Architectures. Springer, 87--104.Google Scholar
- Wes McKinney et al. 2010. Data structures for statistical computing in Python. In 9th Python in Science Conference, Vol. 445. SciPy Austin, TX, 51--56.Google Scholar
- George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and et al. 2019. Bandwidth Steering in HPC Using Silicon Nanophotonics. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). Association for Computing Machinery, New York, NY, USA, Article Article 41, 25 pages. Google Scholar
Digital Library
- National Academies of Sciences, Engineering, and Medicine. 2016. Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020. The National Academies Press, Washington, DC. Google Scholar
Cross Ref
- Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O Behavior in Large-scale Storage Systems: The Expected and the Unexpected. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 65, 13 pages. Google Scholar
Digital Library
- Stephan Schlagkamp et al. 2016. Analyzing Users in Parallel Compiting: A User-Oriented Study. In 2016 Int. Conf. on High Performance Comp. & Simulation. IEEE, 395--402.Google Scholar
- Stephan Schlagkamp et al. 2016. Consecutive Job Submission Behavior at Mira Supercomputer. In 25th ACM Int. Symposium on High-Performance Parallel and Dist. Comp. ACM, 93--96.Google Scholar
- S. Snyder, P. Carns, K. Harms, R. Ross, G. K. Lockwood, and N. J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In 5th Workshop on Extreme-Scale Programming Tools (ESPT). 9--17. Google Scholar
Cross Ref
- TOP500. 2019 (accessed May 3, 2019). TOP500 Supercomputer. https://www.top500.org.Google Scholar
- Bob Walkup. 2019 (accessed Dec 3, 2019). Application Performance Characterization and Analysis on Blue Gene/Q. https://www.alcf.anl.gov/files/miracon_AppPerform_BobWalkup_1.pdf.Google Scholar
- Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011), 633--652.Google Scholar
Digital Library
- Justin M Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, and Matthew Wolf. 2019. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows. Future Generation Computer Systems 101 (2019), 576--589.Google Scholar
Digital Library
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65. Google Scholar
Digital Library
Index Terms
Characterization and identification of HPC applications at leadership computing facility
Comments