research-article

Characterization and identification of HPC applications at leadership computing facility

Authors:
Zhengchun Liu

Argonne National Laboratory

Argonne National Laboratory
View Profile

,
Ryan Lewis

Northern Illinois University

Northern Illinois University
View Profile

,
Rajkumar Kettimuthu

Argonne National Laboratory

Argonne National Laboratory
View Profile

,
Kevin Harms

Argonne National Laboratory

Argonne National Laboratory
View Profile

,
Philip Carns

Argonne National Laboratory

Argonne National Laboratory
View Profile

,
Nageswara Rao

Oak Ridge National Laboratory

Oak Ridge National Laboratory
View Profile

,
Ian Foster

Argonne National Laboratory

Argonne National Laboratory
View Profile

,
Michael E. Papka

Northern Illinois University

Northern Illinois University
View Profile

ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingJune 2020 Article No.: 29Pages 1–12https://doi.org/10.1145/3392717.3392774

Published:29 June 2020Publication History

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Pages 1–12

ABSTRACT

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation.

In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O.

Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

References

William Allcock et al. 2017. Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1--24.Google Scholar
Gonzalo Pedro Rodrigo Alvarez et al. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 16th IEEE/ACM Int. Symposium on Cluster, Cloud and Grid Comp. (CCGrid). IEEE, 521--526.Google Scholar
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in Python. In 28th International Symposium on High-Performance Parallel and Distributed Computing. 25--36.Google ScholarDigital Library
K. Bergman. 2018. Empowering Flexible and Scalable High Performance Architectures with Embedded Photonics. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 378--378. Google ScholarCross Ref
Philip Carns. 2014. Darshan. In High Performance Parallel I/O. Chapman and Hall/CRC, 351--358.Google Scholar
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. arXiv preprint arXiv.1603.02754 (2016).Google ScholarDigital Library
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI Usage on a Production Supercomputer. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 30, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291696Google ScholarDigital Library
Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). Association for Computing Machinery, New York, NY, USA, 40--51. Google ScholarDigital Library
Jack J Dongarra et al. 1992. Performance of various computers using standard linear equations software. ACM SIGARCH Computer Architecture News 20, 3 (1992), 22--44.Google ScholarDigital Library
Argonne Leadership Computing Facility. [n.d.]. Job Scheduling Policy for Mira/Cetus/Vesta. https://www.alcf.anl.gov/support-center/miracetusvesta/job-scheduling-policy-miracetusvesta.Google Scholar
Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Supercomputing, Vol. 99. 5--33.Google Scholar
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert, and M. Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium. 1013--1022. Google ScholarDigital Library
Salman Habib, Vitali Morozov, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Tom Peterka, Joe Insley, Venkat Vishwanath, Zarija Lukic, David Daniel, Patricia Fasel, and Nicholas Frontiere. 2013. Blasting Through the 10 Petaflops Barrier: HACC on the BG/Q. https://press3.mcs.anl.gov//salman-habib/files/2013/05/hacc_pflops.pdf.Google Scholar
Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, Venkatram Vishwanath, Zarija Lukić, Saba Sehrish, and Wei-keng Liao. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.Google ScholarCross Ref
J. J. Hack and M. E. Papka. 2014. New Frontiers in Leadership Computing. Computing in Science & Engineering 16, 6 (Nov 2014), 10--12. Google ScholarCross Ref
J. J. Hack and M. E. Papka. 2015. Big Data: Next-Generation Machines for Big Science. Computing in Science & Engineering 17, 4 (July 2015), 63--65. Google ScholarDigital Library
Jim Collins Jared Sagoff. 2019 (accessed Dec 3, 2019). A game changer for computational materials science. https://www.alcf.anl.gov/news/argonne-s-mira-supercomputer-set-retire-after-years-enabling-groundbreaking-science.Google Scholar
Jianwei Li, Wei-keng Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. 2003. Parallel netCDF: A High-Performance Scientific I/O Interface. In ACM/IEEE Conference on Supercomputing. 39--39. Google ScholarCross Ref
Wayne Joubert et al. 2012. An Analysis of Computational Workloads for the ORNL Jaguar System. In 26th ACM Int. Conf. on SuperComp. ACM, 247--256.Google Scholar
Qiao Kang, Ankit Agrawal, Alok N. Choudhary, Alex Sim, Kesheng Wu, Rajkumar Kettimuthu, Peter H. Beckman, Zhengchun Liu, and Wei-Keng Liao. 2019. Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems. In Big Data Predictive Maintenance using Artificial Intelligence workshop.Google Scholar
Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Future Generation Computer Systems 88(2018), 191--198. Google ScholarCross Ref
Jeongnim Kim, Andrew D Baczewski, Todd D Beaudet, Anouar Benali, M Chandler Bennett, Mark A Berrill, Nick S Blunt, Edgar Josué Landinez Borda, Michele Casula, David M Ceperley, et al. 2018. QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids. Journal of Physics: Condensed Matter 30, 19 (2018), 195901.Google ScholarCross Ref
Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.Google ScholarCross Ref
Gary Lakner, Brant Knudson, et al. 2013. IBM system Blue Gene solution: Blue Gene/Q system administration. IBM Redbooks.Google Scholar
Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific User Behavior and Data-sharing Trends in a Petascale File System. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 46, 12 pages. Google ScholarDigital Library
Yuanlai Liu, Zhengchun Liu, Rajkumar Kettimuthu, Nageswara Rao, Zizhong Chen, and Ian Foster. 2019. Data Transfer between Scientific Facilities - Bottleneck Analysis, Insights and Optimizations. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 122--131. Google ScholarCross Ref
Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing (Washington, DC, USA). ACM, New York, NY, USA, 167--178. Google ScholarDigital Library
Zhengchun Liu, Rajkumar Kettimuthu, Prasanna Balaprakash, and Ian Foster. 2018. Building a Wide-Area Data Transfer Performance Predictor: An Empirical Study. In 1st International Conference on Machine Learning for Networking (Paris, France). Springer, 20.Google Scholar
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2018. Toward a smart data transfer node. Future Generation Computer Systems 89 (2018), 10--18. Google ScholarCross Ref
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Yuanlai Liu. 2018. A comprehensive study of wide area data movement at a scientific computing facility. In 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria). IEEE, 8.Google ScholarCross Ref
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S. V. Rao. 2018. Cross-geography Scientific Data Transferring Trends and Behavior. In 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona). ACM, New York, NY, USA, 267--278. Google ScholarDigital Library
Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J. Wright. 2018. A Year in the Life of a Parallel File System. In International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas). IEEE Press, Piscataway, NJ, USA, Article 74, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291755Google Scholar
Lustre. 2019 (accessed Dec 3, 2019). Data on MDT Solution Architecture. http://wiki.lustre.org/Data_on_MDT_Solution_Architecture.Google Scholar
Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA). ACM, New York, NY, USA, 33--44. Google ScholarDigital Library
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.Google Scholar
Robert Mawhinney. 2013. Lattice QCD from Mira or Probing Quarks at a Sustained Petaflops. https://www.alcf.anl.gov/files/Mawhinney_ESP_May_2013_0.pdf.Google Scholar
Sally A McKee, Steven A Moyer, Wm A Wulf, and Charles Hitchcock. 1994. Increasing memory bandwidth for vector computations. In Programming Languages and System Architectures. Springer, 87--104.Google Scholar
Wes McKinney et al. 2010. Data structures for statistical computing in Python. In 9th Python in Science Conference, Vol. 445. SciPy Austin, TX, 51--56.Google Scholar
George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, and et al. 2019. Bandwidth Steering in HPC Using Silicon Nanophotonics. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). Association for Computing Machinery, New York, NY, USA, Article Article 41, 25 pages. Google ScholarDigital Library
National Academies of Sciences, Engineering, and Medicine. 2016. Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020. The National Academies Press, Washington, DC. Google ScholarCross Ref
Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O Behavior in Large-scale Storage Systems: The Expected and the Unexpected. In International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado). ACM, New York, NY, USA, Article 65, 13 pages. Google ScholarDigital Library
Stephan Schlagkamp et al. 2016. Analyzing Users in Parallel Compiting: A User-Oriented Study. In 2016 Int. Conf. on High Performance Comp. & Simulation. IEEE, 395--402.Google Scholar
Stephan Schlagkamp et al. 2016. Consecutive Job Submission Behavior at Mira Supercomputer. In 25th ACM Int. Symposium on High-Performance Parallel and Dist. Comp. ACM, 93--96.Google Scholar
S. Snyder, P. Carns, K. Harms, R. Ross, G. K. Lockwood, and N. J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In 5th Workshop on Extreme-Scale Programming Tools (ESPT). 9--17. Google ScholarCross Ref
TOP500. 2019 (accessed May 3, 2019). TOP500 Supercomputer. https://www.top500.org.Google Scholar
Bob Walkup. 2019 (accessed Dec 3, 2019). Application Performance Characterization and Analysis on Blue Gene/Q. https://www.alcf.anl.gov/files/miracon_AppPerform_BobWalkup_1.pdf.Google Scholar
Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011), 633--652.Google ScholarDigital Library
Justin M Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, and Matthew Wolf. 2019. MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows. Future Generation Computer Systems 101 (2019), 576--589.Google ScholarDigital Library
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56--65. Google ScholarDigital Library

Index Terms

Characterization and identification of HPC applications at leadership computing facility

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin
Copyright © 2020 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
characterization
application identification
logs data mining
high performance computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate 584 of 2,055 submissions, 28%
Upcoming Conference
ICS '23

Sponsor:

sigarch

ICS '23: 2023 International Conference on Supercomputing

June 21 - 23, 2023

Orlando , FL , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 382
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Characterization and identification of HPC applications at leadership computing facility

Save to Binder

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Characterization and identification of HPC applications at leadership computing facility