Article

Encoding for Reinforcement Learning Driven Scheduling

Authors:
Boyang Li

Illinois Institute of Technology, 60616, Chicago, IL, USA

Illinois Institute of Technology, 60616, Chicago, IL, USA
View Profile

,
Yuping Fan

Illinois Institute of Technology, 60616, Chicago, IL, USA

Illinois Institute of Technology, 60616, Chicago, IL, USA
View Profile

,
Michael E. Papka

Argonne National Laboratory, 9700 S S Cass Ave, 60439, Lemont, IL, USA

Northern Illinois University, 1425 W Lincoln Hwy, 60115, DeKalb, IL, USA

Argonne National Laboratory, 9700 S S Cass Ave, 60439, Lemont, IL, USA

Northern Illinois University, 1425 W Lincoln Hwy, 60115, DeKalb, IL, USA
View Profile

,
Zhiling Lan

Illinois Institute of Technology, 60616, Chicago, IL, USA

Illinois Institute of Technology, 60616, Chicago, IL, USA
View Profile

Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected PapersJun 2022 Pages 68–87https://doi.org/10.1007/978-3-031-22698-4_4

Published:12 January 2023Publication History

Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers

Pages 68–87

Abstract

Reinforcement learning (RL) is exploited for cluster scheduling in the field of high-performance computing (HPC). One of the key challenges for RL driven scheduling is state representation for RL agent (i.e., capturing essential features of dynamic scheduling environment for decision making). Existing state encoding approaches either lack critical scheduling information or suffer from poor scalability. In this study, we present SEM (Scalable and Efficient encoding Model) for general RL driven scheduling in HPC. It captures system resource and waiting job state, both being critical information for scheduling. It encodes these pieces of information into a fixed-sized vector as an input to the agent. A typical agent is built on deep neural network, and its training/inference cost grows exponentially with the size of its input. Production HPC systems contain a large number of computer nodes. As such, a direct encoding of each of the system resources would lead to poor scalability of the RL agent. SEM uses two techniques to transform the system resource state into a small-sized vector, hence being capable of representing a large number of system resources in a vector of 100–200. Our trace-based simulations demonstrate that compared to the existing state encoding methods, SEM can achieve 9X training speedup and 6X inference speedup while maintaining comparable scheduling performance.

References

1.Argonne Leadership Computing Facility (ALCF). https://www.alcf.anl.govGoogle Scholar
2.Cqsim. https://github.com/SPEAR-IIT/CQSimGoogle Scholar
3.Lawrence Livermore National Laboratory. https://www.llnl.gov/Google Scholar
4.Mira. https://www.alcf.anl.gov/alcf-resources/miraGoogle Scholar
5.Oak Ridge Leadership Computing Facility (OLCF). https://www.olcf.ornl.gov/Google Scholar
6.PWA. https://www.cs.huji.ac.il/labs/parallel/workload/Google Scholar
7.Theta. https://www.alcf.anl.gov/thetaGoogle Scholar
8.Allcock, W., Rich, P., Fan, Y., Lan, Z.: Experience and practice of batch scheduling on leadership supercomputers at argonne. In: Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), IEEE (2017)Google Scholar
9.Appice, A., Ceci, M., Rawles, S., Flach, P.: Redundant feature elimination for multi-class problems. In: Proceedings of the 21st International Conference on Machine Learning, p. 5 (2004)Google Scholar
10.Baheri, B., Guan, Q.: Mars: multi-scalable actor-critic reinforcement learning scheduler. arXiv preprint arXiv:2005.01584 (2020)Google Scholar
11.Domeniconi, G., Lee, E.K., Venkataswamy, V., Dola, S.: Cush: cognitive scheduler for heterogeneous high performance computing system. In: DRL4KDD 19: Workshop on Deep Reinforcement Learning for Knowledge Discover, vol. 7 (2019)Google Scholar
12.Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: Proceedings of the 35th International Parallel and Distributed Processing Symposium, IEEE (2021)Google Scholar
13.Feitelson DGTsafrir DKrakov DExperience with using the parallel workloads archiveJ. Parallel Distrib. Comput.201474102967298210.1016/j.jpdc.2014.06.013Google ScholarCross Ref
14.de Freitas Cunha, R.L., Chaimowicz, L.: Towards a common environment for learning scheduling algorithms. In: 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). pp. 1–8. IEEE (2020)Google Scholar
15.He KZhang XRen SSun JSpatial pyramid pooling in deep convolutional networks for visual recognitionIEEE Trans. Pattern Anal. Mach. Intell.20153791904191610.1109/TPAMI.2015.2389824Google ScholarDigital Library
16.Heaton, J.: Introduction to neural networks with Java. Heaton Research, Inc. (2008)Google Scholar
17.Kaelbling LPLittman MLMoore AWReinforcement learning: a surveyJ. Artif. Intell. Res.1996423728510.1613/jair.301Google ScholarCross Ref
18.Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)Google Scholar
19.Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (2016)Google Scholar
20.Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp. 270–288 (2019)Google Scholar
21.Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)Google Scholar
22.Mu’alem AWFeitelson DGUtilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfillingIEEE Trans. Parallel Distrib. Syst.200112652954310.1109/71.932708Google ScholarDigital Library
23.Peng YBao YChen YWu CMeng CLin WDl2: a deep learning-driven scheduler for deep learning clustersIEEE Trans. Parallel Distrib. Syst.20213281947196010.1109/TPDS.2021.3052895Google ScholarCross Ref
24.Pinto FAPde Moura LGLBarroso GCAguilar MMFAlgorithms scheduling with migration strategies for reducing fragmentation in distributed systemsIEEE Lat. Am. Trans.201513376276810.1109/TLA.2015.7069102Google Scholar
25.Ryu, B., An, A., Rashidi, Z., Liu, J., Hu, Y.: Towards topology aware pre-emptive job scheduling with deep reinforcement learning. In: Proceedings of the 30th Annual International Conference on Computer Science and Software Engineering, pp. 83–92 (2020)Google Scholar
26.Shahzad, B., Afzal, M.T.: Optimized solution to shortest job first by eliminating the starvation. In: The 6th Jordanian International Electrical and Electronics Engineering Conference (JIEEEC 2006), Jordan (2006)Google Scholar
27.Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 PMLR (2014)Google Scholar
28.Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press (2018)Google Scholar
29.Tang, W., Lan, Z., Desai, N., Buettner, D., Yu, Y.: Reducing fragmentation on torus-connected supercomputers. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 828–839 IEEE (2011)Google Scholar
30.Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC’20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE/ACM (2020)Google Scholar

Index Terms

(auto-classified)

Encoding for Reinforcement Learning Driven Scheduling

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers
Jun 2022
266 pages
ISBN:978-3-031-22697-7
DOI:10.1007/978-3-031-22698-4
Editors:
Dalibor Klusáček
CESNET, Prague, Czech Republic
,
Corbalán Julita
Polytechnic University of Catalonia, Barcelona, Spain
,
Gonzalo P. Rodrigo
Apple, Cupertino, CA, USA
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
Sponsors
In-Cooperation
Publisher
Springer-Verlag
Berlin, Heidelberg
Publication History
- Published: 12 January 2023
Author Tags
High-performance computing
Batch scheduling
Scalability
Reinforcement learning
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Encoding for Reinforcement Learning Driven Scheduling

Save to Binder

Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers

Abstract

References

Cited By

Index Terms

Encoding for Reinforcement Learning Driven Scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Encoding for Reinforcement Learning Driven Scheduling

Save to Binder

Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers

Abstract

References

Cited By

Index Terms

Encoding for Reinforcement Learning Driven Scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media