Abstract
Reinforcement learning (RL) is exploited for cluster scheduling in the field of high-performance computing (HPC). One of the key challenges for RL driven scheduling is state representation for RL agent (i.e., capturing essential features of dynamic scheduling environment for decision making). Existing state encoding approaches either lack critical scheduling information or suffer from poor scalability. In this study, we present SEM (Scalable and Efficient encoding Model) for general RL driven scheduling in HPC. It captures system resource and waiting job state, both being critical information for scheduling. It encodes these pieces of information into a fixed-sized vector as an input to the agent. A typical agent is built on deep neural network, and its training/inference cost grows exponentially with the size of its input. Production HPC systems contain a large number of computer nodes. As such, a direct encoding of each of the system resources would lead to poor scalability of the RL agent. SEM uses two techniques to transform the system resource state into a small-sized vector, hence being capable of representing a large number of system resources in a vector of 100–200. Our trace-based simulations demonstrate that compared to the existing state encoding methods, SEM can achieve 9X training speedup and 6X inference speedup while maintaining comparable scheduling performance.
- 1.Argonne Leadership Computing Facility (ALCF). https://www.alcf.anl.govGoogle Scholar
- 2.Cqsim. https://github.com/SPEAR-IIT/CQSimGoogle Scholar
- 3.Lawrence Livermore National Laboratory. https://www.llnl.gov/Google Scholar
- 4.Mira. https://www.alcf.anl.gov/alcf-resources/miraGoogle Scholar
- 5.Oak Ridge Leadership Computing Facility (OLCF). https://www.olcf.ornl.gov/Google Scholar
- 6.PWA. https://www.cs.huji.ac.il/labs/parallel/workload/Google Scholar
- 7.Theta. https://www.alcf.anl.gov/thetaGoogle Scholar
- 8.Allcock, W., Rich, P., Fan, Y., Lan, Z.: Experience and practice of batch scheduling on leadership supercomputers at argonne. In: Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), IEEE (2017)Google Scholar
- 9.Appice, A., Ceci, M., Rawles, S., Flach, P.: Redundant feature elimination for multi-class problems. In: Proceedings of the 21st International Conference on Machine Learning, p. 5 (2004)Google Scholar
- 10.Baheri, B., Guan, Q.: Mars: multi-scalable actor-critic reinforcement learning scheduler. arXiv preprint arXiv:2005.01584 (2020)Google Scholar
- 11.Domeniconi, G., Lee, E.K., Venkataswamy, V., Dola, S.: Cush: cognitive scheduler for heterogeneous high performance computing system. In: DRL4KDD 19: Workshop on Deep Reinforcement Learning for Knowledge Discover, vol. 7 (2019)Google Scholar
- 12.Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: Proceedings of the 35th International Parallel and Distributed Processing Symposium, IEEE (2021)Google Scholar
- 13.Experience with using the parallel workloads archiveJ. Parallel Distrib. Comput.201474102967298210.1016/j.jpdc.2014.06.013Google Scholar
Cross Ref
- 14.de Freitas Cunha, R.L., Chaimowicz, L.: Towards a common environment for learning scheduling algorithms. In: 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). pp. 1–8. IEEE (2020)Google Scholar
- 15.Spatial pyramid pooling in deep convolutional networks for visual recognitionIEEE Trans. Pattern Anal. Mach. Intell.20153791904191610.1109/TPAMI.2015.2389824Google Scholar
Digital Library
- 16.Heaton, J.: Introduction to neural networks with Java. Heaton Research, Inc. (2008)Google Scholar
- 17.Reinforcement learning: a surveyJ. Artif. Intell. Res.1996423728510.1613/jair.301Google Scholar
Cross Ref
- 18.Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)Google Scholar
- 19.Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (2016)Google Scholar
- 20.Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp. 270–288 (2019)Google Scholar
- 21.Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)Google Scholar
- 22.Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfillingIEEE Trans. Parallel Distrib. Syst.200112652954310.1109/71.932708Google Scholar
Digital Library
- 23.Dl2: a deep learning-driven scheduler for deep learning clustersIEEE Trans. Parallel Distrib. Syst.20213281947196010.1109/TPDS.2021.3052895Google Scholar
Cross Ref
- 24.Algorithms scheduling with migration strategies for reducing fragmentation in distributed systemsIEEE Lat. Am. Trans.201513376276810.1109/TLA.2015.7069102Google Scholar
- 25.Ryu, B., An, A., Rashidi, Z., Liu, J., Hu, Y.: Towards topology aware pre-emptive job scheduling with deep reinforcement learning. In: Proceedings of the 30th Annual International Conference on Computer Science and Software Engineering, pp. 83–92 (2020)Google Scholar
- 26.Shahzad, B., Afzal, M.T.: Optimized solution to shortest job first by eliminating the starvation. In: The 6th Jordanian International Electrical and Electronics Engineering Conference (JIEEEC 2006), Jordan (2006)Google Scholar
- 27.Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 PMLR (2014)Google Scholar
- 28.Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press (2018)Google Scholar
- 29.Tang, W., Lan, Z., Desai, N., Buettner, D., Yu, Y.: Reducing fragmentation on torus-connected supercomputers. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 828–839 IEEE (2011)Google Scholar
- 30.Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC’20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE/ACM (2020)Google Scholar
Index Terms
(auto-classified)Encoding for Reinforcement Learning Driven Scheduling
Comments