10.1007/978-3-031-22698-4_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Encoding for Reinforcement Learning Driven Scheduling

Authors Info & Claims
Published:12 January 2023Publication History

Abstract

Reinforcement learning (RL) is exploited for cluster scheduling in the field of high-performance computing (HPC). One of the key challenges for RL driven scheduling is state representation for RL agent (i.e., capturing essential features of dynamic scheduling environment for decision making). Existing state encoding approaches either lack critical scheduling information or suffer from poor scalability. In this study, we present SEM (Scalable and Efficient encoding Model) for general RL driven scheduling in HPC. It captures system resource and waiting job state, both being critical information for scheduling. It encodes these pieces of information into a fixed-sized vector as an input to the agent. A typical agent is built on deep neural network, and its training/inference cost grows exponentially with the size of its input. Production HPC systems contain a large number of computer nodes. As such, a direct encoding of each of the system resources would lead to poor scalability of the RL agent. SEM uses two techniques to transform the system resource state into a small-sized vector, hence being capable of representing a large number of system resources in a vector of 100–200. Our trace-based simulations demonstrate that compared to the existing state encoding methods, SEM can achieve 9X training speedup and 6X inference speedup while maintaining comparable scheduling performance.

References

  1. 1.Argonne Leadership Computing Facility (ALCF). https://www.alcf.anl.govGoogle ScholarGoogle Scholar
  2. 2.Cqsim. https://github.com/SPEAR-IIT/CQSimGoogle ScholarGoogle Scholar
  3. 3.Lawrence Livermore National Laboratory. https://www.llnl.gov/Google ScholarGoogle Scholar
  4. 4.Mira. https://www.alcf.anl.gov/alcf-resources/miraGoogle ScholarGoogle Scholar
  5. 5.Oak Ridge Leadership Computing Facility (OLCF). https://www.olcf.ornl.gov/Google ScholarGoogle Scholar
  6. 6.PWA. https://www.cs.huji.ac.il/labs/parallel/workload/Google ScholarGoogle Scholar
  7. 7.Theta. https://www.alcf.anl.gov/thetaGoogle ScholarGoogle Scholar
  8. 8.Allcock, W., Rich, P., Fan, Y., Lan, Z.: Experience and practice of batch scheduling on leadership supercomputers at argonne. In: Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), IEEE (2017)Google ScholarGoogle Scholar
  9. 9.Appice, A., Ceci, M., Rawles, S., Flach, P.: Redundant feature elimination for multi-class problems. In: Proceedings of the 21st International Conference on Machine Learning, p. 5 (2004)Google ScholarGoogle Scholar
  10. 10.Baheri, B., Guan, Q.: Mars: multi-scalable actor-critic reinforcement learning scheduler. arXiv preprint arXiv:2005.01584 (2020)Google ScholarGoogle Scholar
  11. 11.Domeniconi, G., Lee, E.K., Venkataswamy, V., Dola, S.: Cush: cognitive scheduler for heterogeneous high performance computing system. In: DRL4KDD 19: Workshop on Deep Reinforcement Learning for Knowledge Discover, vol. 7 (2019)Google ScholarGoogle Scholar
  12. 12.Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., Papka, M.E.: Deep reinforcement agent for scheduling in HPC. In: Proceedings of the 35th International Parallel and Distributed Processing Symposium, IEEE (2021)Google ScholarGoogle Scholar
  13. 13.Feitelson DGTsafrir DKrakov DExperience with using the parallel workloads archiveJ. Parallel Distrib. Comput.201474102967298210.1016/j.jpdc.2014.06.013Google ScholarGoogle ScholarCross RefCross Ref
  14. 14.de Freitas Cunha, R.L., Chaimowicz, L.: Towards a common environment for learning scheduling algorithms. In: 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). pp. 1–8. IEEE (2020)Google ScholarGoogle Scholar
  15. 15.He KZhang XRen SSun JSpatial pyramid pooling in deep convolutional networks for visual recognitionIEEE Trans. Pattern Anal. Mach. Intell.20153791904191610.1109/TPAMI.2015.2389824Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.Heaton, J.: Introduction to neural networks with Java. Heaton Research, Inc. (2008)Google ScholarGoogle Scholar
  17. 17.Kaelbling LPLittman MLMoore AWReinforcement learning: a surveyJ. Artif. Intell. Res.1996423728510.1613/jair.301Google ScholarGoogle ScholarCross RefCross Ref
  18. 18.Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)Google ScholarGoogle Scholar
  19. 19.Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks (2016)Google ScholarGoogle Scholar
  20. 20.Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp. 270–288 (2019)Google ScholarGoogle Scholar
  21. 21.Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)Google ScholarGoogle Scholar
  22. 22.Mu’alem AWFeitelson DGUtilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfillingIEEE Trans. Parallel Distrib. Syst.200112652954310.1109/71.932708Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.Peng YBao YChen YWu CMeng CLin WDl2: a deep learning-driven scheduler for deep learning clustersIEEE Trans. Parallel Distrib. Syst.20213281947196010.1109/TPDS.2021.3052895Google ScholarGoogle ScholarCross RefCross Ref
  24. 24.Pinto FAPde Moura LGLBarroso GCAguilar MMFAlgorithms scheduling with migration strategies for reducing fragmentation in distributed systemsIEEE Lat. Am. Trans.201513376276810.1109/TLA.2015.7069102Google ScholarGoogle Scholar
  25. 25.Ryu, B., An, A., Rashidi, Z., Liu, J., Hu, Y.: Towards topology aware pre-emptive job scheduling with deep reinforcement learning. In: Proceedings of the 30th Annual International Conference on Computer Science and Software Engineering, pp. 83–92 (2020)Google ScholarGoogle Scholar
  26. 26.Shahzad, B., Afzal, M.T.: Optimized solution to shortest job first by eliminating the starvation. In: The 6th Jordanian International Electrical and Electronics Engineering Conference (JIEEEC 2006), Jordan (2006)Google ScholarGoogle Scholar
  27. 27.Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 PMLR (2014)Google ScholarGoogle Scholar
  28. 28.Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press (2018)Google ScholarGoogle Scholar
  29. 29.Tang, W., Lan, Z., Desai, N., Buettner, D., Yu, Y.: Reducing fragmentation on torus-connected supercomputers. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 828–839 IEEE (2011)Google ScholarGoogle Scholar
  30. 30.Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC’20: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE/ACM (2020)Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Encoding for Reinforcement Learning Driven Scheduling

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image Guide Proceedings
                Job Scheduling Strategies for Parallel Processing: 25th International Workshop, JSSPP 2022, Virtual Event, June 3, 2022, Revised Selected Papers
                Jun 2022
                266 pages
                ISBN:978-3-031-22697-7
                DOI:10.1007/978-3-031-22698-4

                © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

                Publisher

                Springer-Verlag

                Berlin, Heidelberg

                Publication History

                • Published: 12 January 2023

                Qualifiers

                • Article
              • Article Metrics

                • Downloads (Last 12 months)0
                • Downloads (Last 6 weeks)0

                Other Metrics

              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!