Deploy an Auto-Scaling HPC Cluster with Slurm
Welcome to the Google Qwiklab for running a Slurm cluster on Google Cloud! By the end of this lab you should have a solid understanding of the ease of provisioning and operating an auto-scaling Slurm cluster.
Google Cloud teamed up with SchedMD to release a set of tools that make it easier to launch the Slurm workload manager on Compute Engine, and to expand your existing cluster dynamically when you need extra resources. This integration was built by the experts at SchedMD in accordance with Slurm best practices.
Basic architectural diagram of a stand-alone Slurm Cluster in Google Cloud.
Slurm is one of the leading workload managers for HPC clusters around the world. Slurm provides an open-source, fault-tolerant, and highly-scalable workload management and job scheduling system for small and large Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions:
It allocates exclusive or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.
It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
It arbitrates contention for resources by managing a queue of pending work.
In this lab, you will learn how to:
- Use Google Cloud's Deployment Manager Service.
- Run a job using SLURM.
- Query cluster information and monitor running jobs in SLURM.
- Autoscale nodes to accommodate specific job parameters and requirements.
- Find help with Slurm.
加入 Qwiklabs 即可阅读本实验的剩余内容…以及更多精彩内容！
- 获取对“Google Cloud Console”的临时访问权限。
- 200 多项实验，从入门级实验到高级实验，应有尽有。
Create a deployment Manager
Run a Slurm Job
Scale a Slurm Cluster