Deploy an Auto-Scaling HPC Cluster with Slurm

Welcome to the Google Qwiklab for running a Slurm cluster on Google Cloud! By the end of this lab you should have a solid understanding of the ease of provisioning and operating an auto-scaling Slurm cluster.

Google Cloud teamed up with SchedMD to release a set of tools that make it easier to launch the Slurm workload manager on Compute Engine, and to expand your existing cluster dynamically when you need extra resources. This integration was built by the experts at SchedMD in accordance with Slurm best practices.

If you're planning on using the Slurm on Google Cloud integrations, or if you have any questions, please consider joining our Google Cloud & Slurm Community Discussion Group!

About Slurm


Basic architectural diagram of a stand-alone Slurm Cluster in Google Cloud.

Slurm is one of the leading workload managers for HPC clusters around the world. Slurm provides an open-source, fault-tolerant, and highly-scalable workload management and job scheduling system for small and large Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions:

  1. It allocates exclusive or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.

  2. It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.

  3. It arbitrates contention for resources by managing a queue of pending work.


In this lab, you will learn how to:

  • Use Google Cloud's Deployment Manager Service.
  • Run a job using SLURM.
  • Query cluster information and monitor running jobs in SLURM.
  • Autoscale nodes to accommodate specific job parameters and requirements.
  • Find help with Slurm.

