This article provides an in-depth overview and comparison of three popular schedulers—Slurm Workload Manager, IBM Platform Load Sharing Facility (LSF), and Kubernetes kube-scheduler functionality.
In this article, you will learn:
What Is Job Scheduling and How Important Is It for HPC?
Job scheduling is the process of determining what tasks your system is running and on which resources. In an HPC system, thousands of jobs and nodes may be operating at a single time. Without job scheduling, the tasks that users are trying to perform cannot be properly matched to the available resources.
HPC systems use job schedulers to manage job operations. Schedulers are programs that accept, schedule, and monitor jobs.These utilities enable operations teams to initiate and manage jobs manually or automatically through job control language statements. Manual management and monitoring are performed through a graphical user interface (GUI) or a command line interface (CLI).
The purpose of job schedulers is to:
- Minimize the length of time jobs wait in a queue
- Maximize job throughput to ensure as many jobs as possible are running simultaneously
- Optimize resource utilization to maximize ROI
When schedulers function effectively, these tools help ensure that the technical debt created by HPC systems is steadily decreased. This is essential considering the cost required to implement large scale HPC systems. Schedulers also help ensure that workloads are completed as quickly as possible, reducing bottlenecks for operations pending job results. In the case of machine learning, this means faster model training and faster time to market.
What Is Slurm?
Slurm is an open source job scheduling tool that you can use with Linux-based clusters. It is designed to be highly-scalable, fault-tolerant, and self-contained. Slurm does not require any kernel modifications for use.
When implemented, Slurm performs the following tasks:
- Assigns users to compute nodes. This access can be non-exclusive, with shared resources, or exclusive, with resources limited to a single user.
- Provides a framework for initiating, performing, and monitoring work on the assigned nodes. Work is typically managed as parallel jobs run on multiple nodes.
- Manages the queue of pending work and determines which job will be assigned to the node next.
Slurm also includes the option for extension through plugins. You can build custom plugins through the API or use a variety that are already prepared, including for:
- Authentication
- Job completion logging
- Multi-category security
- Power management
- Topology based scheduling
Architecture
The Slurm scheduler architecture is based on a modular approach that enables you to customize your deployment to suit your infrastructure. The main component is a centralized manager (slurmctld), which monitors work and resources. This manager is backed up by a failover copy to ensure continued operations.
On each compute node in your system, there is a daemon (slurmd) that is controlled by the manager. This daemon functions like a remote shell and provides hierarchical, fault-tolerant communications to other nodes and the manager.
If you are using a database, there is a daemon (slurmdbd) that is used to record information across your clusters. There is also a daemon (slurmrestd) that enables you to interact with the Slurm REST API.
Within Slurm, there are various user commands that you can use to manage your various components. These include:
- scontrol—enables you to monitor and modify cluster state and configuration information.
- sinfo—provides a report of system status.
- squeue—provides a report of job status.
- sacct—provides a report of running and completed jobs and job steps.
- srun—enables you to start jobs.
- sview—provides a graphic report of job status, system status, and network topology.
- sacctmgr—enables you to manage your database, including validating users and accounts, and identifying clusters.
- scancel—enables you to stop running or queued jobs.
Below is a diagram of how these various components and tools interact in a Slurm deployment.
For more information, check out the Slurm project page:
What Is LSF Session Scheduler?
IBM Platform Load Sharing Facility (LSF) is a platform designed for workload management in distributed, high performance computing (HPC) deployments. The LSF Session Scheduler is a scheduler for this platform that enables you to run batches of jobs on one set of resources. It offers low-latency execution of jobs based on a hierarchical scheduling model.
Session Scheduler is designed specifically for managing short-duration jobs, such as job arrays with parametric execution or list processes. For mixed length or longer jobs, you are better using traditional job creation, scheduling, and initiation methods, like job chunking.
The benefit of Session Scheduler is that it enables you to submit multiple tasks as a single LSF job. This ability improves the performance and throughput of the standard scheduler by reducing the number of job scheduling decisions that need to be made.
Additionally, implementing Session Scheduler grants you the following benefits:
- Minimizes latency for short jobs
- Optimizes system performance and cluster utilization
- Assigns resources based on established LSF policies
- Maintains existing job starters, resource limits, and pre and post-execution programs
- Capable of managing more than 50k jobs per user and thousands of users
When you use Session Scheduler, the process (ssched) is run similar to a parallel job and is dynamically scheduled. Each process that is created is responsible for one workload and is limited to the assigned resources. During operation, the scheduler dispatches jobs as task arrays or task definition files to the assigned execution agents until the batch is complete.
Below is a diagram of how the Session Scheduler accepts jobs from the master host and dispatches tasks.
For more information, check out the documentation:
https://www.ibm.com/docs/en/SSWRJV_10.1.0/lsf_welcome/lsf_kc_ss.html
What Is the Kubernetes Scheduler?
Kubernetes is a popular open source orchestration solution for container-based workloads. With Kubernetes, these workloads can be effectively managed in ways similar to traditional HPC clustering methods, though Kuberenetes alone does not offer all of the scheduling capabilities of Slurm - such as batch scheduling and gang scheduling - which we will discuss later in this post.
Kubernetes is based on clusters of nodes (either physical or virtual machines) that are controlled by a master. Each node hosts a group of pods (containers). These pods share resources in the node and exist in a local network. This network enables pods to communicate with each other while still containing isolated workloads or applications.
Below is a diagram of how Kubernetes Master relates to nodes and some of the components contained in each.
kube-scheduler
The default scheduler for Kubernetes deployments is kube-scheduler. This scheduler is run as part of the control plane. When you use Kubernetes, pods are frequently created and destroyed. When you create a new, the scheduler is needed to assign that pod to a node. Based on specified resource requirements and available resources, kube-scheduler locates a suitable node.
Locating a suitable node requires filtering nodes according to scheduling requirements. Any nodes that meet the requirements (called feasible nodes) are surfaced and scored to determine the best match. Scoring is based on multiple factors, including hardware, software, and policy restraints, data locality, affinity and anti-affinity definitions, resource requirements, and inter-workload interference.
Once a match is found, the pod is scheduled and a notification is sent to the API server. This notification makes the node accessible. If no nodes are available, the pod remains queued with the scheduler until a node is available.
For more information, check out the documentation:
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
Choosing the Right Scheduler for HPC and AI Workloads
Historically, HPC implementations have primarily been used to run simulations. These systems were used to model complex systems in an attempt to make predictions about real-world events. For example, models often involved seismic modeling, computational chemistry, or financial risk management calculations.
As HPC technologies have advanced, however, the types of workloads that are run on clustered resources has begun to vary greatly. Modern compute-intensive workloads include training machine learning models, performing distributed analytics, and processing streaming data. This additional workload type and purpose has also created a need for different types of scheduling to optimize workloads.
HPC Schedulers Compared: Slurm vs LSF vs Kubernetes Scheduler
To choose the scheduler that is right for you, you need to compare each scheduler’s capabilities and determine which best meets your needs. Below is a comparison of Slurm vs LSF vs Kubernetes Scheduler. Although there are other options, these are a good place to start.
kube-scheduler vs Slurm
Slurm and kube-scheduler are similar in that both tools are the default for their given environments. Slurm is the go-to scheduler for managing the distributed, batch-oriented workloads typical for HPC. kube-scheduler is the go-to for the management of flexible, containerized workloads and microservices.
Slurm is a strong candidate due to its ability to integrate with common frameworks. For example, the Application Launch and Provisioning System (ALPS) framework, which enables you to manage runtimes and launch applications at scale.
In contrast, Kubernetes, and thus kube-scheduler, enables you to better manage cloud-native technologies and container-based microservices which can be scaled more flexibly than traditional applications.
LSF vs Kubernetes
While Slurm and kube-scheduler overlap in some way, it is more challenging to compare LSF Scheduler and kube-scheduler. This is because the Kubernetes and LSF platforms were designed for different problems and workloads.
The LSF platform was created for running diverse, finite, distributed workloads with flexible resource sharing. It supports parallel and serial batch jobs, including multi-step workflows and parametric jobs. Although it can run containerized workloads, this is not its primary purpose.
In contrast, Kubernetes was created to support running long-lived, highly available, scalable services. For example, API services, mobile backends, databases, and web-stores. Workloads run on Kubernetes are containerized and are typically deployed as part of a larger application.
Despite these differences in purpose, there are a few features that can be compared. LSF, and therefore Session Scheduler, provides more granular control over resource selection and workload placement. With it you can define resource requirement expressions based on boolean operators. This enables you to submit jobs with consideration for service level agreements tied to deadlines, throughput, or velocity. Session Scheduler can efficiently share resources regardless of job run-times and make thousands of scheduling decisions per second. These capabilities create a focus on throughput which is often critical for HPC workloads.
Specific Considerations for AI Workloads
Though Slurm, LSF and Kubernetes all have some overlap from a scheduling and orchestration perspective, each has some challenges when it comes to AI workloads.
The characteristics of deep learning workloads are very similar to those of HPC workloads and so we find some organizations using HPC workload managers like Slurm and LSF to operate their deep learning clusters. However, HPC cluster managers are not part of the data science ecosystem that is being developed around containers and Kubernetes, making it very difficult to integrate data science platforms like Kubeflow to such environments. In addition, Slurm and LSF are complicated to use and are difficult to maintain properly.
Kubernetes, on the other hand, is simpler to use and integrates with common data science workflows but is missing the batch system capabilities that Slurm and LSF have. Enterprises and academic institutions want to use Kubernetes in their deep learning training farms however they find it very inefficient. Typically such Kubernetes clusters are poorly managed, resources are left idle for too long, and users find themselves limited in the compute power they can consume, resulting in typical cluster utilization at around ~20% and highly limited data science productivity.
Comparison - The Bottom Line: Three Solutions Head to Head
Slurm
- The most popular scheduler for managing distributed, batch-oriented HPC workloads
- Integrates well with common HPC frameworks
- Complex to use and maintain, particularly with containerized workloads
LSF
- Built for running diverse, finite, distributed workloads with flexible resource sharing
- Sensitive to factors like timeliness, affinity, and topology
- Complex to use and maintain, particularly with containerized workloads
Kubernetes (kube-scheduler)
- The go-to solution for flexible, containerized workloads
- Part of the cloud-native ecosystem
- Integrates well with common container-based technologies
- Requires plugins for key scheduling features of Slurm and LSF, like topology awareness and batch system capabilities
Automate Job Scheduling with Run:ai
If key features of Slurm and LSF, like batch system capabilities, are necessary for your AI workloads, Run:ai’s Scheduler is a simple plug-in to Kubernetes that enables optimized orchestration of high-performance containerized workloads. The Run:ai platform includes:
- High-performance for scale-up infrastructures - pool resources and enable large workloads that require considerable resources to coexist efficiently with small workloads requiring fewer resources.
- Batch scheduling – workloads can start, pause, restart, end, and then shut down, all without any manual intervention. Plus, when a container terminates, the resources are released and can be allocated to other workloads for greater system efficiency.
- Topology awareness— inter-resource and inter-node communication enable consistent high performance of containerized workloads.
- Gang scheduling – containers can be launched together, start together, and end together for distributed workloads that need considerable resources.
Run:ai simplifies Kubernetes scheduling for AI and HPC workloads, helping researchers accelerate their productivity and the quality of their work. Learn more about the Run:ai Kubernetes Scheduler.