HPC

HPC

What is HPC?

MSI's High Performance Computing (HPC) systems are designed with high speed networks, high performance storage, GPUs, and large amounts of memory in order to support some of the most compute and memory intensive programs developed today.

MSI currently has three main HPC systems: Agate, Mesabi, and Mangi. Mesabi is the oldest cluster, running HP Linux and featuring a large number of nodes with Intel processors that are tightly integrated via a very high speed communication network. Mangi is a heterogeneous HPE cluster featuring nodes with AMD processors and some nodes with NVIDIA V100 GPUs that are tightly integrated via a high speed Infininband network. The newest cluster, Agate, put into service in April 2022, has many nodes with some of the newest AMD processors, and a good number of nodes with the latest NVIDIA A100 and A40 GPUs.

What Can I do with HPC?

MSI’s HPC systems have direct access to high performance storage and many of MSIs software resources including popular programming languages such as Python, R, Matlab, and C compilers. This integration creates an computational environment that is flexible and powerful enough to accommodate any need. Researchers from departments across the University use MSI’s HPC resources daily to accelerate their research.

How Do I Access the HPC systems?

The first step to accessing MSI’s HPC systems is to become an MSI user, from there MSI’s HPC systems are primarily accessed via a terminal interface and many of our users have the ability to write custom programs to run complex analysis. MSI also provides interactive access to the HPC systems though NICE, iPython Notebook and interactive MATLAB options.

 

HPC Fairshare Scheduling

HPC Fairshare Scheduling using Slurm

MSI uses a fairshare job scheduler to try to ensure a mix of jobs from all users can utilize any given HPC resource efficiently and fairly. The Slurm scheduler tracks system usage behind the scenes and uses this information to modify the fairshare priority. The goal of fairshare is to increase the priority when scheduling jobs belong to those groups who are below their fairshare targets, and decrease the priority of jobs belonging to those groups whose usage exceeds their fairshare targets. In general, this means that when a group has recently used a large amount of resources, the priorities of their waiting jobs will be negatively affected until their usage decreases to their fairshare target once again.
 
When scheduling jobs and calculating priorities of waiting jobs, there are many factors to consider, and fairshare is only one such factor. MSI also uses queue time - the time that a job has been waiting to run - to affect the fairshare of any given job. The longer a job waits, the more that the queue time factor will add to the job's priority. Also, the job's requested walltime, relative to the maximum walltime on the resource where the job is waiting, will affect the job's priority. 

Fairshare Priority Breakdown

To help groups determine what is impacting their job priority, the command sprio -u $USER gives a breakdown of the three main factors that determine priority. Using the sprio -u $USER command will display the AGE, JOBSIZE, and FAIRSHARE components that are used to calculate the job priority. The description of each of these three components can be broken down as follows:
  • AGE component -  This component of priority will increase as a job waits in the queue. Each group has a limited number of jobs that will gain AGE priority at a given time. The longer a job waits in the queue, the higher priority it will be given due to this AGE component.
  • JOBSIZE component - This component of priority is based on the number of resources requested. A larger job will receive a higher priority. 
  • FAIRSHARE component - The FAIRSHARE component is adjusted based on the recent usage of their group members. Fairshare usage is proportional to the resources used by the group. CPU, Memory, and GPU resources are tracked separately and have different weights for Fairshare. For example; a single hour on a V100 GPU counts the same as 88 CPU·hours or 602 GB·hours of memory. The resources for each job are summed and multiplied by the duration of the job. The impact on priority decays over time, so a recently completed job has a greater impact on priority than a job that completed last week.
 
Additionally, the Slurm scheduler is configured to first try to schedule jobs requesting a large amount of resources, and then schedule smaller jobs around the larger jobs. Jobs requesting a large amount of resources need to reserve those resources in order to run, and they cannot run until there are sufficient free resources to fit such jobs. It is undesirable to have unused resources, so the scheduler uses smaller jobs to fill in the gaps created by the reservations of the large jobs. This scheduling behavior is called "backfill." It is far more efficient to backfill smaller jobs around larger jobs. Accurate estimates of wall clock time on your jobs, especially small jobs, will help the scheduler schedule your jobs promptly.
 
MSI understands that no one wants to wait. It is also true that no scheduling policy can guarantee that no one will wait - only impossibly large machines can guarantee that - so we use fairshare to try to ensure a mix of jobs from all users can utilize the resources efficiently and fairly. We monitor queues and often adjust parameters to get better turnaround times on jobs. Your comments are always welcome.
 
Job Limits
Slurm has resource limits for each research group. Each group is limited to 10,000 cpu cores, 8 V100 GPUs, and 40 TB of memory. If your job is waiting in the queue, it may be limited by the resources being used by other users in your group.