HPC
MSI's High Performance Computing (HPC) systems are designed with high speed networks, high performance storage, GPUs, and large amounts of memory in order to support some of the most compute and memory intensive programs developed today.
MSI currently has three main HPC systems: Agate, Mesabi, and Mangi. Mesabi is the oldest cluster, running HP Linux and featuring a large number of nodes with Intel processors that are tightly integrated via a very high speed communication network. Mangi is a heterogeneous HPE cluster featuring nodes with AMD processors and some nodes with NVIDIA V100 GPUs that are tightly integrated via a high speed Infininband network. The newest cluster, Agate, put into service in April 2022, has many nodes with some of the newest AMD processors, and a good number of nodes with the latest NVIDIA A100 and A40 GPUs.
MSI’s HPC systems have direct access to high performance storage and many of MSIs software resources including popular programming languages such as Python, R, Matlab, and C compilers. This integration creates an computational environment that is flexible and powerful enough to accommodate any need. Researchers from departments across the University use MSI’s HPC resources daily to accelerate their research.
The first step to accessing MSI’s HPC systems is to become an MSI user, from there MSI’s HPC systems are primarily accessed via a terminal interface and many of our users have the ability to write custom programs to run complex analysis. MSI also provides interactive access to the HPC systems though NICE, iPython Notebook and interactive MATLAB options.
HPC Fairshare Scheduling using Slurm
Fairshare Priority Breakdown
- AGE component - This component of priority will increase as a job waits in the queue. Each group has a limited number of jobs that will gain AGE priority at a given time. The longer a job waits in the queue, the higher priority it will be given due to this AGE component.
- JOBSIZE component - This component of priority is based on the number of resources requested. A larger job will receive a higher priority.
- FAIRSHARE component - The FAIRSHARE component is adjusted based on the recent usage of their group members. Fairshare usage is proportional to the resources used by the group. CPU, Memory, and GPU resources are tracked separately and have different weights for Fairshare. For example; a single hour on a V100 GPU counts the same as 88 CPU·hours or 602 GB·hours of memory. The resources for each job are summed and multiplied by the duration of the job. The impact on priority decays over time, so a recently completed job has a greater impact on priority than a job that completed last week.