Summary
Most MSI systems use job partitions to efficiently and fairly manage when computations are executed. A job partition is an automated waiting list for use of a particular set of computational hardware. When computational jobs are submitted to a job partition they wait in the partition in line until the appropriate resources become available. Different job partitions have different resources and limitations. When submitting a job, it is very important to choose a job partition which has resources and limitations suitable to the particular calculation.
This document outlines factors to consider when choosing a job partition. These factors are important when choosing where to place a job. This document is best used on all MSI systems and in conjunction with the partitions page that outlines the resource limitations for each partition.
Guidelines
There are several important factors to consider when choosing a job partition for a specific program or custom script. In most cases, jobs are submitted via Slurm scripts as described in Job Submission and Scheduling (Slurm).
Overall System
Each MSI system contains job partitions managing sets of hardware with different resource and policy limitations. MSI currently has three primary systems: the cluster Agate, the cluster Mesabi, and Mesabi's expansion Mangi. These clusters has a wide variety of partitions suitable for many different job types, including some Federated partitions for jobs that are flexible and can run on any of the clusters. The Agate and Mesabi Interactive partitions are primarily used for interactive software that is graphical in nature, and for short-term testing of jobs that will eventually run on a regular batch partition. Which system to choose depends highly on which system has partitions appropriate for your software/script. Examine the partition page to determine the most appropriate system.
Job Walltime (--time=)
The job walltime is the time from the start to the finish of a job (as you would measure it using a clock on a wall), not including time spent waiting to run. This is in contrast to cputime, which measures the cumulative time all cores spent working on a job. Different job partitions have different walltime limits, and it is important to choose a partition with a sufficiently high walltime that enables your job to complete. Jobs that exceed the requested walltime are killed by the system to make room for other jobs. Walltime limits are maximums only, and you can always request a shorter walltime, which will reduce the amount of time you wait in the partition for your job to start. If you are unsure how much walltime your job will need, start with the partitions with shorter walltime limits and only move to others if needed.
Job Nodes and Cores (--nodes= and --ntasks= )
Many calculations have the ability to use multiple cores, or (less often) multiple nodes, to improve calculation speed. Certain job partitions have maximum or minimum values for the number nodes and cores a job may use. If Node Sharing is enabled for a partition you can request fewer cores than exist on an entire node. If Node Sharing is not enabled then you must request resources equivalent to a multiple of an entire node. Mesabi’s large partition does not allow Node Sharing.
Job Memory (--mem=)
The memory which a job requires is an important factor when choosing a partition. The largest amount of memory (RAM) that can be requested for a job is limited by the memory on the hardware associated with that partition. Mesabi has two partitions (ram256g and ram1t) with high memory hardware, the largest memory hardware is available through the amd2tb partition.
User and Group Limitations
To efficiently share resources, many partitions have limits on the number of jobs or cores a particular user or group may simultaneously use. If a workflow requires many jobs to complete, it can be helpful to choose partitions which will allow many jobs to run simultaneously.
Special Hardware
Some partitions contain nodes with special hardware, GPU accelerators and solid-state scratch drives being the most common. If a calculation needs to use special hardware, then it is important to choose a partition with the correct hardware available. Furthermore, those partitions may require additional resources to be specified (e.g., A100 GPU nodes require "--gres=gpu:a100:1").
Partition Congestion
At certain times particular partitions may become overloaded with submitted jobs. In such a case, it can be helpful to send jobs to partitions with lower utilization (node status). Sending jobs to lower utilization partitions can decrease wait time and improve throughput. Care must be taken to make sure calculations will fit within partition limitations.
Preemptable Partitions
The preempt and preempt-gpu partitions are special partitions that allow the use of idle interactive resources. Jobs submitted to the preempt queue may be killed at any time to make room for an interactive job. Care must be taken to use these queues only for jobs that can easily restart after being killed. An example job is shown below
#SBATCH --time=24:00:00
#SBATCH --mem=20gb
#SBATCH -n 12
#SBATCH --requeue
#SBATCH -p preempt-gpu
#SBATCH --gres=gpu:k40:1
module load singularity
singularity exec --nv \ /home/support/public/singularity/gromacs_2018.2.sif \
gmx mdrun -s benchMEM.tpr -cpi state.cpi -append