Partition Configuration and Limits

Contents

Partitions

The scheduler on Nova is Slurm Resource Manager. To see current partition configuration, issue:

scontrol show partitions

Most job submissions do not require specifying a partition, and those jobs will run in the primary nova partition.  Partitions are no longer used to separate out special hardware (e.g., GPU or high-memory nodes).

The current nova partitions are:

  • nova - This is the primary partition.  Jobs in this partition may be time-sliced, meaning that slurm will schedule 2 jobs to run using the same CPU resources, and alternate which job is actively running at a time. Both jobs are held in memory so the suspend/restart operation is very fast.  GPU jobs are not timesliced.  Time spent in a suspended state does not count against the job's timelimit, although it's wall-clock time may take up to 2x longer than requested.
  • interactive - This partition is for interactive jobs, such as those launched with salloc or via Nova OnDemand. In order to facilitate interactive sessions, this partition does not have time-slicing enabled.
  • reserved - This partition is for jobs which require access to special-purpose nodes, such as those purchased with funding sources which restrict their usage to either specific users/groups or specific uses.  Nodes in this partition may be used by any group with access to the partition, but each group is limited to only use the quantity of resources they have contributed (e.g., a group buys 12 nodes with a grant and adds them to the reserved partition.  Their jobs will run on any node in the partition, but they cannot run more than 12-nodes' worth of jobs at once.)
  • scavenger - This partition allows users to utilize additional resources. Users may go beyond their limits on the nova partition as well as leverage idle resources in the reserved partition. Note: Jobs in the scavenger partition may be killed and requeued at ANY time.  Submitting jobs to the scavenger partition requires adding -q scavenger to the salloc command or sbatch file.
  • instruction - This partition is for class usage.  It contains 11 nodes exclusively for classroom usage (8 compute, 3 with 8x A100 GPUs each) and an additional 8 general-compute nodes.

 

Each partition allows jobs up to 31 days long.

To see current partition state, such as how many nodes are idle, issue:

sinfo 

Return to top

 

Resource Limits

Besides partition limits, each group is subject to maximum resource limits.  There are two categories of limits: absolute-maximums and runtime limitations.

Classes utilizing the instruction partition are only limited by resources in the instruction partition.

Groups with access to the reserved partition submit jobs with per-job QOS's which limit their group's resource-usage to match the nodes they have placed in the reserved partition.

Note that jobs are always limited by the available hardware on the cluster.  Just because a group has significant resources set in their limits does not guarantee that the cluster has any particular quantity of idle resources available for immediate usage.

Return to top

 

Absolute-Maximum Limits

Each group on the cluster is limited to the following across all of their running jobs:

  • 3600 CPUs
  • 38,000,000 MB (~38 TB) RAM
  • The greater of 2, or 3x the group's depreciated GPU-count.
    • Depreciation of resources is described in the Runtime Limits section.  Fractional values are rounded up to the next integer.


These limits correspond to approximately 25% of the current total Nova compute resources, and are larger than the entirety of the original Nova cluster as-purchased in 2018.

Return to top

 

Runtime Limits

Starting in January 2024, Nova has a new framework for setting runtime limits for each group.  We use the GrpTRESRunMins option in slurm, which limits the resource-minutes available to a group's running jobs, rather than just the absolute resource-count.  This means the scheduler is looking at the CPU-minutes of a running job rather than just the number of CPUs being used by the job.  Megabyte-minutes and GPU-minutes are also considered.  This allows groups to run larger jobs so long as the time-limits of those jobs are sufficiently short and the resources requested by the job fall beneath the absolute-maximum limits discussed above.

Note that this only limits running jobs.  It does not impose a limit on the ability to submit jobs to the queue.

Runtime limits for a group are calculated as the sum of the community allocation granted to every group and depreciated value of resources purchased by the group.  Purchased resources are depreciated linearly over 5 years, so for the first 5 years after a purchase the group will have additional resources available to them in excess of the community allocation.  At the end of the 5 years the group still has the foundational community allocation available to them.

The community allocation is:

  • 64 CPUs
  • 500,000 MB (~500 GB) RAM
  • 2 GPUs

 

The runtime limits appear large because they are based on having the ability to run a month-long job which uses all of the resources (community allocation + depreciated purchases) available to the group.

For a concrete example, consider the CPU-minutes limit for a group that purchased a 48-CPU node 18 months ago.  The depreciated value of that node is 48 * (60 - 18)/60 = 33.6 CPUs.  Adding in the 64-CPU community allocation brings that up to 97.6 CPUs.  To get our limit, we multiply

97.6 CPUs * 44640 Minutes/month * 1.1 = 4,792,550.4 CPU-minutes

The factor of 1.1 is included to provide headroom to run a few small jobs while the month-long job is running.  This calculated limit is finally rounded-up to the nearest integer and then set in the scheduler.

For this example group with 4,792,551 CPU-minutes available to them, the trade-off between job size and allowable runtime would look like this:

 
Number of CPUsAllowed Time Limit
131 days (Partition time-limit)
431 days (Partition time-limit)
831 days (Partition time-limit)
... 
9731 days (Partition time-limit)
12826 days
25613 days
5126.5 days
3600 (absolute-max)0.92 days

Note that in the case of running multiple jobs, the first column would need to be divided by the number of jobs.  For example, if this group can run 512 CPUs for 6.5 days, they can divide that up as

  • 1x 512-CPU job for 6.5 days
  • 2x 256-CPU job, each for 6.5 days
  • 4x 128-CPU job, each for 6.5 days
  • 1x 128-CPU job for 13 days, plus 1x 256-CPU job for 6.5 days
  • etc.

 

To query the runtime limits available to your group, run sacctmgr show assoc where account=${YOUR_SLURM_ACCOUNT} format=GrpTRESRunMin%60

And you will see output like

cpu=4763088,gres/gpu=44352,mem=36956554272M

Note that the units are CPU-minutes, GPU-minutes, and MB-minutes.

These runtime limits can be applied at each level of the slurm account hierarchy.  For most groups, this is not an issue.  For groups where a parent entity has purchased resources (e.g., a departmental or college-level purchase for the benefit of all groups within that unit) it is possible that although a research group is not hitting their runtime limits, they may still be limited by usage of their sibiling groups within the higher-level unit.

For example, the ficticious Quantum Studies department purchases resources for its faculty to use, and has a runtime limit of 10,000,000 CPU-minutes.  The department says each faculty member may use up to 75% of the departmental limit.  Professor Alice's group has jobs running which are using 6,500,000 CPU-minutes (65% of the departmental limit).  Professor Bob's running jobs are therefore limited to 3,500,000 CPU-minutes (35% of the departmental limit) because Alice and Bob's combined usage is not allowed to exceed that of their department.

Return to top

 

Checking Resource Usage

Users may check the current resource-usage on Nova using the slurm-tres-usage script on the Nova login node.  Run slurm-tres-usage -h to see all available options.  

When run without arguments, the script will return a tree-structure view of resource usage, shown as a percent of runtime limits used by each slurm account.  There are 3 pairs of columns, a pair each for CPU, memory, and GPU resources. Each column-pair shows the percent of runtime-limits used by running jobs, and what percent of runtime-limits is in pending jobs.  

Note: that due to nuances of the scheduler algorithm it is sometimes possible for groups to utilize more than 100% of their runtime limits.  System administrators may also adjust runtime limits from time-to-time in an effort to better-balance system utilization.  This typically takes the form of increasing runtime limits to allow more jobs to run when nodes are idle.  Limits may be reduced to their expected values at any time.  If a runtime limit is decreased, any jobs already running will be allowed to finish.

Slurm accounts with no running or pending jobs are not displayed in the output.  All groups, including those with 0 current usage, can be displayed using the --all flag.

Helpful flags for use with slurm-tres-usage:

  • --abs - Display information based on the maximum-limits rather than the runtime-limits.  For example, CPU-count rather than CPU-minutes.
  • -c - Display detailed information about CPU resources.  
  • -m - Display detailed information about memory resources.
    • -m --gb - Display memory units in GB
    • -m --tb - Display memory units in TB 
  • -g - Display detailed information about GPU resources
  • --hours - Display runtime limit information in hours.  For example, CPU-hours rather than CPU-minutes.
  • --days - Display runtime limit information in days.  For example, CPU-days rather than CPU-minutes.  Due to the scale of the limits involved, days is often a suitable time-unit for viewing limit information.
  • -A <account> - Display usage information for the specified account and its children.  

 

Helpful sample commands:

  • slurm-tres-usage --abs -cmg --gb
    • Display absolute usage for all 3 resource types, and show memory information in GB.
    • This is a good way to see the magnitude of cluster usage.
  • slurm-tres-usage -c --days
    • Display CPU runtime limits, using units of CPU-days.
  • slurm-tres-usage -m --days --gb
    • Display Memory runtime limits using units of GB-days.
  • slurm-tres-usage --abs -g
    • Display absolute GPU usage (i.e., how many GPU cards are being used by each account).

 

Return to top

 

Requesting Specific Hardware

Job Constraints

If a job requires specific hardware it can make use of the constraint flag, -C / --constraint, to specify required capabilities.  The constraints which may be used on Nova are:

ConstraintDescription
intelNodes with Intel processors
amdNodes with AMD processors
avx512Nodes with support for AVX-512 instructions
skylakeNodes with Intel Skylake processors
icelakeNodes with Intel Icelake processors
epyc-7502Nodes with AMD EPYC 7502 processors
epyc-9654Nodes with AMD EPYC 9654 processors
nova18Nodes purchased in 2018
nova21Nodes purchased in 2021
nova22Nodes purchased in 2022
nova23Nodes purchased in 2023
nova24Nodes purchased in 2024

Most jobs will not be sensitive to the hardware they run on.  An example of when a job should request specific constraints is if it is using binaries which were compiled with processor-specific optimizations.  For example, if a job is using an application compiled for Intel processors, the job should include:

-C intel

or

#SBATCH -C intel

Return to top

 

GPU Resources

Anyone on the cluster can use GPUs. To request a GPU, one needs to specify the number of GPU cards needed. Use the Slurm script generator for Nova to create the job script. If A100 GPU is needed, this should be specified in the --gres option, e.g.:

salloc -N 1 -n 6 --gres gpu:a100:1 -t 1:00:00 -p interactive

To get good performance one should not request more than 8 cores per A100 GPU card. Most of the software does not even need more than 5 cores. If your job does require more than 8 cores per GPU, you can add "--gres-flags=disable-binding" to the job script or salloc/srun/sbatch command, but be aware that the performance may be degraded.

Note, that the a100 nodes in the interactive partition have even lower limit of the number of CPUs per GPU. Without disabling the binding, one can only request 6 cores/a100 in the interactive partition.

Return to top