Partition Configuration and Limits

Partitions
Resource Limits
Requesting Specific Hardware
- Job Constraints
- GPU Resources

Partitions

The scheduler on Nova is Slurm Resource Manager. To see current partition configuration, issue:

scontrol show partitions

Most job submissions do not require specifying a partition, and those jobs will run in the primary nova partition. Partitions are no longer used to separate out special hardware (e.g., GPU or high-memory nodes).

The current nova partitions are:

nova - This is the primary partition.
interactive - This partition is for interactive jobs, such as those launched with salloc or via Nova OnDemand.
reserved - This partition is for jobs which require access to special-purpose nodes, such as those purchased with funding sources which restrict their usage to either specific users/groups or specific uses. Nodes in this partition may be used by any group with access to the partition, but each group is limited to only use the quantity of resources they have contributed (e.g., a group buys 12 nodes with a grant and adds them to the reserved partition. Their jobs will run on any node in the partition, but they cannot run more than 12-nodes' worth of jobs at once.)
scavenger - This partition allows users to utilize additional resources. Users may go beyond their limits on the nova partition as well as leverage idle resources in the reserved partition. Note: Jobs in the scavenger partition may be killed and requeued at ANY time. Submitting jobs to the scavenger partition requires adding -q scavenger to the salloc command or sbatch file.
instruction - This partition is for class usage. It contains 11 nodes exclusively for classroom usage (8 compute, 3 with 8x A100 GPUs each) and an additional 8 general-compute nodes.

To see current partition state, such as how many nodes are idle and how long a job is allowed in a partition, issue:

sinfo

Return to top

Resource Limits

Besides partition limits, each group is subject to maximum resource limits. There are two categories of limits: absolute-maximums and runtime limitations.

Classes utilizing the instruction partition are only limited by resources in the instruction partition.

Groups with access to the reserved partition submit jobs with per-job QOS's which limit their group's resource-usage to match the nodes they have placed in the reserved partition.

Note that jobs are always limited by the available hardware on the cluster. Just because a group has significant resources set in their limits does not guarantee that the cluster has any particular quantity of idle resources available for immediate usage.

Return to top

Absolute-Maximum Limits

Each group on the cluster is limited to the following across all of their running jobs:

3600 CPUs
38,000,000 MB (~38 TB) RAM
The greater of 2, or 3x the group's depreciated GPU-count.
- Depreciation of resources is described in the Runtime Limits section. Fractional values are rounded up to the next integer.

These limits correspond to approximately 25% of the current total Nova compute resources, and are larger than the entirety of the original Nova cluster as-purchased in 2018.

Return to top

Runtime Limits

Starting in January 2024, Nova has a new framework for setting runtime limits for each group. We use the GrpTRESRunMins option in slurm, which limits the resource-minutes available to a group's running jobs, rather than just the absolute resource-count. This means the scheduler is looking at the CPU-minutes of a running job rather than just the number of CPUs being used by the job. Megabyte-minutes and GPU-minutes are also considered. This allows groups to run larger jobs so long as the time-limits of those jobs are sufficiently short and the resources requested by the job fall beneath the absolute-maximum limits discussed above.

Note that this only limits running jobs. It does not impose a limit on the ability to submit jobs to the queue.

Runtime limits for a group are calculated as the sum of the community allocation granted to every group and depreciated value of resources purchased by the group. Purchased resources are depreciated linearly over 5 years, so for the first 5 years after a purchase the group will have additional resources available to them in excess of the community allocation. At the end of the 5 years the group still has the foundational community allocation available to them.

The community allocation is:

64 CPUs
500,000 MB (~500 GB) RAM
2 GPUs

The runtime limits appear large because they are based on having the ability to run a month-long job which uses all of the resources (community allocation + depreciated purchases) available to the group.

For a concrete example, consider the CPU-minutes limit for a group that purchased a 48-CPU node 18 months ago. The depreciated value of that node is 48 * (60 - 18)/60 = 33.6 CPUs. Adding in the 64-CPU community allocation brings that up to 97.6 CPUs. To get our limit, we multiply

97.6 CPUs * 44640 Minutes/month * 1.1 = 4,792,550.4 CPU-minutes

The factor of 1.1 is included to provide headroom to run a few small jobs while the month-long job is running. This calculated limit is finally rounded-up to the nearest integer and then set in the scheduler.

For this example group with 4,792,551 CPU-minutes available to them, the trade-off between job size and allowable runtime would look like this:


Number of CPUs	Allowed Time Limit
1	31 days (Partition time-limit)
4	31 days (Partition time-limit)
8	31 days (Partition time-limit)
...
97	31 days (Partition time-limit)
128	26 days
256	13 days
512	6.5 days
3600 (absolute-max)	0.92 days

Note that in the case of running multiple jobs, the first column would need to be divided by the number of jobs. For example, if this group can run 512 CPUs for 6.5 days, they can divide that up as

1x 512-CPU job for 6.5 days
2x 256-CPU job, each for 6.5 days
4x 128-CPU job, each for 6.5 days
1x 128-CPU job for 13 days, plus 1x 256-CPU job for 6.5 days
etc.

To query the runtime limits available to your group, run sacctmgr show assoc where account=${YOUR_SLURM_ACCOUNT} format=GrpTRESRunMin%60

And you will see output like

cpu=4763088,gres/gpu=44352,mem=36956554272M

Note that the units are CPU-minutes, GPU-minutes, and MB-minutes.

These runtime limits can be applied at each level of the slurm account hierarchy. For most groups, this is not an issue. For groups where a parent entity has purchased resources (e.g., a departmental or college-level purchase for the benefit of all groups within that unit) it is possible that although a research group is not hitting their runtime limits, they may still be limited by usage of their sibiling groups within the higher-level unit.

For example, the ficticious Quantum Studies department purchases resources for its faculty to use, and has a runtime limit of 10,000,000 CPU-minutes. The department says each faculty member may use up to 75% of the departmental limit. Professor Alice's group has jobs running which are using 6,500,000 CPU-minutes (65% of the departmental limit). Professor Bob's running jobs are therefore limited to 3,500,000 CPU-minutes (35% of the departmental limit) because Alice and Bob's combined usage is not allowed to exceed that of their department.

Return to top

Checking Resource Usage

Users may check the current resource-usage on Nova using the slurm-tres-usage script on the Nova login node. Run slurm-tres-usage -h to see all available options.

When run without arguments, the script will return a tree-structure view of resource usage, shown as a percent of runtime limits used by each slurm account. There are 3 pairs of columns, a pair each for CPU, memory, and GPU resources. Each column-pair shows the percent of runtime-limits used by running jobs, and what percent of runtime-limits is in pending jobs.

Note: that due to nuances of the scheduler algorithm it is sometimes possible for groups to utilize more than 100% of their runtime limits. System administrators may also adjust runtime limits from time-to-time in an effort to better-balance system utilization. This typically takes the form of increasing runtime limits to allow more jobs to run when nodes are idle. Limits may be reduced to their expected values at any time. If a runtime limit is decreased, any jobs already running will be allowed to finish.

Slurm accounts with no running or pending jobs are not displayed in the output. All groups, including those with 0 current usage, can be displayed using the --all flag.

Helpful flags for use with slurm-tres-usage:

--abs - Display information based on the maximum-limits rather than the runtime-limits. For example, CPU-count rather than CPU-minutes.
-c - Display detailed information about CPU resources.
-m - Display detailed information about memory resources.
- -m --gb - Display memory units in GB
- -m --tb - Display memory units in TB
-g - Display detailed information about GPU resources
--hours - Display runtime limit information in hours. For example, CPU-hours rather than CPU-minutes.
--days - Display runtime limit information in days. For example, CPU-days rather than CPU-minutes. Due to the scale of the limits involved, days is often a suitable time-unit for viewing limit information.
-A <account> - Display usage information for the specified account and its children.

Helpful sample commands:

slurm-tres-usage --abs -cmg --gb
- Display absolute usage for all 3 resource types, and show memory information in GB.
- This is a good way to see the magnitude of cluster usage.
slurm-tres-usage -c --days
- Display CPU runtime limits, using units of CPU-days.
slurm-tres-usage -m --days --gb
- Display Memory runtime limits using units of GB-days.
slurm-tres-usage --abs -g
- Display absolute GPU usage (i.e., how many GPU cards are being used by each account).

Return to top

Requesting Specific Hardware

Job Constraints

If a job requires specific hardware it can make use of the constraint flag, -C / --constraint, to specify required capabilities. The constraints which may be used on Nova are:

Constraint	Description
intel	Nodes with Intel processors
amd	Nodes with AMD processors
avx512	Nodes with support for AVX-512 instructions
skylake	Nodes with Intel Skylake processors
icelake	Nodes with Intel Icelake processors
epyc-7502	Nodes with AMD EPYC 7502 processors
epyc-9654	Nodes with AMD EPYC 9654 processors
nova18	Nodes purchased in 2018
nova21	Nodes purchased in 2021
nova22	Nodes purchased in 2022
nova23	Nodes purchased in 2023
nova24	Nodes purchased in 2024

Most jobs will not be sensitive to the hardware they run on. An example of when a job should request specific constraints is if it is using binaries which were compiled with processor-specific optimizations. For example, if a job is using an application compiled for Intel processors, the job should include:

-C intel

#SBATCH -C intel

Return to top

GPU Resources

Anyone on the cluster can use GPUs. To request a GPU, one needs to specify the number of GPU cards needed. Use the Slurm script generator for Nova to create the job script. If A100 GPU is needed, this should be specified in the --gres option, e.g.:

salloc -N 1 -n 6 --gres gpu:a100:1 -t 1:00:00 -p interactive

To get good performance one should not request more than 8 cores per A100 GPU card. Most of the software does not even need more than 5 cores. If your job does require more than 8 cores per GPU, you can add "--gres-flags=disable-binding" to the job script or salloc/srun/sbatch command, but be aware that the performance may be degraded.

Note, that the a100 nodes in the interactive partition have even lower limit of the number of CPUs per GPU. Without disabling the binding, one can only request 6 cores/a100 in the interactive partition.

Return to top

Partition Configuration and Limits

Contents

Partitions

Resource Limits

Absolute-Maximum Limits

Runtime Limits

Checking Resource Usage

Requesting Specific Hardware

Job Constraints

GPU Resources