Memorandum of Understanding

Nova Research Cluster Memorandum of Understanding

Background

Recognizing that High Performance Computing (HPC) is an enabler of discovery and innovation, ISU is investing in the HPC@ISU initiative with the following goal of leveraging campus HPC resources and expertise by combining separate activities into a unified HPC effort to ensure that needed computational resources are available to the campus research community. This sustainable model for HPC is supported through a combination of university and grant support.

We acquired a large research cluster, called CyEnce, through an NSF-MRI grant received in 2012. The Condo cluster was acquired in 2015.  Nova, acquired in 2018 with an MRI grant and expanded in 2021 with another MRI grant, incorporates the MRI cluster into the condo mode.  In this model, faculty use grant funds (or departmental funds, startup funds, institutional education support, etc.) to purchase compute nodes in a coordinated all-campus purchase.  The university sources (called the base funds) will fund costs associated with space, utilities, racks (to house compute processors and storage), network, software, and systems support staff.  This in effect represent almost a one-to-one match to funds being spent by researchers, departments, and colleges for their researchers. The base funds are provided through a combination of HPC funds provided by the VPR office and colleges. This approach effectively splits the cost evenly between institutional and faculty sources.  The cluster is maintained by the central staff. Participants will be able to use computational cycles proportional to the investment, however, at any given time, researchers will be able to also use idle cycles across the cluster.

 

Equipment and purchase

Research groups will be required to purchase both nodes and storage.  Purchases will be made through ITS.  

 

Compute nodes

  1. Compute nodes will come in three varieties. Each must be purchased in sets as indicated below.
    1. Base thin nodes (4 in a set) 
      1. ECC memory  (default 512GB)
      2. Two sockets each with a 2.6GHz 32 core Intel® 8358 Processor (Icelake)
      3. Redundant hot-swappable power supply
      4. About 1.5TB of local scratch disk space
      5. 5 years warranty
    2. Accelerator nodes (one in a set) 
      1. ECC memory  (default 512GB)
      2. Two 24-Core  AMD EPYC 7413
      3. Eight NVIDIA A100 GPUs. 
      4. About 1.5TB of local scratch disk space
      5. 5 years warranty
    3. Large memory node (one in a set)
      1. ECC memory  (2TB or 3TB)
      2. Two sockets each with a 2.6GHz 32 core Intel® 8358 Processor (Icelake)
      3. About 1.5TB of local scratch disk space
      4. 5 years warranty

Storage

1)      Long-term production storage is purchased in TB increments.  

2)      For each production partition, there is an option of having equivalent sized non-production system of cheaper storage.  Each production partition will be synced nightly to the non-production partition.  

3)      In case of hardware or file system failure on a production partition, the files from the non-production partition will be copied back to the production system.  This will provide access to files which have not changed since the last sync.  Though not enterprise level backup, it does provide a level of data integrity.  It is limited by synchronization transfer rates and utilization. Users should store a copy of any mission-critical data off-site.

4)      Research groups can select to purchase only production storage without the equivalent non-production storage. In this case no local backup described in item (3) will be provided.

5)      Note: Storage amounts are after ZFS RAID-3 is applied, in TB where TB=10^12 bytes. Final storage space will be somewhat smaller than these round figures as some of this space will be used by the file system and the amount of file system overhead will depend on the pattern of use. Automatic on-the-fly compression is used both to allow more storage and speed up access to spinning disks.

Purchasing

6)      Purchases may be made as a collaboration of researchers or grants

Purchases made with more than one fund account should have a single researcher/staff in charge of the purchases, and ITS will work with the financial officer for that department. 

Hardware Return request

7)      Purchasers of full sets of storage (a storage server plus 2 JBODs totaling about 270TB) may have the storage returned to them should they so wish.

  1. The warranty will no longer apply.  
  2. No infrastructure will be returned.

8)      Purchasers of a complete set of compute resources may have those resources returned to them should they wish 

  1. The warranty will no longer apply. 
  2. ITS staff will format the hard drives of returned systems to ensure we abide by software licensing restrictions. 

9)      Purchases made with multiple POs may be returned to the researcher in charge.  

End of Warranty

10)      Compute nodes out of warranty are moved into Free Tier.

End of life

11)      Equipment falls under the policy at https://www.policy.iastate.edu/policy/equipment/disposal 

Operation

Shared Resource

12)   This cluster is a shared resource, the cluster infrastructure components are purchased with shared funds.  The compute nodes will become part of that shared resource pool. All of the nodes will be utilized by all contributing research groups. This sharing of resources is fundamental to a well-functioning cluster.  The primary benefit of this method is that it allows researchers to access a larger resource than they would otherwise be able to afford consequently allowing researchers to solve larger problems than they otherwise could.  

13) To prevent one reserach group from dominating the cluster, the groups are limited by eight times of the computing resources purchased by the group or by half of the cluster, whatever is lesser. That means that for every purchased node, an equivalent of 8 nodes can be used across all running jobs by users in the group. 

 

Logistics

Connectivity

14)   Access to the cluster will require the use of two-factor authentication. This will provide an additional layer of security to help prevent a system compromise and avoid down time associated with recovery from such an event.

15)   The login node will be for compiling and job submission, not for running long large programs. Such programs should be submitted to the job queuing system.  Data transfer will be through a data transfer node to maintain interactive performance on the login node.   

16)   The cluster will be only accessible from on campus or via Iowa State VPN

Queue

17)   A job scheduling system (SLURM) is used to reserve nodes for the exclusive use of the jobs running on them.  It also enforces fair-sharing policies.  Other than some special queues, jobs do not need to be submitted to a specific queue, but will be routed to an appropriate queue based on the resources requested.

18)   Complete utilization is not possible, particularly when not using job cancelation policies. Typically job queuing delays do not present a problem until the system utilization is at about 67% utilization. At this level of cluster utilization, some jobs do not have enough resources to start immediately, but will start later when the system becomes less busy, typically during the off-peak night and weekend hours.  Due to this we will consider the system to be fully loaded at a utilization of 67% of all available node hours.

19)   Usage is managed by the fair share scheduler in SLURM. Group's fairshare value is based on the amount paid for the compute equipment that is not in the Free Tier.   When jobs are waiting, they are assigned a priority to run, and the jobs with highest priority run first.  Additionally, there are limits on the jobs running from a single user to allow for better sharing within a group.

20)   Some grants require that nodes are available at any time for grant projects. These nodes are placed in restrictive partitions and are available to other users on the cluster through scavenger partition that will need to be explicitly specified when submitting a job. Jobs in the scavenger partition do not affect groups' usage but can be killed if the nodes are needed in the restrictive partitions.

21)   The classrom cluster has been combined with the research cluster. A limited set of nodes is shared between the class short partition and long research suspendible partition allowing short classroom jobs to interrupt long research jobs for up to 15 minutes. Additional memory has been purchased for those nodes to keep suspended jobs in memory minimizing the interruption. This allows for better utilization of cluster resources.

22)   The system is meant for HPC computing, not as a fast workstation.  The queues will be geared towards parallel computing using full nodes. It is recommended that only parallel applications be run on the cluster.

Prioritization

23)   Groups which have used a smaller percentage of their share over the last 90 days (i.e. have not been using the system very much) will have a higher priority.  This will allow groups with large balances to use more of their balance.  Typically, this has no discernible effect on scheduling until utilization exceeds 40%.

24)   Jobs from groups which have used their allocation should not impede jobs from groups which have not.  To that end, the fair share algorithm assigns lower priority to jobs from groups which have used their allocation than jobs from those groups which have not.  This balances out usage over time to the percentage of the resources that each group has provided to the cluster.  This should allow groups to use their allocation, and allow the system to provide service for groups which have used their allocation when the system is idle.  If this is insufficient, we will either direct such jobs to a low priority queue or prevent such jobs from running.

Free Tier

25) Nodes out of warranty are placed into Free Tier that is open for community usage (no cluster equipment purchase is required). 

Scratch Disk Space

26)   A parallel filesystem will be created for each job using the 1.5TB NVME SSDs on each compute node.  This exists only during the job, so data will need to be read into it, and read back out of it at the end of the job.  This provides a parallel filesystem which scales in size and performance with the number of nodes in a job.  The large memory nodes will have their own scratch SSD RAID array of about 10TB in size, not a parallel filesystem.