Memorandum of Understanding

Nova Research Cluster Memorandum of Understanding

Background

Recognizing that High Performance Computing (HPC) is an enabler of discovery and innovation, ISU is investing in the HPC@ISU initiative with the following goal of leveraging campus HPC resources and expertise by combining separate activities into a unified HPC effort to ensure that needed computational resources are available to the campus research community. This sustainable model for HPC is supported through a combination of university and grant support.

We acquired a large research cluster, called CyEnce, through an NSF-MRI grant received in 2012. The Condo cluster was acquired in 2015.  Nova, acquired in 2018, incorporates the MRI cluster into the condo mode.  In this model, faculty use grant funds (or departmental funds, startup funds, institutional education support, etc.) to purchase compute nodes in a coordinated all-campus purchase.  The university sources (called the base funds) will fund costs associated with space, utilities, racks (to house compute processors and storage), network, software, and systems support staff.  This in effect represent almost a one-to-one match to funds being spent by researchers, departments, and colleges for their researchers. The base funds are provided through a combination of HPC funds provided by the VPR office and colleges. This approach effectively splits the cost evenly between institutional and faculty sources.  The cluster is maintained by the central staff. Participants will be able to use computational cycles proportional to the investment, however, at any given time, researchers will be able to also use idle cycles across the cluster.

 

Equipment and purchase

Research groups will be required to purchase both nodes and storage.  Purchases will be made through ITS.  Purchases can be made now, or within the 2019 fiscal year at fixed negotiated prices up to the capacity of the machine (about 300 nodes.)  Additional nodes may be added in the 2020 fiscal year provided there is space available. The same fixed price cannot be guaranteed for Fiscal year 2020.

 

Compute nodes

  1. Compute nodes will come in three varieties. Each must be purchased in sets as indicated below.
    1. Base thin nodes (4 in a set) 
      1. ECC memory  (3 options)
        1. 192GB  using 16GB DIMMs (optimal memory performance, but no expansion)
        2. 192GB  using 32 GB DIMMs (about 66% of optimal performance, but can expand to 384GB)
        3. 384GB  using 32 GB DIMMs (optimal performance , no expansion)
      2. Two sockets each with an 18 core 2.3GHz  Intel® Xeon® Gold 6140 Processor (SkyLake)
      3. Redundant hot-swappable power supply
      4. About 1.5TB of local scratch disk space
      5. 5 years warranty
    2. Accelerator nodes (1 in a set) 
      1. Same as base node with 192GB using 32 GB DIMMs, but with larger power supplies capable of powering VOLTA GPUs
      2. One or Two NVIDIA Volta GPUs. 
      3. About 1.5TB of local scratch disk space
      4. 5 years warranty
    3. Large memory node (one in a set)
      1. ECC memory  (2 options) 1.5 TB ECC memory or 3TB ECC Memory
        1. 1.5 TB
        2. 3TB
      2. 4 sockets each with an 16 core 2.1GHz  Intel Xeon Gold series 6130
      3. About 10TB of local scratch disk space
      4. 5 years warranty

Storage

2)      Long-term production storage is purchased in TB increments.  For each production partition, there is an equivalent sized non-production system of cheaper storage.  Each production partition will be synced nightly to the non-production partition.  

3)      In case of hardware or file system failure on a production partition, the files from the non-production partition will be copied back to the production system.  This will provide access to files which have not changed since the last sync.  Though not enterprise level backup, it does provide a level of data integrity.  It is limited by synchronization transfer rates and utilization. Users should store a copy of any mission-critical data off-site.

4)      Note: Storage amounts are after ZFS RAID-3 is applied, in TB where TB=10^12 bytes. Final storage space will be somewhat smaller than these round figures as some of this space will be used by the file system and the amount of file system overhead will depend on the pattern of use. Automatic on-the-fly compression is used both to allow more storage and speed up access to spinning disks.

Purchasing

5)      Purchases may be made as a collaboration of researchers or grants

 

  1. Purchases made with more than one fund account should have a single researcher/staff in charge of the purchases, and ITS will work with the financial officer for that department. 

Hardware Return request

6)      Purchasers of full sets of storage (a storage server plus 2 JBODs totaling about 270TB) may have the storage returned to them should they so wish.

  1. The warranty will no longer apply.  
  2. No infrastructure will be returned.

7)      Purchasers of a complete set of compute resources may have those resources returned to them should they wish 

  1. The warranty will no longer apply. 
  2. ITS staff will format the hard drives of returned systems to ensure we abide by software licensing restrictions. 

8)      Purchases made with multiple POs may be returned to the researcher in charge.  

End of life

9)      Equipment falls under the policy at http://www.policy.iastate.edu/policy/equipment/disposal 

Operation

Shared Resource

10)   This new cluster is a shared resource, the cluster infrastructure components are purchased with shared funds.  The compute nodes will become part of that shared resource pool. All of the nodes will be utilized by all contributing research groups. This sharing of resources is fundamental to a well-functioning cluster.  The primary benefit of this method is that it allows researchers to access a larger resource than they would otherwise be able to afford consequently allowing researchers to solve larger problems than they otherwise could.  

Logistics

Connectivity

11)   Access to the cluster will require the use of two-factor authentication. This will provide an additional layer of security to help prevent a system compromise and avoid down time associated with recovery from such an event.

12)   The login node will be for compiling and job submission, not for running long large programs. Such programs should be submitted to the job queuing system.  Data transfer will be through a data transfer node to maintain interactive performance on the login node.   

13)   The cluster will be only accessible from on campus or via Iowa State VPN

Queue

14)   A job scheduling system (SLURM) will be used to reserve nodes for the exclusive use of the jobs running on them.  It will also enforce fair-sharing policies.  Other than some special queues, jobs do not need to be submitted to a specific queue, but will be routed to an appropriate queue based on the resources requested.

15)   Complete utilization is not possible, particularly when not using job cancelation policies. Typically job queuing delays do not present a problem until the system utilization is at about 67% utilization. At this level of cluster utilization, some jobs do not have enough resources to start immediately, but will start later when the system becomes less busy, typically during the off-peak night and weekend hours.  Due to this we will consider the system to be fully loaded at a utilization of 67% of all available node hours.

16)   Usage is managed by the fair share scheduler in SLURM. Larger purchases increase a group’s share of the resource.  When jobs are waiting, they are assigned a priority to run, and the jobs with highest priority run first.  Additionally, there are limits on the jobs running from a single user to allow for better sharing within a group.

17)   The system is meant for HPC computing, not as a fast workstation.  The queues will be geared towards parallel computing using full nodes. It is recommended that only parallel applications be run on the cluster.

Prioritization

18)   Groups which have used a smaller percentage of their share over the last 90 days (i.e. have not been using the system very much) will have a higher priority.  This will allow groups with large balances to use more of their balance.  Typically, this has no discernible effect on scheduling until utilization exceeds 40%.

19)   Jobs from groups which have used their allocation should not impede jobs from groups which have not.  To that end, the fair share algorithm assigns lower priority to jobs from groups which have used their allocation than jobs from those groups which have not.  This balances out usage over time to the percentage of the resources that each group has provided to the cluster.  This should allow groups to use their allocation, and allow the system to provide service for groups which have used their allocation when the system is idle.  If this is insufficient, we will either direct such jobs to a low priority queue or prevent such jobs from running.

Scratch Disk Space

20)   A parallel filesystem will be created for each job using the 1.5TB NVME SSDs on each compute node.  This exists only during the job, so data will need to be read into it, and read back out of it at the end of the job.  This provides a parallel filesystem which scales in size and performance with the number of nodes in a job.  The large memory nodes will have their own scratch SSD RAID array of about 10TB in size, not a parallel filesystem.

21)   We do not plan on having a cluster-wide parallel file system outside of that created for each job.