Nova

 

Originally Nova cluster consisted of compute nodes with multicore Intel Skylake Xeon processors, 1.5TB or 11TB of fast NVME local storage and 192GB / 384GB / 3TB of memory.  Five of those nodes also have one or two Nvidia Tesla V100-32GB GPU cards. In 2021 the cluster has been expanded with AMD nodes, each having two 32-Core  AMD EPYC 7502 Processors, 1.5TB of fast NVME local storage and 528GB of memory. The new GPU nodes in addition have four NVidia A100 80GB GPU cards.

The three service nodes include login node, data transfer node and the management node.

Large shared storage consists of four file servers and eight JBODS configured to provide 338TB of backed up storage per server.

All nodes and storage are connected via Mellanox EDR (100Gbps) switch.

 

Detailed Hardware Specification

Number of NodesProcessors per NodeCores per NodeMemory per NodeInterconnectLocal $TMPDIR DiskAccelerator CardCPU-Hour Cost Factor
72Two 18-Core
Intel Skylake 6140
36192 GB100G IB1.5 TBN/A1.0
40Two 18-Core
Intel Skylake 6140
36384 GB100G IB1.5 TBN/A1.2
28

Two 24-Core Intel Skylake 8260

48384 GB100G IB1.5 TBN/A1.2
2Two 18-Core
Intel Skylake 6140
36192 GB100G IB1.5 TB2x NVIDIA Tesla V100-32GB2.7
1Two 18-Core
Intel Skylake 6140
36192 GB100G IB1.5 TBone NVIDIA Tesla V100-32GB2.7
2Two 18-Core
Intel Skylake 6140
36384 GB100G IB1.5 TB2x NVIDIA Tesla V100-32GB3.0
1Four 16-Core
Intel 6130
643 TB100G IB11 TBN/A

6.2

2Four 24-Core
Intel 8260
963 TB100G IB1.5 TBN/A

3.0

40Two 32-Core  AMD EPYC 750264512 GB100G IB1.5 TBN/A 
15Two 32-Core  AMD EPYC 750264512 GB100G IB1.5 TBfour NVidia A100 80GB  

 

 

HPC group schedules regular maintenances every 3 months to update system software and to perform other tasks that require a downtime.

The date of the next maintenance is listed in the message of the day displayed at login (when ssh-ing to the cluster).

Note: Queued jobs will not start if they cannot complete before the maintenance begins. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance) . The jobs will start after the scheduled outage completes.