FAQ: Frequently Asked Questions

 

 

I would like to get access to the HPC facilities at ISU. How do I do that, and how much will it cost?

One can not buy compute time on the clusters listed at http://www.hpc.iastate.edu/systems.

The Education cluster hpc-class is available for classes. The instructor on record can request an access to hpc-class for himself and the students on the class list via http://www.hpc.iastate.edu/access. Students that would like to use hpc-class for their graduate research can sign up for for a 699 class (Research for Thesis or Dissertation).

Access to the CyStorm cluster is available free of charge. Use the webform at http://www.hpc.iastate.edu/access to request an access to CyStorm.

CyEnce can be used only by the PIs on the NSF MRI grant used to purchase the CyEnce cluster (Award # 1229081) and their students. The PIs can request access for their students by sending an email to hpc-help@iastate.edu .

Condo can be used only by those who purchased nodes and storage on the Condo cluster. These researchers can request access for their students and colleagues by sending an email to hpc-help@iastate.edu . If your group would like to purchase nodes on the Condo Cluster, submit the form at https://hpc-ga.its.iastate.edu/condo_purchase. Some departments (LAS, Mathematics and Statistics) have purchased nodes and storage for their researchers. To obtain an account, contact

  • For LAS send an email to researchit@iastate.edu
  • For mathematics send an email to Cliff Bergman
  • For Statistics send an email to Michael Brekke

 

 

I need help using clusters. Who do I contact?

 

If you need help please send an email to hpc-class@iastate.edu . In the email specify the name of the cluster. When sending email from non-iastate email address, also specify your ISU NetID. If applicable add the following information:

  • The command you ran (qsub ...)
  • The directory you ran it from
  • The job ID
  • Any output you received
  • Any additional information that would help us help you.

 

I bought a new phone or erased GA. How can I reset GA?

If you logged your phone number in the system, you can reset your GA at https://hpc-ga.its.iastate.edu/reset/ . Note that you need to be on campus or use ISU VPN in order to access this page. To log your phone number where you can receive text message, issue phone-collect.sh on condo2017 for Condo cluster or on discovery for CyEnce cluster.

 

What is ssh and scp? Where do I find them?

ssh is a program for logging into a remote machine and for executing commands on a remote machine. To be able to use our clusters you will need to ssh to them. If your computer is running Linux or MacOS you already have ssh command (open terminal and type “ssh <name_of_the_cluster>”). If your login name on your computer is different from the login name on the cluster (which is the same as University NetID), add your NetID to the ssh command: “ssh <NetID>@<name_of_the_cluster>” or “ssh –l <NetID> <name_of_the_cluster>”. scp command is used to copy files between different machines on a network. It's also available on the Linux and Mac computers.

If you’re using Windows computer, you probably don’t have ssh and scp available. In that case you can download free SSH client PuTTY and free SFTP/SCP client WinSCP. Some people use SFTP client FileZilla which is also available for MacOS. Both WinSCP and FileZilla have graphic interface.

 

FileZilla does not ask me for verification code. What do I do to connect to my account on a cluster which uses Google Authenticator?

In FileZilla click on the “File” menu section and choose “Site Manager”. In the Site Manager window click on “New Site”, enter Host (e.g.cystorm.its.iastate.edu), set Protocol to SFTP, set Logon Type to Interactive, type your NetID in the User field. You will also want to open the "Transfer Settings" tab and check "Limit number of simultaneous connections." Set the "maximum number of connections" to "1" If you do not do this you will be prompted for your verification code and password many times.  Click on Connect and a small window will open, showing message from the cluster prompting for Verification code. In the Password field enter the 6 digit number generated by the GA running on your mobile device. Next you will be prompted for password.

 

What is the queue structure? Why doesn't my job start running and just sits in the queue?

When you submit a job with qsub command it gets into routing queue which places the job into appropriate queue based on the number of nodes and walltime requested. To see the current queue structure on a specific cluster, issue 'qstat -q'. The Lm column shows the maximum number of jobs that can simulteniously run in the queues. Queues may also have other limits on the number of jobs per user and per group. The ' qmgr -c "p s" ' command provides full information about the queues.

Sometimes you may see a higher number of jobs running in a queue. The extra jobs are started by the metascheduler to maximize system utilization when the system us relatively idle. For more details about the queue structure and the metascheduler refer to the "Queues and Scheduling Structure" section of the user guides for specific clusters available at http://www.hpc.iastate.edu/guides .

 

I need a specific software package to be installed. What do I do?

First of all check whether this software is already installed and can be loaded as a module. For this issue the 'module avail' command. To load a specific module issue 'module load <module_name>'. You can also unload, swap and purge modules. See the output of 'module --help' or 'man module' for additional information.

The group specific software should be installed in the group working directories where all members of the group have access. Most software packages don't need superuser priviligies to be installed. These are generic instructions that work in many cases:

  • use 'wget URL_TO_TAR_FILE' to download source code (e.g. wget https://julialang.s3.amazonaws.com/bin/linux/x64/0.4/julia-0.4.5-linux-x86_64.tar.gz)
  • use 'tar xvzf your_tar_file' to unpack *.tar.gz file that you downloaded, usually a new directory will appear in the current directory
  • cd to the unpacked directory, and run './configure --prefix=/your_work_directory' (software will be installed in this directory)
  • issue 'make' to compile software
  • issue 'make install' to install compiled code in the directory specified in the --prefix option on the configure command

If you need help installing software send email to hpc-help@iastate.edu.

If you're part of LAS, you may send request to researchit@iastate.edu, and the LAS IT staff may install your software package as a module.  

 

How can I use Matlab on a cluster?

Matlab is installed on all clusters. To use it you need to load matlab module. Since we don’t have the Distributed Compute Manager license, users can run Matlab only on one node, and not across multiple nodes. Remember to NOT run Matlab on the head node. Instead generate job script using the scriptwriter for the cluster where you want to run Matlab (see appropriate User Guide at http://www.hpc.iastate.edu/guides). In the script add the following commands:

module load matlab/matlab-R2015a

matlab  -nodisplay -nosplash -nodesktop < program.m > matlab_output.log

Replace program.m with the name of your Matlab program. The output of the matlab command will be redirected to file matlab_output.log . If a different version of Matlab is needed, replace matlab-R2015a with  the appropriate matlab module name.

 

How can I use R on a cluster?

To use R that is installed on the clusters you may need to load R module. Remember to NOT run R on the head node. Instead generate job script using the scriptwriter for the cluster where you want to run R (see appropriate User Guide at http://www.hpc.iastate.edu/guides). In the script add the following commands:

module load LAS/R

R --no-save CMD BATCH program.R > R_output.log

Replace program.R with the name of your R program. The output of the R command will be redirected to file R_output.log . The "module load" command above will set environment for the default version of R. If a different version of R is needed, replace LAS/R with the appropriate R module name. To see available modules issue "module avail".

 

I need an R package to be installed. What do I do?

Most of the R packages can be installed in the home directory which is mounted on all nodes of a cluster. To use R that is installed on the clusters you may need to load R module:

module load LAS/R/3.2.3

To run your R programs follow the instructions at How can I use R on a cluster? and remember to NOT run R on the head node. However to install an R package, issue R command on the head node:

R

Within R issue the following command:

install.packages("name_of_the_package", repos="http://cran.r-project.org")

(replace name_of_the_package in the command above with the name of your package). It will issue a warning about "/shared/software/LAS/R/3.2.3/lib64"' not being writable and ask whether you would like to use a personal library instead. Reply "y" to this and to the following question on whether you would like to create a personal library ~/R/x86_64-pc-linux-gnu-library/3.2 .

If you want to use another location rather than the default location, for example, ~/local/R_libs/,  you need to create the directory first:

mkdir -p ~/local/R_libs

Then type the following command inside R:

install.packages("name_of_the_package", repos="http://cran.r-project.org", lib="~/local/R_libs/")

To use libraries installed in this non-default location,  create a file .Renviron in your home directory, and add the following line to the file:

export R_LIBS=~/local/R_libs/

To see the directories where R searches for libraries, use the following R command:

.libPaths();

I need to use Python on a cluster

Each cluster has an least one and usually several versions of python installed.  To see which ones are availible use:

module avail

Then load the module for the version you want to use.  If you dont need a particular version often 

module load python

is enough.  After loading the module in many of them, but not all, you can issue

pip list

To see a list of the python packages that are installed in that module.  Unfortunatly each python module has its own seperate package list.  Some were added for specific needs and dont even include pip.

 

I need a Python package that isnt installed, what do I do?

With easy_install you can do:

easy_install --prefix=$HOME/local package_name

Which will install into $HOME/local/lib/pythonX.Y/site-packages (the 'local' folder is a typical name many people use, but of course you may specify any folder you have permissions to write into). 

You will need to manually create $HOME/local/lib/pythonX.Y/site-packages and add it to your PYTHONPATH environment variable (otherwise easy_install will complain -- btw run the command above once to find the correct value for X.Y).  

To add to your PYTHONPATH 

export $PYTHONPATH=$PYTHONPATH:$HOME/local/lib/pythonX.Y/site-packages

How can I move files between the cluster and my personal computer or another cluster?

There are several ways to transfer files. The user guides for specific clusters (see http://www.hpc.iastate.edu/guides) in section "How to logon and transfer files" describe how to use scp. On Condo and CyEnce the data transfer nodes (condodtn and stream) should be used to transfer large and multiple files. Remember to copy large amounts of data directly to your group working directory, and not to home directory.

The files storage service MyFiles (see https://www.it.iastate.edu/services/storage/myfiles) can also be used to transfer files. MyFiles are mounted on hpc-class, CyStorm, CyEnce and Condo clusters.

 

How can I access MyFiles from the clusters?

On CyStorm login to cystormdtn, while on Condo login to condodtn and on CyEnce login to stream, issue 'kinit' and enter your university password. On hpc-class you don't need to login to a separate node to access MyFiles. After entering your password, you will be able to cd to /myfiles/<deptshare>, where <deptshare> is your departmental share name; this is usually your department or college’s shortname, such as engr or las.

 

My compile or make commands on the head node fail with a segmentation fault. What should I do?

It is possible that the system resource limits on the head node are too strict to compile your programs. Try to login to the data transfer node (condodtn or stream) and issue your commands there. If that does not help, send an email to hpc-help@iastate.edu .

 

How do I make specific modules be automatically loaded on the compute nodes assigned to my job?

There are several ways to do this. You can include module commands in your job script. If you would like a specific set of modules to be loaded every time you login to a cluster, as well as on the compute nodes, add the necessary module commands to the .bashrc file in your home directory. If you have the necessary modules loaded and enviroment variables set, using -V option on the qsub command will export all the environment to the context of batch job.

 

I get "module: command not found" error when trying to run my job. Are modules installed on the clusters?

Yes, modules are installed on all clusters. If you see the list of available modules when issuing 'module avail' on the head node, but the 'module load' command in the job script fails with the message "module: command not found", it might be that you're not a bash user. To see the current shell, issue 'echo $0'. If your default shell is tcsh or csh,  you can either change it to bash or replace' #!/bin/bash' in your job script with '#!/bin/csh'. To change default shell for hpc-class, go to https://asw.iastate.edu/cgi-bin/acropolis/user/shell.

 

What do I do when the job runtime exceeds max queue time?

Option 1: Get the answers faster:

  • Use the fastest library routines. E.g. if dense linear routines are used in the code to solve systems of linear equations, a large increase in speed may be possible by linking with the vendor supplied routines. Link with the MKL library rather than non-optimized libraries.
  • Change to a more efficient algorithm. This is the best since you get your answers quicker. HPC group can help you with numerical aspects and some algorithm choices, but you would need to supply the modeling knowledge.
  • Go parallel. The program can be rewritten to use MPI. This often takes a long time but usually gives the best performance. The program can also be modified with OpenMP directives to perform portions of the program in parallel, and compiled to use all cores in a single node. The speedup is limited to the number of cores, and if not done well can even slow down a program.

Option 2: Use checkpointing.

Major production codes are checkpointed. In checkpointing, you periodically save the state of the program in a restart file. Whenever you run your program it reads the restart file to pick up from the last checkpoint. The advantage of this is that there is no limit to the total amount of time you can use. Barring disk crashes or total loss of the machine, your total runtime is indefinite, you just keep submitting the same job and start from where you left off. There is overhead associated with each checkpoint, and time executed after the last checkpoint is lost whenever the job is stopped.

 

How can I find my account balance? What happens when my account balance goes negative?

Condo and CyEnce

Your group account balance is printed each time you login. The monthly report also can be found in your group working directory /work/<group_name>/monthly_balance .

You might see your account balance heading toward zero quickly if you are using large numbers of processors per job or running large numbers of jobs. Your group account balance should only go negative if your group has used more computer time than the amount of compute time represented by the compute nodes that your group purchased. On Condo the monthly amount is 1/12 of the number of hours in a year times the number of compute nodes your group purchased. On CyEnce the six major PI groups had 15% of node-hours each, except for bioinformatics, which has 25%. Some PI groups stayed together, others asked for separate groups, and the percentages were divided proportionately.

The policy is implemented as follows:

Groups that have a negative balance:

  • are allowed to continued to run, but
  • have a lower number of total nodes that they can compete for
  • have a lower number of jobs per user and per group than that for groups which have a postive account balance.

Groups with positive balances are running under the relaxed scheduling policies approved by the HPC Committee in November 2013, which allowed for increased system utilization.

This means that the groups with positive balances are given priority in use of the system. If the system is relatively idle, other policies for short duration jobs (24 hours or less) are invoked which allow all groups to use resources when the system is idle. This is explained more fully in: Condo Queues and Scheduling Stucture and CyEnce Queues and Scheduling Stucture

 

How to debug my program?

Programs can appear to have no bugs with some sets of data because all paths through the program may not be executed. To debug your program, it is important to try and test your program with a variety of different data sets so that (hopefully) all paths in your program can be tested for errors.

Some bugs can be found at compile and/or run time when using compiler debugging options. We suggest to use the following debugging options for Intel compilers. (These compilers are available on all clusters.)

The “-g” option produces symbolic debug information that allows one to use a debugger. The “-debug” option turns on many of the debugging capabilities of the compiler. The “-check all” option turns on all run-time error checking for Fortran. The “-check-pointers=rw” option enables checking of all indirect accesses through pointers and all array accesses for Intel’s C/C++ compiler but not for Fortran. The “-check-uninit” option enables uninitialized variable checking for Intel’s C/C++ compiler (this functionality is included in Intel’s Fortran “-check all” option). The “-traceback” option causes the compiler to generate extra information in the object file to provide source code traceback information when a severe error occurs at run-time.

Intel’s MPI Checker provides special checking for MPI programs. To use MPI Checker add the “-check-mpi” option to the mpirun command.

Note that using options above will likely slow the program, so these options are not normally used when not debugging.

You can also use a debugger to debug a program. On hpc-class, Condo and CyEnce clusters we recommend using Allinea’s DDT debugger. See instructions on how to use DDT in the Guides section on the left. The following debuggers are also available: Intel’s Inspector XE (inspxe-cl and inspxe-gui), GNU’s gdb and PGI’s pghpf (PGI is available only on CyEnce and hpc-class).

When program does not complete and coredumpsize is set to a non-zero value one or more core files may be produced and placed in the working directory. If the limit on coredumpsize is not large enough to hold the program's memory image, the core file produced will not be usable. To change the configuration to allow useful core files to be produced enter “unlimit coredumpsize” or “ulimit –c unlimited”. Then a debugger can be used to analyze the core file(s).

 

I'm a Windows/Mac user. How to use Unix?

There are several ways to learn how to use Unix. You can take Math/ComS/CprE 424X class (Introduction to High Performance Computing) or a workshop organized by Genome Informatics Facility. You can take an online class or just search on Internet. For a quick introduction to Unix try the following Tutorial. You may also find useful the following materials: Slides from Workshop: Basic UNIX for Biologists and Notes from the How to Use Condo, CyEnce and CyStorm Workshop.

 

I get "Disk quota exceeded" error message when trying to remove files. What can I do?

It looks like the disc is 100% full. Try to replace one of the files that you want to remove with an empty file by issuing the following command: "echo > file", where file is the name of your file. This command will write end-of-file into the file, and its size will become 1. Now the disc has some room, and you should be able to remove all the files that you want without getting "Disk quota exceeded" error message.

My ssh sessions dies with the messages "Write failed: Broken pipe" after a few minutes of inactivity.  How do I fix this?

If your logging in from a  Mac or Linux machine create a a directory called .ssh in your home directory.  In that directory create a file called config with these contents:

Host *
ServerAliveInterval 60

Run "chmod 0600 config" on the file after creating it or it wont work and the system will complain. 

If you are using putty in windows look in the Putty configuration for your host. Under Connection you will see "Seconds between Keepalives"  set this to 60.

New to Slurm?  Here are some useful commands.

This page has a summary of some many useful slurm commands. https://rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/