Conda Environments

Table of Contents

Overview

Conda is a software package manager for data science that allows unprivileged (non-administrative) Linux or MacOS users to search, fetch, install, upgrade, use, and manage supported open-source software packages and programming languages/libraries/environments (primarily Python and R, but also others such as Perl, Java, and Julia) in a directory they have write access to. Conda allows users to create reproducible scientific software environments, including outside of ISU clusters.

Many open-source scientific software packages are available:

The Bioconda channel contains thousands of software packages that are useful for bioinformatics.

Setup

To install and use conda packages, a conda installer is needed. Miniconda is a free minimal installer for conda. On ISU clusters miniconda is available via an environment module. Issue "module spider miniconda" to see all available versions. miniconda2 includes python2, while miniconda3 uses python3. Since python2 is no longer supported, we recommend loading miniconda3 module. To begin, start an interactive session on a compute node with the salloc command and then load miniconda3:

salloc #desired options

module load miniconda3

Note: You can see all available versions of miniconda on ISU HPC clusters with:

module spider miniconda

(Optional one-time setup for bioconda users) 

If you plan on installing software primarily from the bioconda channel, before using conda for the first time, you may wish to configure conda per the bioconda documentation to search for software packages in the conda-forge, bioconda, and defaults channels (in that order):

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Otherwise, the conda-forge and then bioconda channels must be specified every time software is installed via conda install or conda create:

conda install -c conda-forge -c bioconda SOFTWARE_PACKAGE1 SOFTWARE_PACKAGE2...

Installing Software

Software can be installed into separate environments (directories) that are managed separately. At least one environment must be created before installing software using the miniconda environment module. 

Some potential locations for conda environments housing conda packages on ISU HPC Clusters include:

  • Home Directory - Note: some Conda packages (with dependencies) can take gigabytes of storage space. Since home directory quotas are low, it is NOT recommended to install conda packages and envs in the home directory.
  • A user-specified directory within the user's work directory - Note: Creating environments in work directories will allow other users of that directory to use the environment with the same source activate command. 

Managing Conda Cache 

The default location for Conda to cache files is the user's home directory, which can rapidly fill and cause issues. This behavior can be changed by setting the pkgs_dirs entry in the .condarc file or setting the CONDA_PKGS_DIRS environment variable. First, to see the current cache directory, issue:

conda info

The package cache entry will display the current package cache directories. The config file entry displays the location of the user .condarc file. Editing/creating the pkgs_dirs entry in the .condarc file will change the cache directory:

#Open the user .condarc file. By default it is located at $HOME/.condarc.
vim /path/to/.condarc

#Edit the file to include the pkgs_dirs entry with desired cache directory:
pkgs_dirs:
  - /path/to/desired/cache/directory
#Save and exit the file

#Use conda info to confirm change:
conda info

 Another method to adjust the cache directory is by setting the CONDA_PKGS_DIRS environment variable. To do this, issue:

#Export the variable:
export CONDA_PKGS_DIRS=/path/to/desired/cache/directory

#Confirm change with conda info:
conda info

Pip Installs in a Conda Environment 

Software that can only be installed with pip may also be installed in a Conda environment by using pip in the environment. While issues can arise, per the Conda guide for using pip in a Conda environment, there are some best practices to follow to reduce their likelihood, namely:

  • Use pip only after conda package installs
  • Use conda environments for isolation (Don't perform pip installs in the "root" environment
  • Recreate the entire environment if changes are needed after pip packages have been installed
  • Use the --no-cache-dir option for pip installation commands to prevent pip filling your home directory with cached data

While performing pip installs, consider adding --no-cache-dir to avoid filling the home directory with cached packages. An install command would look like:

 python3 -m pip install <package> --no-cache-dir

Best Practices

Use Interactive Sessions

Use an interactive session on a compute node to install software with Conda to avoid slowing down the login node for everyone.

salloc -n 2 -N 1 -t 180
#Modify salloc options as desired

Use Storage With Large Quotas

Use storage other than the home directory for conda environments and packages. Using your home directory can fill its limited space. See the Managing Conda Cache section for directions on changing the default caching behavior.

Configure Environment With a .condarc File

The conda configuration file allows customization of conda. The location of the file can be determined by running conda info and checking the config file entry. 

In the Managing Conda Cache section, it was used to configure the location Conda caches downloaded files to avoid filling home directories. It is highly recommended to set that behavior when using ISU HPC clusters. 

Another common use of the .condarc file is setting the channels that will be used to search for packages to install. For example, to add the bioconda and defaults channels a user would run:

conda config --add channels defaults
conda config --add channels bioconda

#NOTE: If channels are listed in the .condarc file it will override conda defaults and cause conda to only search the channels listed in the .condarc file. That is why defaults was included in this example.
#NOTE: The order channels are listed in the file determines the order conda will search them

To specify the directories where conda environments are located (official documentation here), a user can set the key envs_dir. To set the key, a user would open the .condarc file and add or edit the envs_dirs key. An example would look like:

envs_dirs:
  - ~/my-envs
  - /path/to/conda/envs 

 To see the many additional configuration options, check the official .condarc user guide here

Example 1:  Installing Trinity into a home directory

Note: Due to the home directory's limited size, it is NOT the recommended installation destination. 

Start an interactive session to prevent overloading the login node:

salloc #desired options

Load the latest miniconda module, if you haven't already, and create an environment called trinityenv:

module load miniconda3

conda create --name trinityenv

Note: the conda create command used above lacks the --prefix option and will thus create the environment in your home directory, which is not recommended due to home quota sizes. ($HOME/.conda/envs/

To activate the environment (and update environment variables such as PATH that are requried to use software installed into this environment):

source activate trinityenv

An indicator that the environment successfully activated is that the command prompt will be prepended with the name of the environment within parentheses:

(trinityenv) [username@cluster ~]$

Now that you are inside the trinityenv environment, install software into this environment with:

conda install <package_name> <package_name> <package_name>

For example, install the Trinity transcriptome assembler with:

conda install -c bioconda trinity
...
Proceed ([y]/n?) y
...

After installation, the Trinity executable will be in your PATH. This can be checked with:

type Trinity

To exit the environment:

source deactivate

After deactivating the trinityenv environment, Trinity is no longer in your PATH:

#Test to see if Trinity is in your PATH
type Trinity 

Example 2: Installing Trinity into a /work directory

Load miniconda and then create the environment with the --prefix option to choose a filepath. Activating will require the path to the environment as shown below:

salloc #desired options
module load miniconda3
conda create --prefix /work/mydir/trinityworkenv
#...
source activate /work/mydir/trinityworkenv

Software can then be installed into the environment.  

Note: Conda environments in shared directories have the potential to be used by others with the same source activate command. 

Note: Conda first downloads packages into a package cache directory. By default, the package cache is in your home directory ($HOME/.conda/pkgs). If installing a large amount of software that may cause the home directory quota to be exceeded, you can configure another directory to be the package cache by adding a pkgs_dirs list to the $HOME/.condarc file.

pkgs_dirs:
  - /work/mydir/my_pkg_cache

Example 3: Using a Conda Environment in an OOD Jupyter Notebook

Conda environments can be used in an Open OnDemand Jupyter Notebook on ISU Clusters. For this example, we will create an environment in a work directory. To begin, start an interactive session to prevent overloading the login node, load miniconda3, and create the environment:

salloc #desired options
module load miniconda3
conda create --prefix /work/mydir/mycondaenv

Next you will need to activate the environment:

source activate /work/mydir/mycondaenv 

Once the environment is activated (which can be indicated by parentheses containing the environment name preceding the bash prompt) you will install ipykernel which is a package that provides the IPython kernel for Jupyter. 

conda install -c anaconda ipykernel

When ipykernel is installed, the config that Open OnDemand needs can be created in the home directory with:

python3 -m ipykernel install --user --name "mycondaenv"

Once the config is created, the environment can be used by:

  • Logging in to the appropriate cluster's Open OnDemand. See guide here
  • Creating a Jupyter Notebook Session
    • Select Interactive Apps
    • Select Jupyter Notebook from the list of apps
    • Select desired compute and partition settings. 
    • Select Launch
    • Once it starts, select Connect to Jupyter
  • In the Jupyter Notebook Launcher, select mycondaenv from the Notebook or Console section. It can also be selected as the kernel when choosing the kernel for a new notebook or console. 

Managing Environments

See the official Conda Documentation for managing environments for a complete list of commands.

To list environments that have been created in your home directory:

conda env list

 To list software packages in an environment:

conda list --name env_name

 Or, for environments created elsewhere:

conda list --prefix /path/to/env

Tip: For reproducibility, a list of all packages/versions in an environment can be exported to an environment file, which can be used to recreate the environment (e.g., by another user, or on another system) or archived with analysis results. This makes it easy for you or anyone else to re-run your analysis on any system and is also a record of the exact software environment you used for your analysis.

To remove an environment in your home directory:

conda env remove --name env_name

To remove an environment located elsewhere:

rm -rf /path/to/env

To remove packages not used by any environment, as well as tarballs downloaded into the conda package cache ($HOME/.conda/pkgs):

conda clean --all