Using the new RSPH cluster (September 2020)
All following tips are based on using terminal on Mac OS. I believe it will work for any linux system. For Windows, one needs to install some type of Unix-like systems, such as Cygwin.
Basic information
Here are some information provided by the RSPH IT:
Read it carefully because the system uses a new job scheduler.
If you want to access the cluster from outside the School of Public Health (this includes using laptop through Emory wifi), you will need connect through the Emory VPN.
Login to the RSPH cluster
Address for the RSPH cluster is clogin01.sph.emory.edu.
Login commmand is
ssh -X userid@hwu30@clogin01.sph.emory.edu
Here userid needs to be replaced by your login (your EmoryID).
I usually create an alias by adding the following line to my .bash_profile:
alias cluster="ssh -X hwu30@clogin01.sph.emory.edu"
So I can login to the cluster by typing cluster in the terminal.
Password-less logins using SSH
It’s annoying to have to type in password every time login or scp to/from the cluster. Fortunately there is a solution. Follow the steps to setup a password-less login.
-
Create Public/Private Keys. First check whether you have
id_rsaandid_rsa.pubat the.sshfolder in your home directory. Note it’s a hidden directory, and can be seen by typingls -a. If there exist, skip this step. Otherwise, typessh-keygen -t rsain the terminal and those files will be generated. -
Set up logins. First copy your public key (
id_rsa.pub) to remote host by doing:scp .ssh/id_rsa.pub userid@hpc4.sph.emory.edu:~. Now login to the cluster and cd to the.sshdirectory. Add the public key from your computer to the end of your ``authorized_keys file and set the correct permissions by typing the following commands at the terminal:cat ../id_rsa.pub >> authorized_keys chmod 600 authorized_keys
Transfer data from the old cluster to the new one
Use the scp commands to copy files over. For example, I can use the following commands to copy a whole directory over. If the file sizes are large, you can also submit a job for file transferring.
#!/bin/bash
#SBATCH --partition=day-long-cpu
scp -r hwu30@hpc4.sph.emory.edu:SourceDir TargetDir
In order for this to run succesfully, you also need to setup the password-less login between the new and old cluster (adding the id_rsa.pub line in the new cluster to the authorized_keys file in the old cluster). Otherwise, the system will prompt for password, and the submitted job cannot run.
This is also a good opportunity to reorganize your files.
The job scheduler on the new cluster
The new clusters uses SLURM as job scheduler, instead of Sun Grid Engine (SGE) on the old clusters. A few basic SGE commands and and their corresponding SLURM commands are
qsub–sbatchqstat–squeueqdel–scancelqlogin–srun --pty bash
For a more comprehensive list, please see this SGE to SLURM conversion page.
Environment for bioinformatics group members
We have a group compbio created for all members in the bioinformatics group. If you belong to this group you’ll have access to some data and software.
Do groups userid to check your group membership. For example, I can see my group memberships:
[hwu30@hpc4 ~]$ groups hwu30
hwu30 : hpcusers compbio
So I belong to following groups: hpcusers compbio.
By default, all users in the compbio group should be able to see each other’s file (have read permission, but not write permission).
Disk space
-
We currently have around 150T storage, under the
/projectsmount. Useful commands for checking disk usage aredf -h: report disk space usagedu -h --max-depth=1: report file space usage
The disk space is not as limited as before, but all members still need to be careful in managing the disk usage.
Setup your working directories
- All group users should setup their own working directory under
/projects/compbio/users. For example, my directory is/projects/compbio/users/hwu30. - You can create a symbolic link in your home directory by running the following command (in your home directory):
ln -s /projects/compbio/users/hwu30 projectsIt creates a directory
projectin my home, which is in fact/projects/compbio/users/hwu30. - Try to be organized in managing your projects. Create directories and sub-directories.
Shared resources
We have shared software/libraries/data mostly located at /projects/compbio. In particular:
-
/projects/compbio/binhas a number of often-used binary software tools for genetic/genomic data analysis. You can add following line to your.bash_profile(in the home directory), so that the software installed here can be accessed from anywhere.PATH=$PATH:/projects/compbio/bin -
/projects/compbio/datahas some useful shared data, including index files for alignment, reference genomes, etc.
Using R
-
As of September 2020, The latest R (version 4.0) and Bioconductor (version 3.11) are installed.
-
Note that R cannot run on the head node directly. You must first run
module load Rto load the R module, and then run R. Read the HPC Getting Started Guide for details. -
The R library directory for the group is at
projects/compbio/Rlib. You need to setup R libraray directory by adding the following line in your.Rprofilefile. Note: it’s a hidden text file in your home directory. If you don’t have it, just create one:.libPaths( c("projects/compbio/Rlib", .libPaths()) )After this, run
.libPaths()in R to make sure you have the correct path. -
To submit an R job to the scheduler, you need to create a
.shand put in some commands. The description in the guide is not accurate. The shell script should look like (assuming you want to submitrun.R):#!/bin/bash #SBATCH --job-name=run.R #SBATCH --partition=day-long-cpu module purge module load R srun R CMD BATCH --no-save run.RAssume the script is called
runR.sh, usesbatch runR.shto submit the job. You can usesqueueto view the job status.