Using the new RSPH cluster (September 2020)
All following tips are based on using terminal on Mac OS. I believe it will work for any linux system. For Windows, one needs to install some type of Unix-like systems, such as Cygwin.
Basic information
Here are some information provided by the RSPH IT:
Read it carefully because the system uses a new job scheduler.
If you want to access the cluster from outside the School of Public Health (this includes using laptop through Emory wifi), you will need connect through the Emory VPN.
Login to the RSPH cluster
Address for the RSPH cluster is clogin01.sph.emory.edu
.
Login commmand is
ssh -X userid@hwu30@clogin01.sph.emory.edu
Here userid
needs to be replaced by your login (your EmoryID).
I usually create an alias by adding the following line to my .bash_profile
:
alias cluster="ssh -X hwu30@clogin01.sph.emory.edu"
So I can login to the cluster by typing cluster
in the terminal.
Password-less logins using SSH
It’s annoying to have to type in password every time login or scp to/from the cluster. Fortunately there is a solution. Follow the steps to setup a password-less login.
-
Create Public/Private Keys. First check whether you have
id_rsa
andid_rsa.pub
at the.ssh
folder in your home directory. Note it’s a hidden directory, and can be seen by typingls -a
. If there exist, skip this step. Otherwise, typessh-keygen -t rsa
in the terminal and those files will be generated. -
Set up logins. First copy your public key (
id_rsa.pub
) to remote host by doing:scp .ssh/id_rsa.pub userid@hpc4.sph.emory.edu:~
. Now login to the cluster and cd to the.ssh
directory. Add the public key from your computer to the end of your ``authorized_keys file and set the correct permissions by typing the following commands at the terminal:cat ../id_rsa.pub >> authorized_keys chmod 600 authorized_keys
Transfer data from the old cluster to the new one
Use the scp
commands to copy files over. For example, I can use the following commands to copy a whole directory over. If the file sizes are large, you can also submit a job for file transferring.
#!/bin/bash
#SBATCH --partition=day-long-cpu
scp -r hwu30@hpc4.sph.emory.edu:SourceDir TargetDir
In order for this to run succesfully, you also need to setup the password-less login between the new and old cluster (adding the id_rsa.pub
line in the new cluster to the authorized_keys
file in the old cluster). Otherwise, the system will prompt for password, and the submitted job cannot run.
This is also a good opportunity to reorganize your files.
The job scheduler on the new cluster
The new clusters uses SLURM as job scheduler, instead of Sun Grid Engine (SGE) on the old clusters. A few basic SGE commands and and their corresponding SLURM commands are
qsub
–sbatch
qstat
–squeue
qdel
–scancel
qlogin
–srun --pty bash
For a more comprehensive list, please see this SGE to SLURM conversion page.
Environment for bioinformatics group members
We have a group compbio
created for all members in the bioinformatics group. If you belong to this group you’ll have access to some data and software.
Do groups userid
to check your group membership. For example, I can see my group memberships:
[hwu30@hpc4 ~]$ groups hwu30
hwu30 : hpcusers compbio
So I belong to following groups: hpcusers compbio
.
By default, all users in the compbio
group should be able to see each other’s file (have read permission, but not write permission).
Disk space
-
We currently have around 150T storage, under the
/projects
mount. Useful commands for checking disk usage aredf -h
: report disk space usagedu -h --max-depth=1
: report file space usage
The disk space is not as limited as before, but all members still need to be careful in managing the disk usage.
Setup your working directories
- All group users should setup their own working directory under
/projects/compbio/users
. For example, my directory is/projects/compbio/users/hwu30
. - You can create a symbolic link in your home directory by running the following command (in your home directory):
ln -s /projects/compbio/users/hwu30 projects
It creates a directory
project
in my home, which is in fact/projects/compbio/users/hwu30
. - Try to be organized in managing your projects. Create directories and sub-directories.
Shared resources
We have shared software/libraries/data mostly located at /projects/compbio
. In particular:
-
/projects/compbio/bin
has a number of often-used binary software tools for genetic/genomic data analysis. You can add following line to your.bash_profile
(in the home directory), so that the software installed here can be accessed from anywhere.PATH=$PATH:/projects/compbio/bin
-
/projects/compbio/data
has some useful shared data, including index files for alignment, reference genomes, etc.
Using R
-
As of September 2020, The latest R (version 4.0) and Bioconductor (version 3.11) are installed.
-
Note that R cannot run on the head node directly. You must first run
module load R
to load the R module, and then run R. Read the HPC Getting Started Guide for details. -
The R library directory for the group is at
projects/compbio/Rlib
. You need to setup R libraray directory by adding the following line in your.Rprofile
file. Note: it’s a hidden text file in your home directory. If you don’t have it, just create one:.libPaths( c("projects/compbio/Rlib", .libPaths()) )
After this, run
.libPaths()
in R to make sure you have the correct path. -
To submit an R job to the scheduler, you need to create a
.sh
and put in some commands. The description in the guide is not accurate. The shell script should look like (assuming you want to submitrun.R
):#!/bin/bash #SBATCH --job-name=run.R #SBATCH --partition=day-long-cpu module purge module load R srun R CMD BATCH --no-save run.R
Assume the script is called
runR.sh
, usesbatch runR.sh
to submit the job. You can usesqueue
to view the job status.