SCC Compute Nodes Scratch Disks Usage : TechWeb : Boston University

There are occasions in which the use of a local scratch disk storage may be beneficial: like gaining access to additional (temporary) storage as well as to take advantage of its close proximity to the compute node for more efficient I/O operations. Once the job is completed, in order to go to that particular node to view or retrieve files, you will need to identify the name of the node. The following shows you the ways to do that.

To access a compute node’s local scratch file, simply refer to /scratch. The absolute path of a compute node has the form /net/scc-AB#/mydir/myfile, where scc-AB# is the node name with A being the node type, B being the chassis letter, and # being the node number; the login nodes are simply named scc1..scc4. It is recommended that you store files in a user-created sub-directory (/net/scc-AB#/scratch/userID) for file management purposes. Because the nodes assigned to a given batch job are selected at runtime, they must be derived from the environment variable, $PE_HOSTFILE, available after the batch job has started and nodes have been assigned. Demonstrated below is an example of an MPI batch script that redirects the output of each rank (processor) to an output file in the local node’s scratch.

MPI batch script


#!/bin/csh
#
# Example OGS script for running mpi jobs
#
# Submit job with the command: qsub script
#
# Note: A line of the form "#$ qsub_option" is interpreted
#       by qsub as if "qsub_option" was passed to qsub on
#       the commandline.
#
# Set the hard runtime (aka wallclock) limit for this job,
# default is 2 hours. Format: -l h_rt=HH:MM:SS
#$ -l h_rt=2:00:00
#
# Invoke the mpi Parallel Environment for N processors.
# There is no default value for N, it must be specified.
#$ -pe mpi_4_tasks_per_node 4

# Merge stderr into the stdout file, to reduce clutter.
#$ -j y

# Request email sent to your BU login ID when the job is completed or aborted
#$ -m ae

## end of qsub options

# By default, the script is executed in the directory from which
# it was submitted with qsub. You might want to change directories
# before invoking mpirun...
# cd somewhere

# $PE_HOSTFILE contains the following info for each node assigned to job:
# scc-AB#.scc.bu.edu Ncores
# AB is 2-letter designation of assigned node, # is a number (e.g., fc3)
# Ncores is number of cores for node. Total cores = NSLOTS (see below)

# If needed, create folder for local scratch on each node
foreach i (`awk '{print $1}' $PE_HOSTFILE | sed 's/\.scc\.bu\.edu//'`)
  printf "\n ***** $i *****\n"
  mkdir -p /net/$i/scratch/$USER
end

# MPI_PROGRAM is the name of the MPI executable
set MPI_PROGRAM = local_scratch_example

# Run executable with "mpirun -np $NSLOTS $MPI_PROGRAM arg1 arg2 ..."
# where NSLOTS is set by SGE to N as defined in "-pe Q N"
mpirun -np $NSLOTS $MPI_PROGRAM

Sample FORTRAN MPI program
On the scc1.bu.edu login node, when you cd to /scratch, it goes to the scratch of that SCC login node. Similarly, when addressing /scratch in a batch job, it points to the scratch of the (runtime) assigned node. Here is a FORTRAN MPI example (C works in similar fashion) that uses this default convention to write output for each rank to its corresponding compute node’s core. In addition, Rank 0 also computes the sum of the ranks and write to scratch.

Program local_scratch_example
implicit none
include "mpif.h"
integer p, total, ierr, master, myid, my_int, dest, tag
character*40 filename
data master, tag, dest/0, 0 ,0/

c**Starts MPI processes ...
call MPI_Init(ierr)                            ! starts MPI
call MPI_Comm_rank(MPI_COMM_WORLD, myid, ierr) ! get process id
call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)    ! get # procs

C**define and open output file for each rank on local node scratch
write(filename,"('/scratch/kadin/myoutput.dat.',i1)")myid
open(unit=11, file=filename,form='formatted',status='unknown')

my_int = myid    ! result of local proc
C**write local output to its own output file
write(11,"('Contents of process ',i2,' is ',i8)")myid,my_int

call MPI_Reduce(my_int, total, 1, MPI_INTEGER, MPI_SUM, dest,
     &                  MPI_COMM_WORLD, ierr)

if(myid .eq. master) then
  write(11,*)'The sum is =', total  ! writes total to master
endif

close(11)
call MPI_Finalize(ierr)                     ! MPI finish up ...
end

OpenMP applications
With trivial changes in Parallel Environment request (e.g., -pe omp 4 instead of -pe mpi_4_tasks_per_node 4), the above batch script will be applicable to OpenMP type shared memory codes.
Ways to find out which node(s) your batch job ran on
The first three methods described below are convenient if you just need to know — after the job is completed — which node was used. The third method is specifically for MPI applications. The fourth method must be used if you need the information during runtime to do something with it, such as creating local directories on the assigned node’s scratch storage. For this, you will need to extract the node information from the $PE_HOSTFILE environment variable in your batch script.
1. When the job is in run state, qstat returns the node used for job
```
scc1% qstat -u kadin
1912322 ...  kadin  r [date]  [time]  budge@scc-je1.scc.bu.edu    4
```
  You have to write the node name down for future reference — an extra step if you don’t usually log your submitted jobs.
2. The assigned nodes information is available in the email the batch system sent you. By default, the SCC does not send you email. So, you need to add #$ -m ae to your batch script for it to send you mail.
3. This method is valid for MPI jobs only. When this type of jobs is in the run state or completed, a batch output file with .poJobID suffix contains the node information. Each line in the file represents a core of a node. For example, if you ask for 4 MPI cores, you will see 4 lines with the same node name (like scc-fc3). If you request more than 16 cores, you would see multiple distinct node names appearing in the list. Here is an example of a 4-core job
```
scc1% more batch_script_name.po837
-catch_rsh /usr/local/...pool/scc-fc3/active_jobs/837.1/pe_hostfile
scc-fc3
scc-fc3
scc-fc3
scc-fc3
```
4. This method is valid for omp type shared memory multicore jobs. A parallel job of this type produces an empty .po file and hence the above method is not useful. Insert the following in your batch script
```
foreach i (`awk '{print $1}' $PE_HOSTFILE | sed 's/\.scc\.bu\.edu//'`)
   printf "\n The node used for job is $i.\n"
end
```
  When the job is done, the batch output file, for example, Tseng.o1920719, will contain a line similar to this
```
scc1% more Tseng.o1920719
.   .   .   .
.   .   .   .
 The node used for job is scc-fc3.
.   .   .   .
```
Access output files on the scratch disks
With one of the above methods to identify compute nodes used for a batch job, you will be able to get to your output files. For the MPI example, method 3 appears most natural to use. The .po file indicates that scc-fc3 was used to run the job. The 4 recurrences of the same node indicate the node’s 4 cores.
```
scc1% cd /net/scc-fc3/scratch
```
Shown below are the example MPI program’s rank 0 and 2 output (which all go to the scc-fc3 node’s scratch)
```
scc1% cd /net/scc-fc3/scratch/kadin
scc1:% ls
myoutput.dat.0  myoutput.dat.1  myoutput.dat.2  myoutput.dat.3
scc1% more myoutput.dat.0
Contents of process  0 is        0
The sum is =           6
scc1$ more myoutput.dat.2
Contents of process  2 is        2
```
You can delete files or directories in scratch the same way you handle other files and directories. By default, your files and directories residing on a compute node’s scratch disk will be removed after 30 days.
```
scc1$ rm -r /net/scc-fc3/scratch/kadin
```