Estimating Memory Demands

It can be challenging to estimate how much memory your job will require before submission. Benchmarking tests are available for specific applications that can provide a guide but initially it is best to run your job and review the SCC’s job status reports. Each job is allocated virtual memory throughout the job’s runtime. Virtual memory is the required amount of memory for the job to run and can be accessed with three commands: qstat, top, and qacct. qstat and top allow you to monitor your jobs’ processes in real time and qacct is a full report available after a job has finished. Guidelines for submitting batch jobs with large memory requirements are available here.

qstat

qstat is an SGE command that reports the status of jobs submitted to the cluster. To see more details of a specific job running on the cluster, you will need to run qstat with the -j job_ID flag specifying the job_ID assigned to your job which can be be found by qstat -u userID.

scc % qstat -u userID
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4717015 0.10072 my_job1   userID      r     03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu              8
4717016 0.10072 my_job2   userID      r     03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu              8

scc % qstat -j 4717015
==============================================================
job_number:                 4717015
exec_file:                  job_scripts/4717016
submission_time:            Thu Mar  1 09:34:35 2018
owner:                      userID
...
job_name:                   my_job1
stdout_path_list:           NONE:NONE:/projectnb/scv/userID/scripts/log/
jobshare:                   0
env_list:                   PATH=/projectnb/scv/userID/scripts:/bin:/usr/bin:/usr/local/sbin:/usr/sbi
job_args:                   sub001
script_file:                my_job1.qsub
parallel environment:  omp8 range: 8
verify_suitable_queues:     2
project:                    scv
usage    1:                 cpu=7:58:19, mem=5319.88221 GBs, io=66.36036, vmem=15.690G, maxvmem=15.886G
scheduling info:            (Collecting of scheduler job information is turned off)

The usage 1 line contains the maxvmem which reports the maximum virtual memory that has been used during the cpu runtime. In this example, my_job1 requires 16GB of total memory during the first 8 hours of runtime.

top

top is a command that shows the active processes on a system. In order to see your active processes on the compute node your job is running on, you will need to run top on that compute node. We can do this by remotely accessing the compute node running your job using ssh. The compute node running your job can be identified using the qstat -u userID command.

scc % qstat -u userID
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4717015 0.10072 my_job1   userID      r     03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu              8
4717016 0.10072 my_job2   userID      r     03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu              8

scc % ssh -t scc-pf2 'top -u userID'
top - 14:37:07 up 40 days, 16:19,  7 users,  load average: 0.11, 0.21, 0.14
Tasks: 418 total,   2 running, 416 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.3%us,  0.1%sy,  0.0%ni, 98.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132064072k total, 110925644k used, 21138428k free,   358992k buffers
Swap:  8388604k total,    33376k used,  8355228k free, 107315220k cached

   PID USER      PR  NI  VIRT   RES  SHR S %CPU %MEM    TIME+  COMMAND
 37182 userID    20   0 13396  1416  852 R  3.9  0.0   0:00.03 top
 36370 userID    20   0 77648  4756 1080 S  0.0  0.0   0:01.99 sshd
 36510 userID    20   0 10.7g  2.1g  78m S  0.0  1.3   0:22.97 my_matrix1.m
 36510 userID    20   0 10.7g  2.1g  78m S  0.0  1.3   0:22.97 my_matrix2.m
 36510 userID    20   0 10.7g  2.1g  78m S  0.0  1.3   0:22.97 my_matrix3.m
 36510 userID    20   0 10.7g  2.1g  78m S  0.0  1.3   0:22.97 my_matrix4.m
 36371 userID    20   0  9676  1916 1384 S  0.0  0.0   0:00.03 bash
 36475 userID    20   0 30944  5456 2708 S  0.0  0.0   0:00.08 fslwish8.4
 36502 userID    20   0 13432  1232  904 S  0.0  0.0   0:00.05 freeview
 36504 userID    20   0  600m   59m  34m S  0.0  0.0   0:05.91 freeview.bin
 36510 userID    20   0 3872m  412m  78m S  0.0  0.3   0:22.97 MATLAB
 37181 userID    20   0 92872  1840  872 S  0.0  0.0   0:00.00 sshd

Note: In this example, the compute node is scc-pf2 which will need to be changed to the compute node allocated to your job. This is reported in the ‘queue’ column of qstat -u userID command.

VIRT and RES represents the total amount of allocated memory (virtual) and actual physical memory (resident) for each process, respectively. In this example, four MATLAB scripts are running in parallel: my_matrix1.m, my_matrix2.m, my_matrix3.m, and my_matrix4.m. Each of these processes has been allocated 10.7GB of memory. You would need to request a minimum of 44GB for four cores, or 11GB per core:

#!/bin/bash -l
#$ -P my_project
#$ -N my_matlab_job
#$ -l mem_per_core=11G
#$ -pe omp 4

qacct

scc % qacct -o userID -d 1  -j
==============================================================
qname        p-int
hostname     scc-pi2.scc.bu.edu
group        scv
owner        userID
project      scv
department   defaultdepartment
jobname      my_job
jobnumber    4035924
...
qsub_time    Thu Jan 25 14:45:36 2018
start_time   Thu Jan 25 14:46:15 2018
end_time     Fri Jan 26 02:46:16 2018
granted_pe   NONE
slots        8
...
cpu          202.390
mem          7478.277
io           0.348
iow          0.000
maxvmem      63.953G
...

Note: In this example, -d is the number of days of job summaries you want to view. See man qacct for more details.

The slots variable reports the number of cores requested for this job and the maxvmem reports the maximum virtual memory used for this job. In this example, my_job would need to request 64GB of total memory, or 8GB per core to run optimally:

#!/bin/bash -l
#$ -P my_project
#$ -N my_job
#$ -l mem_per_core=8G
#$ -pe omp 8