Estimating Memory Demands
It can be challenging to estimate how much memory your job will require before submission. Benchmarking tests are available for specific applications that can provide a guide but initially it is best to run your job and review the SCC’s job status reports. Each job is allocated virtual memory throughout the job’s runtime. Virtual memory is the required amount of memory for the job to run and can be accessed with three commands: qstat
, top
, and qacct
. qstat
and top
allow you to monitor your jobs’ processes in real time and qacct
is a full report available after a job has finished. Guidelines for submitting batch jobs with large memory requirements are available here.
qstat
qstat
is an SGE command that reports the status of jobs submitted to the cluster. To see more details of a specific job running on the cluster, you will need to run qstat
with the -j job_ID
flag specifying the job_ID
assigned to your job which can be be found by qstat -u userID
.
scc % qstat -u userID
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4717015 0.10072 my_job1 userID r 03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu 8
4717016 0.10072 my_job2 userID r 03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu 8
scc % qstat -j 4717015
==============================================================
job_number: 4717015
exec_file: job_scripts/4717016
submission_time: Thu Mar 1 09:34:35 2018
owner: userID
...
job_name: my_job1
stdout_path_list: NONE:NONE:/projectnb/scv/userID/scripts/log/
jobshare: 0
env_list: PATH=/projectnb/scv/userID/scripts:/bin:/usr/bin:/usr/local/sbin:/usr/sbi
job_args: sub001
script_file: my_job1.qsub
parallel environment: omp8 range: 8
verify_suitable_queues: 2
project: scv
usage 1: cpu=7:58:19, mem=5319.88221 GBs, io=66.36036, vmem=15.690G, maxvmem=15.886G
scheduling info: (Collecting of scheduler job information is turned off)
The usage 1 line contains the maxvmem which reports the maximum virtual memory that has been used during the cpu runtime. In this example, my_job1
requires 16GB of total memory during the first 8 hours of runtime.
top
top
is a command that shows the active processes on a system. In order to see your active processes on the compute node your job is running on, you will need to run top
on that compute node. We can do this by remotely accessing the compute node running your job using ssh
. The compute node running your job can be identified using the qstat -u userID
command.
scc % qstat -u userID
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4717015 0.10072 my_job1 userID r 03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu 8
4717016 0.10072 my_job2 userID r 03/01/2018 09:35:08 p8@scc-pf2.scc.bu.edu 8
scc % ssh -t scc-pf2 'top -u userID'
top - 14:37:07 up 40 days, 16:19, 7 users, load average: 0.11, 0.21, 0.14
Tasks: 418 total, 2 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.3%us, 0.1%sy, 0.0%ni, 98.5%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 132064072k total, 110925644k used, 21138428k free, 358992k buffers
Swap: 8388604k total, 33376k used, 8355228k free, 107315220k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37182 userID 20 0 13396 1416 852 R 3.9 0.0 0:00.03 top
36370 userID 20 0 77648 4756 1080 S 0.0 0.0 0:01.99 sshd
36510 userID 20 0 10.7g 2.1g 78m S 0.0 1.3 0:22.97 my_matrix1.m
36510 userID 20 0 10.7g 2.1g 78m S 0.0 1.3 0:22.97 my_matrix2.m
36510 userID 20 0 10.7g 2.1g 78m S 0.0 1.3 0:22.97 my_matrix3.m
36510 userID 20 0 10.7g 2.1g 78m S 0.0 1.3 0:22.97 my_matrix4.m
36371 userID 20 0 9676 1916 1384 S 0.0 0.0 0:00.03 bash
36475 userID 20 0 30944 5456 2708 S 0.0 0.0 0:00.08 fslwish8.4
36502 userID 20 0 13432 1232 904 S 0.0 0.0 0:00.05 freeview
36504 userID 20 0 600m 59m 34m S 0.0 0.0 0:05.91 freeview.bin
36510 userID 20 0 3872m 412m 78m S 0.0 0.3 0:22.97 MATLAB
37181 userID 20 0 92872 1840 872 S 0.0 0.0 0:00.00 sshd
Note: In this example, the compute node is
scc-pf2
which will need to be changed to the compute node allocated to your job. This is reported in the ‘queue’ column ofqstat -u userID
command.
VIRT and RES represents the total amount of allocated memory (virtual) and actual physical memory (resident) for each process, respectively. In this example, four MATLAB scripts are running in parallel: my_matrix1.m, my_matrix2.m, my_matrix3.m, and my_matrix4.m. Each of these processes has been allocated 10.7GB of memory. You would need to request a minimum of 44GB for four cores, or 11GB per core:
#!/bin/bash -l
#$ -P my_project
#$ -N my_matlab_job
#$ -l mem_per_core=11G
#$ -pe omp 4
qacct
scc % qacct -o userID -d 1 -j
==============================================================
qname p-int
hostname scc-pi2.scc.bu.edu
group scv
owner userID
project scv
department defaultdepartment
jobname my_job
jobnumber 4035924
...
qsub_time Thu Jan 25 14:45:36 2018
start_time Thu Jan 25 14:46:15 2018
end_time Fri Jan 26 02:46:16 2018
granted_pe NONE
slots 8
...
cpu 202.390
mem 7478.277
io 0.348
iow 0.000
maxvmem 63.953G
...
Note: In this example,
-d
is the number of days of job summaries you want to view. Seeman qacct
for more details.
The slots variable reports the number of cores requested for this job and the maxvmem reports the maximum virtual memory used for this job. In this example, my_job
would need to request 64GB of total memory, or 8GB per core to run optimally:
#!/bin/bash -l
#$ -P my_project
#$ -N my_job
#$ -l mem_per_core=8G
#$ -pe omp 8