The Shared Computing Cluster (SCC) implements several automatic “process reapers” to enforce policy. These detect and terminate processes or batch jobs that use resources beyond the job request or that make inefficient use of resources. Actions taken by the process reapers are reported to the owner of the impacted process via email. Research Computing is available to assist researchers in optimizing their workflows and batch job specifications.
The Login Node Process Reaper
The SCC Login Nodes are the primary connection point for researchers using the SCC. These nodes can be used for administrative tasks and light work; long-term or high-cpu tasks should be
run as a batch job.
This process reaper enforces a time limit of 15 minutes of CPU time on each process on the login node.
Example Message
To: username@bu.edu
From: root <root@scc.bu.edu>
Subject: Message from the process reaper on SCC1 |
The following process, running on SCC1, has been terminated because it exceeded the limits for interactive use. An interactive process is killed if its total CPU time is greater than 15 minutes and greater than 25% of its lifetime. Processes which may exceed these limits should be submitted through the batch system.
See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information.
COMMAND STATE PID PPID TIME(min.) RATE(%) SIZE RSS START TIME
processname S 5912 8049 18 + 0 37 2883 2385 05/07 11:23
Please email help@scc.bu.edu for assistance.
|
The CPU Limit Process Reaper
Compute nodes should only run processes associated with jobs and jobs should use only the resources requested by the job submission. You can learn about process/slot requests on our Submitting Batch Jobs page. This process reaper terminates processes that are not associated with a job (e.g. SSH directly to a compute node) and jobs that use more than processors than requested.
Example Message
To: username@bu.edu
From: root <root@scc.bu.edu>
Subject: Message from the process reaper on SCC_COMPUTE_NODE |
The following batch job, running on SCC_COMPUTE_NODE, has been terminated because it was using 17.1 processors but was allocated only 16. Please resubmit the job using an appropriate PE specification.
See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information.
job JOBNUMBER owner: username pe: omp16 type: "Single node batch" slots: 16
sge_gid: 1000902 job_pid: 5407
cputime: 97 min. rate: 1711.70% starttime: 04/25 16:14:20
COMMAND STATE PID PPID TIME(min.) RATE(%) SIZE RSS START TIME
process0 R 5634 5411 10 1827 1832 114 04/25 16:19
process1 S 5411 5410 0 + 87 1535 725 39 04/25 16:14
Please email help@scc.bu.edu for assistance.
|
The Idle GPU Process Reaper
Interactive sessions and batch jobs should make effective use of specialized resources, like GPUs, when they are requested. You can learn about the use of GPUs on our GPU Computing page. This process reaper terminates a job if all of the requested GPU(s) remain idle for two hours on Shared resources and some Buy-in resources.
Example Message
To: username@bu.edu
From: root <root@scc.bu.edu>
Subject: Message from the process reaper on SCC_COMPUTE_NODE |
The following batch job, running on SCC_COMPUTE_NODE, has been terminated because all of its requested GPUs remained idle for 2 hours. Please ensure your software makes effective use of GPU resources and resubmit the job using an appropriate GPU specification.
See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information.
COMMAND STATE PID PPID START TIME
python S 1411234 1415123 12/04 19:03:40
Please email help@scc.bu.edu for assistance.
|
The Unassigned GPU Process Reaper
GPUs are only accessible through batch jobs and batch jobs should use only the GPUs they are assigned. You can learn about use of GPUs and the
$CUDA_VISIBLE_DEVICES variable on our
GPU Computing page.
This process reaper enforces GPU assignment of processes within a batch job – jobs and processes that use a GPU which is not assigned to the job are terminated.
Example Message for Case 1: A non-batch process accesses a GPU
To: username@bu.edu
From: root <root@scc.bu.edu>
Subject: Message from the process reaper on SCC_COMPUTE_NODE |
The following process, running on SCC_COMPUTE_NODE, has been terminated because it was using gpu 0, but it was not associated with a batch job. Only processes which are part of a batch job are allowed to use gpus.
COMMAND STATE PID PPID START TIME
python S 1415191 1415019 12/04 15:23:10
Please email help@scc.bu.edu for assistance.
|
Example Message for Case 2: A batch job process accesses a GPU that is not assigned to it.
To: username@bu.edu
From: root <root@scc.bu.edu>
Subject: Message from the process reaper on SCC_COMPUTE_NODE |
The following process, running on SCC_COMPUTE_NODE, has been terminated because it was using gpu 2, which was not assigned to its associated batch job, JOB_NUMBER. Batch jobs are only allowed to use the gpus assigned to them via the $CUDA_VISIBLE_DEVICES environment variable.
See https://www.bu.edu/tech/support/research/software-and-programming/programming/multiprocessor/gpu-computing/#CUDAVISIBLE for more information.
COMMAND STATE PID PPID START TIME
python S 1320228 1320137 12/04 13:16:18
Please email help@scc.bu.edu for assistance. |