Process Reaper and Policy Enforcement : TechWeb : Boston University

The Shared Computing Cluster (SCC) implements several automatic “process reapers” to enforce policy. These detect and terminate processes or batch jobs that use resources beyond the job request or that make inefficient use of resources. Actions taken by the process reapers are reported to the owner of the impacted process via email. Research Computing is available to assist researchers in optimizing their workflows and batch job specifications.

The Login Node Process Reaper

The SCC Login Nodes are the primary connection point for researchers using the SCC. These nodes can be used for administrative tasks and light work; long-term or high-cpu tasks should be run as a batch job. This process reaper enforces a time limit of 15 minutes of CPU time on each process on the login node.

Example Message

To: username@bu.edu From: root <root@scc.bu.edu> Subject: Message from the process reaper on SCC1
The following process, running on SCC1, has been terminated because it exceeded the limits for interactive use. An interactive process is killed if its total CPU time is greater than 15 minutes and greater than 25% of its lifetime. Processes which may exceed these limits should be submitted through the batch system. See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information. COMMAND STATE PID PPID TIME(min.) RATE(%) SIZE RSS START TIME processname S 5912 8049 18 + 0 37 2883 2385 05/07 11:23 Please email help@scc.bu.edu for assistance.

The CPU Limit Process Reaper

Compute nodes should only run processes associated with jobs and jobs should use only the resources requested by the job submission. You can learn about process/slot requests on our Submitting Batch Jobs page. This process reaper terminates processes that are not associated with a job (e.g. SSH directly to a compute node) and jobs that use more than processors than requested.

Example Message

To: username@bu.edu From: root <root@scc.bu.edu> Subject: Message from the process reaper on SCC_COMPUTE_NODE
The following batch job, running on SCC_COMPUTE_NODE, has been terminated because it was using 17.1 processors but was allocated only 16. Please resubmit the job using an appropriate PE specification. See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information. job JOBNUMBER owner: username pe: omp16 type: "Single node batch" slots: 16 sge_gid: 1000902 job_pid: 5407 cputime: 97 min. rate: 1711.70% starttime: 04/25 16:14:20 COMMAND STATE PID PPID TIME(min.) RATE(%) SIZE RSS START TIME process0 R 5634 5411 10 1827 1832 114 04/25 16:19 process1 S 5411 5410 0 + 87 1535 725 39 04/25 16:14 Please email help@scc.bu.edu for assistance.

The Idle GPU Process Reaper

Interactive sessions and batch jobs should make effective use of specialized resources, like GPUs, when they are requested. You can learn about the use of GPUs on our GPU Computing page. This process reaper terminates a job if all of the requested GPU(s) remain idle for two hours on Shared resources and some Buy-in resources.

Example Message

To: username@bu.edu From: root <root@scc.bu.edu> Subject: Message from the process reaper on SCC_COMPUTE_NODE
The following batch job, running on SCC_COMPUTE_NODE, has been terminated because all of its requested GPUs remained idle for 2 hours. Please ensure your software makes effective use of GPU resources and resubmit the job using an appropriate GPU specification. See https://www.bu.edu/tech/support/research/system-usage/running-jobs for more information. COMMAND STATE PID PPID START TIME python S 1411234 1415123 12/04 19:03:40 Please email help@scc.bu.edu for assistance.

The Unassigned GPU Process Reaper

GPUs are only accessible through batch jobs and batch jobs should use only the GPUs they are assigned. You can learn about use of GPUs and the $CUDA_VISIBLE_DEVICES variable on our GPU Computing page. This process reaper enforces GPU assignment of processes within a batch job – jobs and processes that use a GPU which is not assigned to the job are terminated.

The Login Node Process Reaper

Example Message

The CPU Limit Process Reaper

Example Message

The Idle GPU Process Reaper

Example Message

The Unassigned GPU Process Reaper

Example Message for Case 1: A non-batch process accesses a GPU

Example Message for Case 2: A batch job process accesses a GPU that is not assigned to it.