Academic Support Service FAQ : TechWeb : Boston University

The Shared Computing Cluster (SCC) is different from working on your home computer/laptop in several ways. The SCC is “Shared” with several thousand faculty, researchers, and students who all use it. Changes to the central systems that operate the SCC can only be done by administrators, otherwise individual users might accidentally make changes that adversely affect everyone else. It is a “Cluster” because the SCC isn’t one “big” computer, but many “normal” size computers all connected together. The individual computers are often called “nodes”. You may use part of a node, an entire node, or even multiple nodes when you have a computing task. To make this work there are special subsystems that support all the things that you might want to do: batch job scheduler, software module system, networked file system, etc.

Below are answers to a number of questions that students in courses utilizing the SCC commonly have.

How do I use the SCC?

If you want to work interactively or with graphical applications, SCC OnDemand is an easy web-based interface. If you have a single big or many smaller compute tasks that can run independently, submitting batch jobs is the best way to start them.

You can think of the SCC like a big hotel. The “lobby” of the SCC is the login nodes. These are special nodes that are connected to the internet and are the gateway to the SCC. If the login nodes are the lobby, then the compute nodes are the rooms you can “rent” (but with no monetary charge here). Just like a hotel, if you want to do any large task you need to “rent a room”; trying to run compute tasks on a login node will result in your task being killed. You should instead request an interactive session or submit a batch job to run on a compute node. Basic tasks that don’t use computing power like editing text files, copying files and folders around, etc. are fine to do on the login nodes.

Where should I put my data, files, etc.? How do I access my files?

Everyone has a “home directory” limited to 10 GB of disk space. There is also “Project Disk Space” which is an additional area you can work with more space available. Each project you are a part of has its own disk space. These files and folders aren’t stored on a particular node, they are stored centrally on the networked file system (NFS). Generally speaking anything stored in the NFS is accessible on *any* login or compute node. Any changes you make to your files during an OnDemand session or batch job are written to the NFS and persist even after your session/job ends. See our File Storage page for more information.

How do I use or install software on the SCC?

We have many hundreds of software applications already installed on the SCC. These are accessed via the module system.

It is also possible to install software yourself. Remember: the SCC isn’t like your laptop; often you can’t exactly follow the software’s installation instructions. In particular, using admin commands like “sudo” or trying to modify the SCC’s central systems will fail. However you can install software into your Project Disk Space. For the installation of Python software (including conda environments) please refer to the instructions on our website.

I want to use PyTorch or Tensorflow. What do I do?

You should use the software Module system. To see available versions:

module avail tensorflow
module avail pytorch

These libraries are difficult to install correctly on a cluster like the SCC. Do not install them for yourself. If you require a version that is not available, contact your course TF for assistance. The SCC modules have both CPU and GPU compatibility, and GPUs will be used automatically if you requested them for your compute job.

If you are installing pre-defined conda environments, for example one that you have downloaded from Github, double check that the environment.yaml file does not include PyTorch or Tensorflow libraries. The conda environment can use the module versions once the modules are loaded.

Tensorflow compute jobs (batch or interactive) should always request at least 2 cores as that is the minimum quantity of cores that it will use, even when using GPUs.

Why would I want to submit a batch job on the SCC? How do I submit one?

Imagine you have a computation that will take 5 days. Alternatively: imagine you have an analysis you need to do, but have to do it for each of 1000 data files. In both these cases you do not want to have to manually set up and/or wait for these jobs to start/finish. You will instead want to automate the process. You can create a “batch job submission script” that will be used to automatically run your job. For more details, see the page on Submitting your Batch Job.

Why does my job take so long to start!?

The SCC is shared with the rest of the BU research computing community. To ensure fairness jobs will sit in a “queue” until it is their turn in line and the resources they ask for are available. If your job is waiting it either means it is not your turn yet or the resources you asked for are not available yet. The more CPU cores, GPUs, memory, and running time you ask for the longer you may need to wait. Don’t ask for more than what you need to run your compute task.

Why was my job killed!?

When you submit your job you make a request for a particular number of CPU cores, physical memory (RAM), and running time. This happens either implicitly if you don’t specify these options and are given the default (1 CPU core, 4 GB of RAM per core requested, 12 hours) or explicitly via your own request. Exceeding any of these requests can result in your job being killed. You will need to make sure the software you run does not exceed the limits of the resources you requested.

I heard GPUs can help my computations complete much more quickly. How can I use them on the SCC?
We have documentation that describes qsub options for requesting GPUs on our website.

GPUs can also be selected for interactive usage in the OnDemand interface. The great majority of GPU jobs do not need more than 1 GPU.

Note that use of the PyTorch library with a GPU requires that you select a compute capability of 6.0 or greater with the option “-l gpu_c=6.0”. This option is available in a drop-down menu in OnDemand for interactive jobs.

When your job (compute or interactive) is running, the GPU assigned to your job will automatically be used by GPU-enabled libraries such as Tensorflow or PyTorch. Your code should always reference GPU #0 if it requires any explicit GPU selection.

What specific things should I know about using Python/R/MATLAB on the SCC?

We have short videos outlining the specifics of using these tools on the SCC. You can access these starting here.

Are there examples for how to use XYZ on the SCC?

We have a page that has examples for how to use some of the more common software packages on the SCC.

Why is my home directory full?

Do you have a bunch of undeleted files in your Trash? Delete them from within an OnDemand desktop.

Do you have a bunch of files saved in your home directory? Move them to Project Disk Space.

Did you install a conda environment in your home directory? Delete it, create a .condarc file, and reinstall in Project Disk Space. More details are available.

Did you install Python packages in your home directory? Delete them from your ~/.local directory and reinstall them in Project Disk Space. More details are available.

What are the first things to check when something isn’t working or it used to work, but doesn’t now?

Is your home directory full? It only has 10 GB of space and so fills quickly. Free up space there and instead use Project Disk Space.

Do you automatically load SCC modules? Do you automatically load a conda environment? Do you automatically run scripts? Everyone has a special file in their home directory called “.bashrc”. Every command in your .bashrc is run automatically each time you create a session or run a job on the SCC. It is not recommended to do any of the above actions automatically in your .bashrc because it may break things. Edit your .bashrc to comment out or remove these actions.