Many deep learning applications can be trained using a single or may require multiple GPUs. In general, SCC users are limited to using a maximum of 4 GPUs on a single node unless they have access to reserved buy-in resources. If your application requires more GPUs that are available on the SCC, we recommend requesting resources from ACCESS. For information about getting credits to access these resources please see this page. We advise using the NCSA Delta system for distributed multi-node GPU computations. The full list of resource providers can be found here.

Each of these clusters uses the SLURM workload manager to schedule batch jobs on their system. We provide an overview of hardware details about each system below. For specific details on using SLURM on these clusters and complete hardware details please click on the documentation links in each section. At the end of this documentation we provide a link to a Github repository that contains two example codes which demonstrate how to run multiple-GPU distributed node computations on the Delta system.

Sections

 

Delta

The University of Illinois NCSA Delta system is designed to run applications on GPU nodes or hybrid CPU-GPU nodes. The Delta system is comprised of 5 node types. The following are the relevant 4 GPU node types:

Number of nodes Number of GPUs per node GPU type Memory
100 4 NVIDIA A40 40GB
100 4 NVIDIA A100 48GB
6 8 NVIDIA A100 40 GB
1 8 AMD M100 32 GB

Users can log on to the Delta system by following the instructions at this link. Delta users can ssh to command line login nodes. Alternatively, there is an Open OnDemand interface that is similar to the SCC.

Example GPU codes for Delta

Follow this Github link for documentation on the example codes.

 

Additional Access GPU resources

Rockfish

The JHU Rockfish system is a community-shared cluster housed at the Maryland Advanced Research and Computer Center in Baltimore. The GPU nodes that are available are:

Number of nodes Number of GPUs per node GPU type Memory
18 4 NVIDIA A100 40 GB
6 4 NVIDIA A100 80 GB

FASTER

The TAMU FASTER system is a Dell x86 HPC cluster consisting of 180 compute nodes. Researchers with allocations on FASTER can request up to 10 composable GPUs. This means that GPU resources are added to a compute node on the fly. The GPU architectures that are composable to the compute nodes are:

Number of GPUS GPU type Memory
200 NVIDIA T4 16 GB
40 NVIDIA A100 40 GB
8 NVIDIA A10 24 GB
4 NVIDIA A30 24 GB
8 NVIDIA A40 24 GB

 

Last updated: Loading…