Intel Compiler Flags : TechWeb : Boston University

Intel produces compilers that produce highly optimized code for their CPUs. As with all compilers, programs compiled with optimization should have their output double-checked for accuracy. If the numeric output is incorrect or lacks the desired accuracy less-aggressive compile options should be tried. The following table summarizes some relevant commands on the SCC:

Command	Description
module avail intel	List available versions of the Intel compiler.
module load intel/2023.1	Load a particular version.
icc	C compiler.
icpc	C++ compiler
ifort	Fortran compiler.
icx	New generation C compiler. (intel/2023.1 and newer only)
icpx	New generation C++ compiler. (intel/2023.1 and newer only)
ifx	New generation Fortran compiler. (intel/2023.1 and newer only)

Intel compiler modules after 2023.1 will only support the new generation compilers (icx, icpx, ifx) as Intel is retiring the older ones (icc, icpc, ifort).

All compilers have manuals available, for example:

man ifx
man icpx

Intel also has a document that makes recommendations for optimization options.

General Compiler Optimization Flags

The Intel compilers optimization flags deliberately mimic many of those used with the GNU family of compilers. The basic optimization flags are summarized below. Using these flags does not result in any incompatibility between CPU architectures. Note that it is not recommended to use the Intel compiler when the program will be run on AMD processors due to lackluster executable performance in that case.

Flag	Description
-O	Optimized compile.
-O2	More extensive optimization. Recommended by Intel for general use.
-O3	More aggressive than -O2 with longer compile times. Recommended for codes that loops involving intensive floating point calculations.
-Ofast	-O3 plus some extras.
-ipo	Interprocedural optimization, a step that examines function calls between files when the program is linked. This flag must be used to compile and when linking. Compile times are very long with this flag, however depending on the application there may be appreciable performance improvements when combined with the -O* flags.
-mtune=processor	This flag does additional tuning for specific processor types, however it does not generate extra SIMD instructions so there are no architecture compatibility issues. The tuning will involve optimizations for processor cache sizes, preferred ordering of instructions, and so on. The useful values for the value processor on the SCC are: broadwell,haswell,ivybridge,sandybridge, or cascadelake.

Flags to Specify SIMD Instructions

These flags will produce executables that contain specific SIMD instructions which may effect compatibility with compute nodes on the SCC.

Flag	Description
-xHost	Must be used with at least -O2. Creates an executable that uses SIMD instructions based on the CPU that is compiling the code. Not recommended as compiling on a newer architecture compute node results in a program that cannot run on older architectures.
-fast	A combination of -Ofast, -ipo, -static (for static linking), and -xHost.
-march	Must be used with at least -O2 and pecifies the type of SIMD instructions to be generated. When combined with the -ax flag this sets the minimum SIMD instruction set. Also note that when the compiled software runs on an AMD processor the value specified by the -mx flag is used even if the processor supports other instruction sets. The values for this flag mimic those from the Gnu compilers: avx, avx2, and a large number of avx512 flags. There is an alternate form of this flag, -x, which uses the options given below with -ax. However, code compiled with -x will not execute at all on AMD processors so it is not recommended.
-axarch	This must be used with at least -O2 and –march. The -march flag will produce specific SIMD instructions, and additional SIMD instructions can be supported by adding the -axarch flag. Every function that can be compiled with SIMD instructions will have separate copies created for each instruction set. The executable will auto-detect CPU instruction support at runtime which version to run. The compile times can be very long as functions will be compiled multiple times over and the resulting binary will be large. The useful values for arch on the SCC are: AVX, CORE-AVX2, and CORE-AVX512. . Several instruction sets can included with this command when comma-separated. For example: icx -c -O3 -mavx -axCORE-AVX2,CORE-AVX512 mycode.cpp

Default Optimization Behavior

Most open source programs that compile from source code use the -O2 or -O3 flags. This will result in fast code that can run on any compute node on the SCC. The -fast flag can be problematic (due to its inclusion of the -xHost flag) when run on the login nodes as they are Broadwell architecture CPUs which support AVX2 instructions. Codes compiled with -fast will only be able to execute on Broadwell architecture compute nodes on the SCC.

Recommendations

Most codes will be well-optimized with the -O2 or -O3 flags. Programs that involve intensive floating-point calculations inside of loops can additionally be compiled with the -xarch flag. For maximum cross-compatibility across the SCC compute nodes and probable highest performance a combination of flags should be used:

icc -Ofast -mavx -axCORE-AVX2,CORE-AVX512 -c mycode.cpp

If benchmarking and testing of the compiled code does not show any improvement with the -x and -ax flags then they can be removed to improve compilation times.

Note that selecting specific SIMD instructions with the -xarch flag alone will restrict compatibility with compute nodes unless the job is submitted with this qsub flag: -l cpu_arch=compatible_arch. The compatible_arch value is an architecture name that matches the SIMD instructions. In this example a code is compiled with AVX instructions and a Haswell architecture CPU is requested with qsub:

icc -Ofast -mavx mycode.cpp -o mycode
qsub -l cpu_arch=haswell -b y mycode

If a code is relatively small in scope it can be compiled as part of a queue job. For example, a job that is submitted to run on a Buy-in node equipped with an Ivybridge architecture CPU could be compiled with tunings for that node. As a precaution the source is copied into $TMPDIR:

Example Batch Script to Recompile on a Compute Node

#!/bin/bash -l
#$ -l cpu_arch=ivybridge
module load intel/2016

# Copy the source to $TMPDIR to avoid interaction
# with other jobs running
cp -R /projectnb/myproject/mysource $TMPDIR

cd $TMPDIR/mysource

icc -Ofast -mtune=ivybridge -xHost -c file1.c
icc -Ofast -mtune=ivybridge -xHost -c file2.c
icc -o myexe file1.o file2.o -lm

myexe arg1 arg2 ....