clu2's notes: Intel complier suite reference card

Intel Compiler Suite

icc ecc	C compiler driver. ecc is the old name of icc for Itanium architecture. (e stands for Electron, the code name of Intel's C compiler for Itanium)
icpc	C++ compiler driver.
ifort efc	Fortran compiler driver. efc is the old name of ifort for Itanium architecture.
mcpcom	C/C++ compiler (Macro CPlus COMpiler ?)
fortcom	Fortran compiler
svcpcom	C/C++ source verifier/checker
svfortcom	Fortran source verifier/checker
fpp	Fortran preprocessor
idb	Debugger
codecov	Code-coverage and test-priorization tool
xiar	Library archiver. It has the same functionality of `ar` in the GNU Binutils, but is for Interprocedural Optimizations here.
xild	Linker. It has the same functionality of `ld` in the GNU Binutils, but is for Interprocedural Optimizations here.

File extensions

.c	C source files.
.h	C/C++ header files.
.i	Preprocessed C source files.
.C/cc/cxx/c++/cp/cpp/CPP	C++ source files.
.H/hh/hxx/h++/hp/hpp/HPP/tcc	C++ header files.
.s	Assembler code.
.d	Dependency files. They contain rules suitable for Makefile describing the dependencies of the source file. Created by -MD option.

Now the compiler...

The default compiler options can be placed in the icc.cfg file (for C compiler), icpc.cfg file (for C++ compiler), or ifort.cfg file (for Fortran compiler). These files can also be pointed to by ICCCFG, ICPCCFG, and IFORTCFG environmental variables.

Compile

-c	Compile .c and assemble .s. NO linking.
-Idir	Also search dir for header files. This can also be controlled by environmental variables `C_INCLUDE_PATH and CPLUS_INCLUDE_PATH.`
-S	Compile .c into assembly codes .s. NO linking.
-fsource-asm	Make the generated assembly codes more readable.
-E	Run preprocessor only. The output is sent to stdout.
-C	When running preprocessor, don't discard comments in the program.
-dM	When used with -E option, display definitions of all built-in macros, e.g. gcc -E -dM - < /dev/null
-o file	Place output in file
-dryrun	Display the programs invoked by the driver, but do not compile.
--version -dumpversion	Print the version number.
-dumpmachine	Print the machine info.
@file	Read command-line options from file. The options read are inserted in place of the original @file option.
-Bprefix	Search prefix for Intel compiler executables. The equivalent environmental variable is `GCC_EXEC_PREFIX`.
-print-multi-lib	Print the search directories of system libraries

C/C++ dialect

-ansi -strict-ansi	Strictly ISO C90 standard. In particular, C programs can't use C++ style "//" comments and inline keyword. `__INTEL_STRICT_ANSI__` will be defined.
-std=s	Determine the language standard. s can be c89, c99, gnu89, gnu++98 (default), c++0x ...
-openmp	Enable OpenMP. The run-time library libiomp will be used. `_OPENMP` will be defined.
-openmp-link static -openmp-link dynamic	Whether OpenMP library should be statically or dynamically linked.
-openmp-stubs	Compile an OpenMP program into serial code. All OpenMP directives are ignored.
-openmp-task model	Which OpenMP tasking model to be used. model can be intel or omp (default).
-openmp-report n	Show diagnostic information when compiling OpenMP programs. n specifies detail levels.
-openmp-threadprivate type	Specify the type of threadprivate implementation. type can be legacy (default) or compat (so it will be compatible with other compilers).
-par-num-threads=n	The number of threads to use. This option will override the environmental variable `OMP_NUM_THREADS` This option must be used together with either -parallel or -openmp.
-par-affinity	Set the threads' CPU affinity. This option will override the environmental variable `KMP_AFFINITY` This option must be used together with either -parallel or -openmp.
-pthread	Enable pthread support.
-mkl	Enable Intel Math Kernel library.
-ipp	Enable Intel Performance Primitives library.
-use-intel-optimized-headers	Use headers of Intel Performance Primitives library.
-tbb	Enable Intel Threading Building Blocks library.
-cilk-serialize	Enable serialization of Intel Cilk Plus code.
-funsigned-char -fsigned-char	Whether by default char is signed or unsigned.

Fortran specific

-heap-arrays 0

Put automatic arrays on heap instead of stack. If your Fortran code crashes because of large arrays, this could help.

Interoperability with GCC

-gcc-name=dir
-cxxlib[=dir]

When used together, specify the full path of gcc C++ libraries.

-cxxlib is on by default

Alternatively, one can set environmental variables GXX_ROOT and GXX_INCLUDE

-gcc-version=nnn Compatibility with gcc version nnn, e.g. 340 means gcc 3.4

-gxx-name=dir Specify the full path of g++

Preprocessor

-Dname -Dname=value	Predefine the macro name, with value 1, or with the specified value
-Uname	Un-define the (built-in or -D defined) macro name
-M -MM	Output a rule (to stdout) suitable for Makefile describing the dependencies of the source file. -MM only outputs header files not in the system header directories. This option implies -E option.
-MF file	The output of -M is written to file. This can also be controlled by environmental variable `DEPENDENCIES_OUTPUT.`
-MD	The same as -M -MF combined, but doesn't imply -E. *.d files will be generated.
-Wp,opt	Pass opt to the C preprocessor.

Warning messages

-Wall	Enable all warnings.
-w	Suppress all warnings.
-Werror	Treat warnings as errors.
-Winline	Warn if a function can't be inlined by compiler but is declared as such in the program.

Link

-Ldir	Also search dir for library files. This can also be controlled by environmental variable `LIBRARY_PATH.`
-llibrary	Link to liblibrary The linker searches libraries and object files in the order they are specified, so foo.o -lz bar.o will search library z after file foo.o but before bar.o, so if bar.o refers to functions in z, then -lz must appear AFTER bar.o
-s	Remove all symbol information from the executable
-static	Produce statically linked executable
-shared -fPIC -rdynamic	Produce shared libraries. For details, see here. -rdynamic is needed for some uses of dlopen or to allow obtaining backtraces from within a program.
-nostartfiles	Don't link to the standard startup files (so the start point of a program is not main, but _start). To compile crt1.o, one has to use this option. Also see here for examples.
-nodefaultlibs	Don't link to the standard system libraries (e.g. libgcc.a).
-nostdlib	Don't link to the standard system libraries (e.g. libgcc.a) or startup files.
-static-libgcc -shared-libgcc	Whether libgcc should be statically or dynamically linked.
-static-intel -shared-intel	Whether Intel-provided libraries should be statically or dynamically linked.
-Wl,opt	Pass opt to the linker.
-Wl,-t -Wl,--print-map	Enable linker to output trace/link map information.
-Wl,--start-group -Wl,--end-group	All the options between this pair are passed to the linker.

Debugging

-g	Produce debugging information.
-trapuv	Initializes stack local variables to an unusual value to aid error detection.
-debug all	Produce complete debugging information.
-debug parallel	Produce debugging information for the thread data sharing and reentrant call detection of the Intel Parallel Debugger Extension.
-diag-enable sc-parallel	Enables analysis of parallelization in source code (parallel lint diagnostics).
-diag-enable group	Enables messages of diagnostic group, which can be vec, thread, par, openmp, driver, ...
-save-temps	Save all temporary/intermediate files produced during compiling.
-parallel-source-info	Emit source code location when OpenMP or auto-parallelization code is generated.

Profiling

-openmp-profile	Produce profiling information for OpenMP. A text file named guide.gvs ("Generated Values and Statistics". The file name can be specified through `KMP_STATSFILE` environmental variable) will be created after the run. See here for the explanation of guide.gvs.
-prof-gen=srcpos	Use together with codecov for code coverage analysis.

Optimization

-O0 -Od	Don't optimize.
-O1	Optimize. When any optimization option is used, `__OPTIMIZE__` is defined.
-O -O2	Optimize even more. This is default.
-O3	Optimize yet more. In particular, it will try to inline a function whenever possible.
-Os	Optimize for code size. This enables all -O2 optimizations that don't increase code size.
-fast	The same as `-ipo -O3 -no-prec-div -static -xHost` Note the -static part.
-opt-report -opt-report-level n	Show diagnostic information about optimization during compilation. n specifies detail levels, default is 1.
-fp-model model	If model is `fast=1` or `fast=2` then optimize floating-point arithmetic aggressively at cost in accuracy or consistency. Other possible values for model are `precise`, `except`, `strict`, `source` (round intermediate results to source-defined precision), `double` (round intermediate results to double precision), and `extended` (round intermediate results to extended precision).
-pcn	Round the significand to n bits, n can be 32, 64, 80.
-fp-speculation=mode	Enable compiler to speculate on floating-point operations. mode can be fast (default mode if any optimization is on), safe, strict, or off.
-fp-port	Round floating-point results after floating-point operations.
-fp-stack-check	Generate extra code after every function call to ensure that the FPU register stack is in the expected state.
-fp-relaxed	Generate fast but less accurate code sequences for math functions (Itanium only).
-ftz	(Enabled automatically by -O3) Enable the DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) bits in the x87 FPU control word. The effect is denormal results from floating-point calculations will be set to 0, and denormal values used as input to floating-point instruction will be treated as 0. See here for details.
-fpen	(Fortran) Set n to 0 to enable floating-point invalid, divide-by-0, overflow exceptions. Set n to 3 (default) to disable all floating-point exceptions.
-fast-transcendentals	Generate fast but less accurate code for transcendental functions.
-prec-div -no-prec-div	(Fortran) Whether or not to generate slow but more accurate code for floating-point divide.
-prec-sqrt -no-prec-sqrt	(Fortran) Whether or not to generate slow but more accurate code for floating-point square root.
-unrolln	Unroll the loop at most n times. To disable, use n=0.
-prof-gen -prof-use -prof-dir	Profile guided optimization (PGO).
-ipo	Link time/Inter-procedural optimization.
-parallel -par-threshold n	Automatically paralellize loops. n sets the maximum number of threads, and the default is 75.
-par-report n	Show diagnostic information about automatic loop parallelization during compilation. n specifies detail levels.
-guide n	Provide guidance/advice for automatic vectorization/parallelization/data transformation when -parallel option is used. n sets the verbose-ness.
-vec -vec-threshold n	Automatically vectorize loops. n sets the maximum number of threads, and the default (which is also the maximum) is 100.
-vec-report n	Show diagnostic information about loop vectorization during compilation. n specifies detail levels, default is 1.
-march=cpu	Generate code for specific cpu, which can be core2, pentium4, or pentium3.
-mtune=cpu	Tune for specific cpu, e.g. core2, pentium4, pentium, pentiumpro, pentium-mmx, itanium, itanium2, etc. The default is pentium4 on x86_64 and itanium2 on IA64.
-xsimd -axsimd	Generate code for specific SSE/SIMD extensions. simd can be avx, sse4.2, sse4.1, ssse3, sse3, sse3_atom, sse2. -ax is similar to -x, except it also generates the non-SIMD-specific code (a=automatic processor dispatch).
-minstruction=movbe	Generate movbe instruction if needed.
-opt-calloc	Use optimized `calloc` call.
-opt-sub-script-in-range	Assume loop indices never overflow.
-opt-matmul	Identify matrix multiplication loop nests (if any) and replace them with a matmul library call for improved performance.

Interesting features

-sox

Record compiler version number and command-line options in the generated objects.

To see them, use the following commands

    objdump -sj .comment a.out
    strings -a a.out |grep comment:

-fstack-protector
-fstack-protector-all

Enable protection against buffer overflows such as stack smashing attacks.

Environmental variables

The following environmental variables will affect compilation:

`GXX_ROOT` `GXX_INCLUDE`	Specify the location of the gcc binaries/headers.
`IA32ROOT` `IA64ROOT`	Specify the location of the headers/libraries for a non-standard installation structure.
`ICCCFG` `ICPCCFG` `IFORTCFG`	Specify the compiler-default-options files.

OpenMP environmental variables

The following environmental variables will affect OpenMP programs.

KMP-prefix ones are Intel specific environmental variables (K=KAI, Kuck & Associates, Inc)

OMP-prefix ones (not shown here) are standard OpenMP environmental variables. See the previous link for the complete list.

Also see here for Intel extension routines to OpenMP.

`KMP_SETTINGS`	[Version 12] Set to 1 to display OpenMP run-time library environmental variables.
`KMP_AFFINITY`	Set the threads' CPU affinity. See here for details.
`KMP_ALL_THREADS`	Maximum number of simultaneously executing threads.
`KMP_BLOCKTIME`	The time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. The default is 200. Can be set to `infinity`.
`KMP_DYNAMIC_MODE`	How to choose the number of threads when `OMP_DYNAMIC` environmental variable is set to true. If the value is `load_balance` (default), then it tries to avoid using more threads than the number of available processors. If the value is `thread_limit`, then it tries to avoid using more threads than the total number of processors. If the value is `asat`, then it is based on parallel start time.
`KMP_LIBRARY`	OpenMP run-time library execution mode. If the value is `throughput` (default), then it is optimized for sharing resources with other programs. If the value is `turnaround`, then it is for dedicated use of resources, as in HPC. If the value is `serial`, then it enforces serial execution.
`KMP_STACKSIZE`	The number of bytes (e.g. 2m, 4k) allocated for each OpenMP thread to use as the private stack for the thread.
`KMP_STATSFILE`	The file name for OpenMP profiling option -openmp-profile The default name is `guide.gvs`
`KMP_CPUINFO_FILE`	Specify an alternate file name for file containing machine topology description The default is `/proc/cpuinfo`

Math Kernel Library environmental variables

The following environmental variables will affect programs using Intel Math Kernel Library. Most of them are thread control, i.e. they will only affect programs linked to multi-threaded version of MKL.

`MKL_NUM_THREADS` `OMP_NUM_THREADS`	Set the number of threads.
`MKL_DYNAMIC`	Set to FALSE to disable automatic selection of number of threads. The default is TRUE. Note that MKL will by default ignore the logical cores created by the HyperThreading technology.
`MKL_ALL` `MKL_BLAS` `MKL_FFT` `MKL_VML`	Set the number of threads for each function domain. Alternatively, one can also use environmental variable `MKL_DOMAIN_NUM_THREADS`
`MKL_SERIAL`	For older versions of MKL, set it to YES to disable multithreading.
`MKL_DISABLE_FAST_MM` `MKL_MM_DISABLE`	Set to any value to disable fast memory management; it will cause memory to be allocated and freed from call to call, which will negatively impact performance of routines such as the level 3 BLAS functions (`*gemm`), especially for small problem sizes.
`I_MPI_NUMBER_OF_MPI_PROCESSES_PER_NODE` `I_MPI_THREAD_LEVEL`	Although these are Intel-MPI environmental variable, MKL will read their values to determine the optimal number of threads to be used.
`MKL_DEBUG_CPU_TYPE`	(Undocumented) Set to an integer between 0 and 4, inclusive, to choose the MKL math functions optimized for a specific SSE instruction set. This setting overrides MKL's Processor Dispatch. For 64-bit, 0 to choose the default SSE, 1 for SSE 2, 2 for Supplemental SSE 3, 3 for SSE 4.2, and 4 for AVX. For 32-bit, 0 the default SSE, 1 for SSE2, 2 for SSE 3, 3 for Supplemental SSE 3, 4 for SSE 4.2, and 5 for AVX. This setting has substantial performance impact on computational intensive BLAS functions (e.g. `*gemm`) for large problem sizes. Also, if your CPU does not support the SSE instruction set you specified, you could get "illegal instruction" error.
`MKL_DEBUG_CPU_MA`	(Undocumented) (Tested in version 10.2.5 and 10.3) Set to an integer to choose the MKL math functions optimized for a specific MicroArchitecture: 32 for Merom, 33 for Penryn, 64 for Nehalem, 66 for Westmere, 128 for Sandy Bridge, and 0 for everything else (including AMD Barcelona). Currently this flag is only used to distinguish between Nehalem and Westmere, that is, when `MKL_DEBUG_CPU_TYPE` is set for SSE 4.2, and only BLAS functions `*gemm` use it (and only use it on very large problem size). The performance difference is not that great though (1-2%). For all other cases, Intel MKL infers the MicroArchitecture from the SSE instruction set and ignores `MKL_DEBUG_CPU_MA` setting. For example, AVX instruction means the MicroArchitecture is Sandy Bridge, SSE4.1 means Penryn (both Meron and Penryn support Supplemental SSE 3), etc.

Intel C/C++ Compiler built-in macros

`__cplusplus`	Is defined if C++ compiler is in use.
`__FILE__` `__BASE_FILE__`	Name of the current input file (as a C string constant) This is ANSI C standard macro.
`__LINE__`	Current input line number (as an integer constant) This is ANSI C standard macro
`__DATE__` `__TIME__`	Date & time on which the preprocessor is run. (as C string constants) These are ANSI C standard macros.
`__TIMESTAMP__`	Last modification time of the input file (as a C string constant)
`__STDC__` `__STDC_VERSION__`	Evaluate to 1 to mean the compiler is ISO standard conformant. `__STDC_VERSION__` evaluates to a C string constant of the form of the form yyyymmL. __STDC__ is an ANSI C standard macro.
`__PIC__` `__pic__`	Evaluate to 1 if the program is compiled with `-fPIC` flag, i.e. position-independent code..
`__GNUC__` `__GNUC_MINOR__` `__GNUC_PATCHLEVEL__`	Evaluate to integer constants representing the GNU (C/C++/Fortran) compiler version numbers (major/minor/patch level).
`__ICC` `__VERSION__` `__ECC`	Evaluate to integer constants representing the Intel C/C++ compiler version numbers. `__ECC` is for Itanium only.
`__INTEL_COMPILER_BUILD_DATE`	Evaluate to the Intel C compiler build date (yyyymmdd format).
`_OPENMP`	Is defined if OpenMP is in effect.
`KMP_VERSION_MAJOR` `KMP_VERSION_MINOR` `KMP_VERSION_BUILD`	Evaluate to integer constants representing the Intel OpenMP version numbers and build date (yyyymmdd format).
`__itanium__` `__ia64__`	Defined for Itanium.
`__LP64__`	Defined for Linux x86_64 and Itanium.
`__x86_64__`	Defined for x86_64.
`__SSE__` `__SSE2__` `__SSE3__` `__SSSE3__`	Defined for processors that supports SSE/SSE2... instructions.
`__OPTIMIZE__` `__OPTIMIZE_SIZE__`	Is defined if any optimization flag is used. Furthermore, `__OPTIMIZE_SIZE__` is defined if the optimization is for size, not speed.
`__GNUG__`	Evaluate to minor version number of GNU C++ compiler.

Intel C/C++ Compiler #pragma directives

See here for a list of Intel specific pragmas.

#pragma prefetch var1,var2...	Prefetch data in variables var1,var2...
#pragma loop_count n1,n2...	Specify the possible numbers n1,n2... of iterations for the loop.
#pragma unroll n	Specify the how many times a loop to be unrolled.
#pragma unroll_and_jam n	Specify the how to unrolls one or more loops higher in the nest than the innermost loop and fuses/jams the resulting loops back together.
#pragma swp	Specify the loop to be software-pipelined.
#pragma poison symbol1 symbol2 ...	symbol1 symbol2 .. are (unquoted string) identifiers which will be removed during compilation.
#pragma message string	Display string (C string constant) during compilation.
#pragma weak symbol	Declare symbol as a weak symbol. A better way to achieve this is through function attribute "weak".
#pragma weak symbol1=symbol2	Declare symbol1 as a weak alias of symbol2. A better way to achieve this is through function attributes "weak, alias".

Intel supplied libraries

i*libFNP	FlexNet Publisher License server manager for Intel C, C++, and Fortran compilers
libcilkrts	Intel Cilk Plus run-time library
libclompc libclusterguide	Intel Cluster OpenMP run-time library
libcxaguard	For guarded initiailization of static variables in C++ code. For details, see the comments in `libstdc++-v3/libsupc++/guard.cc` in GCC source tree and here
libdecimal	IEEE 754-2008 Decimal floating-point arithmetic.
libguide	Intel's legacy OpenMP run-time library. It also implements Intel's extensions (i.e. `KMP_`-prefix environmental variables) Why it's called "Guide" ? Because the OpenMP-enabled KAI compiler is called Guide Compiler (e.g. `guidec`)
libguide_stats	libguide for the parallelizer tool with performance statistics and profile information
libimf	Intel's equivalent of `libm.a`
libiomp5 libiompprof5 libiompstubs5 libompstub	Intel's new OpenMP run-time library. Programs compiled with non-Intel compilers can be linked to it. The stub libraries are used when -openmp-stubs option is used (compile an OpenMP program into serial code)
libipgo	Support library for profile-guided optimization.
libirc	Intel optimized run-time library. It contains the infamous Processor Dispatch code.
libirc_s	Like above, but contains SSE-specific code.
libirml	Intel Resource Management Layer library. It is a work dispatcher used by Threading Building Blocks (TBB).
libpdbx	Parallel debugger extension runtime library.
libsvml	Short vector math library.
libifcore libifcoremt libifcore_pic libifcoremt_pic	Fortran run-time library. The mt version are for multi-threaded programs. The _pic version allows creation of shared libraries linked to static version of libifcore instead of the dynamic one.
libifport	Portability & POSIX support (Fortran).
libmkl_blacs_intelmpi20_*	BLACS routines using Intel MPI 2.x
libmkl_blacs_intelmpi_*	BLACS routines using Intel MPI 1.x or MPICH1/2
libmkl_blacs_*lp64	BLACS routines
libmkl_blacs_openmpi_*	BLACS routines using Open MPI
libmkl_blacs_sgimpt_*	BLACS routines using SGI Message Passing Toolkit
libmkl_cdft_core	Cluster discrete Fourier tansform routines
libmkl_core	Math Kernel Library core functions
libmkl_gf libmkl_gf_*lp	MKL interface library for GNU Fortran compiler
libmkl_gnu_thread	MKL interface library for GNU OpenMP
libmkl_intel libmkl_intel_*lp	MKL interface library for Intel compiler
libmkl_intel_sp2dp	MKL interface library supporting Cray-style naming in user programs targeted for the Intel 64 or IA-64 architecture and using the ILP64 convention. SP2DP interface provides a mapping between single-precision names (for both real and complex types) in the user program and double-precision names in Intel MKL BLAS and LAPACK. Function names are mapped as shown in the following example for BLAS functions `*gemm` sgemm -> dgemm dgemm -> dgemm cgemm -> zgemm zgemm -> zgemm
libmkl_intel_thread	MKL interface library for Intel OpenMP
libmkl_lapack	LAPACK routines
libmkl_def	Default, untuned MKL, for SSE capable processors.
libmkl_p4n	MKL for SSE 2 capable processors , i.e. Pentium 4 or better.
libmkl_mc	MKL for Supplemental SSE 3 capable processors, i.e Core or better.
libmkl_mc3	MKL for SSE4.2 capable processors, i.e Nehalem or better.
libmkl_avx	MKL for AVX capable processors, i.e. Sandy Bridge or better.
libmkl_pgi_thread	MKL interface library for PGI OpenMP
libmkl_scalapack	ScaLAPACK routines
libmkl_sequential	Sequential version of MKL
libmkl_solver	Iterative sparse solver, trust region solver, and GNU Multiple Precision (GMP) Arithmetic Library routines
libmkl_vml_*	Vector math library (*=avx is for AVX, mc3/p4m3 for SSE 4.2, mc2/p4m2 for SSE 4.1, mc/p4m for Supplemental SSE3, p4p for SSE3, p4/p4n for Pentium 4)
libmpi_mt libmpi_dbg_mt	Thread-safe MPI library (dbg=with extra debugging information)
libmpigc3 libmpigc4	MPI interface library for GCC 3.x/4.x compilers
libmpigf	MPI interface library for GNU Fortran compiler
libmpi_lustre libmpi_panfs libmpi_pvfs2	MPI IO library tuned for Lustre, Panasas, and PVFS parallel file systems
libtmi libtmip_mx libtmip_psm	Tag Matching Interface support for Qlogic PSM and Myricom MX interconnects
libtvmpi	MPI debugging interface library for TotalView debugger