A good reference on this topic is Agner Fog's Optimizing software in C++ , Chapter 13 (available here) and his blog.
Also see a recent article here.
You are eligible for Intel compiler reimbursement if you meet some criteria.
-x switch | Processor Dispatch Routine |
---|---|
(none) | __intel_new_proc_init |
-xsse2 | __intel_new_proc_init |
-xsse3 -xP | __intel_new_proc_init_P |
-xssse3 -xsse3_atom -xT | __intel_new_proc_init_T |
-xsse4.1 -xS | __intel_new_proc_init_S |
-xsse4.2 -xH | __intel_new_proc_init_H |
-xavx -xG | __intel_new_proc_init_G |
-xcore-avx-i | __intel_new_proc_init_I |
-xcore-avx2 | __intel_new_proc_init_E |
All of __intel_new_proc_init_* call __intel_cpu_indicator_init to determine the capability of the CPU, and then call their respective routines to display an error message (via irc__print function) such as
Fatal Error: This program was not built to run on the processor in your system. The allowed processors are: Intel(R) processors with Intel(R) AVX instructions support.(if the CPU is not capable enough) and enable DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) flags in the MXCSR (MMX Extension Control/Status Register.) Therefore, it is enough for us to analyze __intel_cpu_indicator_init in detail.
Anatomy of __intel_cpu_indicator_init
As the name suggests, __intel_cpu_indicator_init initializes a global variable called __intel_cpu_indicator. This variable is consulted in many places (e.g. Intel optimized math functions such as sin, cos, etc, or C runtime functions such as memcpy, strchr. In the latter case, one can find functions with names like __intel_ssse3_memcpy, __intel_ssse3_strchr, __intel_sse4_strchr) to determine the optimal execution path. Initially, __intel_cpu_indicator is 0.__intel_new_proc_init relies on the cpuid instruction, which takes the value in EAX register as input and puts output in EAX EBX, ECX, and EDX registers. __intel_cpu_indicator_init only calls cpuid with EAX=0 (to get vendor ID string and maximum standard cpuid levels) and EAX=1 (to get feature flags.) Detailed explanation of cpuid can be found here.
The pseudo code of __intel_cpu_indicator_init is as follows.
Call cpuid with EAX=0 and store the result in local variables. Call cpuid with EAX=1 and store the result in local variables. if (vendor ID string is NOT "GenuineIntel") { set "maximum standard cpuid level" to be 0 } if (maximum standard cpuid level is 0) /* Maximum standard cpuid level is the max value EAX can have when calling cpuid. If this is greater than 0, then calling cpuid with EAX=1 will return the feature flags. Otherwise, there is no need to continue. */ return; } if (CPU family is 15) { /* i.e. Pentium 4 and derivatives */ if (is SSE3 capable) { __intel_cpu_indicator = SSE3; } return; } if (CPU family is not 6) { /* Pentium Pro and anything comes after Pentium 4, i.e. Core 2, Nehalem, Sandy Bridges, etc, are of Family 6. */ return; } __intel_cpu_indicator = SSE3; if (is SSSE3 capable) __intel_cpu_indicator = SSSE3; else return; if (has MOVBE instruction) __intel_cpu_indicator = MOVBE; else return; /* MOVBE (MOV with Bi-Endian support) is a new instruction introduced in Intel Atom processors. It allows swapping the high and low bits of a long value during a move operation. MOVBE is currently only available on Intel Atom processors. */ if (is SSE4.1 capable) __intel_cpu_indicator = SSE4.1; else return; if (has POPCNT instruction and is SSE4.2 capable) __intel_cpu_indicator = SSE4.2; else return; /* POPCNT is a new instruction introduced in SSE4.2 (Intel) and SSE4a (AMD). It counts the number of bit 1's in a 64-bit word. */ if (has PCLMULQDQ instruction and is AES capable) __intel_cpu_indicator = PCLMULQDQ; else return; /* PCLMULQDQ is part of the AES New Instructions (AES-NI). It performs carry-less multiplications. See here for an application in cryptography. */ if (has XGETBV instruction enabled by the OS) { Call XGETBV with ECX=0 and get results in EAX and EDX. if (is AVX capable) if (EDX shows that both XMM and YMM are both available) { __intel_cpu_indicator = AVX; } } /* XGETBV instruction will get the value of Extended Control Register. */ if (has F16C instructions) __intel_cpu_indicator = F16C; /* F16C are 16-bit floating-point conversion instructions */ (more checks... for AVX2) return;
Note that __intel_new_proc_init checks for availability of AVX using XGETBV instruction. This is also the recommended approach in Intel Advanced Vector Extensions Programming Reference.
Processor Dispatch for memory operations functions
As mentioned earlier, Intel Compiler suite has optimized version for certain C library functions. Most of them use __intel_cpu_indicator only. However, memory operations (e.g. memset, memcpy) use additional Processor Dispatch code to determine the cache sizes: __intel_init_mem_ops_method initializes global variables __intel_memcpy_mem_ops_method (=2 for SSE2 capable, 1 for MMX/SSE), _data_cache_size (L1 cache size in bytes), _data_cache_size_half, __intel_memcpy_largest_cache_size (Last level cache, usually L3 cache, size in bytes), _largest_cache_size_half, __intel_memcpy_largest_cachelinesize (usually 64)One example is __intel_ssse3_memcpy, which makes use of _data_cache_size_half and _largest_cache_size_half.
How to override the Processor Dispatch ?
If you want to override the Processor Dispatch setting, you can add the following code snippet to your program (either include it in a source file, or compile it as an independent object and link it to your program):/* In the following, "__attribute__ ((constructor))" means this routine ("my_intel_cpu_indicator_override" below, but one can pick any name of your choice) should be executed before main() Agner Fog's "Optimizing software in C++" guide Chapter 13 uses a similar approach, but on Linux you will see errors from the linker ld: "multiple definitions of __intel_cpu_indicator_init" This error can be fixed by using "-z multidef" command-line option, which instructs the linker to accept multiple definitions. And when one does this, make sure "__intel_cpu_indicator_init" is in the same source file as the main() function. */ #ifdef __INTEL_COMPILER void __attribute__ ((constructor)) my_intel_cpu_indicator_override() { extern unsigned int __intel_cpu_indicator; __intel_cpu_indicator = 1<<11; } #endifwhere 1<<11 means the CPU should be recognized as SSE3 capable, 1<<12 as SSSE3 capable, 1<<14 as MOVBE capable, 1<<13 as SSE4.1 capable, 1<<15 as SSE4.2 & POPCNT capable, 1<<16 as PCLMULQDQ & AES capable, 1<<17 as AVX capable, 1<<18 as F16C and RDRAND capable, and 1<<22 as AVX2, BMI (bit manipulation instructions), LZCNT and FMA capable.
If your code is already compiled with the Intel compiler version 12.0 or 13.0, you can still override it using this Perl script.
Why would anyone want to override it ? A use case is for AMD processors, __intel_cpu_indicator will be 1, the baseline case, even if they are SSE3 capable. Setting it to 1<<11 can enable the optimal execution path on SSE3 capable processors.
If one wants to use Intel optimized memory operations functions on AMD processors, one needs to further manually set the following variables, because Intel's runtime routine _irc_init_cache_tbl checks for CPU vendor string and is not able to obtain cache size info through cpuid instruction if the CPU is not GenuineIntel (TM):
#ifdef __INTEL_COMPILER void __intel_init_mem_ops_method() __attribute__ ((weak)); void __attribute__ ((constructor)) my_intel_init_mem_ops_method() { if (__intel_init_mem_ops_method) { extern unsigned int __intel_memcpy_mem_ops_method, _data_cache_size, _data_cache_size_half, __intel_memcpy_largest_cache_size, _largest_cache_size_half, __intel_memcpy_largest_cachelinesize; /* initialize _irc_cache_tbl */ __intel_init_mem_ops_method(); /* override the cache parameters */ __intel_memcpy_mem_ops_method = 2; /* 2 = SSE2 capable */ /* for AMD Shanghai processors, 64 KB L1 data cache */ _data_cache_size = 65536; _data_cache_size_half = _data_cache_size/2; /* for AMD Shanghai processors, 6 MB L3 cache */ __intel_memcpy_largest_cache_size = 6291456; _largest_cache_size_half = __intel_memcpy_largest_cache_size/2; __intel_memcpy_largest_cachelinesize = 64; } } #endifMake sure you know what you are doing. As mentioned earlier, __intel_cpu_indicator is used by Intel optimized functions and your program could end up with "illegal instruction" error if the CPU does not support the instructions you specified. Even if your program does not call the tuned version of math functions, -x switch can generate code which uses instructions your CPU does not support.
What are DAZ, FTZ, and MXCSR ?
(More info can be found here)DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) are not of IEEE-754 standard, but they can speed up floating point arithmetics:
- DAZ: Treats denormal values (used as input to floating-point instructions) as 0.
- FTZ: Sets denormal results (from floating-point calculations) to 0.
DAZ and FTZ only affect SSE instructions but not the traditional x87 instructions.
If any optimization option is used, Intel Compiler will insert a code to enable DAZ/FTZ, unless -ftz- switch is also used.
Depending on the capability of the CPU, there are different ways to detect & enable DAZ/FTZ. They are just different ways to manipulate the DAZ/FTZ bits in the MXCSR (MMX Extension Control/Status Register). For details, see Chapter 7 and Code Example 9.4 in Intel Processor Identification and the CPUID Instruction or this link.
On a side note, gcc can enable DAZ/FTZ by the -ffast-math switch.
Intel MKL (Math Kernel Library) Processor Dispatch Code
There is no doubt that Intel MKL will use the similar infamous Processor Dispatch (TM). For example, if you use static linking, you can find that your executable binary contains code from libmkl_core.a, which has different versions of math functions for different SSE instruction sets. Take BLAS function daxpy (double-precision a*x+y where x, y are vectors). Then the 64-bit libmkl_core.a contains the following functions:
daxpy implementation function | SSE instruction set |
---|---|
mkl_blas_def_xdaxpy | The default, untuned version, which works for SSE capable processors. |
mkl_blas_p4n_xdaxpy | SSE2 version (Pentium 4 processors or better). |
mkl_blas_mc_xdaxpy | Supplemental SSE 3 version (Core/Merom processors or better). |
mkl_blas_mc3_xdaxpy | SSE4.2 version (Nehalem processors or better). |
mkl_blas_avx_xdaxpy | AVX version (Sandy Bridge processors or better). |
mkl_blas_avx2_xdaxpy | AVX2 version (Haswell processors or better). [Since MKL version 11.0] |
And the 32-bit libmkl_core.a:
daxpy implementation function | SSE instruction set |
---|---|
mkl_blas_def_xdaxpy | The default, untuned version, which works for SSE capable processors. |
mkl_blas_p4_xdaxpy | SSE2 version (Pentium 4 processors or better). |
mkl_blas_p4p_xdaxpy | SSE3 version (Pentium 4 Prescott processors or better). |
mkl_blas_p4m_xdaxpy | Supplemental SSE 3 version (Core/Merom processors or better). |
mkl_blas_p4m3_xdaxpy | SSE4.2 version (Nehalem processors or better). |
mkl_blas_avx_xdaxpy | AVX version (Sandy Bridge processors or better). |
If you instead choose dynamic linking, then these functions are always called mkl_blas_xdaxpy, and you can find them in libmkl_def.so, libmkl_p4n.so, libmkl_mc.so, libmkl_mc3.so, libmkl_avx.so, libmkl_avx2.so, etc.
Anyway, in x86_64 MKL version 10.2.5/10.3/11.0, one can find the following Processor Dispatch (TM) code:
MKL dynamic link library file | Function | Purpose |
---|---|---|
libmkl_core.so | mkl_serv_intel_cpu | Check for Intel processor |
libmkl_core.so | MKL_CPUisINTEL | Check for Intel processor |
libmkl_core.so | mkl_serv_cpuhasnhm | Check for SSE 4.2 (nhm=Nehalem) |
libmkl_core.so | mkl_serv_cpuhaspnr | Check for SSE 4.1 (pnr=Penryn) |
libmkl_core.so | xxxMKL_CPUhasNHMWST | Check for AES (WST=Westmere) |
libmkl_core.so | mkl_serv_cpuisitbarcelona | Check for AMD Barcelona processor |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_getCPUisintel (or mkl_vml_serv_getCPUisintel) | Check for Intel processor |
libmkl_{intel|gf}_{lp|ilp}64.so | mkl_vml_serv_CPUisHSW | Check for AVX2 (HSW=Haswell) |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisGSSE (or mkl_vml_serv_CPUisGSSE) | Check for AVX |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE42 (or mkl_vml_serv_CPUisSSE42) | Check for SSE 4.2 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE41 (or mkl_vml_serv_CPUisSSE41) | Check for SSE 4.1 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE4 (or mkl_vml_serv_CPUisSSE4) | Check for Supplemental SSE 3 |
In x86_64 MKL version 10.2.2, one can find the following code:
MKL dynamic link library file | Function | Purpose |
---|---|---|
libmkl_core.so | mkl_serv_intel_cpu | Check for Intel processor |
libmkl_core.so | MKL_CPUisINTEL | Check for Intel processor |
libmkl_core.so | MKL_CPUhasNHMx | Check for SSE 4.2 (NHMx=Nehalem) |
libmkl_core.so | mkl_serv_cpuhasnhm | Check for SSE 4.2 (nhm=Nehalem) |
libmkl_core.so | mkl_serv_cpuhaspnr | Check for SSE 4.1 (pnr=Penryn) |
libmkl_core.so | MKL_CPUhasMNI | Check for Supplemental SSE 3 (MNI=Merom New Instructions) |
libmkl_core.so | MKL_CPUhasSSE3 | Check for SSE 3 |
libmkl_core.so | MKL_CPUhasAVX | Check for AVX |
libmkl_core.so | mkl_serv_cpuisitbarcelona | Check for AMD Barcelona processor |
libmkl_{intel|gnu|pgi}_thread.so | GetAPIC_ID | Get APIC ID. This is to determine the processor/core topology/enumeration. See here or here for more info. Why MKL needs to know this ? Because when running the multi-threaded version of MKL, MKL will by default ignore the "extra" logical cores created by the HyperThreading technology. |
libmkl_{intel|gnu|pgi}_thread.so | MaxCorePerPhysicalProc | Get number of cores per physical processor. This can help optimize cache usage. |
libmkl_{intel|gnu|pgi}_thread.so | MaxLogicalProcPerPhysicalProc | Get number of logic cores per physical processor. |
libmkl_{intel|gnu|pgi}_thread.so | GetCpuIdInfo | Check for Intel processor |
libmkl_{intel|gnu|pgi}_thread.so | CountProcNum_omp | Check for Intel processor and count the number of processors |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_getCPUisintel | Check for Intel processor |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisGSSE | Check for AVX |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE42 | Check for SSE 4.2 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE41 | Check for SSE 4.1 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE4 | Check for Supplemental SSE 3 |
How to override MKL's Processor Dispatch ?
There is an undocumented environmental variable called MKL_DEBUG_CPU_TYPE which allows users to select the SSE instruction set at runtime. Setting this environmental variable to 0 will choose the "def" version, 1 the "p4n", 2 the "mc", and so on.Are there other MKL internal parameters ?
(This has been verified in MKL versions 10.2.2, 10.2.5, and 10.3) MKL has internal parameters to determine the optimal execution path, including the number of threads, block sizes (as in certain BLAS functions), etc. MKL_DEBUG_CPU_TYPE is only one of them. But most of them are not exposed to user's program. Here is a list of them:
MKL internal parameter | Purpose |
---|---|
disable_fast_mm | Enable/disable fast memory management. One should instead use the environmental variable MKL_DISABLE_FAST_MM. Fast memory management is used only when MKL allocates certain sizes of memory chunks in certain BLAS functions (e.g. dgemm) |
__MKL_CPU_MicroArchitecture | The CPU microarchitecture. This is not as useful as MKL_DEBUG_CPU_TYPE. See the MKL_DEBUG_CPU_MA section here |
itisBarcelona | When __MKL_CPU_MicroArchitecture is 0, this parameter indicates whether the CPU is AMD Barcelona or not. |
mkl_cpu_type | SSE instruction set level. It has the same value as the environmental variable MKL_DEBUG_CPU_TYPE |
__HT | Intel Hyper-Threading technology is present or not |
__N_Logical_Cores __N_Physical_Cores __N_CPU_Packages __N_Cores_per_Packages | Processor topology. __N_Physical_Cores is used to determine the number of threads to be used. One should instead use the environmental variables OMP_NUM_THREADS or MKL_NUM_THREADS instead. |
MKL_cache_sizes | The levels of on-chip cache and their sizes in byte. |
If one really needs to modify these internal parameters in the program, use this code snippet.
Intel OpenMP Processor Dispatch Code
Intel OpenMP is used by MKL for multi-threading support. Intel OpenMP also implements a GOMP (GNU OpenMP) interface, so GCC-compiled OpenMP programs can use Intel's OpenMP runtime library. As such, Intel OpenMP (libiomp5.so) also contains its own Processor Dispatch code. The code is the same as Intel Compiler's Processor Dispatch Code, except all the functions and variables now have __kmp_external_ prefix, e.g. __kmp_external___intel_new_proc_init, __kmp_external___intel_cpu_indicator, etc.Intel MPI Processor Dispatch Code
Intel MPI contains the same Processor Dispatch code as mentioned above, except all the functions and variables now have __I_MPI_ prefix, e.g. __I_MPI___intel_new_proc_init_L, __I_MPI___intel_cpu_indicator, etc.As of version 4.0.0 and 4.0.1, Intel MPI has its own additional Processor Dispatch Code to determine the algorithms for collective operations. First, Intel MPI's MPD (multi-purpose daemon) script (mpd or mpd.py) contains a function called pin_Topology, which executes the cpuinfo utility under the same directory with a single command-line argument p. The MPD script then reads in the result of this command and sets I_MPI_INFO_-prefixed environmental variables (e.g. I_MPI_INFO_STATE, I_MPI_INFO_C_NAME, I_MPI_INFO_CACHE1, etc), which are then read by the Intel MPI run-time code, e.g. libmpi.so. Based on the values of these environmental variables, the Intel MPI run-time code will set an internal variable called I_MPI_Platform, which is used to determine the algorithms for collective operations (I_MPI_COLL_DEFAULT, I_MPI_COLL_DEFAULT_HTN, I_MPI_COLL_DEFAULT_NHM, I_MPI_COLL_DEFAULT_WSM)
There is an undocumented environmental variable called I_MPI_PLATFORM which allows users to override the default value for I_MPI_Platform. See here for more info.