Analysis of Intel compiler's and Intel MKL's processor dispatch code

If any optimization option is used (by default, Intel Compiler uses -O2. Also, if any debugging-related flag is used, e.g. -g, then optimization will be turned off.) Intel Compiler (version 9 and later) will insert a call to a routine to determine the capability of the CPU on which the program is running. This routine is called before anything else in your program's main. For C/C++ programs, the name of this routine can be found in the following table, which depends on the on the compiler option -x. For Fortran programs, __intel_new_proc_init is always called first in main, and then -x dependent routine (as the table below) is called in MAIN__ (which is the real main function of a Fortran program) This has been verified in Intel Compiler versions 11.x thru 13.0.

A good reference on this topic is Agner Fog's Optimizing software in C++ , Chapter 13 (available here) and his blog.

Also see a recent article here.

You are eligible for Intel compiler reimbursement if you meet some criteria.

-x switchProcessor Dispatch Routine
(none)__intel_new_proc_init
-xsse2__intel_new_proc_init
-xsse3
-xP
__intel_new_proc_init_P
-xssse3
-xsse3_atom
-xT
__intel_new_proc_init_T
-xsse4.1
-xS
__intel_new_proc_init_S
-xsse4.2
-xH
__intel_new_proc_init_H
-xavx
-xG
__intel_new_proc_init_G
-xcore-avx-i__intel_new_proc_init_I
-xcore-avx2__intel_new_proc_init_E

All of __intel_new_proc_init_* call __intel_cpu_indicator_init to determine the capability of the CPU, and then call their respective routines to display an error message (via irc__print function) such as

  Fatal Error: This program was not built to run on the processor in your system.
  The allowed processors are: Intel(R) processors with Intel(R) AVX instructions support.
(if the CPU is not capable enough) and enable DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) flags in the MXCSR (MMX Extension Control/Status Register.) Therefore, it is enough for us to analyze __intel_cpu_indicator_init in detail.

Anatomy of __intel_cpu_indicator_init

As the name suggests, __intel_cpu_indicator_init initializes a global variable called __intel_cpu_indicator. This variable is consulted in many places (e.g. Intel optimized math functions such as sin, cos, etc, or C runtime functions such as memcpy, strchr. In the latter case, one can find functions with names like __intel_ssse3_memcpy, __intel_ssse3_strchr, __intel_sse4_strchr) to determine the optimal execution path. Initially, __intel_cpu_indicator is 0.

__intel_new_proc_init relies on the cpuid instruction, which takes the value in EAX register as input and puts output in EAX EBX, ECX, and EDX registers. __intel_cpu_indicator_init only calls cpuid with EAX=0 (to get vendor ID string and maximum standard cpuid levels) and EAX=1 (to get feature flags.) Detailed explanation of cpuid can be found here.

The pseudo code of __intel_cpu_indicator_init is as follows.

   Call cpuid with EAX=0 and store the result in local variables.
   Call cpuid with EAX=1 and store the result in local variables.

   if (vendor ID string is NOT "GenuineIntel") {
      set "maximum standard cpuid level" to be 0
   }

   if (maximum standard cpuid level is 0)
      /*
       Maximum standard cpuid level is the max value EAX can have when calling cpuid.
       If this is greater than 0, then calling cpuid with EAX=1 will return
       the feature flags. Otherwise, there is no need to continue.
      */
      return;
   }

   if (CPU family is 15) {
      /* i.e. Pentium 4 and derivatives */
      if (is SSE3 capable) {
        __intel_cpu_indicator = SSE3;
      }
      return;
   }

   if (CPU family is not 6) {
      /* Pentium Pro and anything comes after Pentium 4,
         i.e. Core 2, Nehalem, Sandy Bridges, etc,
         are of Family 6.
      */
      return;
   }

   __intel_cpu_indicator = SSE3;

   if (is SSSE3 capable) __intel_cpu_indicator = SSSE3;
   else return;

   if (has MOVBE instruction) __intel_cpu_indicator = MOVBE;
   else return;
   /* MOVBE (MOV with Bi-Endian support) is a new instruction
      introduced in Intel Atom processors. It allows swapping
      the high and low bits of a long value during a move
      operation.

      MOVBE is currently only available on Intel Atom
      processors.
   */

   if (is SSE4.1 capable) __intel_cpu_indicator = SSE4.1;
   else return;

   if (has POPCNT instruction and is SSE4.2 capable) __intel_cpu_indicator = SSE4.2;
   else return;
   /* POPCNT is a new instruction introduced in SSE4.2 (Intel)
      and SSE4a (AMD). It counts the number of bit 1's in
      a 64-bit word.
   */

   if (has PCLMULQDQ instruction and is AES capable) __intel_cpu_indicator = PCLMULQDQ;
   else return;
   /* PCLMULQDQ is part of the AES New Instructions (AES-NI).
      It performs carry-less multiplications.
      See here for an application in cryptography.
   */

   if (has XGETBV instruction enabled by the OS) {
      Call XGETBV with ECX=0 and get results in EAX and EDX.
      if (is AVX capable)
        if (EDX shows that both XMM and YMM are both available) {
           __intel_cpu_indicator = AVX;
        }
   }
   /* XGETBV instruction will get the value of Extended
      Control Register.
   */

   if (has F16C instructions) __intel_cpu_indicator = F16C;
   /* F16C are 16-bit floating-point conversion instructions */

   (more checks... for AVX2)

   return;

Note that __intel_new_proc_init checks for availability of AVX using XGETBV instruction. This is also the recommended approach in Intel Advanced Vector Extensions Programming Reference.

Processor Dispatch for memory operations functions

As mentioned earlier, Intel Compiler suite has optimized version for certain C library functions. Most of them use __intel_cpu_indicator only. However, memory operations (e.g. memset, memcpy) use additional Processor Dispatch code to determine the cache sizes: __intel_init_mem_ops_method initializes global variables __intel_memcpy_mem_ops_method (=2 for SSE2 capable, 1 for MMX/SSE), _data_cache_size (L1 cache size in bytes), _data_cache_size_half, __intel_memcpy_largest_cache_size (Last level cache, usually L3 cache, size in bytes), _largest_cache_size_half, __intel_memcpy_largest_cachelinesize (usually 64)

One example is __intel_ssse3_memcpy, which makes use of _data_cache_size_half and _largest_cache_size_half.

How to override the Processor Dispatch ?

If you want to override the Processor Dispatch setting, you can add the following code snippet to your program (either include it in a source file, or compile it as an independent object and link it to your program):
     /* In the following, "__attribute__ ((constructor))" means
        this routine ("my_intel_cpu_indicator_override" below, but
        one can pick any name of your choice) should be executed
        before main()

        Agner Fog's "Optimizing software in C++" guide Chapter 13
        uses a similar approach, but on Linux you will see errors
        from the linker ld: "multiple definitions of
        __intel_cpu_indicator_init"

        This error can be fixed by using "-z multidef"
        command-line option, which instructs the linker
        to accept multiple definitions. And when one
        does this, make sure "__intel_cpu_indicator_init"
        is in the same source file as the main() function.
      */

     #ifdef __INTEL_COMPILER
     void  __attribute__ ((constructor)) my_intel_cpu_indicator_override() {
         extern unsigned int __intel_cpu_indicator;
         __intel_cpu_indicator = 1<<11;
     }
     #endif
where 1<<11 means the CPU should be recognized as SSE3 capable, 1<<12 as SSSE3 capable, 1<<14 as MOVBE capable, 1<<13 as SSE4.1 capable, 1<<15 as SSE4.2 & POPCNT capable, 1<<16 as PCLMULQDQ & AES capable, 1<<17 as AVX capable, 1<<18 as F16C and RDRAND capable, and 1<<22 as AVX2, BMI (bit manipulation instructions), LZCNT and FMA capable.

If your code is already compiled with the Intel compiler version 12.0 or 13.0, you can still override it using this Perl script.

Why would anyone want to override it ? A use case is for AMD processors, __intel_cpu_indicator will be 1, the baseline case, even if they are SSE3 capable. Setting it to 1<<11 can enable the optimal execution path on SSE3 capable processors.

If one wants to use Intel optimized memory operations functions on AMD processors, one needs to further manually set the following variables, because Intel's runtime routine _irc_init_cache_tbl checks for CPU vendor string and is not able to obtain cache size info through cpuid instruction if the CPU is not GenuineIntel (TM):

     #ifdef __INTEL_COMPILER
     void __intel_init_mem_ops_method() __attribute__ ((weak));

     void  __attribute__ ((constructor)) my_intel_init_mem_ops_method() {
        if (__intel_init_mem_ops_method) {

            extern unsigned int __intel_memcpy_mem_ops_method,
                                _data_cache_size,
                                _data_cache_size_half,
                                __intel_memcpy_largest_cache_size,
                                _largest_cache_size_half,
                                __intel_memcpy_largest_cachelinesize;

            /* initialize _irc_cache_tbl */
            __intel_init_mem_ops_method();

            /* override the cache parameters */

            __intel_memcpy_mem_ops_method = 2; /* 2 = SSE2 capable */

            /* for AMD Shanghai processors, 64 KB L1 data cache */
            _data_cache_size = 65536;
            _data_cache_size_half = _data_cache_size/2;

            /* for AMD Shanghai processors, 6 MB L3 cache */
            __intel_memcpy_largest_cache_size = 6291456;
            _largest_cache_size_half = __intel_memcpy_largest_cache_size/2;

            __intel_memcpy_largest_cachelinesize = 64;
        }
    }
    #endif
Make sure you know what you are doing. As mentioned earlier, __intel_cpu_indicator is used by Intel optimized functions and your program could end up with "illegal instruction" error if the CPU does not support the instructions you specified. Even if your program does not call the tuned version of math functions, -x switch can generate code which uses instructions your CPU does not support.

What are DAZ, FTZ, and MXCSR ?

(More info can be found here)

DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) are not of IEEE-754 standard, but they can speed up floating point arithmetics:

  • DAZ: Treats denormal values (used as input to floating-point instructions) as 0.
  • FTZ: Sets denormal results (from floating-point calculations) to 0.

DAZ and FTZ only affect SSE instructions but not the traditional x87 instructions.

If any optimization option is used, Intel Compiler will insert a code to enable DAZ/FTZ, unless -ftz- switch is also used.

Depending on the capability of the CPU, there are different ways to detect & enable DAZ/FTZ. They are just different ways to manipulate the DAZ/FTZ bits in the MXCSR (MMX Extension Control/Status Register). For details, see Chapter 7 and Code Example 9.4 in Intel Processor Identification and the CPUID Instruction or this link.

On a side note, gcc can enable DAZ/FTZ by the -ffast-math switch.

Intel MKL (Math Kernel Library) Processor Dispatch Code

There is no doubt that Intel MKL will use the similar infamous Processor Dispatch (TM). For example, if you use static linking, you can find that your executable binary contains code from libmkl_core.a, which has different versions of math functions for different SSE instruction sets. Take BLAS function daxpy (double-precision a*x+y where x, y are vectors). Then the 64-bit libmkl_core.a contains the following functions:

daxpy implementation
function
SSE instruction set
mkl_blas_def_xdaxpyThe default, untuned version, which works for SSE capable processors.
mkl_blas_p4n_xdaxpySSE2 version (Pentium 4 processors or better).
mkl_blas_mc_xdaxpySupplemental SSE 3 version (Core/Merom processors or better).
mkl_blas_mc3_xdaxpySSE4.2 version (Nehalem processors or better).
mkl_blas_avx_xdaxpyAVX version (Sandy Bridge processors or better).
mkl_blas_avx2_xdaxpyAVX2 version (Haswell processors or better). [Since MKL version 11.0]

And the 32-bit libmkl_core.a:

daxpy implementation
function
SSE instruction set
mkl_blas_def_xdaxpyThe default, untuned version, which works for SSE capable processors.
mkl_blas_p4_xdaxpySSE2 version (Pentium 4 processors or better).
mkl_blas_p4p_xdaxpySSE3 version (Pentium 4 Prescott processors or better).
mkl_blas_p4m_xdaxpySupplemental SSE 3 version (Core/Merom processors or better).
mkl_blas_p4m3_xdaxpySSE4.2 version (Nehalem processors or better).
mkl_blas_avx_xdaxpyAVX version (Sandy Bridge processors or better).

If you instead choose dynamic linking, then these functions are always called mkl_blas_xdaxpy, and you can find them in libmkl_def.so, libmkl_p4n.so, libmkl_mc.so, libmkl_mc3.so, libmkl_avx.so, libmkl_avx2.so, etc.

Anyway, in x86_64 MKL version 10.2.5/10.3/11.0, one can find the following Processor Dispatch (TM) code:

MKL dynamic link library fileFunctionPurpose
libmkl_core.somkl_serv_intel_cpuCheck for Intel processor
libmkl_core.soMKL_CPUisINTELCheck for Intel processor
libmkl_core.somkl_serv_cpuhasnhmCheck for SSE 4.2 (nhm=Nehalem)
libmkl_core.somkl_serv_cpuhaspnrCheck for SSE 4.1 (pnr=Penryn)
libmkl_core.soxxxMKL_CPUhasNHMWSTCheck for AES (WST=Westmere)
libmkl_core.somkl_serv_cpuisitbarcelonaCheck for AMD Barcelona processor
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_getCPUisintel
(or mkl_vml_serv_getCPUisintel)
Check for Intel processor
libmkl_{intel|gf}_{lp|ilp}64.so
mkl_vml_serv_CPUisHSWCheck for AVX2 (HSW=Haswell)
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisGSSE
(or mkl_vml_serv_CPUisGSSE)
Check for AVX
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE42
(or mkl_vml_serv_CPUisSSE42)
Check for SSE 4.2
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE41
(or mkl_vml_serv_CPUisSSE41)
Check for SSE 4.1
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE4
(or mkl_vml_serv_CPUisSSE4)
Check for Supplemental SSE 3

In x86_64 MKL version 10.2.2, one can find the following code:

MKL dynamic link library fileFunctionPurpose
libmkl_core.somkl_serv_intel_cpuCheck for Intel processor
libmkl_core.soMKL_CPUisINTELCheck for Intel processor
libmkl_core.soMKL_CPUhasNHMxCheck for SSE 4.2 (NHMx=Nehalem)
libmkl_core.somkl_serv_cpuhasnhmCheck for SSE 4.2 (nhm=Nehalem)
libmkl_core.somkl_serv_cpuhaspnrCheck for SSE 4.1 (pnr=Penryn)
libmkl_core.soMKL_CPUhasMNICheck for Supplemental SSE 3 (MNI=Merom New Instructions)
libmkl_core.soMKL_CPUhasSSE3Check for SSE 3
libmkl_core.soMKL_CPUhasAVXCheck for AVX
libmkl_core.somkl_serv_cpuisitbarcelonaCheck for AMD Barcelona processor
libmkl_{intel|gnu|pgi}_thread.so
GetAPIC_IDGet APIC ID. This is to determine the processor/core topology/enumeration. See here or here for more info.

Why MKL needs to know this ? Because when running the multi-threaded version of MKL, MKL will by default ignore the "extra" logical cores created by the HyperThreading technology.

libmkl_{intel|gnu|pgi}_thread.so
MaxCorePerPhysicalProcGet number of cores per physical processor. This can help optimize cache usage.
libmkl_{intel|gnu|pgi}_thread.so
MaxLogicalProcPerPhysicalProcGet number of logic cores per physical processor.
libmkl_{intel|gnu|pgi}_thread.so
GetCpuIdInfoCheck for Intel processor
libmkl_{intel|gnu|pgi}_thread.so
CountProcNum_ompCheck for Intel processor and count the number of processors
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_getCPUisintelCheck for Intel processor
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisGSSECheck for AVX
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE42Check for SSE 4.2
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE41Check for SSE 4.1
libmkl_{intel|gf}_{lp|ilp}64.so
_vmlserv_CPUisSSE4Check for Supplemental SSE 3

How to override MKL's Processor Dispatch ?

There is an undocumented environmental variable called MKL_DEBUG_CPU_TYPE which allows users to select the SSE instruction set at runtime. Setting this environmental variable to 0 will choose the "def" version, 1 the "p4n", 2 the "mc", and so on.

Are there other MKL internal parameters ?

(This has been verified in MKL versions 10.2.2, 10.2.5, and 10.3) MKL has internal parameters to determine the optimal execution path, including the number of threads, block sizes (as in certain BLAS functions), etc. MKL_DEBUG_CPU_TYPE is only one of them. But most of them are not exposed to user's program. Here is a list of them:

MKL internal parameterPurpose
disable_fast_mmEnable/disable fast memory management. One should instead use the environmental variable MKL_DISABLE_FAST_MM.

Fast memory management is used only when MKL allocates certain sizes of memory chunks in certain BLAS functions (e.g. dgemm)

__MKL_CPU_MicroArchitectureThe CPU microarchitecture. This is not as useful as MKL_DEBUG_CPU_TYPE. See the MKL_DEBUG_CPU_MA section here
itisBarcelonaWhen __MKL_CPU_MicroArchitecture is 0, this parameter indicates whether the CPU is AMD Barcelona or not.
mkl_cpu_typeSSE instruction set level. It has the same value as the environmental variable MKL_DEBUG_CPU_TYPE
__HTIntel Hyper-Threading technology is present or not
__N_Logical_Cores
__N_Physical_Cores
__N_CPU_Packages
__N_Cores_per_Packages
Processor topology.

__N_Physical_Cores is used to determine the number of threads to be used. One should instead use the environmental variables OMP_NUM_THREADS or MKL_NUM_THREADS instead.

MKL_cache_sizesThe levels of on-chip cache and their sizes in byte.

If one really needs to modify these internal parameters in the program, use this code snippet.

Intel OpenMP Processor Dispatch Code

Intel OpenMP is used by MKL for multi-threading support. Intel OpenMP also implements a GOMP (GNU OpenMP) interface, so GCC-compiled OpenMP programs can use Intel's OpenMP runtime library. As such, Intel OpenMP (libiomp5.so) also contains its own Processor Dispatch code. The code is the same as Intel Compiler's Processor Dispatch Code, except all the functions and variables now have __kmp_external_ prefix, e.g. __kmp_external___intel_new_proc_init, __kmp_external___intel_cpu_indicator, etc.

Intel MPI Processor Dispatch Code

Intel MPI contains the same Processor Dispatch code as mentioned above, except all the functions and variables now have __I_MPI_ prefix, e.g. __I_MPI___intel_new_proc_init_L, __I_MPI___intel_cpu_indicator, etc.

As of version 4.0.0 and 4.0.1, Intel MPI has its own additional Processor Dispatch Code to determine the algorithms for collective operations. First, Intel MPI's MPD (multi-purpose daemon) script (mpd or mpd.py) contains a function called pin_Topology, which executes the cpuinfo utility under the same directory with a single command-line argument p. The MPD script then reads in the result of this command and sets I_MPI_INFO_-prefixed environmental variables (e.g. I_MPI_INFO_STATE, I_MPI_INFO_C_NAME, I_MPI_INFO_CACHE1, etc), which are then read by the Intel MPI run-time code, e.g. libmpi.so. Based on the values of these environmental variables, the Intel MPI run-time code will set an internal variable called I_MPI_Platform, which is used to determine the algorithms for collective operations (I_MPI_COLL_DEFAULT, I_MPI_COLL_DEFAULT_HTN, I_MPI_COLL_DEFAULT_NHM, I_MPI_COLL_DEFAULT_WSM)

There is an undocumented environmental variable called I_MPI_PLATFORM which allows users to override the default value for I_MPI_Platform. See here for more info.