clu2's notes: Analysis of Intel compiler's and Intel MKL's processor dispatch code

If any optimization option is used (by default, Intel Compiler uses -O2. Also, if any debugging-related flag is used, e.g. -g, then optimization will be turned off.) Intel Compiler (version 9 and later) will insert a call to a routine to determine the capability of the CPU on which the program is running. This routine is called before anything else in your program's main. For C/C++ programs, the name of this routine can be found in the following table, which depends on the on the compiler option -x. For Fortran programs, __intel_new_proc_init is always called first in main, and then -x dependent routine (as the table below) is called in MAIN__ (which is the real main function of a Fortran program) This has been verified in Intel Compiler versions 11.x thru 13.0.

A good reference on this topic is Agner Fog's Optimizing software in C++ , Chapter 13 (available here) and his blog.

Also see a recent article here.

You are eligible for Intel compiler reimbursement if you meet some criteria.

`-x` switch	Processor Dispatch Routine
(none)	`__intel_new_proc_init`
`-xsse2`	`__intel_new_proc_init`
`-xsse3` `-xP`	`__intel_new_proc_init_P`
`-xssse3` `-xsse3_atom` `-xT`	`__intel_new_proc_init_T`
`-xsse4.1` `-xS`	`__intel_new_proc_init_S`
`-xsse4.2` `-xH`	`__intel_new_proc_init_H`
`-xavx` `-xG`	`__intel_new_proc_init_G`
`-xcore-avx-i`	`__intel_new_proc_init_I`
`-xcore-avx2`	`__intel_new_proc_init_E`

All of __intel_new_proc_init_* call __intel_cpu_indicator_init to determine the capability of the CPU, and then call their respective routines to display an error message (via irc__print function) such as

  Fatal Error: This program was not built to run on the processor in your system.
  The allowed processors are: Intel(R) processors with Intel(R) AVX instructions support.

(if the CPU is not capable enough) and enable DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) flags in the MXCSR (MMX Extension Control/Status Register.) Therefore, it is enough for us to analyze __intel_cpu_indicator_init in detail.

Anatomy of `__intel_cpu_indicator_init`

As the name suggests, __intel_cpu_indicator_init initializes a global variable called __intel_cpu_indicator. This variable is consulted in many places (e.g. Intel optimized math functions such as sin, cos, etc, or C runtime functions such as memcpy, strchr. In the latter case, one can find functions with names like __intel_ssse3_memcpy, __intel_ssse3_strchr, __intel_sse4_strchr) to determine the optimal execution path. Initially, __intel_cpu_indicator is 0.

__intel_new_proc_init relies on the cpuid instruction, which takes the value in EAX register as input and puts output in EAX EBX, ECX, and EDX registers. __intel_cpu_indicator_init only calls cpuid with EAX=0 (to get vendor ID string and maximum standard cpuid levels) and EAX=1 (to get feature flags.) Detailed explanation of cpuid can be found here.

The pseudo code of __intel_cpu_indicator_init is as follows.

   Call cpuid with EAX=0 and store the result in local variables.
   Call cpuid with EAX=1 and store the result in local variables.

   if (vendor ID string is NOT "GenuineIntel") {
      set "maximum standard cpuid level" to be 0
   }

   if (maximum standard cpuid level is 0)
      /*
       Maximum standard cpuid level is the max value EAX can have when calling cpuid.
       If this is greater than 0, then calling cpuid with EAX=1 will return
       the feature flags. Otherwise, there is no need to continue.
      */
      return;
   }

   if (CPU family is 15) {
      /* i.e. Pentium 4 and derivatives */
      if (is SSE3 capable) {
        __intel_cpu_indicator = SSE3;
      }
      return;
   }

   if (CPU family is not 6) {
      /* Pentium Pro and anything comes after Pentium 4,
         i.e. Core 2, Nehalem, Sandy Bridges, etc,
         are of Family 6.
      */
      return;
   }

   __intel_cpu_indicator = SSE3;

   if (is SSSE3 capable) __intel_cpu_indicator = SSSE3;
   else return;

   if (has MOVBE instruction) __intel_cpu_indicator = MOVBE;
   else return;
   /* MOVBE (MOV with Bi-Endian support) is a new instruction
      introduced in Intel Atom processors. It allows swapping
      the high and low bits of a long value during a move
      operation.

      MOVBE is currently only available on Intel Atom
      processors.
   */

   if (is SSE4.1 capable) __intel_cpu_indicator = SSE4.1;
   else return;

   if (has POPCNT instruction and is SSE4.2 capable) __intel_cpu_indicator = SSE4.2;
   else return;
   /* POPCNT is a new instruction introduced in SSE4.2 (Intel)
      and SSE4a (AMD). It counts the number of bit 1's in
      a 64-bit word.
   */

   if (has PCLMULQDQ instruction and is AES capable) __intel_cpu_indicator = PCLMULQDQ;
   else return;
   /* PCLMULQDQ is part of the AES New Instructions (AES-NI).
      It performs carry-less multiplications.
      See here for an application in cryptography.
   */

   if (has XGETBV instruction enabled by the OS) {
      Call XGETBV with ECX=0 and get results in EAX and EDX.
      if (is AVX capable)
        if (EDX shows that both XMM and YMM are both available) {
           __intel_cpu_indicator = AVX;
        }
   }
   /* XGETBV instruction will get the value of Extended
      Control Register.
   */

   if (has F16C instructions) __intel_cpu_indicator = F16C;
   /* F16C are 16-bit floating-point conversion instructions */

   (more checks... for AVX2)

   return;

Note that __intel_new_proc_init checks for availability of AVX using XGETBV instruction. This is also the recommended approach in Intel Advanced Vector Extensions Programming Reference.

Processor Dispatch for memory operations functions

As mentioned earlier, Intel Compiler suite has optimized version for certain C library functions. Most of them use __intel_cpu_indicator only. However, memory operations (e.g. memset, memcpy) use additional Processor Dispatch code to determine the cache sizes: __intel_init_mem_ops_method initializes global variables __intel_memcpy_mem_ops_method (=2 for SSE2 capable, 1 for MMX/SSE), _data_cache_size (L1 cache size in bytes), _data_cache_size_half, __intel_memcpy_largest_cache_size (Last level cache, usually L3 cache, size in bytes), _largest_cache_size_half, __intel_memcpy_largest_cachelinesize (usually 64)

One example is __intel_ssse3_memcpy, which makes use of _data_cache_size_half and _largest_cache_size_half.

How to override the Processor Dispatch ?

If you want to override the Processor Dispatch setting, you can add the following code snippet to your program (either include it in a source file, or compile it as an independent object and link it to your program):

     /* In the following, "__attribute__ ((constructor))" means
        this routine ("my_intel_cpu_indicator_override" below, but
        one can pick any name of your choice) should be executed
        before main()

        Agner Fog's "Optimizing software in C++" guide Chapter 13
        uses a similar approach, but on Linux you will see errors
        from the linker ld: "multiple definitions of
        __intel_cpu_indicator_init"

        This error can be fixed by using "-z multidef"
        command-line option, which instructs the linker
        to accept multiple definitions. And when one
        does this, make sure "__intel_cpu_indicator_init"
        is in the same source file as the main() function.
      */

     #ifdef __INTEL_COMPILER
     void  __attribute__ ((constructor)) my_intel_cpu_indicator_override() {
         extern unsigned int __intel_cpu_indicator;
         __intel_cpu_indicator = 1<<11;
     }
     #endif

where 1<<11 means the CPU should be recognized as SSE3 capable, 1<<12 as SSSE3 capable, 1<<14 as MOVBE capable, 1<<13 as SSE4.1 capable, 1<<15 as SSE4.2 & POPCNT capable, 1<<16 as PCLMULQDQ & AES capable, 1<<17 as AVX capable, 1<<18 as F16C and RDRAND capable, and 1<<22 as AVX2, BMI (bit manipulation instructions), LZCNT and FMA capable.

If your code is already compiled with the Intel compiler version 12.0 or 13.0, you can still override it using this Perl script.

Why would anyone want to override it ? A use case is for AMD processors, __intel_cpu_indicator will be 1, the baseline case, even if they are SSE3 capable. Setting it to 1<<11 can enable the optimal execution path on SSE3 capable processors.

If one wants to use Intel optimized memory operations functions on AMD processors, one needs to further manually set the following variables, because Intel's runtime routine _irc_init_cache_tbl checks for CPU vendor string and is not able to obtain cache size info through cpuid instruction if the CPU is not GenuineIntel (TM):

     #ifdef __INTEL_COMPILER
     void __intel_init_mem_ops_method() __attribute__ ((weak));

     void  __attribute__ ((constructor)) my_intel_init_mem_ops_method() {
        if (__intel_init_mem_ops_method) {

            extern unsigned int __intel_memcpy_mem_ops_method,
                                _data_cache_size,
                                _data_cache_size_half,
                                __intel_memcpy_largest_cache_size,
                                _largest_cache_size_half,
                                __intel_memcpy_largest_cachelinesize;

            /* initialize _irc_cache_tbl */
            __intel_init_mem_ops_method();

            /* override the cache parameters */

            __intel_memcpy_mem_ops_method = 2; /* 2 = SSE2 capable */

            /* for AMD Shanghai processors, 64 KB L1 data cache */
            _data_cache_size = 65536;
            _data_cache_size_half = _data_cache_size/2;

            /* for AMD Shanghai processors, 6 MB L3 cache */
            __intel_memcpy_largest_cache_size = 6291456;
            _largest_cache_size_half = __intel_memcpy_largest_cache_size/2;

            __intel_memcpy_largest_cachelinesize = 64;
        }
    }
    #endif

Make sure you know what you are doing. As mentioned earlier, __intel_cpu_indicator is used by Intel optimized functions and your program could end up with "illegal instruction" error if the CPU does not support the instructions you specified. Even if your program does not call the tuned version of math functions, -x switch can generate code which uses instructions your CPU does not support.

What are DAZ, FTZ, and MXCSR ?

(More info can be found here)

DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) are not of IEEE-754 standard, but they can speed up floating point arithmetics:

DAZ: Treats denormal values (used as input to floating-point instructions) as 0.
FTZ: Sets denormal results (from floating-point calculations) to 0.

DAZ and FTZ only affect SSE instructions but not the traditional x87 instructions.

If any optimization option is used, Intel Compiler will insert a code to enable DAZ/FTZ, unless -ftz- switch is also used.

Depending on the capability of the CPU, there are different ways to detect & enable DAZ/FTZ. They are just different ways to manipulate the DAZ/FTZ bits in the MXCSR (MMX Extension Control/Status Register). For details, see Chapter 7 and Code Example 9.4 in Intel Processor Identification and the CPUID Instruction or this link.

On a side note, gcc can enable DAZ/FTZ by the -ffast-math switch.

Intel MKL (Math Kernel Library) Processor Dispatch Code

There is no doubt that Intel MKL will use the similar infamous Processor Dispatch (TM). For example, if you use static linking, you can find that your executable binary contains code from libmkl_core.a, which has different versions of math functions for different SSE instruction sets. Take BLAS function daxpy (double-precision a*x+y where x, y are vectors). Then the 64-bit libmkl_core.a contains the following functions:

`daxpy` implementation function	SSE instruction set
mkl_blas_def_xdaxpy	The default, untuned version, which works for SSE capable processors.
mkl_blas_p4n_xdaxpy	SSE2 version (Pentium 4 processors or better).
mkl_blas_mc_xdaxpy	Supplemental SSE 3 version (Core/Merom processors or better).
mkl_blas_mc3_xdaxpy	SSE4.2 version (Nehalem processors or better).
mkl_blas_avx_xdaxpy	AVX version (Sandy Bridge processors or better).
mkl_blas_avx2_xdaxpy	AVX2 version (Haswell processors or better). [Since MKL version 11.0]

And the 32-bit libmkl_core.a:

`daxpy` implementation function	SSE instruction set
mkl_blas_def_xdaxpy	The default, untuned version, which works for SSE capable processors.
mkl_blas_p4_xdaxpy	SSE2 version (Pentium 4 processors or better).
mkl_blas_p4p_xdaxpy	SSE3 version (Pentium 4 Prescott processors or better).
mkl_blas_p4m_xdaxpy	Supplemental SSE 3 version (Core/Merom processors or better).
mkl_blas_p4m3_xdaxpy	SSE4.2 version (Nehalem processors or better).
mkl_blas_avx_xdaxpy	AVX version (Sandy Bridge processors or better).

If you instead choose dynamic linking, then these functions are always called mkl_blas_xdaxpy, and you can find them in libmkl_def.so, libmkl_p4n.so, libmkl_mc.so, libmkl_mc3.so, libmkl_avx.so, libmkl_avx2.so, etc.

Anyway, in x86_64 MKL version 10.2.5/10.3/11.0, one can find the following Processor Dispatch (TM) code:

MKL dynamic link library file	Function	Purpose
libmkl_core.so	mkl_serv_intel_cpu	Check for Intel processor
libmkl_core.so	MKL_CPUisINTEL	Check for Intel processor
libmkl_core.so	mkl_serv_cpuhasnhm	Check for SSE 4.2 (nhm=Nehalem)
libmkl_core.so	mkl_serv_cpuhaspnr	Check for SSE 4.1 (pnr=Penryn)
libmkl_core.so	xxxMKL_CPUhasNHMWST	Check for AES (WST=Westmere)
libmkl_core.so	mkl_serv_cpuisitbarcelona	Check for AMD Barcelona processor
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_getCPUisintel (or mkl_vml_serv_getCPUisintel)	Check for Intel processor
libmkl_{intel\|gf}_{lp\|ilp}64.so	mkl_vml_serv_CPUisHSW	Check for AVX2 (HSW=Haswell)
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisGSSE (or mkl_vml_serv_CPUisGSSE)	Check for AVX
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE42 (or mkl_vml_serv_CPUisSSE42)	Check for SSE 4.2
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE41 (or mkl_vml_serv_CPUisSSE41)	Check for SSE 4.1
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE4 (or mkl_vml_serv_CPUisSSE4)	Check for Supplemental SSE 3

In x86_64 MKL version 10.2.2, one can find the following code:

MKL dynamic link library file	Function	Purpose
libmkl_core.so	mkl_serv_intel_cpu	Check for Intel processor
libmkl_core.so	MKL_CPUisINTEL	Check for Intel processor
libmkl_core.so	MKL_CPUhasNHMx	Check for SSE 4.2 (NHMx=Nehalem)
libmkl_core.so	mkl_serv_cpuhasnhm	Check for SSE 4.2 (nhm=Nehalem)
libmkl_core.so	mkl_serv_cpuhaspnr	Check for SSE 4.1 (pnr=Penryn)
libmkl_core.so	MKL_CPUhasMNI	Check for Supplemental SSE 3 (MNI=Merom New Instructions)
libmkl_core.so	MKL_CPUhasSSE3	Check for SSE 3
libmkl_core.so	MKL_CPUhasAVX	Check for AVX
libmkl_core.so	mkl_serv_cpuisitbarcelona	Check for AMD Barcelona processor
libmkl_{intel\|gnu\|pgi}_thread.so	GetAPIC_ID	Get APIC ID. This is to determine the processor/core topology/enumeration. See here or here for more info. Why MKL needs to know this ? Because when running the multi-threaded version of MKL, MKL will by default ignore the "extra" logical cores created by the HyperThreading technology.
libmkl_{intel\|gnu\|pgi}_thread.so	MaxCorePerPhysicalProc	Get number of cores per physical processor. This can help optimize cache usage.
libmkl_{intel\|gnu\|pgi}_thread.so	MaxLogicalProcPerPhysicalProc	Get number of logic cores per physical processor.
libmkl_{intel\|gnu\|pgi}_thread.so	GetCpuIdInfo	Check for Intel processor
libmkl_{intel\|gnu\|pgi}_thread.so	CountProcNum_omp	Check for Intel processor and count the number of processors
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_getCPUisintel	Check for Intel processor
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisGSSE	Check for AVX
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE42	Check for SSE 4.2
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE41	Check for SSE 4.1
libmkl_{intel\|gf}_{lp\|ilp}64.so	_vmlserv_CPUisSSE4	Check for Supplemental SSE 3

How to override MKL's Processor Dispatch ?

There is an undocumented environmental variable called MKL_DEBUG_CPU_TYPE which allows users to select the SSE instruction set at runtime. Setting this environmental variable to 0 will choose the "def" version, 1 the "p4n", 2 the "mc", and so on.

Are there other MKL internal parameters ?

(This has been verified in MKL versions 10.2.2, 10.2.5, and 10.3) MKL has internal parameters to determine the optimal execution path, including the number of threads, block sizes (as in certain BLAS functions), etc. MKL_DEBUG_CPU_TYPE is only one of them. But most of them are not exposed to user's program. Here is a list of them:

MKL internal parameter	Purpose
disable_fast_mm	Enable/disable fast memory management. One should instead use the environmental variable `MKL_DISABLE_FAST_MM`. Fast memory management is used only when MKL allocates certain sizes of memory chunks in certain BLAS functions (e.g. `dgemm`)
__MKL_CPU_MicroArchitecture	The CPU microarchitecture. This is not as useful as `MKL_DEBUG_CPU_TYPE`. See the `MKL_DEBUG_CPU_MA` section here
itisBarcelona	When __MKL_CPU_MicroArchitecture is 0, this parameter indicates whether the CPU is AMD Barcelona or not.
mkl_cpu_type	SSE instruction set level. It has the same value as the environmental variable `MKL_DEBUG_CPU_TYPE`
__HT	Intel Hyper-Threading technology is present or not
__N_Logical_Cores __N_Physical_Cores __N_CPU_Packages __N_Cores_per_Packages	Processor topology. __N_Physical_Cores is used to determine the number of threads to be used. One should instead use the environmental variables `OMP_NUM_THREADS` or `MKL_NUM_THREADS` instead.
MKL_cache_sizes	The levels of on-chip cache and their sizes in byte.

If one really needs to modify these internal parameters in the program, use this code snippet.

Intel OpenMP Processor Dispatch Code

Intel OpenMP is used by MKL for multi-threading support. Intel OpenMP also implements a GOMP (GNU OpenMP) interface, so GCC-compiled OpenMP programs can use Intel's OpenMP runtime library. As such, Intel OpenMP (libiomp5.so) also contains its own Processor Dispatch code. The code is the same as Intel Compiler's Processor Dispatch Code, except all the functions and variables now have __kmp_external_ prefix, e.g. __kmp_external___intel_new_proc_init, __kmp_external___intel_cpu_indicator, etc.

Intel MPI Processor Dispatch Code

Intel MPI contains the same Processor Dispatch code as mentioned above, except all the functions and variables now have __I_MPI_ prefix, e.g. __I_MPI___intel_new_proc_init_L, __I_MPI___intel_cpu_indicator, etc.

As of version 4.0.0 and 4.0.1, Intel MPI has its own additional Processor Dispatch Code to determine the algorithms for collective operations. First, Intel MPI's MPD (multi-purpose daemon) script (mpd or mpd.py) contains a function called pin_Topology, which executes the cpuinfo utility under the same directory with a single command-line argument p. The MPD script then reads in the result of this command and sets I_MPI_INFO_-prefixed environmental variables (e.g. I_MPI_INFO_STATE, I_MPI_INFO_C_NAME, I_MPI_INFO_CACHE1, etc), which are then read by the Intel MPI run-time code, e.g. libmpi.so. Based on the values of these environmental variables, the Intel MPI run-time code will set an internal variable called I_MPI_Platform, which is used to determine the algorithms for collective operations (I_MPI_COLL_DEFAULT, I_MPI_COLL_DEFAULT_HTN, I_MPI_COLL_DEFAULT_NHM, I_MPI_COLL_DEFAULT_WSM)

There is an undocumented environmental variable called I_MPI_PLATFORM which allows users to override the default value for I_MPI_Platform. See here for more info.

Analysis of Intel compiler's and Intel MKL's processor dispatch code

Anatomy of __intel_cpu_indicator_init