CPU/hardware performance counters on Linux

Acronyms

CCCRCounter configuration control register (Intel specific)
CPLCurrent privilege level
DEARData event address register (Intel specific). This register records the program counter of the instruction which causes the most recent data cache (or data TLB cache) miss.
DSDebug storage (Intel specific)
ESCREvent selection control register (Intel specific)
FMAFused multiply-add
FMACFused multiply-accumulate
IBSInstruction-based sampling (AMD specific), an idea very similar to PEBS
IEARInstruction event address register (Intel specific). This register records the program counter of the instruction which causes the most recent instruction cache (or instruction TLB cache) miss.
IEBSImprecise event based sampling (Intel specific)
ISRInterrupt service routine
MSRModel specific register
PEBSPrecise event based sampling (Intel specific)
PMCPerformance monitoring counter
PMIPerformance monitoring interrupt (Intel specific). This interrupt is generated when a counter overflows and has been programmed to generate an interrupt, or when the PEBS interrupt threshold has been reached.
PMUPerformance monitoring unit (Intel specific)
QEARQuickPath Interconnect event address register (Intel specific)
SSEStreaming SIMD extension
SMMSystem management mode (Intel specific)
TSCTime-stamp counter
VMXVirtual machine extension (Intel specific)

Useful resource for Intel x86 chips: Intel 64 & IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2 and Appendix B of Intel 64 & IA-32 Architectures Optimization Reference Manual

Useful resource for AMD chips: BIOS & Kernel Developer's Guide for AMD Athlon 64 & AMD Opteron Processors and AMD64 Architecture Programmer's Manual Volume 2: System Programming


Linux PerfCtr

PerfCtr adds support to Linux kernel (2.6.x) for using hardware performance-monitoring counters on x86, x86-64, PowerPC, and certain ARM processors. It contains a kernel patch (as a driver) and a user library. Version 2.6 will create a device file called /dev/perfctr and Version 2.7 will create a SysFs directory at /sys/class/perfctr/*. The user library accesses hardware performance-monitoring counters by manipulating these special files through open, close, read, and ioctl calls.

Linux PerfMon (PFM)

PerfMon is very similar to PerfCtr. Originally it was designed specifically for Itanium (IA64), and now it supports x86 and more architectures. It also contains a kernel patch and a user library. The 3.x version will create a SysFs directory at /sys/kernel/perfmon/*. The user library accesses hardware performance-monitoring counters by system calls such as
syscall(PFM_pfm_create, ...)

Since Linux 2.6.32 and PerfMon version 4.x, a new system call perf_event_open was added. It returns a file descriptor, which can be controlled and accessed using ioctl and read.

For example, to enable/disable counters, use

ioctl(fd, PERF_EVENT_IOC_ENABLE, 0)
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0)
where fd is the file descriptor.

PerfMon has some environment variables to control its runtime behavior: LIBPFM_VERBOSE, LIBPFM_DEBUG, LIBPFM_DEBUG_STDOUT, LIBPFM_FORCE_PMU, LIBPFM_ENCODE_INACTIVE

Performance Counters for Linux (PCL)

PCL is to replace PerfMon as the preferred performance monitoring framework for Linux since it allows monitoring of events other than CPU-related (e.g. software events such as page faults, context switches, etc.) It is part of kernel source since Linux version 2.6.30. It also uses the system calls approach, but unlike PerfMon, it only introduces one new system call: perf_counter_open, which is to set up an event and will return a file descriptor. The users can then manipulate the event through read (to read counts), close (to terminate the monitoring), ioctl (additional controls such as enable, disable, reset, refresh, etc), mmap (kernel event buffer mapping), and fcntl (sample notification) on this file descriptor.

Virtual Machines

Performance monitoring unit inside the CPU is not necessarily virtualized by every virtual machine software. For example, VirtualBox does not have it. There are several ways to check the availability:
  • lsmod |grep perfctr
    (to test for PerfCtr)

  • ls /dev/perfctr
    (to test for PerfCtr)

  • ls /sys/class/perfctr
    (to test for PerfCtr)

  • ls /sys/kernel/perfmon
    (to test for PerfMon [PFM] version 3)

  • ls /proc/perfmon
    (to test for PerfMon [PFM] version 3)

  • cat /proc/sys/kernel/perf_event_paranoid
    (to test for PerfMon [PFM] version 4)
    (-1: not paranoid at all, 0: disallow raw tracepoint access for normal users, 1: disallow CPU events for normal users, 2: disallow kernel profiling for normal users)

  • dmesg|grep PMU
    (to test for PCL)

  • ls /sys/devices/system/cpu/perf_events/*
    (to test for PCL)

  • grep arch_mon /proc/cpuinfo

PAPI 3.x on x86 Linux

PAPI can utilize PerfCtr, PerfMon (PFM), or PCL. The automagic configure will by default use PerfCtr and create a Makefile which best suits the CPU. In this Makefile, look at the variable MAKEVER, which is processor-specific (and library-specific, i.e. to use whether PerfCtr, PerfMon, or PCL.) For example, if
MAKEVER = linux-perfctr-em64t
then Makefile.linux-perfctr-em64t will be used. Makefile.linux-perfctr-em64t dictates what events are available for monitoring.

In this case, the main entry function to set up the initial monitoring environment is setup_p4_presets. It calls _papi_pfm_setup_presets, which scans the file perfmon_events.csv (this file maps preset events to native events) for the line CPU,Intel Pentium4 and load all the preset events. It then calls _papi_hwd_fixup_fp & _papi_hwd_fixup_vec to load additional preset events (from the same perfmon_events.csv file) related to floating point operations. The latter two functions are dependent on the EVENTFLAGS line in Makefile.linux-perfctr-em64t. In PAPI 3.7.2, by default all preset events which are filed under

CPU,Intel Pentium4
and
CPU,Intel Pentium4 FPU X87 SSE_DP
are loaded.

PAPI 4.2 on x86 Linux

PAPI 4.x uses a different way to determine what (PerfCtr or PerfMon) to use. There are two variables, MAKEVER and FILENAME, in the automagic configure script.

MAKEVER will be set to one of $OS-pe, $OS-perfmon2, or $OS-pfm-$CPU, where $OS is either CLE (Cray Linux Environment) or linux, and $CPU can be p3, p4 , core, core2, atom, i7, opteron, or athlon.

Depending on the value of MAKEVER, FILENAME is set to Rules.$LIB, where $LIB is one of (on x86) perfctr-pfm, pfm_pe (which uses Linux PerfMon [PFM] version 3), or pfm4_pe (which uses Linux PerfMon [PFM] version 4).

The generated Makefile will load the Rules.$LIB specified by the FILENAME variable.

In addition to Makefile, the configure script will also generate papi_events_table.h, which is essentially the content of papi_events.csv.

To see what are loaded from papi_events_table.h, compile PAPI with -DDEBUG, then at runtime, set the environment variable PAPI_DEBUG to SUBSTRATE, and run any PAPI programs, e.g. papi_avail. One should then see some output like:

....
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1254:9447 -1
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1258:9447     51 perf perf_events generic PMU 3
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1258:9447     53 wsm_dp Intel Westmere DP 1
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1267:9447       wsm_dp is default
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1254:9447 -1
....
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:250:9447 CPU token found on line 8
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:268:9447 Examining CPU (AMD64 (K7)) vs. (wsm_dp)
....
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:268:9447 Examining CPU (wsm_dp) vs. (wsm_dp)
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:274:9447 Found CPU wsm_dp at line 341 of builtin papi_events_table.
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:280:9447 No additional qualifier found, matching on string.
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:312:13785 Examining preset PAPI_TOT_CYC
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:321:13785 Found 0x8000003b for PAPI_TOT_CYC
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:331:13785 Examining derived NOT_DERIVED
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:340:13785 Found 0 for NOT_DERIVED
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:342:13785 Adding 0x8000003b,0 to preset search table.
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:373:13785 Adding term (0) UNHALTED_CORE_CYCLES to preset event 0x8000003b.
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:405:13785 # events inserted: --1--
....
Here, wsm_dp means Intel Westmere dual sockets/processors, and 0x8000003b is the bitwise OR of PAPI_PRESET_MASK and PAPI_TOT_CYC_idx (both defined in papiStdEventDefs.h)

Cross-checking the papi_events.csv file, one can see

....
CPU,Intel Nehalem
CPU,Intel Westmere
CPU,nhm
CPU,nhm_ex
CPU,wsm
CPU,wsm_dp
#
PRESET,PAPI_TOT_CYC,NOT_DERIVED,UNHALTED_CORE_CYCLES
PRESET,PAPI_TOT_INS,NOT_DERIVED,INSTRUCTION_RETIRED
PRESET,PAPI_L1_ICM,NOT_DERIVED,L1I:MISSES
...
and UNHALTED_CORE_CYCLES is actually a literal in Linux PerfMon [PFM] library (in this case, the literal can be found in libpfm4/lib/events/intel_wsm_events.h)

"Error in PAPI_library_init: PAPI_ESBSTR" ?

According to the manual, PAPI_ESBSTR means "Substrate returned an error, usually the result of an unimplemented feature." One possibility is the CPU is not supported by the PAPI version you are using. So the question is, what CPU does PAPI think you are using ?

To see how PAPI thinks about your CPU, compile PAPI with -DDEBUG, then at runtime, set the environment variable PAPI_DEBUG to SUBSTRATE, and run any PAPI programs, e.g. papi_avail. One should then see some output like:

....
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1258:1164     8 netburst_p Pentium4 (Prescott) 1
SUBSTRATE:papi_libpfm4_events.c:_papi_libpfm_init:1267:1164       netburst_p is default
....
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:268:1164 Examining CPU (netburst) vs. (netburst_p)
SUBSTRATE:papi_libpfm_presets.c:load_preset_table:250:1164 CPU token found on line 756
...
So PAPI uses PerfMon (PFM) version 4 to recognize that your CPU is netburst_p, but the papi_events.csv file does not have any match for it.

The fix is simple: just tweak papi_events.csv by adding netburst_p.


How many performance counters are available?

In Linux 2.6.36 and later, if the system is SMP (i.e. more than one socket/processor), then the kernel can seize one hardware performance counter for lockup detection. The details can be found in Documentation/nmi_watchdog.txt in the Linux kernel soucetree.

Which counter is seized depends on the architecture. For example, for the Pentium 4/Netburst, it is IQ_COUNTER0:

/*
  * Set up IQ_COUNTER0 to behave like a clock, by having IQ_CCCR0 filter
  * CRU_ESCR0 (with any non-null event selector) through a complemented
  * max threshold. [IA32-Vol3, Section 14.9.9]
  */
 static int setup_p4_watchdog(unsigned nmi_hz)
 {
    ....
 }
*/
(full source code here)

Newer Linux (e.g. version 3.2) has a function to handle this:

 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 static int watchdog_nmi_enable(int cpu)
 {
   ...
   /* Try to register using hardware perf events */
   event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
   ...
 }

To see if any hardware performance counter is seized due to this, one can try to :

  • Check whether CONFIG_HARDLOCKUP_DETECTOR is enabled or not when the kernel is compiled.
  • Run dmesg command and look for the NMI message, e.g. :
      Performance Events: PEBS fmt1+, Nehalem events, Intel PMU driver.
      ... version:                3
      ... bit width:              48
      ... generic registers:      4
      ... value mask:             0000ffffffffffff
      ... max period:             000000007fffffff
      ... fixed-purpose events:   3
      ... event mask:             000000070000000f
      NMI watchdog enabled, takes one hw-pmu counter.
    
  • Check whether /proc/sys/kernel/nmi_watchdog contain value 1.

To disable this lockup watchdog, execute

echo 0 > /proc/sys/kernel/nmi_watchdog
as root.

If you see PAPI Error in PAPI_library_init: PAPI_ENOSUPP or Pentium 4 not supported on kernels before 2.6.35, this is the cause. (One can also run PAPI example code ctests/nmi_watchdog and check its output.)


FLOP measurement on x86

Counting floating-point operations (FLOP) on x86 is complex: There are old-school x87 instructions as well as modern SSE (Streaming SIMD Extensions) instructions, and they must be counted using separate counters. To make things worse, for SSE, there are four different categories:
  • Packed double-precision
  • Packed single-precision
  • Scalar double-precision
  • Scalar single-precision
and on older architectures (e.g. Pentium 4/Netburst), each category must be counted separately, and there are only four hardware performance counters (Pentium 4/Netburst) for floating-point related metrics.

In PAPI, there are two seemingly identical events: PAPI_FP_INS and PAPI_FP_OPS. The exact meanings of these events are architecture-dependent (see PAPI FAQ), but in general:

  • PAPI_FP_INS: Retired floating point operations (could include load/store of floating point numbers).
  • PAPI_FP_OPS: All floating point operations, including those that are speculatively executed.

In PAPI 4.2, they are defined as

ArchitecturePAPI_FP_INSPAPI_FP_OPS
Pentium 4
Netburst
x87_FP_uopscalar_DP_uop (default, but can be configured at compile time to scalar_SP_uop. See PAPI FAQ)
Nehalem
Westmere
FP_COMP_OPS_EXE:SSE_FPFP_COMP_OPS_EXE:SSE_FP + FP_COMP_OPS_EXE:X87
Sandy Bridge FP_COMP_OPS_EXE:X87 + FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE + FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE + FP_COMP_OPS_EXE:SSE_PACKED_SINGLE + FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE (same as left)