clu2's notes: How GCC generates code: a case study

Useful tools

Matt Godbolt's GCC explorer, a web-based tool for exploring how code tweaks change the machine code emitted by the compiler (x86).

Debugging GCC's code generation

Consider the following simple C code, say foo.c

        #include <stdlib.h>
        void main() {
           float f = rand();
        }

When compiled on POWER (Linux) using GCC 4.5.1 without any optimization flags and using the default -m32 mode, GCC will generate assembly code as follows:

        stwu 1,-48(1)
        mflr 0
        stw 0,52(1)
        stw 31,44(1)
        mr 31,1
        bl rand
        mr 9,3
        lis 0,0x4330
        lis 11,.LC0@ha
        lfd 0,.LC0@l(11)
        xoris 9,9,0x8000
        stw 9,28(31)
        stw 0,24(31)
        lfd 13,24(31)
        fsub 0,13,0
        frsp 0,0
        stfs 0,8(31)
        addi 11,31,48
        lwz 0,4(11)
        mtlr 0
        lwz 31,-4(11)
        mr 1,11
        blr

There are two floating-point operations: fsub and frsp. frsp is Floating Round to Single Precision instruction, which makes sense.

The question is, where is the integer to floating-point conversion code? It seems subtraction instruction fsub has something to do about it. What is GCC's logic in generating such a code?

GCC internals

To understand this, we need to have some knowledge of GCC internals (also see slides here). GCC has two intermediate stages: GIMPLE (i.e. the tree), and RTL (Register Transfer Language, which is an assembly language with infinitely many registers.) In each stage, there are dozens of "passes" (see GCC source file gcc/passes.c) to perform code transformations and optimizations.

GIMPLE is pretty much like C language and is platform independent. To see it, compile above code with -fdump-tree-gimple option, and a file foo.c.004t.gimple will be created. Its content is like this:

        main ()
        {
          int D.3135;
          volatile float f.0;
          volatile float f;

          D.3135 = rand ();
          f.0 = (volatile float) D.3135;
          f = f.0;
        }

RTL is platform dependent; this stage will perform code generation, which includes, for example, register allocation, instruction scheduling, intra-procedural optimizations. So to answer our question, we need to look at various passes of RTL.

Compile above code with -S -dP -fdump-rtl-all option, and a bunch files named foo.c.XXXr.YYY will be created. XXX are digits, representing the Pass number XXX in the RTL stage, and YYY is the name of that Pass.

RTL

First, look at the assembly code foo.s. With -dP option, the assembly code is annotated with RTL. The code snippet we are interested is:

    ...
    #(insn 19 18 11 1234.c:8 (set (reg:DF 32 0 [121])
    #        (minus:DF (reg:DF 45 13 [125])
    #            (reg:DF 32 0 [123]))) 258 {*subdf3_fpr} (nil))
           fsub 0,13,0      # 19   *subdf3_fpr     [length = 4]
    ...

The key word here is subdf3_fpr. Next, do

    grep subdf3_fpr foo.c.???r.*

The first occurrence of RTL pass which has subdf3_fpr is foo.c.188r.ira, and the pass immediately before this is foo.c.185r.asmcons. These two RTL dump files still do not show why a subtraction instruction is generated. However, in these two files, one can also spot another key word: movdf_hardfloat32.

    grep movdf_hardfloat32 foo.c.???r.*

The first occurrence of RTL pass which has movdf_hardfloat32 is foo.c.146r.vregs. Now this file provides some clue:

    (insn 10 9 11 3 1234.c:8 (parallel [
                (set (reg:DF 121)
                    (float:DF (reg:SI 119 [ D.3135 ])))
                (use (reg:SI 122))
                (use (reg:DF 123))
                (clobber (mem/c:DF (plus:SI (reg/f:SI 113 sfp)
                            (const_int 24 [0x18])) [0 S8 A64]))
                (clobber (reg:DF 125))
                (clobber (reg:SI 126))
            ]) 271 {*floatsidf2_internal} (expr_list:REG_EQUAL (float:DF (reg:SI 119 [ D.3135 ]))
            (nil)))

floatsidf2_internal sounds like the integer to floating-point conversion code we are seeking and probably holds the key why a subtraction instruction is generated.

The platform-dependent RTL rules/patterns are in gcc/config/<arch> in GCC source tree. In our case, <arch> is rs6000. Look at the file gcc/config/rs6000/rs6000.md (md=machine description), we can find the following rewriting rules:

    (define_expand "floatsidf2"
      [(parallel [(set (match_operand:DF 0 "gpc_reg_operand" "")
               (float:DF (match_operand:SI 1 "gpc_reg_operand" "")))
              (use (match_dup 2))
              (use (match_dup 3))
              (clobber (match_dup 4))
              (clobber (match_dup 5))
              (clobber (match_dup 6))])]
      "TARGET_HARD_FLOAT
       && ((TARGET_FPRS && TARGET_DOUBLE_FLOAT) || TARGET_E500_DOUBLE)"
      "
    {
      if (TARGET_E500_DOUBLE)
        {
          emit_insn (gen_spe_floatsidf2 (operands[0], operands[1]));
          DONE;
        }
      if (TARGET_POWERPC64)
        {
          rtx x = convert_to_mode (DImode, operands[1], 0);
          emit_insn (gen_floatdidf2 (operands[0], x));
          DONE;
        }

      operands[2] = force_reg (SImode, GEN_INT (0x43300000));
      operands[3] = force_reg (DFmode, CONST_DOUBLE_ATOF (\"4503601774854144\", DFmode));
      operands[4] = assign_stack_temp (DFmode, GET_MODE_SIZE (DFmode), 0);
      operands[5] = gen_reg_rtx (DFmode);
      operands[6] = gen_reg_rtx (SImode);
    }")

    (define_insn_and_split "*floatsidf2_internal"
      [(set (match_operand:DF 0 "gpc_reg_operand" "=&f")
        (float:DF (match_operand:SI 1 "gpc_reg_operand" "r")))
       (use (match_operand:SI 2 "gpc_reg_operand" "r"))
       (use (match_operand:DF 3 "gpc_reg_operand" "f"))
       (clobber (match_operand:DF 4 "offsettable_mem_operand" "=o"))
       (clobber (match_operand:DF 5 "gpc_reg_operand" "=&f"))
       (clobber (match_operand:SI 6 "gpc_reg_operand" "=&r"))]
      "! TARGET_POWERPC64 && TARGET_HARD_FLOAT && TARGET_FPRS && TARGET_DOUBLE_FLOAT"
      "#"
      "&& (can_create_pseudo_p () || offsettable_nonstrict_memref_p (operands[4]))"
      [(pc)]
      "
    {
      rtx lowword, highword;
      gcc_assert (MEM_P (operands[4]));
      highword = adjust_address (operands[4], SImode, 0);
      lowword = adjust_address (operands[4], SImode, 4);
      if (! WORDS_BIG_ENDIAN)
        {
          rtx tmp;
          tmp = highword; highword = lowword; lowword = tmp;
        }

      emit_insn (gen_xorsi3 (operands[6], operands[1],
                 GEN_INT (~ (HOST_WIDE_INT) 0x7fffffff)));
      emit_move_insn (lowword, operands[6]);
      emit_move_insn (highword, operands[2]);
      emit_move_insn (operands[5], operands[4]);
      emit_insn (gen_subdf3 (operands[0], operands[5], operands[3]));
      DONE;
    }"
      [(set_attr "length" "24")])

The floatsidf2_internal is really just an identifier (pattern name).

The [(set (match_operand:DF 0 ... "gpc_reg_operand" "=&r"))]] is the RTL matching template.

The "! TARGET_POWERPC64 && TARGET_HARD_FLOAT && TARGET_FPRS && TARGET_DOUBLE_FLOAT" .. are additional predicates. For example

! TARGET_POWERPC64: Not PowerPC64.
TARGET_HARD_FLOAT: Has hardware FPU.
TARGET_FPRS: Has floating-point-specific registers (some PowerPC variants uses general-purpose registers for floating-point operations.)
TARGET_DOUBLE_FLOAT: Has double-precision floating-point support

The { rtx lowword, highword; ... } is the code generation.

It is now clear that the generated code and its RTL counterparts are:

    xoris 9,9,0x8000     emit_insn (gen_xorsi3 (operands[6], operands[1], GEN_INT (~ (HOST_WIDE_INT) 0x7fffffff)));
    stw 9,28(31)         emit_move_insn (lowword, operands[6]);
    stw 0,24(31)         emit_move_insn (highword, operands[2]);
    lfd 13,24(31)        emit_move_insn (operands[5], operands[4]);
    fsub 0,13,0          emit_insn (gen_subdf3 (operands[0], operands[5], operands[3]));

Integer to floating-point conversion in PowerPC

The integer to single-precision floating-point conversion in PowerPC 32-bit mode is quite convoluted. According to Section 3.3.8 of PowerPC Compiler Writer's Guide, the steps are

Flip the integer sign bit and place the result in the low-order part of a doubleword (8-byte) in memory.
Create the high-order part with sign and exponent fields such that the resulting doubleword value interpreted as a hexadecimal floating-point value is 0x1.00000dddddddd*10^D, where 0xdddddddd is the hexadecimal sign-flipped integer value.
Load the doubleword as a floating-point value.
Subtract the hexadecimal floating-point value 0x1.0000080000000*10^D from the previous value to generate the result.

Basically the first 3 steps create a double-precision floating-point number out of the integer, and step 4 is to subtract from it a "magic double" (0x4330000080000000) This is exactly our original assembly code does:

        bl rand
        mr 9,3              // result of rand() is in $r9
        lis 0,0x4330        // the high-order part of "magic double", i.e. 0x43300000
        lis 11,.LC0@ha
        lfd 0,.LC0@l(11)    // have magic double in $r0 now
        xoris 9,9,0x8000    // flip the sign of $r9 (Step 1)
        stw 9,28(31)        // put $r9 in lower part of a doubleword in memory (Step 1)
        stw 0,24(31)        // the high-order part of our doubleword is also 0x43300000 (Step 2)
        lfd 13,24(31)       // Step 3
        fsub 0,13,0         // Step 4

In PowerPC 64-bit mode, the code is much shorter since the instruction fcfid can be used.