clu2's notes: Notes on x86 assembly

Control flow

Some common control flow patterns in x86 assembly (Intel syntax):

test arg1,arg2

Bitwise AND

Unlike and, none of arg1 or arg2 are changed.

Zero flag (ZF) is 0 if result is nonzero.

test arg,arg Zero flag (ZF) is 0 if arg is nonzero.

je
jz Jump if arg is 0, or arg1&arg2 is 0 (Zero flag is 1) jne
jnz Jump if arg is nonzero, or arg1&arg2 is nonzero (Zero flag is 0)

`xor` arg1,arg2	Bitwise XOR
`je` `jz`	Jump if arg1==arg2	`jne` `jnz`	Jump if arg1!=arg2

`cmp arg1,arg2`	Subtraction: `arg1-arg2` (Intel syntax) Unlike `sub`, none of `arg1` or `arg2` are changed.
`je` `jz`	Jump if arg1==arg2 (Zero flag is 1)	`jne` `jnz`	Jump if arg1!=arg2
`jg` `jnle`	Jump if signed arg1>arg2 (Zero flag is 0 and Sign flag=Overflow flag)	`jl` `jnge`	Jump if signed arg1<arg2
`jge` `jnl`	Jump if signed arg1>=arg2 (Sign flag=Overflow flag)	`jle` `jng`	Jump if signed arg1<=arg2
`ja` `jnbe`	Jump if unsigned arg1>arg2 (Both Carry flag and Zero flag are 0)	`jb` `jnae`	Jump if unsigned arg1<arg2
`jae` `jnb` `jnc`	Jump if unsigned arg1>=arg2 (Carry flag is 0)	`jbe` `jna` `jc`	Jump if unsigned arg1<=arg2
`js`	Jump if Sign flag is 1	`jns`	Jump if Sign flag is 0

If above rules are confusing, consider the following scenarios :

`arg1`	`arg2`	`arg1-arg2`	Sign flag	Overflow flag
0xFFFF (-1)	0xFFFE (-2)	0x0001	0	0
0x8000 (-32768)	0x0001	0x7FFF (32767)	0	1
0xFFFE (-2)	0xFFFF (-1)	0xFFFF (-1)	1	0
0x7FFF (32767)	0xFFFF (-1)	0x8000 (-32768)	1	1

The j in above tables can be replaced by cmov or set for Conditional Move or Conditional Set.

`loop`	Decrement the CX register, and if it is nonzero, continue the loop (i.e. branch to the target location)
`loope` `loopz`	Decrement the CX register, and if it is nonzero and Zero flag is set, continue the loop (i.e. branch to the target location) It is usually used in conjuction with `cmp`, i.e. if arg1==arg2, loop back.

Opcodes for conditional jumps

Opcodes for x86_64 conditional jumps have two formats:

7x xx                 RIP=RIP+8 bit displacement (signed extended to 64-bits)
0F 8x xx xx xx xx     RIP=RIP+32 bit displacement (signed extended to 64-bits)

`je` `jz`	74 0F 84	`jne` `jnz`	75 0F 85
`jg` `jnle`	7F 0F 8F	`jl` `jnge`	7C 0F 8C
`jge` `jnl`	7D 0F 8D	`jle` `jng`	7E 0F 8E
`ja` `jnbe`	77 0F 87	`jb` `jnae`	72 0F 82
`jae` `jnb` `jnc`	73 0F 83	`jbe` `jna` `jc`	76 0F 86
`js`	78 0F 88	`jns`	79 0F 89

Opcodes for unconditional jumps

Opcodes for x86_64 unconditional jumps have the following formats (this list is incomplete):

EB xx                 RIP=RIP+8 bit displacement (signed extended to 64-bits)
E9 xx xx xx xx        RIP=RIP+32 bit displacement (signed extended to 64-bits)
FF E0                 RIP=RAX
FF E3                 RIP=RBX
FF E1                 RIP=RCX
FF E2                 RIP=RDX
FF E7                 RIP=RDI
FF E6                 RIP=RSI
FF E4                 RIP=RSP
FF E5                 RIP=RBP
FF 20                 RIP=[RAX]
FF 23                 RIP=[RBX]
FF 21                 RIP=[RCX]
FF 22                 RIP=[RDX]
FF 27                 RIP=[RDI]
FF 26                 RIP=[RSI]
FF 24 24              RIP=[RSP]
FF 65 00              RIP=[RBP]
FF 25 ?? ?? ?? ??     RIP=[RIP+0x????????]

Synchronization support

`cmpxchg` arg1,arg2

cmpxchg automatically chooses AL, AX, EAX, or RAX and compares with arg1:

if (EAX == arg1) then
   Zero flag = 1;
   arg1 = arg2;
else
   Zero flag = 0;
   EAX = arg1;
endif

Here is an example how to implement a spin lock using cmpxchg

      mov  EDX, 1
spin: mov  EAX, my_lock
      test EAX, EAX
      jnz  spin
      lock cmpxchg my_lock, EDX
      test EAX, EAX
      jnz  spin

At lock cmpxchg my_lock, EAX is guaranteed to be 0, but my_lock is stored in memory, so its value might be non-zero.

If my_lock==0, then my_lock is set to 1 (EDX's value), and EAX is still 0, so test EAX, EAX fails and the lock is acquired.
If my_lock!=0, then my_lock is still nonzero, but EAX will be set to nonzero (my_lock's value), so test EAX, EAX succeeds and the lock keeps spinning.

`xchg` arg1,arg2

Atomic exchange. It can also be used to implement a spin lock, as follows:

      mov  EAX, 1
spin: xchg EAX, my_lock
      test EAX, EAX
      jnz  spin

`xadd` arg1,arg2

Exchange and Add:

tmp = arg1+arg2
arg2 = arg1
arg1 = tmp

This is to achieve the atomic version of

arg1 += arg2

which returns the original value of arg1, and meanwhile arg1 is incremented by arg2.

x86 calling conventions (Assembly view)

`push` arg

Place arg operand onto the top of the hardware supported stack in memory:

ESP -= 4
[ESP] = arg

`pop` arg

Remove data from top of the stack into arg:

arg = [ESP]
ESP += 4

`call` func

Push the address of the next instruction after itself onto the stack, and jump to func:

ESP -= 4
[ESP] = address of next instruction
EIP = address of func

`ret`

Pop off the hardware supported in-memory stack into EIP:

EIP = [ESP]
ESP += 4

`ret` n

Same as ret but will adjust ESP by n bytes as well:

EIP = [ESP]
ESP += (4+n)

`enter` n,0

Allocate a stack frame for a procedure and reserve n bytes from the stack for local variables:

push  EBP
mov   EBP, ESP
sub   ESP, n

`leave`

Deallocate the stack frame set up by an earlier enter instruction:

mov   ESP, EBP
pop   EBP

32-bit x86 calling conventions (C programmer's view)

(For more information, see The gen on function calling conventions)

cdecl

All parameters are on stack. EAX register holds the return value. This is the standard C calling convention.

; foo(1, 2, 3, 4);   // caller calls foo
push   4
push   3
push   2
push   1
call   foo
add    ESP, 16     ; adjust ESP to its former value (32-bit architecture)

GCC uses the following code snippet instead, which saves the last add above:

; foo(1, 2, 3, 4);   // caller calls foo
sub    ESP, 16
mov    [ESP+12], 4
mov    [ESP+8],  3
mov    [ESP+4],  2
mov    [ESP],    1
call   foo

The callee code would be:

; foo(int x, int y, int z, int w)
push   EBP
mov    EBP, ESP
;   [EBP+8 ] is x
;   [EBP+12] is y
;   [EBP+16] is z
;   [EBP+20] is w
...
;   [EBP-4] would be the first local variable
pop    EBP     ; this line could be replaced by "leave"
ret

stdcall

All parameters are on stack. EAX register holds the return value. GCC generates the following code snippet:

; foo(1, 2, 3, 4);   // caller calls foo
mov    [ESP+12], 4
mov    [ESP+8],  3
mov    [ESP+4],  2
mov    [ESP],    1
call   foo
sub    ESP, 16

And the callee code would be:

; __attribute__((stdcall)) foo(int x, int y, int z, int w)
push   EBP
mov    EBP, ESP
;   [EBP+8 ] is x
;   [EBP+12] is y
;   [EBP+16] is z
;   [EBP+20] is w
...
;   [EBP-4] would be the first local variable
pop    EBP     ; this line could be replaced by "leave"
ret    16

fastcall

ECX and EDX have the first two parameters, and the rest would be on stack and are treated like stdcall. EAX register holds the return value. GCC generates the following code snippet:

; foo(1, 2, 3, 4);   // caller calls foo
mov    [ESP+4], 4
mov    [ESP],   3
mov    EDX,     2
mov    ECX,     1
call   foo
sub    ESP, 8

And the callee code would be:

; __attribute__((fastcall)) foo(int x, int y, int z, int w)
push   EBP
mov    EBP, ESP
sub    ESP, 8
mov    [EBP-4], ECX
mov    [EBP-8], EDX
;   [EBP-4 ] is x
;   [EBP-8 ] is y
;   [EBP+8 ] is z
;   [EBP+12] is w
...
leave
ret    8

64-bit x86 calling conventions (C programmer's view)

Linux

RDI, RSI, RDX, RCX, R8, R9 have the first six parameters (XMM0, .., XMM7 are used for floating point parameters), and the rest would be on stack. RAX register holds the return value.

RAX, RCX, RDX, RSI, RDI, R8 - R11 are scratch registers and their contents should be considered (from caller's perspective) clobbered after a function call. Callee is only responsible for the preservation of RBX, RSP, RBP, R12 - R15.

GCC generates the following code snippet:

; foo(1, 2, 3, 4);   // caller calls foo
mov    ECX, 4
mov    EDX, 3
mov    ESI, 2
mov    EDI, 1
call   foo

The callee code would be:

; foo(int x, int y, int z, int w)
push   RBP
mov    RBP, RSP
;   EDI is x
;   ESI is y
;   EDX is z
;   ECX is w
...
;   [RBP-4] would be the first local 32-bit integer variable
leave
ret

See here for details.

Windows

RCX, RDX, R8, R9 have the first four parameters (XMM0, .., XMM3 are used for floating point parameters), and the rest would be on stack. RAX register holds the return value. If the return data type is float, XMM0 register is used instead.

RAX, RCX, RDX, R8 - R11 are scratch registers and their contents should be considered (from caller's perspective) destroyed after a function call.

See here and here for details.

`-fomit-frame-pointer`, GDB, `alloca`, and all that

On x86 the frame pointer is the EBP register. As mentioned above, normally the callee always have the following prolog:

push   EBP
mov    EBP, ESP

and the following epilog

leave   (recall leave is a combination of 'mov ESP, EBP' & 'pop EBP')
ret

What's the purpose of frame pointer and above prolog/epilog ? It allows the debugger/us to "walk through" the stack frames; the saved EBP's are like "next" pointers in a C linked list struct, as the following pseudo code:

while (0 != EBP) {
  printf("Return address is %p\n", [EBP+4]);
  EBP = [EBP];
}

(assuming EBP is initialized to be 0)

A common compiler command-line switch for release/optimization builds is -fomit-frame-pointer. What it does is to not save previous EBP onto the stack, thus freeing EBP for other uses. Since x86 already has a meager set of registers, one extra register will not hurt.

The side effect, as mentioned in GCC manual here is making debugging impossible on some machines, since we lose the ability to walk through the stack frames. Here is an example. Compile the following code:

int foo(int x, int y, int z, int w) {
   return x+y+z+w;
}
int main() {
   foo(1,2,3,4);
}

With -fomit-frame-pointer, one can see from the assembly code that the prolog and the leave in the epilog are no longer generated. To see how this thwarts the debugging, strip the generated binaries (if you do not strip them, then newer version of GDB, e.g. 6.0, can gain stack frame knowledge through the embedded DWARF debugging information and display the correct back trace, with or without -fomit-frame-pointer) and run GDB. First, let's run nm to get the address of foo, then strip the binary, launch GDB and put a breakpoint on the address of foo. Without -fomit-frame-pointer, on 32-bit Linux, the backtrace would be like:

(gdb) break *0x8048317
(gdb) run
(gdb) bt
#0  0x08048317 in ?? ()
#1  0x08048365 in ?? ()
#2  0x55597e93 in __libc_start_main () from /lib/tls/libc.so.6
#3  0x08048291 in ?? ()

so we can see some function at 0x8048365 called foo. Do the same with -fomit-frame-pointer, and we have

(gdb) break *0x8048314
(gdb) run
(gdb) bt
#0  0x08048314 in ?? ()
#1  0x55597e93 in __libc_start_main () from /lib/tls/libc.so.6
#2  0x08048291 in ?? ()

In this case the backtrace isn't functional correctly: one stack frame is missing, and we no longer know what happened before call to foo. Also, the x86 ABI mandates that the value of EBP must be preserved across calls (that's why we have leave instructions at end of callees to revert back the original value of EBP) so with -fomit-frame-pointer on, EBP's value will always be 0.

Interestingly, if one tries above example with 64-bit compilation, all the stack frames can still be recovered with -fomit-frame-pointer. The secret lies in the .eh_frame section in 64-bit x86 binaries. (see here) To see the content of .eh_frame section, use

readelf --debug-dump=frames-interp a.out

If this section is removed (e.g. with objcopy -R .eh_frame a.out command), then the back trace would be a total mess (not just missing stack frames, as in 32-bit case, but showing garbage values/addresses).

The conclusion is: If you want to deter hackers, compile your program with -fomit-frame-pointer, strip it, AND remove/temper with its .eh_frame section (the latter could crash GDB too)

A closely related topic to stack frames is the alloca function, which allocates memory from stack and needs no explicit deallocation (actually, cannot). For x86, GCC generates the following code for, say alloca(126),

  sub $144, %rsp
  mov %rsp, %rax
  add $15, %rax
  shr $4, %rax
  sal $4, %rax

What GCC does here is it allocates 16 bytes more than requested, then align to the 16-byte boundary to get 144. It then aligns the result to the 16-byte address boundary: RAX = (RAX+15)/16*16 (see GCC source tree: expand_builtin_alloca in gcc/builtins.c, allocate_dynamic_stack_space in gcc/explow.c and BIGGEST_ALIGNMENT macro in gcc/config/i386/i386.h, which is 128-bit, and this is where the 16-byte alignment comes from.)

Recall that ESP is modified inside a function body ONLY based on the total size of local variables, and this information is known during compilation time. This is important because at the end of the function call, ESP must be reverted back to the original value, so pop can function correctly, i.e. get the return address.

However, with alloca, ESP can be decremented arbitrarily, so how can ESP be reverted ? Moreoever, how does automatic deallocation occur ? The answer to both questions is to use the frame pointer EBP!

Recall EBP, if it is saved, is always kept constant through the function body. Put another way, if alloca is ever used inside a function, then that function will always explicitly preserve the frame pointer EBP with the aforementioned prolog/epilog pairs, and the -fomit-frame-pointer optimization will be ignored for that function.

System calls

System calls can be done by either calling them directly, e.g. sched_yield(), or via syscall interface, e.g. syscall(SYS_sched_yield). To use syscall interface, one must include header files unistd.h and syscall.h

The SYS_sched_yield is the system call number and is mapped to a number, which differs from system to system. For example, on 64-bit Linux SYS_sched_yield is mapped to 24 (as __NR_sched_yield in /usr/include/asm/unistd_64.h or /usr/include/asm-x86_64/unistd.h) while on 32-bit Linux, 158 (as __NR_sched_yield in /usr/include/asm/unistd_32.h or /usr/include/asm-i386/unistd.h)

In Glibc, system calls on 64-bit x86 Linux are handled using code in sysdeps/unix/sysv/linux/x86_64/sysdep.h In sched_yield()'s case, the code would be

 mov    EAX, 24
 syscall
 cmp    RAX, -4095
 jae    SYSCALL_ERROR_LABEL
END:
 ret
SYSCALL_ERROR_LABEL:
 mov    RCX, QWORD PTR errno@GOTTPOFF[RIP]
 xor    EDX, EDX
 sub    RDX, RAX
 mov    DWORD PTR FS:[RCX], EDX
 or     RAX, -1
 jmp    END

On 64-bit x86 Linux the parameters of a system call are passed via registers. The system call number goes into RAX, and RDI, RSI, RDX, R10, R8, R9 have the six parameters (system calls are limited to six parameters)

After a system call, contents in RCX and R11 registers should be considered clobbered, and RAX register holds the return value. A value in the range between -4095 and -1 indicates an error, it is -errno.

GOTTPOFF is the thread local storage for global variables (in this case, errno); GOT is global offset table, TP is thread pointer. On 64-bit x86 Linux, the segment register FS is used as the thread register whose content is the TP. (On 32-bit x86 Linux, the segment register GS is used instead)

On 32-bit x86 Linux, system calls are handled using code in sysdeps/unix/sysv/linux/i386/sysdep.h. In sched_yield's case, the code is (if sysenter instruction is supported):

 mov    EAX, 158
 call   DWORD PTR GS:SYSINFO_OFFSET
 cmp    EAX, -4095
 jae    SYSCALL_ERROR_LABEL
END:
 ret
SYSCALL_ERROR_LABEL:
 call   __i686.get_pc_thunk.cx
 add    ECX, _GLOBAL_OFFSET_TABLE_
 mov    ECX, DWORD PTR [errno + ECX]
 xor    EDX, EDX
 sub    EDX, EAX
 mov    DWORD PTR GS:[ECX], EDX
 or     EAX, -1
 jmp    END

If sysenter instruction is not available, replace the call DWORD.. by int 0x80.

Like the 64-bit mode, on the 32-bit x86 Linux the parameters of a system call are passed via registers. The system call number goes into EAX, and EBX, ECX, EDX, ESI, EDI, EBP have the six parameters. EAX register holds the return value. A value in the range between -4095 and -1 indicates an error, it is -errno.

__i686.get_pc_thunk.cx is used to get program counter (IP) into register ECX since 32-bit x86 has no RIP relative addressing mode as in 64-bit mode. Its code is very straightforward (see SETUP_PIC_REG macro here):

__i686.get_pc_thunk.cx:
  mov    ECX, DWORD PTR [ESP]
  ret

There are also similar functions such as __i686.get_pc_thunk.bx and __i686.get_pc_thunk.dx

For FreeBSD, see here

What are cancelable system calls and how does Glibc handle them?

sched_yield is a non-cancellation-point system call, but there are also system calls which are cancellation points as in POSIX thread specification, and their handling in Glibc is slightly complicated.

POSIX specifies a thread cancellation mechanism which allows a thread to terminate any other thread in a controlled manner. Each thread has the following cancellation information associated with it:

	Meaning	Bitmask in Glibc's `nptl/descr.h`	POSIX thread flag
Cancelability enabled ?	If disabled, cancellation requests against the thread are held pending. By default, cancelability is enabled.	`CANCELSTATE_BITMASK` true means Enabled	`PTHREAD_CANCEL_ENABLE` `PTHREAD_CANCEL_DISABLE`
Cancelability type ?	When cancelability is enabled and the cancelability type is Asynchronous, new or pending cancellation requests may be acted upon at any time. When cancelability is enabled and the cancelability type is Deferred, cancellation requests against the thread are held pending until a cancellation point is reached. If cancelability is disabled, the setting of the cancelability type has no immediate effect until cancelability is enabled. By default, cancelability type is Deferred.	`CANCELTYPE_BITMASK` true means Asynchronous	`PTHREAD_CANCEL_DEFERRED` `PTHREAD_CANCEL_ASYNCHRONOUS`
Any pending cancellation request ?		`CANCELING_BITMASK`
Thread is cancelled ?		`CANCELED_BITMASK`
Thread is being cancelled ?		`EXITING_BITMASK`

So what is a cancellation point ? According to POSIX, a call to pthread_testcancel is a cancellation point. In addition, any call to functions (not just system calls!) which would cause blocking is a cancellation point. This link gives a list of such functions. For example, system calls such as read, write, sleep, poll, wait, APIs which depend on the aforementioned system calls such as fgetc, fputc, thread APIs such as pthread_cond_wait, pthread_cond_timedwait, pthread_join, pthread_rwlock_*, and System V IPC such as mq_receive, mq_send, sem_wait, etc.

It's now clear that Glibc must do something to conform to Pthread's cancellation mechanism, and this lead to the following multi-entry function implementation (as in Glibc's nptl/sysdeps/unix/sysv/linux/x86_64/sysdep-cancel.h) Let's take open for example. It now has TWO entry points: __open and __open_nocancel:

__open:
    cmp     __libc_multiple_threads, 0
    jne    PSEUDO_CANCEL

__open_nocancel:
    mov    EAX, 2
    syscall
    cmp    RAX, -4095
    jae    SYSCALL_ERROR_LABEL
    ret
PSEUDO_CANCEL:
    sub    RSP, 8
    call   __libc_enable_asynccancel
    mov    QWORD PTR [RSP], RAX   ; RAX holds the original Cancelability type
    mov    EAX, 2
    syscall
    mov    RDI, QWORD PTR [RSP]   ; now RDI holds the original Cancelability type
    mov    RDX, RAX
    call   __libc_disable_asynccancel
    mov    RAX, RDX
    add    RSP, 8
    cmp    RAX, -4095
    jae    SYSCALL_ERROR_LABEL
END:
    ret
SYSCALL_ERROR_LABEL:
    mov    RCX, QWORD PTR errno@GOTTPOFF[RIP]
    xor    EDX, EDX
    sub    RDX, RAX
    mov    DWORD PTR FS:[RCX], RDX
    or     RAX, -1
    jmp    END

So the entry function __open_nocancel does not check for any thread cancellation and the more generic entry point __open will see if the program is multi-threaded, and decide what to do next. It's also worth noting that __open_nocancel does not set up any stack frame either, but __open does.

Here is __libc_enable_asynccancel's code (as in nptl/sysdeps/unix/sysv/linux/x86_64/cancellation.S)

__libc_enable_asynccancel:
    mov    EAX, DWORD PTR FS:CANCELHANDLING
RETRY:
    mov    R11D, EAX
    or     R11D, CANCELTYPE_BITMASK
    cmp    R11D, EAX
    je     END

    lock cmpxchg DWORD PTR FS:CANCELHANDLING, R11D
    jne    RETRY

    and    R11D, CANCELSTATE_BITMASK|CANCELTYPE_BITMASK|CANCELED_BITMASK|EXITING_BITMASK|CANCEL_RESTMASK|TERMINATED_BITMASK
    cmp    R11D, CANCELTYPE_BITMASK|CANCELED_BITMASK
    je     UNWIND
END:
    ret
UNWIND:
    mov    QWORD PTR FS:RESULT, PTHREAD_CANCELED
    lock or DWORD PTR FS:CANCELHANDLING, EXITING_BITMASK
    mov    RDI, QWORD PTR FS:CLEANUP_JMP_BUF
    call   __pthread_unwind
    hlt

First, it loads the thread's cancellation information (CANCELHANDLING is a number, which is the offset into the thread's local data segment. This offset is calculated when Glibc is compiled and it is done using an awk script at Glibc's scripts/gen-as-const.awk and the symbol file nptl/sysdeps/x86_64/tcb-offsets.sym) and check if the Cancelability type is Asynchronous or not. If not, set it to Asynchronous temporarily. (The original Cancelability type value is in EAX register.) Then it checks whether the thread has Cancelability enabled, has Asynchronous cancelability type, and is not being cancelled, or cancelled, or exiting. If so, set up the relevant flags and call __pthread_unwind

Note that __pthread_enable_asynccancel in libpthread and __librt_enable_asynccancel in librt both have the same code as __libc_enable_asynccancel. In fact, the system call __open and __open_nocancel have the same implementation in these two libraries as well.

Here is __libc_disable_asynccancel's code (also in nptl/sysdeps/unix/sysv/linux/x86_64/cancellation.S)

__libc_disable_asynccancel:
    test   EDI, CANCELTYPE_BITMASK
    jne    END

    mov    EAX, DWORD PTR FS:CANCELHANDLING
RETRY:
    mov    R11D, EAX
    and    R11D, ~CANCELTYPE_BITMASK
    lock cmpxchg DWORD PTR FS:CANCELHANDLING, R11D
    jne    RETRY

    mov    EAX, R11D
RETRY2:
    and    EAX, CANCELING_BITMASK|CANCELED_BITMASK
    cmp    EAX, CANCELING_BITMASK
    je     FUTEX
END:
    ret

FUTEX:
    mov    RDI, QWORD PTR FS:0
    mov    EAX, SYS_futex
    xor    R10, R10
    add    RDI, CANCELHANDLING
    mov    ESI, DWORD PTR FS:PRIVATE_FUTEX
    syscall
    mov    EAX, DWORD PTR FS:CANCELHANDLING
    jmp    RETRY2

First, it checks if the original Cancelability type (now in EDI register... see the code between __libc_enable_asynccancel and __libc_disable_asynccancel above) is Asynchronous or not. If not, nothing to do, so it just returns. If so, set it to Deferred. It then checks whether the thread is being cancelled or not. If so, invoke the futex system call. __libc_disable_asynccancel will not return until the thread is no longer in the "being cancelled" state.

Like __libc_enable_asynccancel, __libc_disable_asynccancel is renamed to __pthread_disable_asynccancel in libpthread and __librt_disable_asynccancel in librt; the code is, of course, identical.

Finally, a platform-neutral high-level C language implementation of __libc_enable_asynccancel and __libc_disable_asynccancel can be found in Glibc's nptl/cancellation.c

What are vsyscalls and VDSO ?

A vsyscall is a system call that avoids crossing the userspace-kernel boundary. To see why a vsyscall is implemented in Linux, see here

For dynamic executable binaries, there is also a memory segment called [vdso], as shown below:

$ LD_SHOW_AUXV=true cat /proc/self/maps

AT_SYSINFO_EHDR: 0x7fffad451000
AT_HWCAP:        78afbfd
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x400040
AT_PHENT:        56
AT_PHNUM:        9
AT_BASE:         0x7f6dd34b7000
AT_FLAGS:        0x0
AT_ENTRY:        0x401850
AT_UID:          254374
AT_EUID:         254374
AT_GID:          16038
AT_EGID:         16038
AT_SECURE:       0
AT_RANDOM:       0x7fffad42b999
AT_EXECFN:       /bin/cat
AT_PLATFORM:     x86_64
00400000-0040d000 r-xp 00000000 fb:00 2097154            /bin/cat
0060d000-0060e000 r--p 0000d000 fb:00 2097154            /bin/cat
0060e000-0060f000 rw-p 0000e000 fb:00 2097154            /bin/cat
01e04000-01e25000 rw-p 00000000 00:00 0                  [heap]

   ..... (useless entries omitted)

7f6dd36d7000-7f6dd36d8000 r--p 00020000 fb:00 1576914    /lib/ld-2.11.1.so
7f6dd36d8000-7f6dd36d9000 rw-p 00021000 fb:00 1576914    /lib/ld-2.11.1.so
7f6dd36d9000-7f6dd36da000 rw-p 00000000 00:00 0
7fffad418000-7fffad42d000 rw-p 00000000 00:00 0          [stack]
7fffad451000-7fffad452000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

VDSO

VDSO (Virtual Dynamically-linked Shared Object) in Linux is a kernel-provided shared library that helps userspace perform a few kernel actions without the overhead of a system call, as well as automatically choosing the most efficient system call mechanism.

In above cat /proc/self/maps command output, note the address 7fffad451000 followed by AT_SYSINFO_EHDR and the starting address of [vdso] segment. They are the same, and this is not a coincidence; this is how kernel passes the address of VDSO to ld.so at runtime: through the auxiliary vector (this is why we set LD_SHOW_AUXV=true when running cat).

This VDSO is loaded into userspace by kernel regardless of the user executable binary is dynamically linked or statically linked. So how does kernel load it into userspace? To understand this, one can see the original VDSO kernel patch for x86_64 at here. Simply put, in load_elf_binary function in Linux kernel source file fs/binfmt_elf.c there is a call to arch_setup_additional_pages function. For x86_64, arch_setup_additional_pages is defined in arch/x86_64/vdso/vma.c. In the same source file, one can find that the variables vdso_start and vdso_end (defined in arch/x86/vdso/vdso.S) point to the starting and end of vdso.so, which is indeed compiled as an independent object file during kernel build and is included using the .incbin directive in arch/x86/vdso/vdso.S.

There are other VDSO's created during kernel build, such as vdso32-int80.so, vdso32-sysenter.so, or vdso32-syscall.so corresponding to different system call approaches in 32-bit x86 Linux.

How does ld.so capture and make use of this AT_SYSINFO_EHDR info ? First, in _dl_aux_init function (see Glibc's source file elf/dl-support.c) one can find code snippet like this:

    case AT_SYSINFO_EHDR:
      GL(dl_sysinfo_dso) = (void *) av->a_un.a_val;
      break;

so the value corresponding to AT_SYSINFO_EHDR tag is saved to a global variable GL(dl_sysinfo_dso). Later on, in dl_main (see Glibc's source file elf/rtld.c) look at the code block if (GLRO(dl_sysinfo_dso) != NULL) { .. or the code near the comment "Initialize l_local_scope to contain just this map", one can find GLRO(dl_sysinfo_dso) is copied into the first entry of l_local_scope. This allows the _dl_vdso_vsym function (see Glibc's source file sysdeps/unix/sysv/linux/dl-vdso.c) to resolve any symbolic references using the symbols in [vdso] first.

In newer versions (e.g. 2.6.32) of Linux kernel, the address pointed by AT_SYSINFO_EHDR changes every time and this is because of the Virtual Address Randomization (a.k.a. ASLR) feature, and there are many levels of ASLR in Linux, e.g. stack, brk, .so/mmap, VDSO, etc. To see if ASLR is enabled or not, run the command

$ sysctl kernel.randomize_va_space

If the value is not 0 (e.g. 1 or 2, depending on CONFIG_COMPAT_BRK; see Linux kernel source file mm/memory.c and search for the keyword randomize_va_space in Linux kernel source tree), then it is enabled. To run a command without ASLR, use the setarch command, e.g.

$ setarch `uname -m` -R cat /proc/self/maps

or simply execute setarch `uname -m` -R which will open a shell with ASLR disabled. Newer GDB by default also disables ASLR, i.e. set disable-randomization off and GDB achieves this by calling personality(orig_personality|ADDR_NO_RANDOMIZE) after fork and before exec.

To see that [vdso] is indeed an ELF shared object, we can use GDB to dump it to a file, say /tmp/vdso.so as follows (there is another approach, based on the uncompressed Linux kernel; see the next paragraph):

$ gdb --quiet /bin/ls
Reading symbols from /bin/ls...(no debugging symbols found)...done.
(gdb) tbreak __open
Function "__open" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (__open) pending.
(gdb) run
Starting program: /bin/ls
[Thread debugging using libthread_db enabled]

Breakpoint 1, 0x00007ffff7503110 in open64 () from /lib/libc.so.6
(gdb) info inferiors
  Num  Description       Executable
* 1    process 19213     /bin/ls
(gdb) shell cat /proc/19213/maps|grep vdso
7ffff7ffb000-7ffff7ffc000 r-xp 00000000 00:00 0  [vdso]
(gdb) dump memory /tmp/vdso.so 0x7ffff7ffb000 0x7ffff7ffc000

One can check that /tmp/vdso.so is indeed an ELF shared object:

$ readelf -h /tmp/vdso.so

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0xffffffffff700600
  Start of program headers:          64 (bytes into file)
  Start of section headers:          2656 (bytes into file)
  Flags:                             0x0
  ....

It has SONAME being linux-vdso.so.1:

$ readelf -d /tmp/vdso.so

Dynamic section at offset 0x460 contains 10 entries:
  Tag        Type                         Name/Value
 0x000000000000000e (SONAME)             Library soname: [linux-vdso.so.1]
 0x0000000000000004 (HASH)               0xffffffffff700120
 0x0000000000000005 (STRTAB)             0xffffffffff700230
 0x0000000000000006 (SYMTAB)             0xffffffffff700158
 0x000000000000000a (STRSZ)              82 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x000000006ffffffc (VERDEF)             0xffffffffff700298
 0x000000006ffffffd (VERDEFNUM)          2
 0x000000006ffffff0 (VERSYM)             0xffffffffff700282
 0x0000000000000000 (NULL)               0x0

The most interesting part is the symbols it exports:

$ readelf -s /tmp/vdso.so

Symbol table '.dynsym' contains 9 entries:
Num:    Value          Size Type    Bind   Vis      Ndx Name
  0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
  1: ffffffffff70030c     0 SECTION LOCAL  DEFAULT    7
  2: ffffffffff7008d0   156 FUNC    WEAK   DEFAULT   12 clock_gettime@@LINUX_2.6
  3: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.6
  4: ffffffffff700790   138 FUNC    GLOBAL DEFAULT   12 __vdso_gettimeofday@@LINUX_2.6
  5: ffffffffff700970    61 FUNC    GLOBAL DEFAULT   12 __vdso_getcpu@@LINUX_2.6
  6: ffffffffff700790   138 FUNC    WEAK   DEFAULT   12 gettimeofday@@LINUX_2.6
  7: ffffffffff700970    61 FUNC    WEAK   DEFAULT   12 getcpu@@LINUX_2.6
  8: ffffffffff7008d0   156 FUNC    GLOBAL DEFAULT   12 __vdso_clock_gettime@@LINUX_2.6

The layout of vdso.so is determined by the linker scripts arch/x86/vdso/vdso.lds.S and arch/x86/vdso/vdso-layout.lds.S in Linux kernel source.

Three weak symbols are exported and they correspond to three system calls: clock_gettime, gettimeofday, and getcpu. Of the three, only clock_gettime and gettimeofday will be used by Glibc. See below for a comprehensive example on how they actually work.

Vsyscall

In the cat /proc/self/maps command output, there is a memory segment called [vdso] with starting address 0xffff ffff ff60 0000. Recall that there are four code models in x86_64/AMD64 and the "kernel" code model has its 2 GB address space spanning from 0xffff ffff 8000 0000 to 0xffff ffff ff00 0000, and this [vsyscall] segment is not within this range. To verify this, look at the last few lines of System.map file at either /boot/System.map-`uname -r` or /lib/modules/`uname -r`/build/System.map (/proc/kallsyms contains the same information, but /proc/kallsyms also contains kernel modules information):

$ tail -20  /boot/System.map-`uname -r`

ffffffff81a4ca90 b rfkill_master_switch_op
ffffffff81a4ca94 b rfkill_op
ffffffff81a4ca98 b rfkill_last_scheduled
ffffffff81a4caa0 b klist_remove_lock
ffffffff81a4caa4 B __bss_stop
ffffffff81a4d000 B __brk_base
ffffffff81a5d000 b .brk.dmi_alloc
ffffffff81a6d000 B __brk_limit
ffffffffff600000 T vgettimeofday
ffffffffff600140 t vread_tsc
ffffffffff600170 t vread_hpet
ffffffffff600180 D __vsyscall_gtod_data
ffffffffff600400 T vtime
ffffffffff600800 T vgetcpu
ffffffffff600880 D __vgetcpu_mode
ffffffffff6008c0 D __jiffies
ffffffffff700000 A VDSO64_PRELINK
ffffffffff700550 A VDSO64_jiffies
ffffffffff700558 A VDSO64_vgetcpu_mode
ffffffffff700560 A VDSO64_vsyscall_gtod_data

How is 0xffff ffff ff60 0000 determined ? In /usr/include/asm/vsyscall.h one can see

    enum vsyscall_num {
       __NR_vgettimeofday,
       __NR_vtime,
       __NR_vgetcpu,
    };
    #define VSYSCALL_START (-10UL << 20)
    #define VSYSCALL_SIZE 1024
    #define VSYSCALL_END (-2UL << 20)
    #define VSYSCALL_MAPPED_PAGES 1
    #define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))

so VSYSCALL_START is 0xffff ffff ff60 0000

In /usr/include/asm/vsyscall.h there are three vsyscalls defined: gettimeofday , time, and getcpu. Note that getcpu is a system call in 32-bit Linux (its number is 318), but not so in 64-bit Linux, so in 64-bit Linux it is implemented as a vsyscall. To use getcpu, use the wrapper function sched_getcpu in Glibc.

The source code of these three vsyscalls is in Linux kernel source file arch/x86/kernel/vsyscall_64.c. In this file, look at functions:

   int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz) {
      ...
   }

   time_t __vsyscall(1) vtime(time_t *t) {
      ...
   }

   long __vsyscall(2)
   vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache) {
      ...
   }

The macro __vsyscall ties vgettimeofday to ELF section .vsyscall_0, vtime to section .vsyscall_1, and vgetcpu to section .vsyscall_2. Then how does the addresses of .vsyscall_X sections determined ? In Linux kernel source file arch/x86/kernel/vmlinux.lds.S one can find:

    #ifdef CONFIG_X86_64

    #define VSYSCALL_ADDR (-10*1024*1024)

    #define VLOAD_OFFSET (VSYSCALL_ADDR - __vsyscall_0 + LOAD_OFFSET)
    #define VLOAD(x) (ADDR(x) - VLOAD_OFFSET)

    ....

    . = ALIGN(4096);
    __vsyscall_0 = .;

    . = VSYSCALL_ADDR;
    .vsyscall_0 : AT(VLOAD(.vsyscall_0)) {
        *(.vsyscall_0)
    } :user

     ....

    .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
        *(.vsyscall_1)
    }
    .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) {
        *(.vsyscall_2)
    }

Thus, vgettimeofday (i.e. section .vsyscall_0) can be found at address 0xffff ffff ff60 0000 (-10*1024*1024, or -10UL << 20 as in /usr/include/asm/vsyscall.h), vtime (i.e. section .vsyscall_1) can be found at 0xffff ffff ff60 0000 + 1024 = 0xffff ffff ff60 0400, and vgetcpu (i.e. section .vsyscall_2) can be found at address 0xffff ffff ff60 0000 + 2048 = 0xffff ffff ff60 0800.

To see them in the Linux kernel, use the same trick as extract-ikconfig script to unpack the compressed Linux kernel image vmlinuz. Here we assume vmlinux is compressed using gzip. (However, vmlinux can be compressed with other compression algorithms such as bzip2, lzma, or lzo):

$ HDR=`binoffset /boot/vmlinuz-$(uname -r) 0x1f 0x8b 0x08 0x0`
$ dd if=/boot/vmlinuz-$(uname -r) bs=1 skip=$HDR | zcat - > /tmp/vmlinux

binoffset is a utility and its source file can usually be found under /usr/src/linux-XXX/scripts/ or here.

0x1f 0x8b 0x08 0x0 is the magic word to identify the beginning of a gzipped file.

In newer versions (e.g. 2.6.32) of Linux kernel, /tmp/vmlinux is an ELF object:

$ readelf -S /tmp/vmlinux

 ....

  [19] .vsyscall_0       PROGBITS         ffffffffff600000  00c00000
       0000000000000116  0000000000000000  AX       0     0     16
  [20] .vsyscall_fn      PROGBITS         ffffffffff600140  00c00140
       000000000000003f  0000000000000000  AX       0     0     16
  [21] .vsyscall_gtod_da PROGBITS         ffffffffff600180  00c00180
       0000000000000060  0000000000000000  WA       0     0     16
  [22] .vsyscall_1       PROGBITS         ffffffffff600400  00c00400
       000000000000003d  0000000000000000  AX       0     0     16
  [23] .vsyscall_2       PROGBITS         ffffffffff600800  00c00800
       0000000000000075  0000000000000000  AX       0     0     16
 ...

With the knowledge of System.map, we can also extract vdso.so from the uncompressed kernel image /tmp/vmlinux. As mentioned in the previous paragraph, the variables vdso_start and vdso_end defined in arch/x86/vdso/vdso.S point to the starting and end of vdso.so, and they are indeed in the System.map:

$ grep vdso /boot/System.map-`uname -r`

ffffffff81860080 D vdso_enabled
ffffffff81896932 t vdso_setup
ffffffff81896949 t init_vdso_vars
ffffffff81896b51 t vdso_setup
ffffffff81896b7f t relocate_vdso
ffffffff818cd024 T vdso_start
ffffffff818cde84 T vdso32_int80_start
ffffffff818cde84 T vdso_end
ffffffff818ce520 T vdso32_int80_end
ffffffff818ce520 T vdso32_syscall_start
ffffffff818cebc8 T vdso32_syscall_end
ffffffff818cebc8 T vdso32_sysenter_start
ffffffff818cf274 T vdso32_sysenter_end
ffffffff818dde8b t __setup_str_vdso_setup
ffffffff818dde91 t __setup_str_vdso_setup
ffffffff81911ac0 t __setup_vdso_setup
ffffffff81911ad8 t __setup_vdso_setup
ffffffff819129b8 t __initcall_init_vdso_vars6
ffffffff81973b20 b vdso_pages
ffffffff81973b28 b vdso_size
ffffffff81973b30 b vdso32_pages

One can check that the address ffffffff818cd024 is in the .init.data section. We can use objdump to verify the content is indeed an ELF object:

$ objdump -sj .init.data --start-address=0xffffffff818cd024 --stop-address=0xffffffff818cde84 /tmp/vmlinux

/tmp/vmlinux:     file format elf64-x86-64

Contents of section .init.data:
 ffffffff818cd024 7f454c46 02010100 00000000 00000000  .ELF............
 ffffffff818cd034 03003e00 01000000 000670ff ffffffff  ..>.......p.....
 ffffffff818cd044 40000000 00000000 600a0000 00000000  @.......`.......
 ffffffff818cd054 00000000 40003800 04004000 10000f00  ....@.8...@.....
  ....

Now we need to create a binary file from the above hexdump. To achieve this, first, run above command and redirect the result to a file, say foo. Next, use:

$ awk ' /ffffffff8/ { print $2,$3,$4,$5 }' foo | xxd -r -p > /tmp/vdso.so

and one can now verify that this /tmp/vdso.so is the same as the one created using the GDB approach in the previous paragraph. (If xxd is not available on your system, get its source code here)

Note that in older versions (e.g. 2.6.9) of Linux kernel, /tmp/vmlinux is a raw binary image created from a command like the following:

objcopy -O binary -R .note -R .comment -S [inputELFobj] vmlinux

(To wit, search for -O binary -R .note -R .comment -S or OBJCOPYFLAGS among the Makefiles in the Linux kernel source tree.) The above command will strip all symbolic and debugging information (-S option), strip .note and .comment sections, and create a raw binary image (-O binary option), which can be loaded into memory as it is. The source file which implements this raw binary image creation is in GNU binutil's source file bfd/binary.c (in particular, look at the function binary_set_section_contents)

A simple example shows how this raw binary is created from an ELF executable binary. Suppose we have an ELF executable binary a.out which has the following sections:

$ readelf -t a.out

There are 24 section headers, starting at offset 0x71898:

Section Headers:
  [Nr] Name
       Type              Address          Offset            Link
       Size              EntSize          Info              Align
       Flags
  [ 0]
       NULL             0000000000000000  0000000000000000  0
       0000000000000000 0000000000000000  0                 0
       [0000000000000000]:
  [ 1] .note.ABI-tag
       NOTE             0000000000400120  0000000000000120  0
       0000000000000020 0000000000000000  0                 4
       [0000000000000002]: ALLOC
  [ 2] .init
       PROGBITS         0000000000400140  0000000000000140  0
       0000000000000018 0000000000000000  0                 4
       [0000000000000006]: ALLOC, EXEC
  [ 3] .text
       PROGBITS         0000000000400160  0000000000000160  0
       0000000000052e48 0000000000000000  0                 16
       [0000000000000006]: ALLOC, EXEC

  .....

  [18] .got.plt
       PROGBITS         00000000006700d8  00000000000700d8  0
       0000000000000018 0000000000000008  0                 8
       [0000000000000003]: WRITE, ALLOC
  [19] .data
       PROGBITS         0000000000670100  0000000000070100  0
       00000000000015e8 0000000000000000  0                 32
       [0000000000000003]: WRITE, ALLOC
  [20] .bss
       NOBITS           0000000000671700  00000000000716e8  0
       0000000000002440 0000000000000000  0                 32
       [0000000000000003]: WRITE, ALLOC
  [21] __libc_freeres_ptrs
       NOBITS           0000000000673b40  00000000000716e8  0
       0000000000000030 0000000000000000  0                 8
       [0000000000000003]: WRITE, ALLOC
  [22] .comment
       PROGBITS         0000000000000000  00000000000716e8  0
       00000000000000bf 0000000000000000  0                 1
       [0000000000000000]:
  [23] .shstrtab
       STRTAB           0000000000000000  00000000000717a7  0
       00000000000000ec 0000000000000000  0                 1
       [0000000000000000]:

Then what binary_set_section_contents does is: First, keep only those sections which have ALLOC flag and are not of NOBITS type (NOBITS means this section takes no space in the ELF binary file). Then, sections such as .bss or .comment will be left out. Next, find the lowest Load Memory Address (LMA) of all remaining sections, which is 0x400120 (.note.ABI-tag section) in the above example. 0x400120 will be the new base address, and binary_set_section_contents dumps the remaining sections one by one by doing an lseek with offset = section LMA - 0x400120 and then writing the section content. In above example, the section with the highest LMA is .data section, and its size is 0x15e8, so the resulting raw binary image will have file size 0x670100 - 0x400120 + 0x15e8 = 2561480:

$ objcopy -O binary a.out raw
$ du -B1 --apparent-size raw
2561480 raw
$ du -B1 -s raw
475136 raw

(Files created from objcopy -O binary could be "sparse", i.e. containing "holes", as above two du commands indicate.)

If /tmp/vmlinux is a raw binary image, we can force objdump to disassemble it as follows (use Linux kernel version 2.6.9 as example):

$ objdump -b binary -m i386:x86-64 -D /tmp/vmlinux

/tmp/vmlinux:     file format binary

Disassembly of section .data:

0000000000000000 <.data>:
  0: 89 dd                   mov    %ebx,%ebp
  2: b8 18 00 00 00          mov    $0x18,%eax
  7: 8e d8                   mov    %eax,%ds
  9: b8 00 00 00 80          mov    $0x80000000,%eax
  e: 0f a2                   cpuid
 10: 3d 00 00 00 80          cmp    $0x80000000,%eax
 15: 0f 86 35 01 00 00       jbe    0x150
 1b: b8 01 00 00 80          mov    $0x80000001,%eax
 20: 0f a2                   cpuid
 22: 0f ba e2 1d             bt     $0x1d,%edx
 26: 0f 83 24 01 00 00       jae    0x150
....

To make sure /tmp/vmlinux indeed has no special format or structure and objdump disassembles the file from the beginning, compare it with output of hexdump:

$ hexdump -v -C /tmp/vmlinux

00000000  89 dd b8 18 00 00 00 8e  d8 b8 00 00 00 80 0f a2  |................|
00000010  3d 00 00 00 80 0f 86 35  01 00 00 b8 01 00 00 80  |=......5........|
00000020  0f a2 0f ba e2 1d 0f 83  24 01 00 00 89 d7 31 c0  |........$.....1.|
00000030  0f ba e8 05 0f ba e8 07  0f 22 e0 b8 00 10 10 00  |........."......|
00000040  0f 22 d8 b9 80 00 00 c0  0f 32 0f ba e8 08 0f ba  |.".......2......|
...

Cross-check with the corresponding System.map:

$ head /boot/System.map-`uname -r`

ffffffff80100000 A _text
ffffffff80100000 t startup_32
ffffffff80100081 t reach_compatibility_mode
ffffffff8010008e t second
ffffffff80100100 t reach_long64
ffffffff8010013d T initial_code
ffffffff80100145 T init_rsp
ffffffff80100150 T no_long_mode
ffffffff80100f00 T pGDT32
ffffffff80100f10 t ljumpvector

we can learn that the code starts at address 0xffff ffff 8010 0000. However, we still do not know if the beginning of the raw binary image /tmp/vmlinux is also the beginning of code; we have to dig up the source code of startup_32 and verify. startup_32 is in head.S and a complete copy can be found here. The code snippet is like this:

 startup_32:
    /*
     * At this point the CPU runs in 32bit protected mode (CS.D = 1) with
     * paging disabled and the point of this file is to switch to 64bit
     * long mode with a kernel mapping for kerneland to jump into the
     * kernel virtual addresses.
     * There is no stack until we set one up.
     */

    movl %ebx,%ebp  /* Save trampoline flag */

    movl $__KERNEL_DS,%eax
    movl %eax,%ds

    /* If the CPU doesn't support CPUID this will double fault.
     * Unfortunately it is hard to check for CPUID without a stack.
     */

    /* Check if extended functions are implemented */
    movl    $0x80000000, %eax
    cpuid
    cmpl    $0x80000000, %eax
    jbe     no_long_mode
    /* Check if long mode is implemented */
    mov     $0x80000001, %eax
    cpuid
     ...

Compare above with the objdump output we can now confirm the 0xffff ffff 8010 0000 is the base address of raw binary image /tmp/vmlinux

To extract VDSOs from a raw binary image is an easier task. The VDSO's in kernel version 2.6.9 are indicated by global variables syscall32_syscall/syscall32_syscall_end and syscall32_sysenter/syscall32_sysenter_end:

$ grep syscall32_sys /boot/System.map-`uname -r`

ffffffff8053d5a0 t syscall32_syscall
ffffffff8053e048 t syscall32_syscall_end
ffffffff8053e048 t syscall32_sysenter
ffffffff8053eb00 t syscall32_sysenter_end

Since the base address is 0xffff ffff 8010 0000, the offset of syscall32_syscall in /tmp/vmlinux is 0xffffffff8053d5a0 - 0xffffffff80100000 = 4445600 and the size of this VDSO is 0xffffffff8053e048 - 0xffffffff8053d5a0 = 2728. Now we can extract VDSO by

$ dd if=/tmp/vmlinux of=/tmp/syscall_vdso.so bs=1 skip=4445600 count=2728

Verify this is an ELF binary object. Note that this is 32-bit and has SONAME being linux-gate.so.1:

$ readelf -a /tmp/syscall_vdso.so

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0xffffe400

  ...

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .hash             HASH            ffffe094 000094 000044 04   A  2   0  4
  [ 2] .dynsym           DYNSYM          ffffe0d8 0000d8 0000c0 10   A  3   8  4
  [ 3] .dynstr           STRTAB          ffffe198 000198 000056 00   A  0   0  1
  [ 4] .gnu.version      VERSYM          ffffe1ee 0001ee 000018 02   A  2   0  2
  [ 5] .gnu.version_d    VERDEF          ffffe208 000208 000038 00   A  3   2  4
  [ 6] .text.vsyscall    PROGBITS        ffffe400 000400 000010 00  AX  0   0  1
  [ 7] .text.sigreturn   PROGBITS        ffffe500 000500 000008 00  AX  0   0 32
  [ 8] .text.rtsigreturn PROGBITS        ffffe600 000600 000007 00  AX  0   0 32
  [ 9] .eh_frame_hdr     PROGBITS        ffffe608 000608 000024 00   A  0   0  4
  [10] .eh_frame         PROGBITS        ffffe62c 00062c 000100 00   A  0   0  4
  [11] .dynamic          DYNAMIC         ffffe72c 00072c 000078 08  WA  3   0  4
  [12] .useless          PROGBITS        ffffe7a4 0007a4 00000c 04  WA  0   0  4
  [13] .text             PROGBITS        ffffe7b0 0007b0 000000 00  AX  0   0  4
  [14] .shstrtab         STRTAB          00000000 0007b0 00009e 00      0   0  1

....

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0xffffe000 0xffffe000 0x007b0 0x007b0 R E 0x1000
  DYNAMIC        0x00072c 0xffffe72c 0xffffe72c 0x00078 0x00078 R   0x4
  GNU_EH_FRAME   0x000608 0xffffe608 0xffffe608 0x00024 0x00024 R   0x4

 Section to Segment mapping:
  Segment Sections...
   00     .hash .dynsym .dynstr .gnu.version .gnu.version_d .text.vsyscall
          .text.sigreturn .text.rtsigreturn .eh_frame_hdr .eh_frame .dynamic
          .useless
   01     .dynamic
   02     .eh_frame_hdr

Dynamic section at offset 0x72c contains 10 entries:
  Tag        Type                         Name/Value
 0x0000000e (SONAME)                     Library soname: [linux-gate.so.1]
 0x00000004 (HASH)                       0xffffe094
 0x00000005 (STRTAB)                     0xffffe198
 0x00000006 (SYMTAB)                     0xffffe0d8
 0x0000000a (STRSZ)                      86 (bytes)
 0x0000000b (SYMENT)                     16 (bytes)
 0x6ffffffc (VERDEF)                     0xffffe208
 0x6ffffffd (VERDEFNUM)                  2
 0x6ffffff0 (VERSYM)                     0xffffe1ee
 0x00000000 (NULL)                       0x0

It has the following exported symbols (Note the addresses of these symbols):

$ readelf -s /tmp/syscall_vdso.so

Symbol table '.dynsym' contains 12 entries:
Num:    Value  Size Type    Bind   Vis      Ndx Name
  0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
  1: ffffe400     0 SECTION LOCAL  DEFAULT    6
  2: ffffe500     0 SECTION LOCAL  DEFAULT    7
  3: ffffe600     0 SECTION LOCAL  DEFAULT    8
  4: ffffe608     0 SECTION LOCAL  DEFAULT    9
  5: ffffe62c     0 SECTION LOCAL  DEFAULT   10
  6: ffffe7a4     0 SECTION LOCAL  DEFAULT   12
  7: ffffe7b0     0 SECTION LOCAL  DEFAULT   13
  8: ffffe400    16 FUNC    GLOBAL DEFAULT    6 __kernel_vsyscall@@LINUX_2.5
  9: 00000000     0 OBJECT  GLOBAL DEFAULT  ABS LINUX_2.5
 10: ffffe600     7 FUNC    GLOBAL DEFAULT    8 __kernel_rt_sigreturn@@LINUX_2.5
 11: ffffe500     8 FUNC    GLOBAL DEFAULT    7 __kernel_sigreturn@@LINUX_2.5

To reconstruct the Linux kernel binary (as an ELF executable binary) from the raw binary image, use this dressUp tool.

How do vsyscalls and VDSO work ?

To show how vsyscalls and VDSO work, we use the following code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

int main() {

    {
      struct timeval tim;
      gettimeofday(&tim, NULL);
      printf("%.6lf seconds\n", tim.tv_sec+tim.tv_usec/1000000.0);
    }

    {
      time_t t1=time(NULL);
      printf("%lu seconds\n", t1);
    }

    {
      struct timespec tp;
      clock_gettime(CLOCK_REALTIME, &tp);
      printf("%.6lf seconds\n", tp.tv_sec + tp.tv_nsec/1000000000.0);
    }
}

Compile the above code with -lrt compiler command-line option.

Next, launch GDB and set a breakpoint at __gettimeofday, which is a weak alias of gettimeofday. If one sets a breakpoint at gettimeofday, then as mentioned earlier, the symbol resolution of GDB will find this gettimeofday symbol inside VDSO, not Glibc, since VDSO has precedence.

(gdb) tbreak __gettimeofday
Function "__gettimeofday" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (__gettimeofday) pending.
(gdb) run
Starting program: a.out
[Thread debugging using libthread_db enabled]

Breakpoint 1, 0x00007ffff790bb10 in gettimeofday () from /lib/libc.so.6
(gdb) disas
Dump of assembler code for function gettimeofday:
=> 0x00007ffff790bb10 <+0>:     sub    $0x8,%rsp
   0x00007ffff790bb14 <+4>:     mov    0x2ca9e5(%rip),%rax        # 0x7ffff7bd6500 <__vdso_gettimeofday>
   0x00007ffff790bb1b <+11>:    ror    $0x11,%rax
   0x00007ffff790bb1f <+15>:    xor    %fs:0x30,%rax
   0x00007ffff790bb28 <+24>:    callq  *%rax
   0x00007ffff790bb2a <+26>:    cmp    $0xfffff001,%eax
   0x00007ffff790bb2f <+31>:    jae    0x7ffff790bb36 <gettimeofday+38>
   0x00007ffff790bb31 <+33>:    add    $0x8,%rsp
   0x00007ffff790bb35 <+37>:    retq
   0x00007ffff790bb36 <+38>:    mov    0x2c5463(%rip),%rcx        # 0x7ffff7bd0fa0
   0x00007ffff790bb3d <+45>:    xor    %edx,%edx
   0x00007ffff790bb3f <+47>:    sub    %rax,%rdx
   0x00007ffff790bb42 <+50>:    mov    %edx,%fs:(%rcx)
   0x00007ffff790bb45 <+53>:    or     $0xffffffffffffffff,%rax
   0x00007ffff790bb49 <+57>:    jmp    0x7ffff790bb31 <gettimeofday+33>
End of assembler dump.

At this point we can see what __gettimeofday inside Glibc is doing; it loads a value from a variable called __vdso_gettimeofday into register, doing some rotation and XOR's (pointer demangling, see below) and call the demangled address (all of this can be found in sysdeps/unix/sysv/linux/x86_64/gettimeofday.S)

So how is the variable __vdso_gettimeofday set up ? Let's set a watch point and run the program again:

(gdb) watch *0x7ffff7bd6500
Hardware watchpoint 2: *0x7ffff7bd6500
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: a.out
Watchpoint 2: *0x7ffff7bd6500

Old value = <unreadable>
New value = 0
0x00007ffff7df48ba in mmap64 () from /lib/ld-linux-x86-64.so.2
(gdb) cont
Continuing.
[Thread debugging using libthread_db enabled]
Hardware watchpoint 2: *0x7ffff7bd6500

Old value = 0
New value = 9205590
0x00007ffff789996f in _init () from /lib/libc.so.6
(gdb) bt
#0  0x00007ffff789996f in _init () from /lib/libc.so.6
#1  0x00007ffff7dec559 in call_init () from /lib/ld-linux-x86-64.so.2
#2  0x00007ffff7dec697 in _dl_init_internal () from /lib/ld-linux-x86-64.so.2
#3  0x00007ffff7ddfaca in _dl_start_user () from /lib/ld-linux-x86-64.so.2
#4  0x0000000000000001 in ?? ()
#5  0x00007fffffffe9a1 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) x/10i $pc - 0x1f
   0x7ffff7899950 <_init+224>:  mov    $0xffffffffff600000,%rdx
   0x7ffff7899957 <_init+231>:  cmovne %rax,%rdx
   0x7ffff789995b <_init+235>:  xor    %fs:0x30,%rdx
   0x7ffff7899964 <_init+244>:  rol    $0x11,%rdx
   0x7ffff7899968 <_init+248>:  mov    %rdx,0x33cb91(%rip)        # 0x7ffff7bd6500 <__vdso_gettimeofday>
=> 0x7ffff789996f <_init+255>:  callq  0x7ffff797ff40 <_dl_vdso_vsym>
   0x7ffff7899974 <_init+260>:  mov    %r12,%rdx
   0x7ffff7899977 <_init+263>:  mov    %rbp,%rsi
   0x7ffff789997a <_init+266>:  mov    %ebx,%edi
   0x7ffff789997c <_init+268>:  xor    %fs:0x30,%rax

The first time GDB stops at our watch point, it is when ld.so loads libc.so and sets up its .bss section, so we need to continue (cont) The second stop is the place the variable __vdso_gettimeofday is set. The source code is in _libc_vdso_platform_setup function in sysdeps/unix/sysv/linux/x86_64/init-first.c:

    static inline void
    _libc_vdso_platform_setup (void)
    {
      PREPARE_VERSION (linux26, "LINUX_2.6", 61765110);

      void *p = _dl_vdso_vsym ("gettimeofday", &linux26);
      /* If the vDSO is not available we fall back on the old vsyscall.  */
    #define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000ul
      if (p == NULL)
        p = (void *) VSYSCALL_ADDR_vgettimeofday;
      PTR_MANGLE (p);
      __vdso_gettimeofday = p;

      p = _dl_vdso_vsym ("clock_gettime", &linux26);
      PTR_MANGLE (p);
      __GI___vdso_clock_gettime = p;
    }

So __vdso_gettimeofday stores the mangled address of gettimeofday in VDSO.

Now let's set a breakpoint at gettimeofday and see what's going on inside VDSO:

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) tbreak gettimeofday
Breakpoint 1 at 0x00007ffff7ffb8c0
(gdb) run
Starting program: a.out
[Thread debugging using libthread_db enabled]

Breakpoint 1, 0x00007ffff7ffb8c0 in gettimeofday ()
(gdb) bt
#0  0x00007ffff7ffb8c0 in gettimeofday ()
#1  0x00007ffff790bb2a in gettimeofday () from /lib/libc.so.6
#2  0x000000000040062d in main ()
(gdb) info inferiors
  Num  Description       Executable
* 1    process 1089      a.out
(gdb) shell cat /proc/1089/maps|grep vdso
7ffff7ffb000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]

The memory map shows the address 7ffff7ffb8c0 is indeed within VDSO.

The Linux kernel (2.6.35) source for VDSO's gettimeofday is __vdso_gettimeofday function in arch/x86/vdso/vclock_gettime.c. What it does is:

If sysctl_enabled is true (i.e. the kernel is compiled with CONFIG_SYSCTL) and clock.vread is set, then inside vgetns clock.vread is called to get the time.
Otherwise, it uses the ordinary system call, i.e. it calls syscall function with the system call number __NR_gettimeofday

So what is clock.vread ?

clock.vread is a member of struct vsyscall_gtod_data (Virtual system call GetTimeOfDay) in Linux source arch/x86/include/asm/vgtod.h.

In our debugging session above, without knowing what would happen, i.e. clock.vread gets called or syscall gets called, we could do this in GDB after the above breakpoint:

(gdb) catch syscall
Catchpoint 2 (any syscall)
(gdb) while ($pc < 0xffffffffff600000)
 > si
 > end

Here we assumed if clock.vread ever gets called, its code is in vsyscall's region, i.e. at address 0xffffffffff600000 or after it.

If clock.vread is called, one should see the following GDB output:

0x00007ffff7ffb8c1 in gettimeofday ()
0x00007ffff7ffb8cb in gettimeofday ()
....
0xffffffffff600140 in ?? ()
(gdb) x/20i $pc
=> 0xffffffffff600140:  push   %rbp
   0xffffffffff600141:  mov    %rsp,%rbp
   0xffffffffff600144:  mfence
   0xffffffffff600147:  data32 xchg %ax,%ax
   0xffffffffff60014a:  rdtsc
   0xffffffffff60014c:  mfence
   0xffffffffff60014f:  data32 xchg %ax,%ax
   0xffffffffff600152:  shl    $0x20,%rdx
   0xffffffffff600156:  mov    %eax,%eax
   0xffffffffff600158:  or     %rax,%rdx
   0xffffffffff60015b:  mov    0x46(%rip),%rax        # 0xffffffffff6001a8
   0xffffffffff600162:  leaveq
   0xffffffffff600163:  cmp    %rax,%rdx
   0xffffffffff600166:  cmovae %rdx,%rax
   0xffffffffff60016a:  retq
   ....

so this is what clock.vread is doing! It just uses rdtsc to read the TimeStamp Counter. This is what a userspace system call means!

In a nutshell, for dynamic binaries, gettimeofday is done as follows:

__gettimeofday in Glibc calls gettimeofday in VDSO.
gettimeofday in VDSO could either jump to clock.vread inside vsyscall, or make an ordinary syscall
For the former case, clock.vread uses rdtsc to get the time.

For statically linked binaries, gettimeofday is done as follows:

__gettimeofday in Glibc calls vgettimeofday in vsyscall.
vgettimeofday in vsyscall could either invoke clock.vread, or make an ordinary syscall
For the former case, clock.vread uses rdtsc to get the time.

Pointer Encryption

This security mechanism in Glibc has many different names: pointer encryption, pointer guard (see the LD_POINTER_GUARD part in ld.so's man page), or pointer obfuscation. The purpose is to protect function pointers inside a writable memory region, e.g. .bss section, because the memory region cannot be made Read-Only.

This is done through a pair of macros: PTR_MANGLE and PTR_DEMANGLE. For x86_64, the macros are defined in sysdeps/unix/sysv/linux/x86_64/sysdep.h

An example usage of PTR_MANGLE is in _libc_vdso_platform_setup of sysdeps/unix/sysv/linux/x86_64/init-first.c, where function pointers to __vdso_gettimeofday is encrypted. In __gettimeofday of sysdeps/unix/sysv/linux/x86_64/gettimeofday.S, PTR_DEMANGLE is called to decrypt the mangled value.

Notes on x86 assembly