Control flow
Some common control flow patterns in x86 assembly (Intel syntax):
test arg1,arg2 | Bitwise AND
Unlike and, none of arg1 or arg2 are changed. Zero flag (ZF) is 0 if result is nonzero. |
||
test arg,arg | Zero flag (ZF) is 0 if arg is nonzero. | ||
je jz |
Jump if arg is 0, or arg1&arg2 is 0 (Zero flag is 1) | jne jnz |
Jump if arg is nonzero, or arg1&arg2 is nonzero (Zero flag is 0) |
xor arg1,arg2 | Bitwise XOR | ||
je jz |
Jump if arg1==arg2 | jne jnz |
Jump if arg1!=arg2 |
cmp arg1,arg2 | Subtraction: arg1-arg2
(Intel syntax) Unlike sub, none of arg1 or arg2 are changed. |
||
je jz |
Jump if arg1==arg2
(Zero flag is 1) |
jne jnz |
Jump if arg1!=arg2 |
jg jnle |
Jump if signed arg1>arg2
(Zero flag is 0 and Sign flag=Overflow flag) |
jl jnge |
Jump if signed arg1<arg2 |
jge jnl |
Jump if signed arg1>=arg2
(Sign flag=Overflow flag) |
jle jng |
Jump if signed arg1<=arg2 |
ja jnbe |
Jump if unsigned arg1>arg2
(Both Carry flag and Zero flag are 0) |
jb jnae |
Jump if unsigned arg1<arg2 |
jae jnb jnc |
Jump if unsigned arg1>=arg2
(Carry flag is 0) |
jbe jna jc |
Jump if unsigned arg1<=arg2 |
js | Jump if Sign flag is 1 | jns | Jump if Sign flag is 0 |
If above rules are confusing, consider the following scenarios :
arg1 | arg2 | arg1-arg2 | Sign flag | Overflow flag |
---|---|---|---|---|
0xFFFF (-1) | 0xFFFE (-2) | 0x0001 | 0 | 0 |
0x8000 (-32768) | 0x0001 | 0x7FFF (32767) | 0 | 1 |
0xFFFE (-2) | 0xFFFF (-1) | 0xFFFF (-1) | 1 | 0 |
0x7FFF (32767) | 0xFFFF (-1) | 0x8000 (-32768) | 1 | 1 |
The j in above tables can be replaced by cmov or set for Conditional Move or Conditional Set.
loop | Decrement the CX register, and if it is nonzero, continue the loop (i.e. branch to the target location) |
loope loopz |
Decrement the CX register, and if it is nonzero and Zero flag is set, continue the loop (i.e. branch to the target location)
It is usually used in conjuction with cmp, i.e. if arg1==arg2, loop back. |
Opcodes for conditional jumps
Opcodes for x86_64 conditional jumps have two formats:7x xx RIP=RIP+8 bit displacement (signed extended to 64-bits) 0F 8x xx xx xx xx RIP=RIP+32 bit displacement (signed extended to 64-bits)
je jz |
74 0F 84 |
jne jnz |
75 0F 85 |
jg jnle |
7F 0F 8F |
jl jnge |
7C 0F 8C |
jge jnl |
7D 0F 8D |
jle jng |
7E 0F 8E |
ja jnbe |
77 0F 87 |
jb jnae |
72 0F 82 |
jae jnb jnc |
73 0F 83 |
jbe jna jc |
76 0F 86 |
js | 78 0F 88 |
jns | 79 0F 89 |
Opcodes for unconditional jumps
Opcodes for x86_64 unconditional jumps have the following formats (this list is incomplete):EB xx RIP=RIP+8 bit displacement (signed extended to 64-bits) E9 xx xx xx xx RIP=RIP+32 bit displacement (signed extended to 64-bits) FF E0 RIP=RAX FF E3 RIP=RBX FF E1 RIP=RCX FF E2 RIP=RDX FF E7 RIP=RDI FF E6 RIP=RSI FF E4 RIP=RSP FF E5 RIP=RBP FF 20 RIP=[RAX] FF 23 RIP=[RBX] FF 21 RIP=[RCX] FF 22 RIP=[RDX] FF 27 RIP=[RDI] FF 26 RIP=[RSI] FF 24 24 RIP=[RSP] FF 65 00 RIP=[RBP] FF 25 ?? ?? ?? ?? RIP=[RIP+0x????????]
Synchronization support
cmpxchg arg1,arg2
cmpxchg automatically chooses AL, AX, EAX, or RAX and compares with arg1:if (EAX == arg1) then Zero flag = 1; arg1 = arg2; else Zero flag = 0; EAX = arg1; endifHere is an example how to implement a spin lock using cmpxchg
mov EDX, 1 spin: mov EAX, my_lock test EAX, EAX jnz spin lock cmpxchg my_lock, EDX test EAX, EAX jnz spinAt lock cmpxchg my_lock, EAX is guaranteed to be 0, but my_lock is stored in memory, so its value might be non-zero.
- If my_lock==0, then my_lock is set to 1 (EDX's value), and EAX is still 0, so test EAX, EAX fails and the lock is acquired.
- If my_lock!=0, then my_lock is still nonzero, but EAX will be set to nonzero (my_lock's value), so test EAX, EAX succeeds and the lock keeps spinning.
xchg arg1,arg2
Atomic exchange. It can also be used to implement a spin lock, as follows:mov EAX, 1 spin: xchg EAX, my_lock test EAX, EAX jnz spin
xadd arg1,arg2
Exchange and Add:tmp = arg1+arg2 arg2 = arg1 arg1 = tmpThis is to achieve the atomic version of
arg1 += arg2which returns the original value of arg1, and meanwhile arg1 is incremented by arg2.
x86 calling conventions (Assembly view)
push arg
Place arg operand onto the top of the hardware supported stack in memory:ESP -= 4 [ESP] = arg
pop arg
Remove data from top of the stack into arg:arg = [ESP] ESP += 4
call func
Push the address of the next instruction after itself onto the stack, and jump to func:ESP -= 4 [ESP] = address of next instruction EIP = address of func
ret
Pop off the hardware supported in-memory stack into EIP:EIP = [ESP] ESP += 4
ret n
Same as ret but will adjust ESP by n bytes as well:EIP = [ESP] ESP += (4+n)
enter n,0
Allocate a stack frame for a procedure and reserve n bytes from the stack for local variables:push EBP mov EBP, ESP sub ESP, n
leave
Deallocate the stack frame set up by an earlier enter instruction:mov ESP, EBP pop EBP
32-bit x86 calling conventions (C programmer's view)
(For more information, see The gen on function calling conventions)cdecl
All parameters are on stack. EAX register holds the return value. This is the standard C calling convention.; foo(1, 2, 3, 4); // caller calls foo push 4 push 3 push 2 push 1 call foo add ESP, 16 ; adjust ESP to its former value (32-bit architecture)GCC uses the following code snippet instead, which saves the last add above:
; foo(1, 2, 3, 4); // caller calls foo sub ESP, 16 mov [ESP+12], 4 mov [ESP+8], 3 mov [ESP+4], 2 mov [ESP], 1 call fooThe callee code would be:
; foo(int x, int y, int z, int w) push EBP mov EBP, ESP ; [EBP+8 ] is x ; [EBP+12] is y ; [EBP+16] is z ; [EBP+20] is w ... ; [EBP-4] would be the first local variable pop EBP ; this line could be replaced by "leave" ret
stdcall
All parameters are on stack. EAX register holds the return value. GCC generates the following code snippet:; foo(1, 2, 3, 4); // caller calls foo mov [ESP+12], 4 mov [ESP+8], 3 mov [ESP+4], 2 mov [ESP], 1 call foo sub ESP, 16And the callee code would be:
; __attribute__((stdcall)) foo(int x, int y, int z, int w) push EBP mov EBP, ESP ; [EBP+8 ] is x ; [EBP+12] is y ; [EBP+16] is z ; [EBP+20] is w ... ; [EBP-4] would be the first local variable pop EBP ; this line could be replaced by "leave" ret 16
fastcall
ECX and EDX have the first two parameters, and the rest would be on stack and are treated like stdcall. EAX register holds the return value. GCC generates the following code snippet:; foo(1, 2, 3, 4); // caller calls foo mov [ESP+4], 4 mov [ESP], 3 mov EDX, 2 mov ECX, 1 call foo sub ESP, 8And the callee code would be:
; __attribute__((fastcall)) foo(int x, int y, int z, int w) push EBP mov EBP, ESP sub ESP, 8 mov [EBP-4], ECX mov [EBP-8], EDX ; [EBP-4 ] is x ; [EBP-8 ] is y ; [EBP+8 ] is z ; [EBP+12] is w ... leave ret 8
64-bit x86 calling conventions (C programmer's view)
Linux
RDI, RSI, RDX, RCX, R8, R9 have the first six parameters (XMM0, .., XMM7 are used for floating point parameters), and the rest would be on stack. RAX register holds the return value.RAX, RCX, RDX, RSI, RDI, R8 - R11 are scratch registers and their contents should be considered (from caller's perspective) clobbered after a function call. Callee is only responsible for the preservation of RBX, RSP, RBP, R12 - R15.
GCC generates the following code snippet:
; foo(1, 2, 3, 4); // caller calls foo mov ECX, 4 mov EDX, 3 mov ESI, 2 mov EDI, 1 call fooThe callee code would be:
; foo(int x, int y, int z, int w) push RBP mov RBP, RSP ; EDI is x ; ESI is y ; EDX is z ; ECX is w ... ; [RBP-4] would be the first local 32-bit integer variable leave ret
See here for details.
Windows
RCX, RDX, R8, R9 have the first four parameters (XMM0, .., XMM3 are used for floating point parameters), and the rest would be on stack. RAX register holds the return value. If the return data type is float, XMM0 register is used instead.RAX, RCX, RDX, R8 - R11 are scratch registers and their contents should be considered (from caller's perspective) destroyed after a function call.
See here and here for details.
-fomit-frame-pointer, GDB, alloca, and all that
On x86 the frame pointer is the EBP register. As mentioned above, normally the callee always have the following prolog:push EBP mov EBP, ESPand the following epilog
leave (recall leave is a combination of 'mov ESP, EBP' & 'pop EBP') retWhat's the purpose of frame pointer and above prolog/epilog ? It allows the debugger/us to "walk through" the stack frames; the saved EBP's are like "next" pointers in a C linked list struct, as the following pseudo code:
while (0 != EBP) { printf("Return address is %p\n", [EBP+4]); EBP = [EBP]; }(assuming EBP is initialized to be 0)
A common compiler command-line switch for release/optimization builds is -fomit-frame-pointer. What it does is to not save previous EBP onto the stack, thus freeing EBP for other uses. Since x86 already has a meager set of registers, one extra register will not hurt.
The side effect, as mentioned in GCC manual here is making debugging impossible on some machines, since we lose the ability to walk through the stack frames. Here is an example. Compile the following code:
int foo(int x, int y, int z, int w) { return x+y+z+w; } int main() { foo(1,2,3,4); }With -fomit-frame-pointer, one can see from the assembly code that the prolog and the leave in the epilog are no longer generated. To see how this thwarts the debugging, strip the generated binaries (if you do not strip them, then newer version of GDB, e.g. 6.0, can gain stack frame knowledge through the embedded DWARF debugging information and display the correct back trace, with or without -fomit-frame-pointer) and run GDB. First, let's run nm to get the address of foo, then strip the binary, launch GDB and put a breakpoint on the address of foo. Without -fomit-frame-pointer, on 32-bit Linux, the backtrace would be like:
(gdb) break *0x8048317 (gdb) run (gdb) bt #0 0x08048317 in ?? () #1 0x08048365 in ?? () #2 0x55597e93 in __libc_start_main () from /lib/tls/libc.so.6 #3 0x08048291 in ?? ()so we can see some function at 0x8048365 called foo. Do the same with -fomit-frame-pointer, and we have
(gdb) break *0x8048314 (gdb) run (gdb) bt #0 0x08048314 in ?? () #1 0x55597e93 in __libc_start_main () from /lib/tls/libc.so.6 #2 0x08048291 in ?? ()In this case the backtrace isn't functional correctly: one stack frame is missing, and we no longer know what happened before call to foo. Also, the x86 ABI mandates that the value of EBP must be preserved across calls (that's why we have leave instructions at end of callees to revert back the original value of EBP) so with -fomit-frame-pointer on, EBP's value will always be 0.
Interestingly, if one tries above example with 64-bit compilation, all the stack frames can still be recovered with -fomit-frame-pointer. The secret lies in the .eh_frame section in 64-bit x86 binaries. (see here) To see the content of .eh_frame section, use
readelf --debug-dump=frames-interp a.outIf this section is removed (e.g. with objcopy -R .eh_frame a.out command), then the back trace would be a total mess (not just missing stack frames, as in 32-bit case, but showing garbage values/addresses).
The conclusion is: If you want to deter hackers, compile your program with -fomit-frame-pointer, strip it, AND remove/temper with its .eh_frame section (the latter could crash GDB too)
A closely related topic to stack frames is the alloca function, which allocates memory from stack and needs no explicit deallocation (actually, cannot). For x86, GCC generates the following code for, say alloca(126),
sub $144, %rsp mov %rsp, %rax add $15, %rax shr $4, %rax sal $4, %raxWhat GCC does here is it allocates 16 bytes more than requested, then align to the 16-byte boundary to get 144. It then aligns the result to the 16-byte address boundary: RAX = (RAX+15)/16*16 (see GCC source tree: expand_builtin_alloca in gcc/builtins.c, allocate_dynamic_stack_space in gcc/explow.c and BIGGEST_ALIGNMENT macro in gcc/config/i386/i386.h, which is 128-bit, and this is where the 16-byte alignment comes from.)
Recall that ESP is modified inside a function body ONLY based on the total size of local variables, and this information is known during compilation time. This is important because at the end of the function call, ESP must be reverted back to the original value, so pop can function correctly, i.e. get the return address.
However, with alloca, ESP can be decremented arbitrarily, so how can ESP be reverted ? Moreoever, how does automatic deallocation occur ? The answer to both questions is to use the frame pointer EBP!
Recall EBP, if it is saved, is always kept constant through the function body. Put another way, if alloca is ever used inside a function, then that function will always explicitly preserve the frame pointer EBP with the aforementioned prolog/epilog pairs, and the -fomit-frame-pointer optimization will be ignored for that function.
System calls
System calls can be done by either calling them directly, e.g. sched_yield(), or via syscall interface, e.g. syscall(SYS_sched_yield). To use syscall interface, one must include header files unistd.h and syscall.hThe SYS_sched_yield is the system call number and is mapped to a number, which differs from system to system. For example, on 64-bit Linux SYS_sched_yield is mapped to 24 (as __NR_sched_yield in /usr/include/asm/unistd_64.h or /usr/include/asm-x86_64/unistd.h) while on 32-bit Linux, 158 (as __NR_sched_yield in /usr/include/asm/unistd_32.h or /usr/include/asm-i386/unistd.h)
In Glibc, system calls on 64-bit x86 Linux are handled using code in sysdeps/unix/sysv/linux/x86_64/sysdep.h In sched_yield()'s case, the code would be
mov EAX, 24 syscall cmp RAX, -4095 jae SYSCALL_ERROR_LABEL END: ret SYSCALL_ERROR_LABEL: mov RCX, QWORD PTR errno@GOTTPOFF[RIP] xor EDX, EDX sub RDX, RAX mov DWORD PTR FS:[RCX], EDX or RAX, -1 jmp ENDOn 64-bit x86 Linux the parameters of a system call are passed via registers. The system call number goes into RAX, and RDI, RSI, RDX, R10, R8, R9 have the six parameters (system calls are limited to six parameters)
After a system call, contents in RCX and R11 registers should be considered clobbered, and RAX register holds the return value. A value in the range between -4095 and -1 indicates an error, it is -errno.
GOTTPOFF is the thread local storage for global variables (in this case, errno); GOT is global offset table, TP is thread pointer. On 64-bit x86 Linux, the segment register FS is used as the thread register whose content is the TP. (On 32-bit x86 Linux, the segment register GS is used instead)
On 32-bit x86 Linux, system calls are handled using code in sysdeps/unix/sysv/linux/i386/sysdep.h. In sched_yield's case, the code is (if sysenter instruction is supported):
mov EAX, 158 call DWORD PTR GS:SYSINFO_OFFSET cmp EAX, -4095 jae SYSCALL_ERROR_LABEL END: ret SYSCALL_ERROR_LABEL: call __i686.get_pc_thunk.cx add ECX, _GLOBAL_OFFSET_TABLE_ mov ECX, DWORD PTR [errno + ECX] xor EDX, EDX sub EDX, EAX mov DWORD PTR GS:[ECX], EDX or EAX, -1 jmp ENDIf sysenter instruction is not available, replace the call DWORD.. by int 0x80.
Like the 64-bit mode, on the 32-bit x86 Linux the parameters of a system call are passed via registers. The system call number goes into EAX, and EBX, ECX, EDX, ESI, EDI, EBP have the six parameters. EAX register holds the return value. A value in the range between -4095 and -1 indicates an error, it is -errno.
__i686.get_pc_thunk.cx is used to get program counter (IP) into register ECX since 32-bit x86 has no RIP relative addressing mode as in 64-bit mode. Its code is very straightforward (see SETUP_PIC_REG macro here):
__i686.get_pc_thunk.cx: mov ECX, DWORD PTR [ESP] retThere are also similar functions such as __i686.get_pc_thunk.bx and __i686.get_pc_thunk.dx
For FreeBSD, see here
What are cancelable system calls and how does Glibc handle them?
sched_yield is a non-cancellation-point system call, but there are also system calls which are cancellation points as in POSIX thread specification, and their handling in Glibc is slightly complicated.POSIX specifies a thread cancellation mechanism which allows a thread to terminate any other thread in a controlled manner. Each thread has the following cancellation information associated with it:
Meaning | Bitmask in Glibc's nptl/descr.h | POSIX thread flag | |
---|---|---|---|
Cancelability enabled ? | If disabled, cancellation requests against the thread are held pending. By default, cancelability is enabled. |
CANCELSTATE_BITMASK true means Enabled | PTHREAD_CANCEL_ENABLE PTHREAD_CANCEL_DISABLE |
Cancelability type ? | When cancelability is enabled and the cancelability type is Asynchronous,
new or pending cancellation requests may be acted upon at any time. When cancelability is enabled and the cancelability type is Deferred, cancellation requests against the thread are held pending until a cancellation point is reached. If cancelability is disabled, the setting of the cancelability type has no immediate effect until cancelability is enabled. By default, cancelability type is Deferred. |
CANCELTYPE_BITMASK true means Asynchronous | PTHREAD_CANCEL_DEFERRED PTHREAD_CANCEL_ASYNCHRONOUS |
Any pending cancellation request ? | CANCELING_BITMASK | ||
Thread is cancelled ? | CANCELED_BITMASK | ||
Thread is being cancelled ? | EXITING_BITMASK |
So what is a cancellation point ? According to POSIX, a call to pthread_testcancel is a cancellation point. In addition, any call to functions (not just system calls!) which would cause blocking is a cancellation point. This link gives a list of such functions. For example, system calls such as read, write, sleep, poll, wait, APIs which depend on the aforementioned system calls such as fgetc, fputc, thread APIs such as pthread_cond_wait, pthread_cond_timedwait, pthread_join, pthread_rwlock_*, and System V IPC such as mq_receive, mq_send, sem_wait, etc.
It's now clear that Glibc must do something to conform to Pthread's cancellation mechanism, and this lead to the following multi-entry function implementation (as in Glibc's nptl/sysdeps/unix/sysv/linux/x86_64/sysdep-cancel.h) Let's take open for example. It now has TWO entry points: __open and __open_nocancel:
__open: cmp __libc_multiple_threads, 0 jne PSEUDO_CANCEL __open_nocancel: mov EAX, 2 syscall cmp RAX, -4095 jae SYSCALL_ERROR_LABEL ret PSEUDO_CANCEL: sub RSP, 8 call __libc_enable_asynccancel mov QWORD PTR [RSP], RAX ; RAX holds the original Cancelability type mov EAX, 2 syscall mov RDI, QWORD PTR [RSP] ; now RDI holds the original Cancelability type mov RDX, RAX call __libc_disable_asynccancel mov RAX, RDX add RSP, 8 cmp RAX, -4095 jae SYSCALL_ERROR_LABEL END: ret SYSCALL_ERROR_LABEL: mov RCX, QWORD PTR errno@GOTTPOFF[RIP] xor EDX, EDX sub RDX, RAX mov DWORD PTR FS:[RCX], RDX or RAX, -1 jmp ENDSo the entry function __open_nocancel does not check for any thread cancellation and the more generic entry point __open will see if the program is multi-threaded, and decide what to do next. It's also worth noting that __open_nocancel does not set up any stack frame either, but __open does.
Here is __libc_enable_asynccancel's code (as in nptl/sysdeps/unix/sysv/linux/x86_64/cancellation.S)
__libc_enable_asynccancel: mov EAX, DWORD PTR FS:CANCELHANDLING RETRY: mov R11D, EAX or R11D, CANCELTYPE_BITMASK cmp R11D, EAX je END lock cmpxchg DWORD PTR FS:CANCELHANDLING, R11D jne RETRY and R11D, CANCELSTATE_BITMASK|CANCELTYPE_BITMASK|CANCELED_BITMASK|EXITING_BITMASK|CANCEL_RESTMASK|TERMINATED_BITMASK cmp R11D, CANCELTYPE_BITMASK|CANCELED_BITMASK je UNWIND END: ret UNWIND: mov QWORD PTR FS:RESULT, PTHREAD_CANCELED lock or DWORD PTR FS:CANCELHANDLING, EXITING_BITMASK mov RDI, QWORD PTR FS:CLEANUP_JMP_BUF call __pthread_unwind hltFirst, it loads the thread's cancellation information (CANCELHANDLING is a number, which is the offset into the thread's local data segment. This offset is calculated when Glibc is compiled and it is done using an awk script at Glibc's scripts/gen-as-const.awk and the symbol file nptl/sysdeps/x86_64/tcb-offsets.sym) and check if the Cancelability type is Asynchronous or not. If not, set it to Asynchronous temporarily. (The original Cancelability type value is in EAX register.) Then it checks whether the thread has Cancelability enabled, has Asynchronous cancelability type, and is not being cancelled, or cancelled, or exiting. If so, set up the relevant flags and call __pthread_unwind
Note that __pthread_enable_asynccancel in libpthread and __librt_enable_asynccancel in librt both have the same code as __libc_enable_asynccancel. In fact, the system call __open and __open_nocancel have the same implementation in these two libraries as well.
Here is __libc_disable_asynccancel's code (also in nptl/sysdeps/unix/sysv/linux/x86_64/cancellation.S)
__libc_disable_asynccancel: test EDI, CANCELTYPE_BITMASK jne END mov EAX, DWORD PTR FS:CANCELHANDLING RETRY: mov R11D, EAX and R11D, ~CANCELTYPE_BITMASK lock cmpxchg DWORD PTR FS:CANCELHANDLING, R11D jne RETRY mov EAX, R11D RETRY2: and EAX, CANCELING_BITMASK|CANCELED_BITMASK cmp EAX, CANCELING_BITMASK je FUTEX END: ret FUTEX: mov RDI, QWORD PTR FS:0 mov EAX, SYS_futex xor R10, R10 add RDI, CANCELHANDLING mov ESI, DWORD PTR FS:PRIVATE_FUTEX syscall mov EAX, DWORD PTR FS:CANCELHANDLING jmp RETRY2First, it checks if the original Cancelability type (now in EDI register... see the code between __libc_enable_asynccancel and __libc_disable_asynccancel above) is Asynchronous or not. If not, nothing to do, so it just returns. If so, set it to Deferred. It then checks whether the thread is being cancelled or not. If so, invoke the futex system call. __libc_disable_asynccancel will not return until the thread is no longer in the "being cancelled" state.
Like __libc_enable_asynccancel, __libc_disable_asynccancel is renamed to __pthread_disable_asynccancel in libpthread and __librt_disable_asynccancel in librt; the code is, of course, identical.
Finally, a platform-neutral high-level C language implementation of __libc_enable_asynccancel and __libc_disable_asynccancel can be found in Glibc's nptl/cancellation.c
What are vsyscalls and VDSO ?
A vsyscall is a system call that avoids crossing the userspace-kernel boundary. To see why a vsyscall is implemented in Linux, see hereFor dynamic executable binaries, there is also a memory segment called [vdso], as shown below:
$ LD_SHOW_AUXV=true cat /proc/self/maps AT_SYSINFO_EHDR: 0x7fffad451000 AT_HWCAP: 78afbfd AT_PAGESZ: 4096 AT_CLKTCK: 100 AT_PHDR: 0x400040 AT_PHENT: 56 AT_PHNUM: 9 AT_BASE: 0x7f6dd34b7000 AT_FLAGS: 0x0 AT_ENTRY: 0x401850 AT_UID: 254374 AT_EUID: 254374 AT_GID: 16038 AT_EGID: 16038 AT_SECURE: 0 AT_RANDOM: 0x7fffad42b999 AT_EXECFN: /bin/cat AT_PLATFORM: x86_64 00400000-0040d000 r-xp 00000000 fb:00 2097154 /bin/cat 0060d000-0060e000 r--p 0000d000 fb:00 2097154 /bin/cat 0060e000-0060f000 rw-p 0000e000 fb:00 2097154 /bin/cat 01e04000-01e25000 rw-p 00000000 00:00 0 [heap] ..... (useless entries omitted) 7f6dd36d7000-7f6dd36d8000 r--p 00020000 fb:00 1576914 /lib/ld-2.11.1.so 7f6dd36d8000-7f6dd36d9000 rw-p 00021000 fb:00 1576914 /lib/ld-2.11.1.so 7f6dd36d9000-7f6dd36da000 rw-p 00000000 00:00 0 7fffad418000-7fffad42d000 rw-p 00000000 00:00 0 [stack] 7fffad451000-7fffad452000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
VDSO
VDSO (Virtual Dynamically-linked Shared Object) in Linux is a kernel-provided shared library that helps userspace perform a few kernel actions without the overhead of a system call, as well as automatically choosing the most efficient system call mechanism.In above cat /proc/self/maps command output, note the address 7fffad451000 followed by AT_SYSINFO_EHDR and the starting address of [vdso] segment. They are the same, and this is not a coincidence; this is how kernel passes the address of VDSO to ld.so at runtime: through the auxiliary vector (this is why we set LD_SHOW_AUXV=true when running cat).
This VDSO is loaded into userspace by kernel regardless of the user executable binary is dynamically linked or statically linked. So how does kernel load it into userspace? To understand this, one can see the original VDSO kernel patch for x86_64 at here. Simply put, in load_elf_binary function in Linux kernel source file fs/binfmt_elf.c there is a call to arch_setup_additional_pages function. For x86_64, arch_setup_additional_pages is defined in arch/x86_64/vdso/vma.c. In the same source file, one can find that the variables vdso_start and vdso_end (defined in arch/x86/vdso/vdso.S) point to the starting and end of vdso.so, which is indeed compiled as an independent object file during kernel build and is included using the .incbin directive in arch/x86/vdso/vdso.S.
There are other VDSO's created during kernel build, such as vdso32-int80.so, vdso32-sysenter.so, or vdso32-syscall.so corresponding to different system call approaches in 32-bit x86 Linux.
How does ld.so capture and make use of this AT_SYSINFO_EHDR info ? First, in _dl_aux_init function (see Glibc's source file elf/dl-support.c) one can find code snippet like this:
case AT_SYSINFO_EHDR: GL(dl_sysinfo_dso) = (void *) av->a_un.a_val; break;so the value corresponding to AT_SYSINFO_EHDR tag is saved to a global variable GL(dl_sysinfo_dso). Later on, in dl_main (see Glibc's source file elf/rtld.c) look at the code block if (GLRO(dl_sysinfo_dso) != NULL) { .. or the code near the comment "Initialize l_local_scope to contain just this map", one can find GLRO(dl_sysinfo_dso) is copied into the first entry of l_local_scope. This allows the _dl_vdso_vsym function (see Glibc's source file sysdeps/unix/sysv/linux/dl-vdso.c) to resolve any symbolic references using the symbols in [vdso] first.
In newer versions (e.g. 2.6.32) of Linux kernel, the address pointed by AT_SYSINFO_EHDR changes every time and this is because of the Virtual Address Randomization (a.k.a. ASLR) feature, and there are many levels of ASLR in Linux, e.g. stack, brk, .so/mmap, VDSO, etc. To see if ASLR is enabled or not, run the command
$ sysctl kernel.randomize_va_spaceIf the value is not 0 (e.g. 1 or 2, depending on CONFIG_COMPAT_BRK; see Linux kernel source file mm/memory.c and search for the keyword randomize_va_space in Linux kernel source tree), then it is enabled. To run a command without ASLR, use the setarch command, e.g.
$ setarch `uname -m` -R cat /proc/self/mapsor simply execute setarch `uname -m` -R which will open a shell with ASLR disabled. Newer GDB by default also disables ASLR, i.e. set disable-randomization off and GDB achieves this by calling personality(orig_personality|ADDR_NO_RANDOMIZE) after fork and before exec.
To see that [vdso] is indeed an ELF shared object, we can use GDB to dump it to a file, say /tmp/vdso.so as follows (there is another approach, based on the uncompressed Linux kernel; see the next paragraph):
$ gdb --quiet /bin/ls Reading symbols from /bin/ls...(no debugging symbols found)...done. (gdb) tbreak __open Function "__open" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (__open) pending. (gdb) run Starting program: /bin/ls [Thread debugging using libthread_db enabled] Breakpoint 1, 0x00007ffff7503110 in open64 () from /lib/libc.so.6 (gdb) info inferiors Num Description Executable * 1 process 19213 /bin/ls (gdb) shell cat /proc/19213/maps|grep vdso 7ffff7ffb000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso] (gdb) dump memory /tmp/vdso.so 0x7ffff7ffb000 0x7ffff7ffc000One can check that /tmp/vdso.so is indeed an ELF shared object:
$ readelf -h /tmp/vdso.so ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: DYN (Shared object file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0xffffffffff700600 Start of program headers: 64 (bytes into file) Start of section headers: 2656 (bytes into file) Flags: 0x0 ....It has SONAME being linux-vdso.so.1:
$ readelf -d /tmp/vdso.so Dynamic section at offset 0x460 contains 10 entries: Tag Type Name/Value 0x000000000000000e (SONAME) Library soname: [linux-vdso.so.1] 0x0000000000000004 (HASH) 0xffffffffff700120 0x0000000000000005 (STRTAB) 0xffffffffff700230 0x0000000000000006 (SYMTAB) 0xffffffffff700158 0x000000000000000a (STRSZ) 82 (bytes) 0x000000000000000b (SYMENT) 24 (bytes) 0x000000006ffffffc (VERDEF) 0xffffffffff700298 0x000000006ffffffd (VERDEFNUM) 2 0x000000006ffffff0 (VERSYM) 0xffffffffff700282 0x0000000000000000 (NULL) 0x0The most interesting part is the symbols it exports:
$ readelf -s /tmp/vdso.so Symbol table '.dynsym' contains 9 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: ffffffffff70030c 0 SECTION LOCAL DEFAULT 7 2: ffffffffff7008d0 156 FUNC WEAK DEFAULT 12 clock_gettime@@LINUX_2.6 3: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6 4: ffffffffff700790 138 FUNC GLOBAL DEFAULT 12 __vdso_gettimeofday@@LINUX_2.6 5: ffffffffff700970 61 FUNC GLOBAL DEFAULT 12 __vdso_getcpu@@LINUX_2.6 6: ffffffffff700790 138 FUNC WEAK DEFAULT 12 gettimeofday@@LINUX_2.6 7: ffffffffff700970 61 FUNC WEAK DEFAULT 12 getcpu@@LINUX_2.6 8: ffffffffff7008d0 156 FUNC GLOBAL DEFAULT 12 __vdso_clock_gettime@@LINUX_2.6The layout of vdso.so is determined by the linker scripts arch/x86/vdso/vdso.lds.S and arch/x86/vdso/vdso-layout.lds.S in Linux kernel source.
Three weak symbols are exported and they correspond to three system calls: clock_gettime, gettimeofday, and getcpu. Of the three, only clock_gettime and gettimeofday will be used by Glibc. See below for a comprehensive example on how they actually work.
Vsyscall
In the cat /proc/self/maps command output, there is a memory segment called [vdso] with starting address 0xffff ffff ff60 0000. Recall that there are four code models in x86_64/AMD64 and the "kernel" code model has its 2 GB address space spanning from 0xffff ffff 8000 0000 to 0xffff ffff ff00 0000, and this [vsyscall] segment is not within this range. To verify this, look at the last few lines of System.map file at either /boot/System.map-`uname -r` or /lib/modules/`uname -r`/build/System.map (/proc/kallsyms contains the same information, but /proc/kallsyms also contains kernel modules information):$ tail -20 /boot/System.map-`uname -r` ffffffff81a4ca90 b rfkill_master_switch_op ffffffff81a4ca94 b rfkill_op ffffffff81a4ca98 b rfkill_last_scheduled ffffffff81a4caa0 b klist_remove_lock ffffffff81a4caa4 B __bss_stop ffffffff81a4d000 B __brk_base ffffffff81a5d000 b .brk.dmi_alloc ffffffff81a6d000 B __brk_limit ffffffffff600000 T vgettimeofday ffffffffff600140 t vread_tsc ffffffffff600170 t vread_hpet ffffffffff600180 D __vsyscall_gtod_data ffffffffff600400 T vtime ffffffffff600800 T vgetcpu ffffffffff600880 D __vgetcpu_mode ffffffffff6008c0 D __jiffies ffffffffff700000 A VDSO64_PRELINK ffffffffff700550 A VDSO64_jiffies ffffffffff700558 A VDSO64_vgetcpu_mode ffffffffff700560 A VDSO64_vsyscall_gtod_dataHow is 0xffff ffff ff60 0000 determined ? In /usr/include/asm/vsyscall.h one can see
enum vsyscall_num { __NR_vgettimeofday, __NR_vtime, __NR_vgetcpu, }; #define VSYSCALL_START (-10UL << 20) #define VSYSCALL_SIZE 1024 #define VSYSCALL_END (-2UL << 20) #define VSYSCALL_MAPPED_PAGES 1 #define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))so VSYSCALL_START is 0xffff ffff ff60 0000
In /usr/include/asm/vsyscall.h there are three vsyscalls defined: gettimeofday , time, and getcpu. Note that getcpu is a system call in 32-bit Linux (its number is 318), but not so in 64-bit Linux, so in 64-bit Linux it is implemented as a vsyscall. To use getcpu, use the wrapper function sched_getcpu in Glibc.
The source code of these three vsyscalls is in Linux kernel source file arch/x86/kernel/vsyscall_64.c. In this file, look at functions:
int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz) { ... } time_t __vsyscall(1) vtime(time_t *t) { ... } long __vsyscall(2) vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache) { ... }The macro __vsyscall ties vgettimeofday to ELF section .vsyscall_0, vtime to section .vsyscall_1, and vgetcpu to section .vsyscall_2. Then how does the addresses of .vsyscall_X sections determined ? In Linux kernel source file arch/x86/kernel/vmlinux.lds.S one can find:
#ifdef CONFIG_X86_64 #define VSYSCALL_ADDR (-10*1024*1024) #define VLOAD_OFFSET (VSYSCALL_ADDR - __vsyscall_0 + LOAD_OFFSET) #define VLOAD(x) (ADDR(x) - VLOAD_OFFSET) .... . = ALIGN(4096); __vsyscall_0 = .; . = VSYSCALL_ADDR; .vsyscall_0 : AT(VLOAD(.vsyscall_0)) { *(.vsyscall_0) } :user .... .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) }Thus, vgettimeofday (i.e. section .vsyscall_0) can be found at address 0xffff ffff ff60 0000 (-10*1024*1024, or -10UL << 20 as in /usr/include/asm/vsyscall.h), vtime (i.e. section .vsyscall_1) can be found at 0xffff ffff ff60 0000 + 1024 = 0xffff ffff ff60 0400, and vgetcpu (i.e. section .vsyscall_2) can be found at address 0xffff ffff ff60 0000 + 2048 = 0xffff ffff ff60 0800.
To see them in the Linux kernel, use the same trick as extract-ikconfig script to unpack the compressed Linux kernel image vmlinuz. Here we assume vmlinux is compressed using gzip. (However, vmlinux can be compressed with other compression algorithms such as bzip2, lzma, or lzo):
$ HDR=`binoffset /boot/vmlinuz-$(uname -r) 0x1f 0x8b 0x08 0x0` $ dd if=/boot/vmlinuz-$(uname -r) bs=1 skip=$HDR | zcat - > /tmp/vmlinuxbinoffset is a utility and its source file can usually be found under /usr/src/linux-XXX/scripts/ or here.
0x1f 0x8b 0x08 0x0 is the magic word to identify the beginning of a gzipped file.
In newer versions (e.g. 2.6.32) of Linux kernel, /tmp/vmlinux is an ELF object:
$ readelf -S /tmp/vmlinux .... [19] .vsyscall_0 PROGBITS ffffffffff600000 00c00000 0000000000000116 0000000000000000 AX 0 0 16 [20] .vsyscall_fn PROGBITS ffffffffff600140 00c00140 000000000000003f 0000000000000000 AX 0 0 16 [21] .vsyscall_gtod_da PROGBITS ffffffffff600180 00c00180 0000000000000060 0000000000000000 WA 0 0 16 [22] .vsyscall_1 PROGBITS ffffffffff600400 00c00400 000000000000003d 0000000000000000 AX 0 0 16 [23] .vsyscall_2 PROGBITS ffffffffff600800 00c00800 0000000000000075 0000000000000000 AX 0 0 16 ...With the knowledge of System.map, we can also extract vdso.so from the uncompressed kernel image /tmp/vmlinux. As mentioned in the previous paragraph, the variables vdso_start and vdso_end defined in arch/x86/vdso/vdso.S point to the starting and end of vdso.so, and they are indeed in the System.map:
$ grep vdso /boot/System.map-`uname -r` ffffffff81860080 D vdso_enabled ffffffff81896932 t vdso_setup ffffffff81896949 t init_vdso_vars ffffffff81896b51 t vdso_setup ffffffff81896b7f t relocate_vdso ffffffff818cd024 T vdso_start ffffffff818cde84 T vdso32_int80_start ffffffff818cde84 T vdso_end ffffffff818ce520 T vdso32_int80_end ffffffff818ce520 T vdso32_syscall_start ffffffff818cebc8 T vdso32_syscall_end ffffffff818cebc8 T vdso32_sysenter_start ffffffff818cf274 T vdso32_sysenter_end ffffffff818dde8b t __setup_str_vdso_setup ffffffff818dde91 t __setup_str_vdso_setup ffffffff81911ac0 t __setup_vdso_setup ffffffff81911ad8 t __setup_vdso_setup ffffffff819129b8 t __initcall_init_vdso_vars6 ffffffff81973b20 b vdso_pages ffffffff81973b28 b vdso_size ffffffff81973b30 b vdso32_pagesOne can check that the address ffffffff818cd024 is in the .init.data section. We can use objdump to verify the content is indeed an ELF object:
$ objdump -sj .init.data --start-address=0xffffffff818cd024 --stop-address=0xffffffff818cde84 /tmp/vmlinux /tmp/vmlinux: file format elf64-x86-64 Contents of section .init.data: ffffffff818cd024 7f454c46 02010100 00000000 00000000 .ELF............ ffffffff818cd034 03003e00 01000000 000670ff ffffffff ..>.......p..... ffffffff818cd044 40000000 00000000 600a0000 00000000 @.......`....... ffffffff818cd054 00000000 40003800 04004000 10000f00 ....@.8...@..... ....Now we need to create a binary file from the above hexdump. To achieve this, first, run above command and redirect the result to a file, say foo. Next, use:
$ awk ' /ffffffff8/ { print $2,$3,$4,$5 }' foo | xxd -r -p > /tmp/vdso.soand one can now verify that this /tmp/vdso.so is the same as the one created using the GDB approach in the previous paragraph. (If xxd is not available on your system, get its source code here)
Note that in older versions (e.g. 2.6.9) of Linux kernel, /tmp/vmlinux is a raw binary image created from a command like the following:
objcopy -O binary -R .note -R .comment -S [inputELFobj] vmlinux(To wit, search for -O binary -R .note -R .comment -S or OBJCOPYFLAGS among the Makefiles in the Linux kernel source tree.) The above command will strip all symbolic and debugging information (-S option), strip .note and .comment sections, and create a raw binary image (-O binary option), which can be loaded into memory as it is. The source file which implements this raw binary image creation is in GNU binutil's source file bfd/binary.c (in particular, look at the function binary_set_section_contents)
A simple example shows how this raw binary is created from an ELF executable binary. Suppose we have an ELF executable binary a.out which has the following sections:
$ readelf -t a.out There are 24 section headers, starting at offset 0x71898: Section Headers: [Nr] Name Type Address Offset Link Size EntSize Info Align Flags [ 0] NULL 0000000000000000 0000000000000000 0 0000000000000000 0000000000000000 0 0 [0000000000000000]: [ 1] .note.ABI-tag NOTE 0000000000400120 0000000000000120 0 0000000000000020 0000000000000000 0 4 [0000000000000002]: ALLOC [ 2] .init PROGBITS 0000000000400140 0000000000000140 0 0000000000000018 0000000000000000 0 4 [0000000000000006]: ALLOC, EXEC [ 3] .text PROGBITS 0000000000400160 0000000000000160 0 0000000000052e48 0000000000000000 0 16 [0000000000000006]: ALLOC, EXEC ..... [18] .got.plt PROGBITS 00000000006700d8 00000000000700d8 0 0000000000000018 0000000000000008 0 8 [0000000000000003]: WRITE, ALLOC [19] .data PROGBITS 0000000000670100 0000000000070100 0 00000000000015e8 0000000000000000 0 32 [0000000000000003]: WRITE, ALLOC [20] .bss NOBITS 0000000000671700 00000000000716e8 0 0000000000002440 0000000000000000 0 32 [0000000000000003]: WRITE, ALLOC [21] __libc_freeres_ptrs NOBITS 0000000000673b40 00000000000716e8 0 0000000000000030 0000000000000000 0 8 [0000000000000003]: WRITE, ALLOC [22] .comment PROGBITS 0000000000000000 00000000000716e8 0 00000000000000bf 0000000000000000 0 1 [0000000000000000]: [23] .shstrtab STRTAB 0000000000000000 00000000000717a7 0 00000000000000ec 0000000000000000 0 1 [0000000000000000]:Then what binary_set_section_contents does is: First, keep only those sections which have ALLOC flag and are not of NOBITS type (NOBITS means this section takes no space in the ELF binary file). Then, sections such as .bss or .comment will be left out. Next, find the lowest Load Memory Address (LMA) of all remaining sections, which is 0x400120 (.note.ABI-tag section) in the above example. 0x400120 will be the new base address, and binary_set_section_contents dumps the remaining sections one by one by doing an lseek with offset = section LMA - 0x400120 and then writing the section content. In above example, the section with the highest LMA is .data section, and its size is 0x15e8, so the resulting raw binary image will have file size 0x670100 - 0x400120 + 0x15e8 = 2561480:
$ objcopy -O binary a.out raw $ du -B1 --apparent-size raw 2561480 raw $ du -B1 -s raw 475136 raw(Files created from objcopy -O binary could be "sparse", i.e. containing "holes", as above two du commands indicate.)
If /tmp/vmlinux is a raw binary image, we can force objdump to disassemble it as follows (use Linux kernel version 2.6.9 as example):
$ objdump -b binary -m i386:x86-64 -D /tmp/vmlinux /tmp/vmlinux: file format binary Disassembly of section .data: 0000000000000000 <.data>: 0: 89 dd mov %ebx,%ebp 2: b8 18 00 00 00 mov $0x18,%eax 7: 8e d8 mov %eax,%ds 9: b8 00 00 00 80 mov $0x80000000,%eax e: 0f a2 cpuid 10: 3d 00 00 00 80 cmp $0x80000000,%eax 15: 0f 86 35 01 00 00 jbe 0x150 1b: b8 01 00 00 80 mov $0x80000001,%eax 20: 0f a2 cpuid 22: 0f ba e2 1d bt $0x1d,%edx 26: 0f 83 24 01 00 00 jae 0x150 ....To make sure /tmp/vmlinux indeed has no special format or structure and objdump disassembles the file from the beginning, compare it with output of hexdump:
$ hexdump -v -C /tmp/vmlinux 00000000 89 dd b8 18 00 00 00 8e d8 b8 00 00 00 80 0f a2 |................| 00000010 3d 00 00 00 80 0f 86 35 01 00 00 b8 01 00 00 80 |=......5........| 00000020 0f a2 0f ba e2 1d 0f 83 24 01 00 00 89 d7 31 c0 |........$.....1.| 00000030 0f ba e8 05 0f ba e8 07 0f 22 e0 b8 00 10 10 00 |........."......| 00000040 0f 22 d8 b9 80 00 00 c0 0f 32 0f ba e8 08 0f ba |.".......2......| ...Cross-check with the corresponding System.map:
$ head /boot/System.map-`uname -r` ffffffff80100000 A _text ffffffff80100000 t startup_32 ffffffff80100081 t reach_compatibility_mode ffffffff8010008e t second ffffffff80100100 t reach_long64 ffffffff8010013d T initial_code ffffffff80100145 T init_rsp ffffffff80100150 T no_long_mode ffffffff80100f00 T pGDT32 ffffffff80100f10 t ljumpvectorwe can learn that the code starts at address 0xffff ffff 8010 0000. However, we still do not know if the beginning of the raw binary image /tmp/vmlinux is also the beginning of code; we have to dig up the source code of startup_32 and verify. startup_32 is in head.S and a complete copy can be found here. The code snippet is like this:
startup_32: /* * At this point the CPU runs in 32bit protected mode (CS.D = 1) with * paging disabled and the point of this file is to switch to 64bit * long mode with a kernel mapping for kerneland to jump into the * kernel virtual addresses. * There is no stack until we set one up. */ movl %ebx,%ebp /* Save trampoline flag */ movl $__KERNEL_DS,%eax movl %eax,%ds /* If the CPU doesn't support CPUID this will double fault. * Unfortunately it is hard to check for CPUID without a stack. */ /* Check if extended functions are implemented */ movl $0x80000000, %eax cpuid cmpl $0x80000000, %eax jbe no_long_mode /* Check if long mode is implemented */ mov $0x80000001, %eax cpuid ...Compare above with the objdump output we can now confirm the 0xffff ffff 8010 0000 is the base address of raw binary image /tmp/vmlinux
To extract VDSOs from a raw binary image is an easier task. The VDSO's in kernel version 2.6.9 are indicated by global variables syscall32_syscall/syscall32_syscall_end and syscall32_sysenter/syscall32_sysenter_end:
$ grep syscall32_sys /boot/System.map-`uname -r` ffffffff8053d5a0 t syscall32_syscall ffffffff8053e048 t syscall32_syscall_end ffffffff8053e048 t syscall32_sysenter ffffffff8053eb00 t syscall32_sysenter_endSince the base address is 0xffff ffff 8010 0000, the offset of syscall32_syscall in /tmp/vmlinux is 0xffffffff8053d5a0 - 0xffffffff80100000 = 4445600 and the size of this VDSO is 0xffffffff8053e048 - 0xffffffff8053d5a0 = 2728. Now we can extract VDSO by
$ dd if=/tmp/vmlinux of=/tmp/syscall_vdso.so bs=1 skip=4445600 count=2728Verify this is an ELF binary object. Note that this is 32-bit and has SONAME being linux-gate.so.1:
$ readelf -a /tmp/syscall_vdso.so ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: DYN (Shared object file) Machine: Intel 80386 Version: 0x1 Entry point address: 0xffffe400 ... Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .hash HASH ffffe094 000094 000044 04 A 2 0 4 [ 2] .dynsym DYNSYM ffffe0d8 0000d8 0000c0 10 A 3 8 4 [ 3] .dynstr STRTAB ffffe198 000198 000056 00 A 0 0 1 [ 4] .gnu.version VERSYM ffffe1ee 0001ee 000018 02 A 2 0 2 [ 5] .gnu.version_d VERDEF ffffe208 000208 000038 00 A 3 2 4 [ 6] .text.vsyscall PROGBITS ffffe400 000400 000010 00 AX 0 0 1 [ 7] .text.sigreturn PROGBITS ffffe500 000500 000008 00 AX 0 0 32 [ 8] .text.rtsigreturn PROGBITS ffffe600 000600 000007 00 AX 0 0 32 [ 9] .eh_frame_hdr PROGBITS ffffe608 000608 000024 00 A 0 0 4 [10] .eh_frame PROGBITS ffffe62c 00062c 000100 00 A 0 0 4 [11] .dynamic DYNAMIC ffffe72c 00072c 000078 08 WA 3 0 4 [12] .useless PROGBITS ffffe7a4 0007a4 00000c 04 WA 0 0 4 [13] .text PROGBITS ffffe7b0 0007b0 000000 00 AX 0 0 4 [14] .shstrtab STRTAB 00000000 0007b0 00009e 00 0 0 1 .... Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x000000 0xffffe000 0xffffe000 0x007b0 0x007b0 R E 0x1000 DYNAMIC 0x00072c 0xffffe72c 0xffffe72c 0x00078 0x00078 R 0x4 GNU_EH_FRAME 0x000608 0xffffe608 0xffffe608 0x00024 0x00024 R 0x4 Section to Segment mapping: Segment Sections... 00 .hash .dynsym .dynstr .gnu.version .gnu.version_d .text.vsyscall .text.sigreturn .text.rtsigreturn .eh_frame_hdr .eh_frame .dynamic .useless 01 .dynamic 02 .eh_frame_hdr Dynamic section at offset 0x72c contains 10 entries: Tag Type Name/Value 0x0000000e (SONAME) Library soname: [linux-gate.so.1] 0x00000004 (HASH) 0xffffe094 0x00000005 (STRTAB) 0xffffe198 0x00000006 (SYMTAB) 0xffffe0d8 0x0000000a (STRSZ) 86 (bytes) 0x0000000b (SYMENT) 16 (bytes) 0x6ffffffc (VERDEF) 0xffffe208 0x6ffffffd (VERDEFNUM) 2 0x6ffffff0 (VERSYM) 0xffffe1ee 0x00000000 (NULL) 0x0It has the following exported symbols (Note the addresses of these symbols):
$ readelf -s /tmp/syscall_vdso.so Symbol table '.dynsym' contains 12 entries: Num: Value Size Type Bind Vis Ndx Name 0: 00000000 0 NOTYPE LOCAL DEFAULT UND 1: ffffe400 0 SECTION LOCAL DEFAULT 6 2: ffffe500 0 SECTION LOCAL DEFAULT 7 3: ffffe600 0 SECTION LOCAL DEFAULT 8 4: ffffe608 0 SECTION LOCAL DEFAULT 9 5: ffffe62c 0 SECTION LOCAL DEFAULT 10 6: ffffe7a4 0 SECTION LOCAL DEFAULT 12 7: ffffe7b0 0 SECTION LOCAL DEFAULT 13 8: ffffe400 16 FUNC GLOBAL DEFAULT 6 __kernel_vsyscall@@LINUX_2.5 9: 00000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.5 10: ffffe600 7 FUNC GLOBAL DEFAULT 8 __kernel_rt_sigreturn@@LINUX_2.5 11: ffffe500 8 FUNC GLOBAL DEFAULT 7 __kernel_sigreturn@@LINUX_2.5
To reconstruct the Linux kernel binary (as an ELF executable binary) from the raw binary image, use this dressUp tool.
How do vsyscalls and VDSO work ?
To show how vsyscalls and VDSO work, we use the following code:#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> int main() { { struct timeval tim; gettimeofday(&tim, NULL); printf("%.6lf seconds\n", tim.tv_sec+tim.tv_usec/1000000.0); } { time_t t1=time(NULL); printf("%lu seconds\n", t1); } { struct timespec tp; clock_gettime(CLOCK_REALTIME, &tp); printf("%.6lf seconds\n", tp.tv_sec + tp.tv_nsec/1000000000.0); } }Compile the above code with -lrt compiler command-line option.
Next, launch GDB and set a breakpoint at __gettimeofday, which is a weak alias of gettimeofday. If one sets a breakpoint at gettimeofday, then as mentioned earlier, the symbol resolution of GDB will find this gettimeofday symbol inside VDSO, not Glibc, since VDSO has precedence.
(gdb) tbreak __gettimeofday Function "__gettimeofday" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (__gettimeofday) pending. (gdb) run Starting program: a.out [Thread debugging using libthread_db enabled] Breakpoint 1, 0x00007ffff790bb10 in gettimeofday () from /lib/libc.so.6 (gdb) disas Dump of assembler code for function gettimeofday: => 0x00007ffff790bb10 <+0>: sub $0x8,%rsp 0x00007ffff790bb14 <+4>: mov 0x2ca9e5(%rip),%rax # 0x7ffff7bd6500 <__vdso_gettimeofday> 0x00007ffff790bb1b <+11>: ror $0x11,%rax 0x00007ffff790bb1f <+15>: xor %fs:0x30,%rax 0x00007ffff790bb28 <+24>: callq *%rax 0x00007ffff790bb2a <+26>: cmp $0xfffff001,%eax 0x00007ffff790bb2f <+31>: jae 0x7ffff790bb36 <gettimeofday+38> 0x00007ffff790bb31 <+33>: add $0x8,%rsp 0x00007ffff790bb35 <+37>: retq 0x00007ffff790bb36 <+38>: mov 0x2c5463(%rip),%rcx # 0x7ffff7bd0fa0 0x00007ffff790bb3d <+45>: xor %edx,%edx 0x00007ffff790bb3f <+47>: sub %rax,%rdx 0x00007ffff790bb42 <+50>: mov %edx,%fs:(%rcx) 0x00007ffff790bb45 <+53>: or $0xffffffffffffffff,%rax 0x00007ffff790bb49 <+57>: jmp 0x7ffff790bb31 <gettimeofday+33> End of assembler dump.At this point we can see what __gettimeofday inside Glibc is doing; it loads a value from a variable called __vdso_gettimeofday into register, doing some rotation and XOR's (pointer demangling, see below) and call the demangled address (all of this can be found in sysdeps/unix/sysv/linux/x86_64/gettimeofday.S)
So how is the variable __vdso_gettimeofday set up ? Let's set a watch point and run the program again:
(gdb) watch *0x7ffff7bd6500 Hardware watchpoint 2: *0x7ffff7bd6500 (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: a.out Watchpoint 2: *0x7ffff7bd6500 Old value = <unreadable> New value = 0 0x00007ffff7df48ba in mmap64 () from /lib/ld-linux-x86-64.so.2 (gdb) cont Continuing. [Thread debugging using libthread_db enabled] Hardware watchpoint 2: *0x7ffff7bd6500 Old value = 0 New value = 9205590 0x00007ffff789996f in _init () from /lib/libc.so.6 (gdb) bt #0 0x00007ffff789996f in _init () from /lib/libc.so.6 #1 0x00007ffff7dec559 in call_init () from /lib/ld-linux-x86-64.so.2 #2 0x00007ffff7dec697 in _dl_init_internal () from /lib/ld-linux-x86-64.so.2 #3 0x00007ffff7ddfaca in _dl_start_user () from /lib/ld-linux-x86-64.so.2 #4 0x0000000000000001 in ?? () #5 0x00007fffffffe9a1 in ?? () #6 0x0000000000000000 in ?? () (gdb) x/10i $pc - 0x1f 0x7ffff7899950 <_init+224>: mov $0xffffffffff600000,%rdx 0x7ffff7899957 <_init+231>: cmovne %rax,%rdx 0x7ffff789995b <_init+235>: xor %fs:0x30,%rdx 0x7ffff7899964 <_init+244>: rol $0x11,%rdx 0x7ffff7899968 <_init+248>: mov %rdx,0x33cb91(%rip) # 0x7ffff7bd6500 <__vdso_gettimeofday> => 0x7ffff789996f <_init+255>: callq 0x7ffff797ff40 <_dl_vdso_vsym> 0x7ffff7899974 <_init+260>: mov %r12,%rdx 0x7ffff7899977 <_init+263>: mov %rbp,%rsi 0x7ffff789997a <_init+266>: mov %ebx,%edi 0x7ffff789997c <_init+268>: xor %fs:0x30,%raxThe first time GDB stops at our watch point, it is when ld.so loads libc.so and sets up its .bss section, so we need to continue (cont) The second stop is the place the variable __vdso_gettimeofday is set. The source code is in _libc_vdso_platform_setup function in sysdeps/unix/sysv/linux/x86_64/init-first.c:
static inline void _libc_vdso_platform_setup (void) { PREPARE_VERSION (linux26, "LINUX_2.6", 61765110); void *p = _dl_vdso_vsym ("gettimeofday", &linux26); /* If the vDSO is not available we fall back on the old vsyscall. */ #define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000ul if (p == NULL) p = (void *) VSYSCALL_ADDR_vgettimeofday; PTR_MANGLE (p); __vdso_gettimeofday = p; p = _dl_vdso_vsym ("clock_gettime", &linux26); PTR_MANGLE (p); __GI___vdso_clock_gettime = p; }So __vdso_gettimeofday stores the mangled address of gettimeofday in VDSO.
Now let's set a breakpoint at gettimeofday and see what's going on inside VDSO:
(gdb) delete Delete all breakpoints? (y or n) y (gdb) tbreak gettimeofday Breakpoint 1 at 0x00007ffff7ffb8c0 (gdb) run Starting program: a.out [Thread debugging using libthread_db enabled] Breakpoint 1, 0x00007ffff7ffb8c0 in gettimeofday () (gdb) bt #0 0x00007ffff7ffb8c0 in gettimeofday () #1 0x00007ffff790bb2a in gettimeofday () from /lib/libc.so.6 #2 0x000000000040062d in main () (gdb) info inferiors Num Description Executable * 1 process 1089 a.out (gdb) shell cat /proc/1089/maps|grep vdso 7ffff7ffb000-7ffff7ffc000 r-xp 00000000 00:00 0 [vdso]The memory map shows the address 7ffff7ffb8c0 is indeed within VDSO.
The Linux kernel (2.6.35) source for VDSO's gettimeofday is __vdso_gettimeofday function in arch/x86/vdso/vclock_gettime.c. What it does is:
- If sysctl_enabled is true (i.e. the kernel is compiled with CONFIG_SYSCTL) and clock.vread is set, then inside vgetns clock.vread is called to get the time.
- Otherwise, it uses the ordinary system call, i.e. it calls syscall function with the system call number __NR_gettimeofday
clock.vread is a member of struct vsyscall_gtod_data (Virtual system call GetTimeOfDay) in Linux source arch/x86/include/asm/vgtod.h.
In our debugging session above, without knowing what would happen, i.e. clock.vread gets called or syscall gets called, we could do this in GDB after the above breakpoint:
(gdb) catch syscall Catchpoint 2 (any syscall) (gdb) while ($pc < 0xffffffffff600000) > si > endHere we assumed if clock.vread ever gets called, its code is in vsyscall's region, i.e. at address 0xffffffffff600000 or after it.
If clock.vread is called, one should see the following GDB output:
0x00007ffff7ffb8c1 in gettimeofday () 0x00007ffff7ffb8cb in gettimeofday () .... 0xffffffffff600140 in ?? () (gdb) x/20i $pc => 0xffffffffff600140: push %rbp 0xffffffffff600141: mov %rsp,%rbp 0xffffffffff600144: mfence 0xffffffffff600147: data32 xchg %ax,%ax 0xffffffffff60014a: rdtsc 0xffffffffff60014c: mfence 0xffffffffff60014f: data32 xchg %ax,%ax 0xffffffffff600152: shl $0x20,%rdx 0xffffffffff600156: mov %eax,%eax 0xffffffffff600158: or %rax,%rdx 0xffffffffff60015b: mov 0x46(%rip),%rax # 0xffffffffff6001a8 0xffffffffff600162: leaveq 0xffffffffff600163: cmp %rax,%rdx 0xffffffffff600166: cmovae %rdx,%rax 0xffffffffff60016a: retq ....so this is what clock.vread is doing! It just uses rdtsc to read the TimeStamp Counter. This is what a userspace system call means!
In a nutshell, for dynamic binaries, gettimeofday is done as follows:
- __gettimeofday in Glibc calls gettimeofday in VDSO.
- gettimeofday in VDSO could either jump to clock.vread inside vsyscall, or make an ordinary syscall
- For the former case, clock.vread uses rdtsc to get the time.
For statically linked binaries, gettimeofday is done as follows:
- __gettimeofday in Glibc calls vgettimeofday in vsyscall.
- vgettimeofday in vsyscall could either invoke clock.vread, or make an ordinary syscall
- For the former case, clock.vread uses rdtsc to get the time.
Pointer Encryption
This security mechanism in Glibc has many different names: pointer encryption, pointer guard (see the LD_POINTER_GUARD part in ld.so's man page), or pointer obfuscation. The purpose is to protect function pointers inside a writable memory region, e.g. .bss section, because the memory region cannot be made Read-Only.This is done through a pair of macros: PTR_MANGLE and PTR_DEMANGLE. For x86_64, the macros are defined in sysdeps/unix/sysv/linux/x86_64/sysdep.h
An example usage of PTR_MANGLE is in _libc_vdso_platform_setup of sysdeps/unix/sysv/linux/x86_64/init-first.c, where function pointers to __vdso_gettimeofday is encrypted. In __gettimeofday of sysdeps/unix/sysv/linux/x86_64/gettimeofday.S, PTR_DEMANGLE is called to decrypt the mangled value.