Linux CVE-2017-5123 exploit

Vulnerability description

A bug in the waitid syscall (waitid is a more general variant of waitpid):

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
    }

    if (!err) {
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

The patch that fixed it:

commit 96ca579a1ecc943b75beba58bebb0356f6cc4b51
Author: Kees Cook <keescook@chromium.org>
Date:   Mon Oct 9 11:36:52 2017 -0700

    waitid(): Add missing access_ok() checks
    
    Adds missing access_ok() checks.
    
    CVE-2017-5123
    
    Reported-by: Chris Salls <chrissalls5@gmail.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Acked-by: Al Viro <viro@zeniv.linux.org.uk>
    Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
    Cc: stable@kernel.org # 4.13
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..cf28528842bc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1610,6 +1610,9 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        if (!infop)
                return err;
 
+       if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
+               goto Efault;
+
        user_access_begin();
        unsafe_put_user(signo, &infop->si_signo, Efault);
        unsafe_put_user(0, &infop->si_errno, Efault);

So basically, the code uses unsafe_put_user to fill a struct siginfo structure in userspace, but omits the call to access_ok. This function verifies that a particular pointer has a value that belongs to user space.

Since the call is missing we can overwrite some kernel memory by passing a kernel space pointer to waitid. Our goal is to do a privilege escalation exploit.

Setup

We have a qemu virtual machine that runs the vulnerable kernel.

The machine has 2 users: user:user and root:root. We want to login as user and use the exploit to become root.

For debugging purposes we pass the -s parameter to qemu. This enables a gdbserver interface inside qemu, listening on port 1234. We then use the usual gdb to connect to this remote, thus being able to debug the entire system, including the kernel.

$ gdb -nx -x gdbinit
GNU gdb (Debian 8.3-1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) target remote:1234
Remote debugging using :1234
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0xffffffff81959138 in ?? ()
=> 0xffffffff81959138:  65 44 8b 25 e8 0f 6b 7e mov    r12d,DWORD PTR gs:[rip+0x7e6b0fe8]        # 0xa128
(gdb) c
Continuing.

We can easily see that gdb has stopped somewhere inside the kernel, since addresses starting with ffffffff belong to kernel space.

In order to transfer files between the qemu vm and the host machine we’ll use the 9p filesystem. Qemu can take a directory from the host file system and use it as a 9p share that can be mounted from the guest vm.

In our case, we’ll use the vm/share directory, which will be mounted on /mnt/share inside the vm.

Host machine

adrians@snowgoose:~/cns/e2/vm/share$ touch file
adrians@snowgoose:~/cns/e2/vm/share$ ls -l
total 0
-rw-r--r-- 1 adrians adrians 0 Jan  1 21:46 file
adrians@snowgoose:~/cns/e2/vm/share$

QEMU machine

$ mount
/dev/root on / type ext4 (rw,relatime,data=ordered)
devtmpfs on /dev type devtmpfs (rw,relatime,size=119792k,nr_inodes=29948,mode=755)
proc on /proc type proc (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666)
tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
tmpfs on /tmp type tmpfs (rw,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
sysfs on /sys type sysfs (rw,relatime)
share on /mnt/share type 9p (rw,sync,dirsync,relatime,access=clienttrans=virtio)
$ ls -l /mnt/share
total 0
-rw-r--r--    1 user     user             0 Jan  1 19:46 file
$

We can use the /proc/kallsyms entry to quickly find out the address of a kernel symbol, to aid us in the reverse engineering process.

$ grep sys_waitid /proc/kallsyms
ffffffff8105a970 T sys_waitid
ffffffff8105abb0 T compat_sys_waitid
$

Bug analysis

Let’s take a closer look to see what exactly can we write using this vulnerability.

The waitid syscall is used by a parent process that wants to wait for a child process to terminate. It is more versatile that wait and waitpid, combining the functionality of both. It allows to to wait for any child process, or for some particular process identified by pid or by gid. Also, while wait and waitpid return the exit code of the child process, waitid fills a struct siginfo structure, which contains the exit code, but some other fields as well.

First, the siginfo structure, defined in include/uapi/asm-generic/siginfo.h:

typedef struct siginfo {
    int si_signo;
    int si_errno;
    int si_code;

    union {
        int _pad[SI_PAD_SIZE];

        /* kill() */
        struct {
            __kernel_pid_t _pid;    /* sender's pid */
            __ARCH_SI_UID_T _uid;   /* sender's uid */
        } _kill;

        /* POSIX.1b timers */
        struct {
            __kernel_timer_t _tid;  /* timer id */
            int _overrun;       /* overrun count */
            char _pad[sizeof( __ARCH_SI_UID_T) - sizeof(int)];
            sigval_t _sigval;   /* same as below */
            int _sys_private;       /* not to be passed to user */
        } _timer;

        /* POSIX.1b signals */
        struct {
            __kernel_pid_t _pid;    /* sender's pid */
            __ARCH_SI_UID_T _uid;   /* sender's uid */
            sigval_t _sigval;
        } _rt;

        /* SIGCHLD */
        struct {
            __kernel_pid_t _pid;    /* which child */
            __ARCH_SI_UID_T _uid;   /* sender's uid */
            int _status;        /* exit code */
            __ARCH_SI_CLOCK_T _utime;
            __ARCH_SI_CLOCK_T _stime;
        } _sigchld;

        /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
        struct {
            void __user *_addr; /* faulting insn/memory ref. */
#ifdef __ARCH_SI_TRAPNO
            int _trapno;    /* TRAP # which caused the signal */
#endif
            short _addr_lsb; /* LSB of the reported address */
            union {
                /* used when si_code=SEGV_BNDERR */
                struct {
                    void __user *_lower;
                    void __user *_upper;
                } _addr_bnd;
                /* used when si_code=SEGV_PKUERR */
                __u32 _pkey;
            };
        } _sigfault;

        /* SIGPOLL */
        struct {
            __ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
            int _fd;
        } _sigpoll;

        /* SIGSYS */
        struct {
            void __user *_call_addr; /* calling user insn */
            int _syscall;   /* triggering system call number */
            unsigned int _arch; /* AUDIT_ARCH_* of syscall */
        } _sigsys;
    } _sifields;
}

Looks quite complicated, because it contains a lot of unions. But most of the fields aren’t actually used in our case. We can figure out the actual layout by inspecting the sys_waitid source code and by disassembling/decompiling the kernel binary.

    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);

and

__int64 __fastcall sys_waitid(__int64 a1, __int64 a2, __int64 a3, __int64 a4, __int64 a5)
{
  ...

  if ( v5 )
  {
    *(_DWORD *)v5 = v8;
    *(_DWORD *)(v5 + 4) = 0;
    *(_DWORD *)(v5 + 8) = HIDWORD(v10);
    *(_QWORD *)(v5 + 16) = v9;
    *(_DWORD *)(v5 + 24) = v10;
  }
  return result;
}

In summary, the layout looks something like this:

struct siginfo {
    int si_signo;    /* offset 0 */
    int si_errno;    /* offset 4 */
    int si_code;     /* offset 8 */
    int _pad;        /* offset 12 */
    int pid;         /* offset 16 */
    int uid;         /* offset 20 */
    int status;      /* offset 24 */
}

By inspecting the code and running some simple test cases we can distinguish 2 cases:

We see that we cannot control most of the values. We can control pid, but its value is limited to 32768 (0x8000), and we can also control status, but the exit code of a process is limited to 127.

Exploit 1 - overwrite a kernel function pointer with zero

We know that the kernel programming style uses a lot of function pointers. There are many structures that have function pointers inside: file_operations, inode_operations, etc.

If we can overwrite a function pointer with 0, we can make the kernel jump to address 0. Address 0 is not normally mapped, but we can use mmap to allocate memory at address 0 and fill it with the shellcode. Thus, we can make the kernel execute code that we control.

The problem is that many of these *_operations structures are in rodata.

const struct file_operations ext4_file_operations = {
    .llseek     = ext4_llseek,
    .read_iter  = ext4_file_read_iter,
    .write_iter = ext4_file_write_iter,
    ...

Also, most of the kernel data structures are allocated on the heap, which makes it hard to find out their address.

We need to find a kernel function pointer which is in the .data section. A good candidate is:

int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
{
    if (sk->sk_family == AF_INET)
        return ip_recv_error(sk, msg, len, addr_len);
#if IS_ENABLED(CONFIG_IPV6)
    if (sk->sk_family == AF_INET6)
        return pingv6_ops.ipv6_recv_error(sk, msg, len, addr_len);
#endif
    return -EINVAL;
}

where pingv6_ops is declared as

struct pingv6_ops pingv6_ops;

and has the type

struct pingv6_ops {
    int (*ipv6_recv_error)(struct sock *sk, struct msghdr *msg, int len,
                   int *addr_len);
    void (*ip6_datagram_recv_common_ctl)(struct sock *sk,
                         struct msghdr *msg,
                         struct sk_buff *skb);
    void (*ip6_datagram_recv_specific_ctl)(struct sock *sk,
                           struct msghdr *msg,
                           struct sk_buff *skb);
    int (*icmpv6_err_convert)(u8 type, u8 code, int *err);
    void (*ipv6_icmp_error)(struct sock *sk, struct sk_buff *skb, int err,
                __be16 port, u32 info, u8 *payload);
    int (*ipv6_chk_addr)(struct net *net, const struct in6_addr *addr,
                 const struct net_device *dev, int strict);
};

So ipv6_recv_error is a function pointer inside the pingv6_ops structure, which is a global variable (in .data). We can double check by looking inside the kernel binary:

__int64 __fastcall inet_recv_error(__int64 a1)
{
  __int16 v1; // r8
  __int64 result; // rax

  v1 = *(_WORD *)(a1 + 16);
  if ( v1 == 2 )
    return sub_FFFFFFFF817BA5D0();
  result = 0xFFFFFFEALL;
  if ( v1 == 10 )
    result = qword_FFFFFFFF8212CC40();
  return result;
}

We see that there is an indirect call from the qword at address 0xFFFFFFFF8212CC40, which is indeed in the .data section of the kernel.

Let’s see how inet_recv_error is called:

int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
        int flags, int *addr_len)
{
    ...
    if (unlikely(flags & MSG_ERRQUEUE))
        return inet_recv_error(sk, msg, len, addr_len);

tcp_recvmsg is ultimately reached from the recv syscall. So, if we call recv on an ipv6 socket with the flag MSG_ERRQUEUE, that code will be reached.

A simple exploit that will crash the kernel is in exploit_crash. (run the vm with run.sh) Another exploit that makes the kernel execute an int 3 instruction is in exploit_int3. (run the vm with run.sh)

The shellcode

We need a shellcode that changes the uid of the current task to 0 (root). Normally, the uid is in a cred structure which is stored inside task_struct.

struct task_struct {
       ...
    /* Objective and real subjective task credentials (COW): */
    const struct cred __rcu     *real_cred;

    /* Effective (overridable) subjective task credentials (COW): */
    const struct cred __rcu     *cred;
    ...

and

struct cred {
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC  0x43736564
#define CRED_MAGIC_DEAD 0x44656144
#endif
    kuid_t      uid;        /* real UID of the task */
    kgid_t      gid;        /* real GID of the task */
    kuid_t      suid;       /* saved UID of the task */
    kgid_t      sgid;       /* saved GID of the task */
    kuid_t      euid;       /* effective UID of the task */
    kgid_t      egid;       /* effective GID of the task */
    kuid_t      fsuid;      /* UID for VFS ops */
    kgid_t      fsgid;      /* GID for VFS ops */
...

But the easiest way to change the uid of the current task to 0 is to call commit_creds(prepare_kernel_cred(NULL))

An exploit with the complete shellcode is in exploit_mmap_zero. (run the vm with run.sh)

Exploit 2 - mmap_min_addr

mmap_min_addr is a setting in /proc - /proc/sys/vm/mmap_min_addr. It specifies the minimum address that can be allocated via mmap. If this value is greater than 0, our previous exploit won’t work, since we can’t allocate memory at address 0 anymore.

run the vm with ./run_mmap_min.sh and try the previous exploit:

$ /mnt/share/exploit_mmap_zero
mmap failed

To bypass this, we remember that waitid can write some non-zero values as well, provided that our process has a child process that has exited.

We’ll use the fact that si_code will be set to CLD_EXITED (1) in order to overwrite pingv6_ops.ipv6_recv_error with a non zero value.

si_errno and si_code are 4-byte values. If we consider them together as an 8-byte value, the value will be 0x100000000.

The exploit is in in exploit_mmap_non_zero. (run the vm with run_mmap_min.sh)

Exploit 3 - KASLR

KASLR randomizes the kernel base address at every boot. This prevents us from knowing the address of pingv6_ops.ipv6_recv_error in order to overwrite it.

Besides KASLR there’s also another setting called kptr_restrict (/proc/sys/vm/kptr_restrict). This setting forbids a regular user from seeing any kernel pointer that the kernel prints using printk. Otherwise it would be trivial to break KASLR by looking in /proc/kallsyms.

When kptr_restrict is enabled, all pointers will be seen as 0:

$ grep sys_waitid /proc/kallsyms 
0000000000000000 T sys_waitid
0000000000000000 T compat_sys_waitid

To bypass KASLR we notice that the waitid syscall has a side channel. If we provide an invalid address the syscall will fail with EFAULT. Therefore, we can figure out the kernel base address by starting with 0xffffffff81000000 and incrementing the address until the waitid doesn’t return EFAULT.

There is a simple test program in exploit_kaslr. Run the vm with ./run_kaslr.sh.

Welcome to Buildroot
buildroot login: root
Password: 
# grep _stext /proc/kallsyms 
ffffffff90200000 T _stext
# 
Welcome to Buildroot
buildroot login: user
Password: 
$ /mnt/share/test_leak 
kbase = ffffffff91000000

We obtain 0xffffffff91000000, while the real value is 0xffffffff90200000, a difference of 0xe00000. This can be explained by looking at the output of readelf on the kernel binary.

$ readelf -SW vmlinux 
There are 30 section headers, starting at offset 0x1493140:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        ffffffff81000000 200000 95d9f7 00  AX  0   0 4096
  [ 2] .notes            NOTE            ffffffff8195d9f8 b5d9f8 000024 00   A  0   0  4
  [ 3] __ex_table        PROGBITS        ffffffff8195da20 b5da20 003390 00   A  0   0  4
  [ 4] .rodata           PROGBITS        ffffffff81a00000 c00000 2a0016 00  WA  0   0 4096
  [ 5] .pci_fixup        PROGBITS        ffffffff81ca0018 ea0018 003c30 00   A  0   0  8
  [ 6] .tracedata        PROGBITS        ffffffff81ca3c48 ea3c48 000078 00   A  0   0  1
  [ 7] __ksymtab         PROGBITS        ffffffff81ca3cc0 ea3cc0 015760 00   A  0   0 16
  [ 8] __ksymtab_gpl     PROGBITS        ffffffff81cb9420 eb9420 011540 00   A  0   0 16
  [ 9] __ksymtab_strings PROGBITS        ffffffff81cca960 eca960 02fefb 00   A  0   0  1
  [10] __param           PROGBITS        ffffffff81cfa860 efa860 0047e0 00   A  0   0  8
  [11] __modver          PROGBITS        ffffffff81cff040 eff040 000fc0 00   A  0   0  8
  [12] .data             PROGBITS        ffffffff81e00000 1000000 14b6c0 00  WA  0   0 4096
  [13] __bug_table       PROGBITS        ffffffff81f4b6c0 114b6c0 015480 00  WA  0   0  1
  [14] .vvar             PROGBITS        ffffffff81f61000 1161000 001000 00  WA  0   0 16

We see that .data starts at ffffffff81e00000. All the other sections before that are readonly. So, when trying to write to an address inside those sections, waitid fails because of the permissions.

The full exploit is in exploit_kaslr. Run the vm with ./run_kaslr.sh.

Exploit 4 - SMEP/SMAP

SMEP - supervisor mode execution prevention - prevents the kernel from executing code from userspace pages

SMAP - supervisor mode access prevention - prevents the kernel from reading/writing data from/to userspace pages

So, our previous exploit doesn’t work anymore, since we are executing shellcode from userspace.

./run_smep.sh

$ /mnt/share/exploit_kaslr 
kbase = 0xffffffff9ea00000
[   15.348793] unable to execute userspace code (SMEP?) (uid: 1000)
[   15.349062] BUG: unable to handle kernel paging request at 0000000100000000
[   15.349751] IP: 0x100000000

We’ll have to use a so-called “data-oriented attack”.

There is a string called modprobe_path in kernel/kmod.c:

char modprobe_path[KMOD_PATH_LEN] = "/sbin/modprobe";

This is used by the function call_modprobe:

static int call_modprobe(char *module_name, int wait)
{
    struct subprocess_info *info;
    static char *envp[] = {
        "HOME=/",
        "TERM=linux",
        "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
        NULL
    };

    char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
    if (!argv)
        goto out;

    module_name = kstrdup(module_name, GFP_KERNEL);
    if (!module_name)
        goto free_argv;

    argv[0] = modprobe_path;
    argv[1] = "-q";
    argv[2] = "--";
    argv[3] = module_name;  /* check free_modprobe_argv() */
    argv[4] = NULL;

    info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL,
                     NULL, free_modprobe_argv, NULL);
    if (!info)
        goto free_module_name;

    return call_usermodehelper_exec(info, wait | UMH_KILLABLE);

free_module_name:
    kfree(module_name);
free_argv:
    kfree(argv);
out:
    return -ENOMEM;
}

call_modprobe in turn, is called from __request_module. This basically allows loading kernel modules from inside the kernel. It does this by executing the binary specified in the modprobe_path string.

One place where __request_module is used is in search_binary_handler in fs/exec.c:

int search_binary_handler(struct linux_binprm *bprm)
{
    ...
    if (need_retry) {
        if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
            printable(bprm->buf[2]) && printable(bprm->buf[3]))
            return retval;
        if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
            return retval;
        need_retry = false;
        goto retry;
    }

    return retval;
}

This is basically a feature that allows custom executable formats. If the first 4 bytes of an executable file are not printable, the kernel will turn 2 of them into an unsigned short, and then try to load a module called “binfmt-x”.

To exploit this, we will modify the string inside modprobe_path to a file that we control. Then we create a dummy executable file which contains 4 bytes with a value greater than 0x80. Upon executing this file, the kernel will eventually reach __request_module and will call our file with root privileges.

Overwriting modprobe_path

We’ll have to use multiple calls to waitid to overwrite modprobe_path byte by byte. However, there is a small problem.

Let’s say we want to use the status field, which we can control (it’s the exit code of the child process)

Because most of the bytes in the structure are 0, a subsequent use of waitid will overwrite the previous bytes, thus making it hard to achieve a multi-byte write.

However, if we take a closer look, we have the fields _pad and pid which are close to each other. _pad is unused and will be left as it is, and pid is the pid of the child process, of which we can control 2 bytes.

In total, we can achieve a 6 byte write, like this:

We’ll thus be able to overwrite modprobe_path with a value like tmp/AA, which we can control.

The exploit is in ./exploit_smep.