[lxc-devel] [lxc/master] Pass zero to clone

Thu May 9 18:24:11 UTC 2019

A non-text attachment was scrubbed...
Name: not available
Type: text/x-mailbox
Size: 386 bytes
Desc: not available
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20190509/574fe5fd/attachment.bin>
-------------- next part --------------
From ae15df583f47177215b03005d62c85e7732ac9bd Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho at tycho.ws>
Date: Thu, 9 May 2019 13:52:30 -0400
Subject: [PATCH 1/3] lxc_clone: pass 0 as stack and have the kernel allocate
 it

The kernel allows us to pass a NULL stack and have it allocate one, so
let's just do that instead of doing it manually, since there are two
problems with this code:

1. The math is wrong. We allocate a char *foo[__LXC_STACK_SIZE]; which
   means it's really sizeof(char *) * __LXC_STACK_SIZE, instead of just
   __LXC_STACK SIZE.

2. We can't actually allocate it on our stack. When we use CLONE_VM (which
   we do in the shared ns case) that means that the new thread is just
   running one page lower on the stack, but anything that allocates a page
   on the stack may clobber data. This is a pretty short race window since
   we just do the shared ns stuff and then do a clone without CLONE_VM.

However, it does point out an interesting possible privilege escalation if
things aren't configured correctly: do_share_ns() sets up namespaces while
it shares the address space of the task that spawned it; once it enters the
pid ns of the thing it's sharing with, the thing it's sharing with can
ptrace it and write stuff into the host's address space. Since the function
that does the clone() is lxc_spawn(), it has a struct cgroup_ops* on the
stack, which itself has function pointers called later in the function, so
it's possible to allocate shellcode in the address space of the host and
run it fairly easily.

ASLR doesn't mitigate this since we know exactly the stack offsets; however
this patch has the kernel allocate a new stack, which will help. Of course,
the attacker could just check /proc/pid/maps to find the location of the
stack, but they'd still have to guess where to write stuff in.

The thing that does prevent this is the default configuration of apparmor.
Since the apparmor profile is set in the second clone, and apparmor
prevents ptracing things under a different profile, attackers confined by
apparmor can't do this. However, if users are using a custom configuration
with shared namespaces, care must be taken to avoid this race.

Shared namespaces aren't widely used now, so perhaps this isn't a problem,
but with the advent of crio-lxc for k8s, this functionality will be used
more.

Signed-off-by: Tycho Andersen <tycho at tycho.ws>
---
 src/lxc/namespace.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/src/lxc/namespace.c b/src/lxc/namespace.c
index e22d9a4bf0..04b7dd3d2d 100644
--- a/src/lxc/namespace.c
+++ b/src/lxc/namespace.c
@@ -53,22 +53,18 @@ static int do_clone(void *arg)
 	return clone_arg->fn(clone_arg->arg);
 }
 
-#define __LXC_STACK_SIZE 4096
 pid_t lxc_clone(int (*fn)(void *), void *arg, int flags)
 {
-	size_t stack_size;
 	pid_t ret;
 	struct clone_arg clone_arg = {
 	    .fn = fn,
 	    .arg = arg,
 	};
-	char *stack[__LXC_STACK_SIZE] = {0};
-	stack_size = __LXC_STACK_SIZE;
 
 #ifdef __ia64__
-	ret = __clone2(do_clone, stack, stack_size, flags | SIGCHLD, &clone_arg);
+	ret = __clone2(do_clone, NULL, 0, flags | SIGCHLD, &clone_arg);
 #else
-	ret = clone(do_clone, stack + stack_size, flags | SIGCHLD, &clone_arg);
+	ret = clone(do_clone, 0, flags | SIGCHLD, &clone_arg);
 #endif
 	if (ret < 0)
 		SYSERROR("Failed to clone (%#x)", flags);

From b827d5ed02ce861f52f5703cadc6fbf50129f041 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho at tycho.ws>
Date: Thu, 9 May 2019 14:13:40 -0400
Subject: [PATCH 2/3] doc: add a little note about shared ns + LSMs

We should add a little not about the race in the previous patch.

Signed-off-by: Tycho Andersen <tycho at tycho.ws>
---
 doc/lxc.container.conf.sgml.in | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/lxc.container.conf.sgml.in b/doc/lxc.container.conf.sgml.in
index ee78e49a3d..8247e03487 100644
--- a/doc/lxc.container.conf.sgml.in
+++ b/doc/lxc.container.conf.sgml.in
@@ -1657,6 +1657,12 @@ dev/null proc/kcore none bind,relative 0 0
             process wants to inherit the other's network namespace it usually
             needs to inherit the user namespace as well.
             </para>
+
+            <para>
+            Note that without careful additional configuration of an LSM,
+            sharing user+pid namespaces with a task may allow that task to
+            escalate privileges to that of the task calling liblxc.
+            </para>
           </listitem>
         </varlistentry>
       </variablelist>

From 0f407cd4367e220bf89c41c18c6995fe13ed50bf Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho at tycho.ws>
Date: Thu, 9 May 2019 14:18:10 -0400
Subject: [PATCH 3/3] lxc_clone: get rid of some indirection

We have a do_clone(), which just calls a void f(void *) that it gets
passed. We build up a struct consisting of two args that are just the
actual arg and actual function. Let's just have the syscall do this for us.

Signed-off-by: Tycho Andersen <tycho at tycho.ws>
---
 src/lxc/namespace.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/src/lxc/namespace.c b/src/lxc/namespace.c
index 04b7dd3d2d..e5d6836893 100644
--- a/src/lxc/namespace.c
+++ b/src/lxc/namespace.c
@@ -42,29 +42,14 @@
 
 lxc_log_define(namespace, lxc);
 
-struct clone_arg {
-	int (*fn)(void *);
-	void *arg;
-};
-
-static int do_clone(void *arg)
-{
-	struct clone_arg *clone_arg = arg;
-	return clone_arg->fn(clone_arg->arg);
-}
-
 pid_t lxc_clone(int (*fn)(void *), void *arg, int flags)
 {
 	pid_t ret;
-	struct clone_arg clone_arg = {
-	    .fn = fn,
-	    .arg = arg,
-	};
 
 #ifdef __ia64__
-	ret = __clone2(do_clone, NULL, 0, flags | SIGCHLD, &clone_arg);
+	ret = __clone2(fn, NULL, 0, flags | SIGCHLD, arg);
 #else
-	ret = clone(do_clone, 0, flags | SIGCHLD, &clone_arg);
+	ret = clone(fn, 0, flags | SIGCHLD, arg);
 #endif
 	if (ret < 0)
 		SYSERROR("Failed to clone (%#x)", flags);