[lxc-users] Containers on linux-4.8-rc1 sometimes(?) requiring "cgmanager -m name=systemd" (bisected, but is it a bug?)

Tue Sep 13 08:11:50 UTC 2016

On Linux 4.8-rc1 through 4-8-rc6 (latest rc), lxc fails start to
Ubuntu 16.04 and Centos 7 containers [1], unless I first run
"cgmanager -m name=systemd &" on the host, which, unlike the
containers, was not running systemd or cgmanager.

Git bisect revealed that this behavior began with a commit entitled
"cgroupns: Only allow creation of hierarchies in the initial cgroup
namespace" [2], which appears to be an attempt to protect against a
possible denial of service attack.  Reversing the commit also restores
successful commit the need to run that cgmanager process.  [Eric and
Tejun, I have bcc'ed you so you can be aware of this discussion
thread, as you apparently respectively wrote and approved the commit.]

Running that cgmanager invocation is pretty simple, and seems to me to
be well worth closing a denial of service vulnerability, much as I
dislike adding something systemd-specific to a non-systemd environment
and adding a new dependency (lxc requires cgmanager on the host to
run, I guess, any container that runs systemd).  However, I am posting
this message because I don't fully understand the problem, and, most
importantly, I am wondering if I have stumbled on an unintended
consequence of this commit that might have other indicate other
potential breakage.

If this new lxc behavior is completely acceptable, then I apologize
for consuming people's time with it and hope that this message will
allow others experiencing the same problem find an answer for it when
they search the web.

Adam Richter


[1] Here is an example of failing to start one of these containers.
$ sudo lxc-start --name ubuntu16.04_amd64 --foreground
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
[!!!!!!] Failed to mount API filesystems, freezing.
Freezing execution.


[2] Here is the commit diff that triggers the new mishbehavior.
commit 726a4994b05ff5b6f83d64b5b43c3251217366ce
Author: Eric W. Biederman <ebiederm at xmission.com>
Date:   Fri Jul 15 06:36:44 2016 -0500

    cgroupns: Only allow creation of hierarchies in the initial cgroup namespace

    Unprivileged users can't use hierarchies if they create them as they do not
    have privilieges to the root directory.

    Which means the only thing a hiearchy created by an unprivileged user
    is good for is expanding the number of cgroup links in every css_set,
    which is a DOS attack.

    We could allow hierarchies to be created in namespaces in the initial
    user namespace.  Unfortunately there is only a single namespace for
    the names of heirarchies, so that is likely to create more confusion
    than not.

    So do the simple thing and restrict hiearchy creation to the initial
    cgroup namespace.

    Cc: stable at vger.kernel.org
    Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
    Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
    Signed-off-by: Tejun Heo <tj at kernel.org>

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e75efa8..e0be49f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2215,12 +2215,8 @@ static struct dentry *cgroup_mount(struct
file_system_type *fs_type,
                goto out_unlock;
        }

-       /*
-        * We know this subsystem has not yet been bound.  Users in a non-init
-        * user namespace may only mount hierarchies with no bound subsystems,
-        * i.e. 'none,name=user1'
-        */
-       if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+       /* Hierarchies may only be created in the initial cgroup namespace. */
+       if (ns != &init_cgroup_ns) {
                ret = -EPERM;
                goto out_unlock;
        }