[lxc-users] Containers on linux-4.8-rc1 sometimes(?) requiring "cgmanager -m name=systemd" (bisected, but is it a bug?)

Tue Sep 13 16:06:51 UTC 2016

Quoting Eric W. Biederman (ebiederm at xmission.com):
> Adam Richter <adamrichter4 at gmail.com> writes:
> 
> > On Linux 4.8-rc1 through 4-8-rc6 (latest rc), lxc fails start to
> > Ubuntu 16.04 and Centos 7 containers [1], unless I first run
> > "cgmanager -m name=systemd &" on the host, which, unlike the
> > containers, was not running systemd or cgmanager.
> 
> Yes, that appears correct.  Given the current flat namespace of
> hierarchies you fundamentally must coordinate with the host if you want
> to use a new hierarchy.  So running cgmanager on the host seems like
> the minimum way to do that.
> 
> If we truly need something more (which does not appear to be the case
> here) the names of hierarchies need to be moved into a namespace.
> 
> > Git bisect revealed that this behavior began with a commit entitled
> > "cgroupns: Only allow creation of hierarchies in the initial cgroup
> > namespace" [2], which appears to be an attempt to protect against a
> > possible denial of service attack.  Reversing the commit also restores
> > successful commit the need to run that cgmanager process.  [Eric and
> > Tejun, I have bcc'ed you so you can be aware of this discussion
> > thread, as you apparently respectively wrote and approved the commit.]
> 
> As far as I can tell you were getting lucky and not having problems
> before.
> 
> > Running that cgmanager invocation is pretty simple, and seems to me to
> > be well worth closing a denial of service vulnerability, much as I
> > dislike adding something systemd-specific to a non-systemd environment
> > and adding a new dependency (lxc requires cgmanager on the host to
> > run, I guess, any container that runs systemd).  However, I am posting
> > this message because I don't fully understand the problem, and, most
> > importantly, I am wondering if I have stumbled on an unintended
> > consequence of this commit that might have other indicate other
> > potential breakage.
> 
> I am surprised that your case worked but I don't think it amounts to an
> unintended consequence.
> 
> > If this new lxc behavior is completely acceptable, then I apologize
> > for consuming people's time with it and hope that this message will
> > allow others experiencing the same problem find an answer for it when
> > they search the web.
> 
> I will let the lxc-developers judge.
> 
> I don't think you hit a case that was expected to work.  Furthermore

fwiw indeed this was never expected to work.

> either your containers were overprivileged or they would not have been
> able to create subdirectories in the cgroup hierarchy.  So I expect this
> change transformed a subtle breakage (aka one you had not noticed yet)
> into an explicit breakage.
> 
> I am not subscribed to lxc-users so I don't know if anyone else has
> replied to your post.  Cc's would have been better than Bcc's for
> getting feedback in a situation like this.
> 
> Eric
> 
> 
> > Adam Richter
> >
> >
> > [1] Here is an example of failing to start one of these containers.
> > $ sudo lxc-start --name ubuntu16.04_amd64 --foreground
> > Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
> > [!!!!!!] Failed to mount API filesystems, freezing.
> > Freezing execution.
> >
> >
> > [2] Here is the commit diff that triggers the new mishbehavior.
> > commit 726a4994b05ff5b6f83d64b5b43c3251217366ce
> > Author: Eric W. Biederman <ebiederm at xmission.com>
> > Date:   Fri Jul 15 06:36:44 2016 -0500
> >
> >     cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
> >
> >     Unprivileged users can't use hierarchies if they create them as they do not
> >     have privilieges to the root directory.
> >
> >     Which means the only thing a hiearchy created by an unprivileged user
> >     is good for is expanding the number of cgroup links in every css_set,
> >     which is a DOS attack.
> >
> >     We could allow hierarchies to be created in namespaces in the initial
> >     user namespace.  Unfortunately there is only a single namespace for
> >     the names of heirarchies, so that is likely to create more confusion
> >     than not.
> >
> >     So do the simple thing and restrict hiearchy creation to the initial
> >     cgroup namespace.
> >
> >     Cc: stable at vger.kernel.org
> >     Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
> >     Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> >     Signed-off-by: Tejun Heo <tj at kernel.org>
> >
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index e75efa8..e0be49f 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -2215,12 +2215,8 @@ static struct dentry *cgroup_mount(struct
> > file_system_type *fs_type,
> >                 goto out_unlock;
> >         }
> >
> > -       /*
> > -        * We know this subsystem has not yet been bound.  Users in a non-init
> > -        * user namespace may only mount hierarchies with no bound subsystems,
> > -        * i.e. 'none,name=user1'
> > -        */
> > -       if (!opts.none && !capable(CAP_SYS_ADMIN)) {
> > +       /* Hierarchies may only be created in the initial cgroup namespace. */
> > +       if (ns != &init_cgroup_ns) {
> >                 ret = -EPERM;
> >                 goto out_unlock;
> >         }