[lxc-devel] Containers on linux-4.8-rc1 sometimes(?) requiring "cgmanager -m name=systemd" (bisected, but is it a bug?)

Tue Sep 13 16:17:09 UTC 2016

"Serge E. Hallyn" <serge at hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm at xmission.com):
>> Adam Richter <adamrichter4 at gmail.com> writes:
>> 
>> > On Linux 4.8-rc1 through 4-8-rc6 (latest rc), lxc fails start to
>> > Ubuntu 16.04 and Centos 7 containers [1], unless I first run
>> > "cgmanager -m name=systemd &" on the host, which, unlike the
>> > containers, was not running systemd or cgmanager.
>> 
>> Yes, that appears correct.  Given the current flat namespace of
>> hierarchies you fundamentally must coordinate with the host if you want
>> to use a new hierarchy.  So running cgmanager on the host seems like
>> the minimum way to do that.
>> 
>> If we truly need something more (which does not appear to be the case
>> here) the names of hierarchies need to be moved into a namespace.
>> 
>> > Git bisect revealed that this behavior began with a commit entitled
>> > "cgroupns: Only allow creation of hierarchies in the initial cgroup
>> > namespace" [2], which appears to be an attempt to protect against a
>> > possible denial of service attack.  Reversing the commit also restores
>> > successful commit the need to run that cgmanager process.  [Eric and
>> > Tejun, I have bcc'ed you so you can be aware of this discussion
>> > thread, as you apparently respectively wrote and approved the commit.]
>> 
>> As far as I can tell you were getting lucky and not having problems
>> before.
>> 
>> > Running that cgmanager invocation is pretty simple, and seems to me to
>> > be well worth closing a denial of service vulnerability, much as I
>> > dislike adding something systemd-specific to a non-systemd environment
>> > and adding a new dependency (lxc requires cgmanager on the host to
>> > run, I guess, any container that runs systemd).  However, I am posting
>> > this message because I don't fully understand the problem, and, most
>> > importantly, I am wondering if I have stumbled on an unintended
>> > consequence of this commit that might have other indicate other
>> > potential breakage.
>> 
>> I am surprised that your case worked but I don't think it amounts to an
>> unintended consequence.
>> 
>> > If this new lxc behavior is completely acceptable, then I apologize
>> > for consuming people's time with it and hope that this message will
>> > allow others experiencing the same problem find an answer for it when
>> > they search the web.
>> 
>> I will let the lxc-developers judge.
>> 
>> I don't think you hit a case that was expected to work.  Furthermore
>
> fwiw indeed this was never expected to work.
>

As just creating the hiearchy before starting the container fixes this,
I agree this does appear to be just a documentation issue.

>> either your containers were overprivileged or they would not have been
>> able to create subdirectories in the cgroup hierarchy.  So I expect this
>> change transformed a subtle breakage (aka one you had not noticed yet)
>> into an explicit breakage.
>>
>> I am not subscribed to lxc-users so I don't know if anyone else has
>> replied to your post.  Cc's would have been better than Bcc's for
>> getting feedback in a situation like this.
>> 
>> Eric
>> 
>> 
>> > Adam Richter
>> >
>> >
>> > [1] Here is an example of failing to start one of these containers.
>> > $ sudo lxc-start --name ubuntu16.04_amd64 --foreground
>> > Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
>> > [!!!!!!] Failed to mount API filesystems, freezing.
>> > Freezing execution.
>> >
>> >
>> > [2] Here is the commit diff that triggers the new mishbehavior.
>> > commit 726a4994b05ff5b6f83d64b5b43c3251217366ce
>> > Author: Eric W. Biederman <ebiederm at xmission.com>
>> > Date:   Fri Jul 15 06:36:44 2016 -0500
>> >
>> >     cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
>> >
>> >     Unprivileged users can't use hierarchies if they create them as they do not
>> >     have privilieges to the root directory.
>> >
>> >     Which means the only thing a hiearchy created by an unprivileged user
>> >     is good for is expanding the number of cgroup links in every css_set,
>> >     which is a DOS attack.
>> >
>> >     We could allow hierarchies to be created in namespaces in the initial
>> >     user namespace.  Unfortunately there is only a single namespace for
>> >     the names of heirarchies, so that is likely to create more confusion
>> >     than not.
>> >
>> >     So do the simple thing and restrict hiearchy creation to the initial
>> >     cgroup namespace.
>> >
>> >     Cc: stable at vger.kernel.org
>> >     Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
>> >     Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>> >     Signed-off-by: Tejun Heo <tj at kernel.org>
>> >
>> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> > index e75efa8..e0be49f 100644
>> > --- a/kernel/cgroup.c
>> > +++ b/kernel/cgroup.c
>> > @@ -2215,12 +2215,8 @@ static struct dentry *cgroup_mount(struct
>> > file_system_type *fs_type,
>> >                 goto out_unlock;
>> >         }
>> >
>> > -       /*
>> > -        * We know this subsystem has not yet been bound.  Users in a non-init
>> > -        * user namespace may only mount hierarchies with no bound subsystems,
>> > -        * i.e. 'none,name=user1'
>> > -        */
>> > -       if (!opts.none && !capable(CAP_SYS_ADMIN)) {
>> > +       /* Hierarchies may only be created in the initial cgroup namespace. */
>> > +       if (ns != &init_cgroup_ns) {
>> >                 ret = -EPERM;
>> >                 goto out_unlock;
>> >         }