[lxc-users] Containers on linux-4.8-rc1 sometimes(?) requiring "cgmanager -m name=systemd" (bisected, but is it a bug?)

Adam Richter adamrichter4 at gmail.com
Wed Sep 14 00:54:42 UTC 2016


Hello, Eric.

Thank you for your prompt response to my posting.

If you think that the new lxc behavior is acceptable, I am OK with it
too. I just wanted to let you know because I thought that there was
perhaps a ~30% chance that you might see it as indicating a more
consequential problem.

I emailed lxc-users instead of lxc-devel to see if if I could
determine that this was not a bug without needing to escalate to
lxc-devel.  Also in an effort to err on the side of trying to minimize
annoyance, I bcc'ed you and Tejun instead of cc'ing you because I
didn't want effectively to involuntarily subscribe you to what could
become an ongoing thread.  However, based on your expressed
preference, I will cc you if I respond further to this thread on
lxc-users, unless you request otherwise.

Perhaps, in the future, if I look into control groups and containers
more, I might investigate your alternative idea, that might not
require non-systemd host to do anything specific to systemd to run
systemd guests, reducing the specialness of the real host environment,
thereby perhaps slightly increasing the set of configurations that can
be tested with nested containers instead of VM's and reducing barriers
to other init systems doing whatever systemd does that currently needs
that bit of host configuration.

Adam

On Tue, Sep 13, 2016 at 8:33 AM, Eric W. Biederman
<ebiederm at xmission.com> wrote:
> Adam Richter <adamrichter4 at gmail.com> writes:
>
>> On Linux 4.8-rc1 through 4-8-rc6 (latest rc), lxc fails start to
>> Ubuntu 16.04 and Centos 7 containers [1], unless I first run
>> "cgmanager -m name=systemd &" on the host, which, unlike the
>> containers, was not running systemd or cgmanager.
>
> Yes, that appears correct.  Given the current flat namespace of
> hierarchies you fundamentally must coordinate with the host if you want
> to use a new hierarchy.  So running cgmanager on the host seems like
> the minimum way to do that.
>
> If we truly need something more (which does not appear to be the case
> here) the names of hierarchies need to be moved into a namespace.
>
>> Git bisect revealed that this behavior began with a commit entitled
>> "cgroupns: Only allow creation of hierarchies in the initial cgroup
>> namespace" [2], which appears to be an attempt to protect against a
>> possible denial of service attack.  Reversing the commit also restores
>> successful commit the need to run that cgmanager process.  [Eric and
>> Tejun, I have bcc'ed you so you can be aware of this discussion
>> thread, as you apparently respectively wrote and approved the commit.]
>
> As far as I can tell you were getting lucky and not having problems
> before.
>
>> Running that cgmanager invocation is pretty simple, and seems to me to
>> be well worth closing a denial of service vulnerability, much as I
>> dislike adding something systemd-specific to a non-systemd environment
>> and adding a new dependency (lxc requires cgmanager on the host to
>> run, I guess, any container that runs systemd).  However, I am posting
>> this message because I don't fully understand the problem, and, most
>> importantly, I am wondering if I have stumbled on an unintended
>> consequence of this commit that might have other indicate other
>> potential breakage.
>
> I am surprised that your case worked but I don't think it amounts to an
> unintended consequence.
>
>> If this new lxc behavior is completely acceptable, then I apologize
>> for consuming people's time with it and hope that this message will
>> allow others experiencing the same problem find an answer for it when
>> they search the web.
>
> I will let the lxc-developers judge.
>
> I don't think you hit a case that was expected to work.  Furthermore
> either your containers were overprivileged or they would not have been
> able to create subdirectories in the cgroup hierarchy.  So I expect this
> change transformed a subtle breakage (aka one you had not noticed yet)
> into an explicit breakage.
>
> I am not subscribed to lxc-users so I don't know if anyone else has
> replied to your post.  Cc's would have been better than Bcc's for
> getting feedback in a situation like this.
>
> Eric
>
>
>> Adam Richter
>>
>>
>> [1] Here is an example of failing to start one of these containers.
>> $ sudo lxc-start --name ubuntu16.04_amd64 --foreground
>> Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
>> [!!!!!!] Failed to mount API filesystems, freezing.
>> Freezing execution.
>>
>>
>> [2] Here is the commit diff that triggers the new mishbehavior.
>> commit 726a4994b05ff5b6f83d64b5b43c3251217366ce
>> Author: Eric W. Biederman <ebiederm at xmission.com>
>> Date:   Fri Jul 15 06:36:44 2016 -0500
>>
>>     cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
>>
>>     Unprivileged users can't use hierarchies if they create them as they do not
>>     have privilieges to the root directory.
>>
>>     Which means the only thing a hiearchy created by an unprivileged user
>>     is good for is expanding the number of cgroup links in every css_set,
>>     which is a DOS attack.
>>
>>     We could allow hierarchies to be created in namespaces in the initial
>>     user namespace.  Unfortunately there is only a single namespace for
>>     the names of heirarchies, so that is likely to create more confusion
>>     than not.
>>
>>     So do the simple thing and restrict hiearchy creation to the initial
>>     cgroup namespace.
>>
>>     Cc: stable at vger.kernel.org
>>     Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
>>     Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
>>     Signed-off-by: Tejun Heo <tj at kernel.org>
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index e75efa8..e0be49f 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2215,12 +2215,8 @@ static struct dentry *cgroup_mount(struct
>> file_system_type *fs_type,
>>                 goto out_unlock;
>>         }
>>
>> -       /*
>> -        * We know this subsystem has not yet been bound.  Users in a non-init
>> -        * user namespace may only mount hierarchies with no bound subsystems,
>> -        * i.e. 'none,name=user1'
>> -        */
>> -       if (!opts.none && !capable(CAP_SYS_ADMIN)) {
>> +       /* Hierarchies may only be created in the initial cgroup namespace. */
>> +       if (ns != &init_cgroup_ns) {
>>                 ret = -EPERM;
>>                 goto out_unlock;
>>         }


More information about the lxc-users mailing list