[lxc-devel] [PATCH RFC] Accomodate stricter devices cgroup rules

Mon Jul 8 17:16:23 UTC 2013

Quoting Serge Hallyn (serge.hallyn at ubuntu.com):
> 3.10 kernel comes with proper hierarchical enforcement of devices
> cgroup.  To keep that code somewhat sane, certain things are not
> allowed.  Switching from default-allow to default-deny and vice versa
> are not allowed when there are children cgroups.  (This *could* be
> simplified in the kernel by checking that all child cgroups are
> unpopulated, but that has not yet been done and may be rejected)
> 
> The mountcgroup hook causes lxc-start to break with 3.10 kernels, because
> you cannot write 'a' to devices.deny once you have a child cgroup.  With
> this patch, (a) lxcpath is passed to hooks, (b) the cgroup mount hook sets
> the container's devices cgroup, and (c) setup_cgroup() during lxc startup
> ignores failures to write to devices subsystem if we are already in a
> child of the container's new cgroup.
> 
> ((a) is not really related to this bug, but is definately needed.
> The followup work of making the other hooks use the passed-in lxcpath
> is still to be done)

Note, one alternative I'm considering is to have lxc always create
and enter into /sys/fs/cgroup/devices/lxc/$c/$c.real right before
starting init.  A big downside to this would be that people who
have devices and blkio or memory controllers composed in one hierarchy
would get a performance hit from the extra directory level.

The other option is to stop having the mountcgroups hook do the
$c.real thing.  The reason for doing it is that the admin in a
container which is in /lxc/s2 can re-add whitelist entries which
are allowed to /lxc.  Giving /lxc/s2 the same restrictions as
/lxc/s2/s2.real and restricting the container admin to
/lxc/s2/s2.real means the admin can only add whitelist entries
which are already in /lxc/s2.  However:

	1. if admin is not otherwise sufficiently contained, he
	   can find other ways to circumvent the restriction
	2. we can use apparmor (/selinux) to prevent the container
	   from writing to /lxc/s2/devices.*
	3. we can use user namespaces in place of (2)
	4. we don't pretend containers are secure at the moment,
	   so we could simply concede that this particular
	   safeguard is too onerous for the protection it provides.

The patch I sent isn't yet too bad.  But it will probably prevent
nested containers inside nested containers, and other issues may
come up to further complicate things.

What do others think?

-serge