[lxc-devel] cgroup V2 and LXC

Tue Feb 23 10:23:48 UTC 2016

On Mon, Feb 15, 2016 at 07:48:05PM +0000, Serge Hallyn wrote:
> Quoting Christian Brauner (christian.brauner at mailbox.org):
> > On Wed, Feb 10, 2016 at 05:45:48PM +0000, Serge Hallyn wrote:
> > > Quoting Christian Brauner (christian.brauner at mailbox.org):
> > > > On Mon, Feb 01, 2016 at 04:56:08AM +0000, Serge Hallyn wrote:
> > > > > Quoting Kevin Wilson (wkevils at gmail.com):
> > > > > > Hi, LXC developers,
> > > > > > 
> > > > > > The latest kernel release (4.4) includes initial support to cgroup v2
> > > > > > with 2 controllers (memory and io). Also it seems that the PIDs
> > > > > > controller works in cgroup v2, but I do not know if it is officially
> > > > > > supported in v2.
> > > > > > 
> > > > > > Is there any intention to replace the existing cgroup v1 usage in LXC
> > > > > > by cgroup v2 ? or at least to enable working with both of them ?
> > > > > > 
> > > > > > Regards,
> > > > > > Kevin
> > > > > 
> > > > > Replace, no, support, yes.  I've added support for it to cgmanager, and have
> > > > > used lxc with the unified hierarchy through cgmanager.  Without cgmanager
> > > > > it will currently definately not work.  It's worth discussing how we should
> > > > > handle it - and how init wants us to handle it.   With cgmanager I actually
> > > > > built in the support so that you could treat it as a legacy hierarchy, and
> > > > > upstart was happy with that since it used cgmanager.  Systemd will not be
> > > > > happy with that, and it will be a problem.  The only exception to the "no
> > > > > tasks in a non-leaf node" rule is for the / cgroup.  So lxc would need to
> > > > > place init in say /lxc/c1/.leaf, and systemd would have to accept that
> > > > > /lxc/c1 is the container's cgroup.  A few possibilities:
> > > > > 
> > > > > 1. maybe if we place systemd in /lxc/c1/init.scope it will be happy
> > > > Well, here is how I thought it could go (sticking to systemd specifics here):
> > > >         - create a slice for all lxc "lxc.slice" (similar to "machine.slice" of
> > > >           systemd-nspawn backed containers)
> > > >         - "lxc.slice" contains a scope for each container (e.g. "c1.scope"
> > > >         - "c1.scope" contains an "init.scope"
> > > >         - "init.scope" only contains the PID of "/sbin/init" as seen from the
> > > >           host (obviously)
> > > 
> > > So if we are creating container c1, are you talking about
> > > 
> > > /lxc/c1/lxc.slice/c1.scope/init.scope
> > > 
> > > or are you talking about a host-global
> > > 
> > > /lxc.slice
> > Yes, you have lxc.slice then you have all your machines under this. This is what
> > systemd-nspawn does if I'm not mistaken.
> > > with container-specific
> > > 
> > > /lxc.slice/c1.scope
> > > 
> > > per container?
> > > 
> > > ?
> > Yes.
> 
> This doesn't seem to address the problem.  Where we put these on the host doesn't
> matter.  The question is, we create container c1, in which cgroup do we put the
> init process?
> 
> Assume we create /lxc/c1 on the host as we do now.  This becomes / in the container's
> cgroup namespace.  Where do we put init?  If we put it into (namespaced) /, then
> systemd will not be able to create any cgroups.  So we should probably put it into
> /init.scope.  This is fine with cgroup namespaces since it can see it is in '/init.scope'
> (or '/' if an unprivileged container couldn't create a cgroup for some controllers).
> But if we do not have cgroup namespaces, systemd sees it is running in perhaps
> /user.slice/user-1000.slice/session-c6.scope/lxc/lxdvm1/lxc/c1/init.scope.  In that
> case we want systemd to recognize init.scope and create services under
> /user.slice/user-1000.slice/session-c6.scope/lxc/lxdvm1/lxc/c1.
> 
> > > >         - All other processes are put in another slice "c1-something.slice"
> > > 
> > > Which other processes?
> > Well, all processes, systemd starts are either put in system.slice or
> > user.slice. All other things we start in the container (let it be e.g. vim) is
> > put in a session.slice (e.g. session-0.slice, session-1000.slice).
> 
> wc -l /sys/fs/cgroup/memory/tasks
> 548
This is output from a legacy cgroup. (The tasks file is removed in cgroup
unified hierarchy, no?) I was talking about unified cgroups.

A typical layout for a container BB running a unified cgroup system inside on a
host running a unified cgroup system with systemd-nspawn:

/sys/fs/cgroup/machine.slice/:
        - non-leaf node --> cgroup.procs empty

/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/:
        - non-leaf node --> cgroup.procs empty

The following are on the same level: (/sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/)

- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/init.scope/:
        - leaf node --> cgroup.procs contains PID of init

- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/system.slice/:
        - non-leaf node --> cgroup.procs empty
        - contains leaf nodes for system setup stuff (journald, logind etc.)

- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/user.slice/user-0.slice/session-c1.scope
and
- /sys/fs/cgroup/machine.slice/machine-BB\x2dtree.scope/user.slice/user-0.slice/user at 0.service:
        - filled with leaf-nodes for e.g. processes started by the user

> 
> > > AFAIK all other processes will be created by systemd.  The q is what will it
> > > do.  If we put systemd in /lxc.slice/c1.scope/init.scope, will it take that
> > > as its cgroup root and try to create and move itself into
> > > /lxc.slice/c1.scope/init.scope ?  If so it will fail since it cannot create a
> > > cgroup while it is in it.
> > I don't think so but I need to test that again. Time to boot unified.
> > 
> > > 
> > > So I think I've convinced myself that we need to collaborate with systemd
> > > on this.  Perhaps we can agree with it on a default cgroup in which it should
> > > be started to tell it "this is the leaf cgroup for your init".  So if it sees
> > > it is in /a/b/c/.cg_leaf, then it will know that /a/b/c is its root.
> > I thought the same that's why I started to read some of the code.
> > fwiw, systemd-nspawn already works with the unified cgroup hierarchy and I think
> > nesting works as well. But I'm not completely sure how nspawn handles nesting.
> 
> Looks like it puts systemd into '/supervisor' and the container into '/payload'?
> (nspawn-cgroup.c)
I don't think so. This seems to be a special case when systemd-nspawn is run
from a service unit. Otherwise the layout seems to be as I sketched above.