[lxc-devel] cgroup management daemon

Serge E. Hallyn serge at hallyn.com
Tue Nov 26 20:58:19 UTC 2013


Quoting Tim Hockin (thockin at google.com):
> On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn <serge at hallyn.com> wrote:
> > Quoting Tim Hockin (thockin at google.com):
...
> >> >   . A client (requestor 'r') can make cgroup requests over
> >> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >> >     requirements for r are listed below.
> >> >   . The client request will pertain an existing or new cgroup A.  r's
> >> >     privilege over the cgroup must be checked.  r is said to have
> >> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
> >> >     into r's user namespace, and r is root in that user namespace.
> >>
> >> Problem with this definition.  Being owned-by is not the same as
> >> has-root-in.  Specifically, I may choose to give you root in your own
> >> namespace, but you sure as heck can not increase your own memory
> >> limit.
> >
> > 1. If you don't want me to change the value at all, then just don't map
> > A's owner into the namespace.  I'm uid 100000 which is root in my namespace,
> > but I only have privilege over other uids mapped into my namespace.
> 
> I think I understand this, but it is subtle.  Maybe some examples would help?

When you create a user namespace, at first it is empty, and you are 'nobody'
(-1).  Then magically some uids from the host, say 100000-101999, are mapped
into your namespace, to uids 0-1999.

Now assume you're uid 0 inside that namespace.  You have privilege over your
uids, 0-999, which are 100000-101999 on the host.

If cgroup file A is owned by host uid 0, then the owner is not mapped into
the user namespace.  uid 0 inside the namespace only gets the world access
rights to that file.

If cgroup file A is owned by host uid 100100, then uid 0 in the
namespace has access to that file by virtue of being root, and uid 100
in the namespace (100100 on the host) has access to the file by virtue
of being the owner.

> > 2. I've considered never allowing changes to your own cgroup.  So if you're
> > in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
> > b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
> > could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
> > so I can't escape the memory limit you wanted.
> 
> This is different from what we do internally, but it's an interesting
> semantic.  I'm wary of how much we want to make this API about
> enforcement of policy vs simple enactment.  In other words, semantics
> that diverge from UNIX ownership might be more complicated to
> understand than they are worth.

The semantics I gave are exactly the user namespace semantics.  If you're
not using a user namespace then they simply do not apply, and you are back
to strict UNIX ownership semantics that you want.  But allowing 'root' in
a user namespace to have privilege over uids, without having any privilege
outside its own namespace, must be honored for this to be usable by lxc.

Like I said, on the bright side, if you don't want to care about user
namespaces, then everything falls back to strict unix semantics - so if
you don't want to care, you don't have to care.

> > 3. I've not considered having the daemon track resource limits - i.e. creating
> > a cgroup and saying "give it 100M swap, and if it asks, let it increase that
> > to 200M."  I'd prefer that be done incidentally through (1) and (2).  Do you
> > feel that would be insufficient?
> 
> I think this is a higher-level issue that should not be addressed here.
> 
> > Or maybe your question is something different and I'm missing it?
> 
> My point was that I, as machine admin, create a memory cgroup of 100
> MB for you and put you in it.   I also give you root-in-namespace.
> You must not be able to change 100 MB to 200 MB.  From your (1) you
> are saying that system UID 0 owns the cgroup and is NOT mapped into
> your namespace.  Therefore your definition holds.  I think I can buy
> that.
> 
> >> >   . The client request may pertain a victim task v, which may be moved
> >> >     to a new cgroup.  In that case r's privilege over both the cgroup
> >> >     and v must be checked.  r is said to have privilege over v if v
> >> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >> >     and r is root in its userns.  Or if r and v have the same uid
> >> >     and v is mapped in r's pid namespace.
> >> >   . r's credentials will be taken from socket's peercred, ensuring that
> >> >     pid and uid are translated.
> >> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >> >     UID is mapped there.
> >> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >> >     the kernel translate it for the reader.  Only 'move task v to cgroup
> >> >     A' will require a SCM_CREDENTIAL to be sent.
> >> >
> >> > Privilege requirements by action:
> >> >     * Requestor of an action (r) over a socket may only make
> >> >       changes to cgroups over which it has privilege.
> >> >     * Requestors may be limited to a certain #/depth of cgroups
> >> >       (to limit memory usage) - DEFER?
> >> >     * Cgroup hierarchy is responsible for resource limits
> >> >     * A requestor must either be uid 0 in its userns with victim mapped
> >> >       ito its userns, or the same uid and in same/ancestor pidns as the
> >> >       victim
> >> >     * If r requests creation of cgroup '/x', /x will be interpreted
> >> >       as relative to r's cgroup.  r cannot make changes to cgroups not
> >> >       under its own current cgroup.
> >>
> >> Does this imply that r in a lower-level (farter from root) of the
> >> hierarchy can not make requests of higher levels of the hierarchy
> >> (closer to root), even though they have permissions as per the
> >> definition of privilege?
> >
> > Right.
> 
> Is this really a required semantic?  We have use cases where
> read-access is required to parent cgroups, which means this agent
> could never handle reads.  It's not clear that we have use cases for
> write-access to parents, though we have talked about eventfd - is that
> read or write access?  Does this daemon want to handle event fd?

Denying read access to parent cgroups is not strictly necessary to meet
any of my requirements.  Eventfd only requires an open read handle to
the file, so that should be ok.

So to support that, I guess I'd want to add a 'get-my-cgroup'
command with controller argument, which reeturns the absolute
path.  Cgroups which start with a '/' are taken as absolute
cgroup paths, as opposed to the usual, relative-to-my-own.
It sounds like you also might want to just use '../' ?

I'd refuse write access for now altogether.  We can talk later, if
someone finds a need, about a way to support conditional write
access, but that's pretty much completely bypassing the hierarchial
constraints :)

-serge




More information about the lxc-devel mailing list