[lxc-devel] cgroup management daemon

Tue Nov 26 02:18:04 UTC 2013

Serge...

You have no idea how much I dread mentioning this (well, after
LinuxPlumbers, maybe you can) but...  You do realize that some of this
is EXACTLY what the systemd crowd was talking about there in NOLA back
then.  I sat in those session grinding my teeth and listening to
comments from some others around me about when systemd might subsume
bash or even vi or quake.

Somehow, you and others have tagged me as a "systemd expert" but I am
far from it and even you noted that Lennart and I were on the edge of a
physical discussion when I made some "off the cuff" remarks there about
systemd design during my talk.  I personally rank systemd in the same
category as NetworkMangler (err, NetworkManager) in its propensity for
committing inexplicable random acts of terrorism and changing its
behavior from release to release to release.  I'm not a fan and I'm not
an expert, but I have to be involved with it and watch the damned thing
like a trapped rat, like it or not.

Like it or not, we can not go off on divergent designs.  As much as they
have delusions of taking over the Linux world, they are still going to
be a major factor and this sort of thing needs to be coordinated.  We
are going to need exactly what you are proposing whether we have systemd
in play or not.  IF we CAN kick it to the curb, when we need to, we
still need to know how to without tearing shit up and breaking shit that
thinks it's there.  Ideally, it shouldn't matter if systemd where in
play or not.

All I ask is that we not get too far off track that we have a major
architectural divergence here.  The risk is there.

Mike

On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: 
> Hi,
> 
> as i've mentioned several times, I want to write a standalone cgroup
> management daemon.  Basic requirements are that it be a standalone
> program; that a single instance running on the host be usable from
> containers nested at any depth; that it not allow escaping ones
> assigned limits; that it not allow subjegating tasks which do not
> belong to you; and that, within your limits, you be able to parcel
> those limits to your tasks as you like.  
> 
> Additionally, Tejun has specified that we do not want users to be
> too closely tied to the cgroupfs implementation.  Therefore
> commands will be just a hair more general than specifying cgroupfs
> filenames and values.  I may go so far as to avoid specifying
> specific controllers, as AFAIK there should be no redundancy in
> features.  On the other hand, I don't want to get too general.
> So I'm basing the API loosely on the lmctfy command line API.
> 
> One of the driving goals is to enable nested lxc as simply and safely as
> possible.  If this project is a success, then a large chunk of code can
> be removed from lxc.  I'm considering this project a part of the larger
> lxc project, but given how central it is to systems management that
> doesn't mean that I'll consider anyone else's needs as less important
> than our own.
> 
> This document consists of two parts.  The first describes how I
> intend the daemon (cgmanager) to be structured and how it will
> enforce the safety requirements.  The second describes the commands 
> which clients will be able to send to the manager.  The list of
> controller keys which can be set is very incomplete at this point,
> serving mainly to show the approach I was thinking of taking.
> 
> Summary
> 
> Each 'host' (identified by a separate instance of the linux kernel) will
> have exactly one running daemon to manage control groups.  This daemon
> will answer cgroup management requests over a dbus socket, located at
> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> containers, so that one daemon can support the whole system.
> 
> Programs will be able to make cgroup requests using dbus calls, or
> indirectly by linking against lmctfy which will be modified to use the
> dbus calls if available.
> 
> Outline:
>   . A single manager, cgmanager, is started on the host, very early
>     during boot.  It has very few dependencies, and requires only
>     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>     the cgroup hierarchies in a private namespace and set defaults
>     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>   . A client (requestor 'r') can make cgroup requests over
>     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>     requirements for r are listed below.
>   . The client request will pertain an existing or new cgroup A.  r's
>     privilege over the cgroup must be checked.  r is said to have
>     privilege over A if A is owned by r's uid, or if A's owner is mapped
>     into r's user namespace, and r is root in that user namespace.
>   . The client request may pertain a victim task v, which may be moved
>     to a new cgroup.  In that case r's privilege over both the cgroup
>     and v must be checked.  r is said to have privilege over v if v
>     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>     and r is root in its userns.  Or if r and v have the same uid
>     and v is mapped in r's pid namespace.
>   . r's credentials will be taken from socket's peercred, ensuring that
>     pid and uid are translated.
>   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>     UID is mapped there.
>   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>     the kernel translate it for the reader.  Only 'move task v to cgroup
>     A' will require a SCM_CREDENTIAL to be sent.
> 
> Privilege requirements by action:
>     * Requestor of an action (r) over a socket may only make
>       changes to cgroups over which it has privilege.
>     * Requestors may be limited to a certain #/depth of cgroups
>       (to limit memory usage) - DEFER?
>     * Cgroup hierarchy is responsible for resource limits
>     * A requestor must either be uid 0 in its userns with victim mapped
>       ito its userns, or the same uid and in same/ancestor pidns as the
>       victim
>     * If r requests creation of cgroup '/x', /x will be interpreted
>       as relative to r's cgroup.  r cannot make changes to cgroups not
>       under its own current cgroup.
>     * If r is not in the initial user_ns, then it may not change settings
>       in its own cgroup, only descendants.  (Not strictly necessary -
>       we could require the use of extra cgroups when wanted, as lxc does
>       currently)
>     * If r requests creation of cgroup '/x', it must have write access
>       to its own cgroup  (not strictly necessary)
>     * If r requests chown of cgroup /x to uid Y, Y is passed in a
>       ucred over the unix socket, and therefore translated to init
>       userns.
>     * if r requests setting a limit under /x, then
>       . either r must be root in its own userns, and UID(/x) be mapped
>         into its userns, or else UID(r) == UID(/x)
>       . /x must not be / (not strictly necessary, all users know to
>         ensure an extra cgroup layer above '/')
>       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>         which won't be satisfied.  Therefore we'll need to do privilege
>         checks ourselves, then perform the write as the host root user.
>         (see devices.allow/deny).  Further we need to support older kernels
>         which don't support setns for pid.
>     * If r requests action on victim V, it passes V's pid in a ucred,
>       so that gets translated.
>       Daemon will verify that V's uid is mapped into r's userns.  Since
>       r is either root or the same uid as V, it is allowed to classify.
> 
> The above addresses
>     * creating cgroups
>     * chowning cgroups
>     * setting cgroup limits
>     * moving tasks into cgroups
>   . but does not address a 'cgexec <group> -- command' type of behavior.
>     * To handle that (specifically for upstart), recommend that r do:
>       if (!pid) {
>         request_reclassify(cgroup, getpid());
>         do_execve();
>       }
>   . alternatively, the daemon could, if kernel is new enough, setns to
>     the requestor's namespaces to execute a command in a new cgroup.
>     The new command would be daemonized to that pid namespaces' pid 1.
> 
> Types of requests:
>   * r requests creating cgroup A'/A
>     . lmctfy/cli/commands/create.cc
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Verify that UID(R) is mapped into r's userns
>     . Create R/A'/A
>     . chown R/A'/A to UID(r)
>   * r requests to move task x to cgroup A.
>     . lmctfy/cli/commands/enter.cc
>     . r must send PID(x) as ancillary message
>     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>       that userns
>       (is it safe to allow if UID(x) == UID(r))?
>     . R=cgroup_of(r)
>     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>     . echo PID(x) >> /R/A/tasks
>   * r requests chown of cgroup A to uid X
>     . X is passed in ancillary message
>       * ensures it is valid in r's userns
>       * maps the userid to host for us
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Chown R/A to X
>   * r requests cgroup A's 'property=value'
>     . Verify that either
>       * A != ''
>       * UID(r) == 0 on host
>       In other words, r in a userns may not set root cgroup settings.
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Set property=value for R/A
>       * Expect kernel to guarantee hierarchical constraints
>   * r requests deletion of cgroup A
>     . lmctfy/cli/commands/destroy.cc (without -f)
>     . same requirements as setting 'property=value'
>   * r requests purge of cgroup A
>     . lmctfy/cli/commands/destroy.cc (with -f)
>     . same requirements as setting 'property=value'
> 
> Long-term we will want the cgroup manager to become more intelligent -
> to place its own limits on clients, to address cpu and device hotplug,
> etc.  Since we will not be doing that in the first prototype, the daemon
> will not keep any state about the clients.
> 
> Client DBus Message API
> 
> <name>: a-zA-Z0-9
> <name>: "a-zA-Z0-9 "
> <controllerlist>: <controller1>[:controllerlist]
> <valueentry>: key:value
> <valueentry>: frozen
> <valueentry>: thawed
> <values>: valueentry[:values]
> keys:
> 	{memory,swap}.{limit,soft_limit}
> 	cpus_allowed  # set of allowed cpus
> 	cpus_fraction # % of allowed cpus
> 	cpus_number   # number of allowed cpus
> 	cpu_share_percent   # percent of cpushare
> 	devices_whitelist
> 	devices_blacklist
> 	net_prio_index
> 	net_prio_interface_map
> 	net_classid
> 	hugetlb_limit
> 	blkio_weight
> 	blkio_weight_device
> 	blkio_throttle_{read,write}
> readkeys:
> 	devices_list
> 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> 	hugetlb_max_usage
> 	hugetlb_usage
> 	hugetlb_failcnt
> 	cpuacct_stat
> 	<etc>
> Commands:
> 	ListControllers
> 	Create <name> <controllerlist> <values>
> 	Setvalue <name> <values>
> 	Getvalue <name> <readkeys>
> 	ListChildren <name>
> 	ListTasks <name>
> 	ListControllers <name>
> 	Chown <name> <uid>
> 	Chown <name> <uid>:<gid>
> 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> 	Delete <name>
> 	Delete-force <name>
> 	Kill <name>
> 
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131125/1d37a9bc/attachment.pgp>