[lxc-devel] cgroup management daemon
Marian Marinov
mm at yuhu.biz
Tue Nov 26 01:35:22 UTC 2013
On 11/26/2013 02:11 AM, Stéphane Graber wrote:
> On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
>> On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
>>> Hi,
>>>
>>> as i've mentioned several times, I want to write a standalone cgroup
>>> management daemon. Basic requirements are that it be a standalone
>>> program; that a single instance running on the host be usable from
>>> containers nested at any depth; that it not allow escaping ones
>>> assigned limits; that it not allow subjegating tasks which do not
>>> belong to you; and that, within your limits, you be able to parcel
>>> those limits to your tasks as you like.
>>>
>>> Additionally, Tejun has specified that we do not want users to be
>>> too closely tied to the cgroupfs implementation. Therefore
>>> commands will be just a hair more general than specifying cgroupfs
>>> filenames and values. I may go so far as to avoid specifying
>>> specific controllers, as AFAIK there should be no redundancy in
>>> features. On the other hand, I don't want to get too general.
>>> So I'm basing the API loosely on the lmctfy command line API.
>>>
>>> One of the driving goals is to enable nested lxc as simply and safely as
>>> possible. If this project is a success, then a large chunk of code can
>>> be removed from lxc. I'm considering this project a part of the larger
>>> lxc project, but given how central it is to systems management that
>>> doesn't mean that I'll consider anyone else's needs as less important
>>> than our own.
>>>
>>> This document consists of two parts. The first describes how I
>>> intend the daemon (cgmanager) to be structured and how it will
>>> enforce the safety requirements. The second describes the commands
>>> which clients will be able to send to the manager. The list of
>>> controller keys which can be set is very incomplete at this point,
>>> serving mainly to show the approach I was thinking of taking.
>>>
>>> Summary
>>>
>>> Each 'host' (identified by a separate instance of the linux kernel) will
>>> have exactly one running daemon to manage control groups. This daemon
>>> will answer cgroup management requests over a dbus socket, located at
>>> /sys/fs/cgroup/manager. This socket can be bind-mounted into various
>>> containers, so that one daemon can support the whole system.
>>>
>>> Programs will be able to make cgroup requests using dbus calls, or
>>> indirectly by linking against lmctfy which will be modified to use the
>>> dbus calls if available.
>>>
>>> Outline:
>>> . A single manager, cgmanager, is started on the host, very early
>>> during boot. It has very few dependencies, and requires only
>>> /proc, /run, and /sys to be mounted, with /etc ro. It will mount
>>> the cgroup hierarchies in a private namespace and set defaults
>>> (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>>> will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>>> . A client (requestor 'r') can make cgroup requests over
>>> /sys/fs/cgroup/manager using dbus calls. Detailed privilege
>>> requirements for r are listed below.
>>> . The client request will pertain an existing or new cgroup A. r's
>>> privilege over the cgroup must be checked. r is said to have
>>> privilege over A if A is owned by r's uid, or if A's owner is mapped
>>> into r's user namespace, and r is root in that user namespace.
>>> . The client request may pertain a victim task v, which may be moved
>>> to a new cgroup. In that case r's privilege over both the cgroup
>>> and v must be checked. r is said to have privilege over v if v
>>> is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>>> and r is root in its userns. Or if r and v have the same uid
>>> and v is mapped in r's pid namespace.
>>> . r's credentials will be taken from socket's peercred, ensuring that
>>> pid and uid are translated.
>>> . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>>> translated global pid. It will then read UID(v) from /proc/PID(v)/status,
>>> which is the global uid, and check /proc/PID(r)/uid_map to see whether
>>> UID is mapped there.
>>> . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>>> the kernel translate it for the reader. Only 'move task v to cgroup
>>> A' will require a SCM_CREDENTIAL to be sent.
>>>
>>> Privilege requirements by action:
>>> * Requestor of an action (r) over a socket may only make
>>> changes to cgroups over which it has privilege.
>>> * Requestors may be limited to a certain #/depth of cgroups
>>> (to limit memory usage) - DEFER?
>>> * Cgroup hierarchy is responsible for resource limits
>>> * A requestor must either be uid 0 in its userns with victim mapped
>>> ito its userns, or the same uid and in same/ancestor pidns as the
>>> victim
>>> * If r requests creation of cgroup '/x', /x will be interpreted
>>> as relative to r's cgroup. r cannot make changes to cgroups not
>>> under its own current cgroup.
>>> * If r is not in the initial user_ns, then it may not change settings
>>> in its own cgroup, only descendants. (Not strictly necessary -
>>> we could require the use of extra cgroups when wanted, as lxc does
>>> currently)
>>> * If r requests creation of cgroup '/x', it must have write access
>>> to its own cgroup (not strictly necessary)
>>> * If r requests chown of cgroup /x to uid Y, Y is passed in a
>>> ucred over the unix socket, and therefore translated to init
>>> userns.
>>> * if r requests setting a limit under /x, then
>>> . either r must be root in its own userns, and UID(/x) be mapped
>>> into its userns, or else UID(r) == UID(/x)
>>> . /x must not be / (not strictly necessary, all users know to
>>> ensure an extra cgroup layer above '/')
>>> . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>>> which won't be satisfied. Therefore we'll need to do privilege
>>> checks ourselves, then perform the write as the host root user.
>>> (see devices.allow/deny). Further we need to support older kernels
>>> which don't support setns for pid.
>>> * If r requests action on victim V, it passes V's pid in a ucred,
>>> so that gets translated.
>>> Daemon will verify that V's uid is mapped into r's userns. Since
>>> r is either root or the same uid as V, it is allowed to classify.
>>>
>>> The above addresses
>>> * creating cgroups
>>> * chowning cgroups
>>> * setting cgroup limits
>>> * moving tasks into cgroups
>>> . but does not address a 'cgexec <group> -- command' type of behavior.
>>> * To handle that (specifically for upstart), recommend that r do:
>>> if (!pid) {
>>> request_reclassify(cgroup, getpid());
>>> do_execve();
>>> }
>>> . alternatively, the daemon could, if kernel is new enough, setns to
>>> the requestor's namespaces to execute a command in a new cgroup.
>>> The new command would be daemonized to that pid namespaces' pid 1.
>>>
>>> Types of requests:
>>> * r requests creating cgroup A'/A
>>> . lmctfy/cli/commands/create.cc
>>> . Verify that UID(r) mapped to 0 in r's userns
>>> . R=cgroup_of(r)
>>> . Verify that UID(R) is mapped into r's userns
>>> . Create R/A'/A
>>> . chown R/A'/A to UID(r)
>>> * r requests to move task x to cgroup A.
>>> . lmctfy/cli/commands/enter.cc
>>> . r must send PID(x) as ancillary message
>>> . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>>> that userns
>>> (is it safe to allow if UID(x) == UID(r))?
>>> . R=cgroup_of(r)
>>> . Verify that R/A is owned by UID(r) or UID(x)? (not sure that's needed)
>>> . echo PID(x) >> /R/A/tasks
>>> * r requests chown of cgroup A to uid X
>>> . X is passed in ancillary message
>>> * ensures it is valid in r's userns
>>> * maps the userid to host for us
>>> . Verify that UID(r) mapped to 0 in r's userns
>>> . R=cgroup_of(r)
>>> . Chown R/A to X
>>> * r requests cgroup A's 'property=value'
>>> . Verify that either
>>> * A != ''
>>> * UID(r) == 0 on host
>>> In other words, r in a userns may not set root cgroup settings.
>>> . Verify that UID(r) mapped to 0 in r's userns
>>> . R=cgroup_of(r)
>>> . Set property=value for R/A
>>> * Expect kernel to guarantee hierarchical constraints
>>> * r requests deletion of cgroup A
>>> . lmctfy/cli/commands/destroy.cc (without -f)
>>> . same requirements as setting 'property=value'
>>> * r requests purge of cgroup A
>>> . lmctfy/cli/commands/destroy.cc (with -f)
>>> . same requirements as setting 'property=value'
>>>
>>> Long-term we will want the cgroup manager to become more intelligent -
>>> to place its own limits on clients, to address cpu and device hotplug,
>>> etc. Since we will not be doing that in the first prototype, the daemon
>>> will not keep any state about the clients.
>>>
>>> Client DBus Message API
>>>
>>> <name>: a-zA-Z0-9
>>> <name>: "a-zA-Z0-9 "
>>> <controllerlist>: <controller1>[:controllerlist]
>>> <valueentry>: key:value
>>> <valueentry>: frozen
>>> <valueentry>: thawed
>>> <values>: valueentry[:values]
>>> keys:
>>> {memory,swap}.{limit,soft_limit}
>>> cpus_allowed # set of allowed cpus
>>> cpus_fraction # % of allowed cpus
>>> cpus_number # number of allowed cpus
>>> cpu_share_percent # percent of cpushare
>>> devices_whitelist
>>> devices_blacklist
>>> net_prio_index
>>> net_prio_interface_map
>>> net_classid
>>> hugetlb_limit
>>> blkio_weight
>>> blkio_weight_device
>>> blkio_throttle_{read,write}
>>> readkeys:
>>> devices_list
>>> {memory,swap}.{failcnt,max_use,limitnuma_stat}
>>> hugetlb_max_usage
>>> hugetlb_usage
>>> hugetlb_failcnt
>>> cpuacct_stat
>>> <etc>
>>> Commands:
>>> ListControllers
>>> Create <name> <controllerlist> <values>
>>> Setvalue <name> <values>
>>> Getvalue <name> <readkeys>
>>> ListChildren <name>
>>> ListTasks <name>
>>> ListControllers <name>
>>> Chown <name> <uid>
>>> Chown <name> <uid>:<gid>
>>> Move <pid> <name> [[ pid is sent as a SCM_CREDENTIAL ]]
>>> Delete <name>
>>> Delete-force <name>
>>> Kill <name>
>>>
>>
>> I really like the idea, but I have a few comments.
>> I'm not familiar with the dbus, but how will you identify a request made on dbus?
>> I mean will you get its pid? What if the container has its own PID namespace, how will this be handled?
>
> DBus is essentially just an IPC protocol that can be used over a variety
> of medium.
>
> In the case of this cgroup manager, we'll be using the DBus protocol on
> top of a standard UNIX socket. One of the properties of unix sockets is
> that you can get the uid, gid and pid of your peer. As this information
> is provided by the kernel, it'll automatically be translated to match
> your vision of the pid and user tree.
>
> That's why we're also planning on abusing SCM_CRED a tiny bit so that
> when a container or sub-container is asking for a pid to be moved into a
> cgroup, instead of passing that pid as a standard integer over dbus,
> it'll instead use the SCM_CRED mechanism, sending a ucred structure
> instead which will then get magically mapped to the right namespace when
> accessed by the manager and saving us a whole lot of pid/uid mapping
> logic in the process.
>
>>
>> I know that this may sound a bit radical, but I propose that the daemon is using simple unix sockets.
>> The daemon should have an easy way of adding more sockets to newly started containers and each newly created socket
>> should know the base cgroup to which it belongs. This way the daemon can clearly identify which request is limited to
>> what cgroup without many lookups and will be easier to enforce the above mentioned restrictions.
>
> So it looks like our current design already follows your recommendation
> since we're indeed using a standard unix socket, it's just that instead
> of re-inventing the wheel, we use a standard IPC protocol on top of it.
Thanks, I was thinking about the SCM_CREAD exactly :)
I was unaware that it can be combined with the dbus protocol, this is why I commented.
Is there any particular language that you want this project started in? I know that most part of the LXC is C, but I
don't see any guidelines for using or not other langs.
Marian
>
>>
>> Marian
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up now.
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Lxc-devel mailing list
>> Lxc-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/lxc-devel
>
More information about the lxc-devel
mailing list