<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <span dir="ltr"><<a href="mailto:gregkh@linuxfoundation.org" target="_blank">gregkh@linuxfoundation.org</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div>On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote:<br>


> So the big issues for a device namespace to solve are filtering which<br>

> devices a container has access to and being able to dynamically change<br>

> which devices those are at run time (aka hotplug).<br>

<br>

</div>As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG<br>

anymore, because it was redundant), I think you need to really think<br>

this through better (pci, memory, cpus, etc.) before you do anything in<br>

the kernel.<br>

<div><br>

> After having thought about this for a bit I don't know if a pure<br>

> userspace solution is sufficient or actually a good idea.<br>

><br>

> - We can manually manage a tmpfs with device nodes in userspace.<br>

>   (But that is deprecated functionality in the mainstream kernel).<br>

<br>

</div>Yes, but I'm not going to namespace devtmpfs, as that is going to be an<br>

impossible task, right?<br></blockquote><div><br></div><div>That sounds like a challenge ;-)</div><div>Seriously, as Serge correctly noted, it would not be that different from devpts</div><div>if you start from an empty devtmpfs and populate it with devices that are "added</div>

<div>in the context of that namespace".</div><div>The semantics in which devices are "added in the context of a namespace"</div><div>is the missing piece of the puzzle.</div><div><br></div><div>What we really like to see is a setns() style API that can be used to</div>

<div>add a device in the context of a namespace in either a "shared" or "private"</div><div>mode.</div><div>This kind of API is a required building block for us to write device drivers</div><div>that are namespace aware in a way that userspace will have enough flexibility</div>

<div>for dynamic configuration.</div><div><br></div><div>We are trying to come up with a proposal for that sort of API.</div><div>When we have something decent, we shall post it.</div><div><br></div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

And remember, udev doesn't create device nodes anymore...<br>

<div><br>

> - We can manually export a subset of sysfs with bind mounts.<br>

>   (But that feels hacky, and is essentially incompatible with hotplug).<br>

<br>

</div>True.<br>

<div><br>

> - We can relay a call of /sbin/hotplug from outside of a container<br>

>   to inside of a container based on policy.<br>

>   (But no one uses /sbin/hotplug anymore).<br>

<br>

</div>That's right, they should be listening to libudev events, so why can't<br>

your daemon shuffle them off to the proper container, all in userspace?<br>

<div><br>

> - There is no way to fake netlink uevents for a container to see them.<br>

>   (The best we could do is replace udev everywhere with something that<br>

>    listens on a unix domain socket).<br>

<br>

</div>You shouldn't need to do this.<br>

<div><br>

> - It would be nice to replace the device cgroup with a comprehensive<br>

>   solution that really works. (Among other things the device cgroup<br>

>   does not work in terms of struct device the underlying kernel<br>

>   abstraction for devices).<br>

<br>

</div>I didn't even know there was a device cgroup.<br>

<br>

Which means that if there is one, odds are it's useless.<br>

<div><br>

> We must manage sysfs entries as well device nodes because:<br>

> - Seeing more than we should has the real potential to confuse<br>

>   userspace, especially a userspace that replays uevents.<br>

<br>

</div>You should never replay uevents.  If you don't do that, why can't you<br>

see all of sysfs?<br>

<div><br>

> - Some device control must happens through writing to sysfs files and<br>

>   if we don't remove all root privileges from a container only by<br>

>   exporting a subset of sysfs to that container can we limit which<br>

>   sysfs nodes can be written to.<br>

<br>

</div>But you have the issue of controlling devices in a "shared" way, which<br>

isn't going to be usable for almost all devices.<br>

<div><br>

> The current kernel tagged sysfs entry support does not look like a good<br>

> match for the impelementing device filtering.   The common case will<br>

> be allowing devices like /dev/zero, and /dev/null that live in<br>

> /sys/devices/virtual and are the devices we are most likely to care<br>

> about.  Those devices need to live in multiple device namespaces so<br>

> everyone can use them.  Perhaps exclusive assignment will be the more<br>

> common paradigm for device namespaces like it is for network devices in<br>

> the network namespace but from what little I can of this problem right now I<br>

> don't think so.<br>

><br>

> I definitely think we should hold off on a kernel level implementation<br>

> until we really understand the issues and are ready to implement device<br>

> namespaces correctly.<br>

<br>

</div>I agree, especially as I don't think this will ever work.<br>

<div><br>

> A userspace implementation looks like it can only do about 95% of what<br>

> is really needed, but at the same time looks like an easy way to<br>

> experiment until the problem is sufficiently well understood.<br>

<br>

</div>95% is probably way better than what you have today, and will fit the<br>

needs of almost everyone today, so why not do it?<br>

<br>

I'd argue that those last 5% either are custom solutions that never get<br>

merged, or candidates for true virtulization.<br>

<div><br>

> In summary the situation with device hoptlug and containers sucks today,<br>

> and we need to do something.  Running a linux desktop in a container is<br>

> a reasonably good example use case.<br>

<br>

</div>No it isn't.  I'd argue that this is a horrible use case, one that you<br>

shouldn't do.  Why not just use multi-head machines like people do who<br>

really want to do this, relying on user separation?  That's a workable<br>

solution that is quite common and works very well today.<br>

<div><br>

> Having one standard common maintainable implementation would be very<br>

> useful and the most logical place for that would be in the kernel.<br>

> For now we should focus on simple device filtering and hotplug.<br>

<br>

</div>Just listen for libudev stuff, don't try to filter them, or ever<br>

"replay" them, that way lies madness, and lots of nasty race conditions<br>

that is guaranteed to break things.<br>

<br>

good luck,<br>

<br>

greg k-h<br>

</blockquote></div><br></div></div>