<div dir="ltr">Hi Serge,<div><br></div><div>Sorry for the (long delay)^2</div><div><br></div><div>First, I am glad you fired up this discussion.</div><div>I think it's important that we all familiar with each other's goals,</div><div>so we can aim at better collaboration.</div><div><br></div><div>It was very encouraging for us at Cellrox to learn on LPC about the </div><div>progress of the work on udevd by Michael Coss and the sysfs/procfs FUSE work.</div><div>Those are issues that we are always on the look out for a better solution</div><div>and we are looking forward to see the welcome contributions.</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Nov 4, 2014 at 11:50 AM, Serge Hallyn <span dir="ltr"><<a href="mailto:serge.hallyn@ubuntu.com" target="_blank">serge.hallyn@ubuntu.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi,<br>
<br>
Over the last few weeks several people have talked to me about<br>
various issues relating to devices in containers. So I thought<br>
I'd send this out as a general survey of the work that I know of<br>
being done relating in one way or another to devices in containers.<br>
Different people have different goals, and several people are doing<br>
their own thing to achieve their goals. I wanted to get started by<br>
having everyone being aware of what others are doing, in the hopes<br>
that, over the next few years, we can work toward a comprehensive<br>
solution.<br>
<br>
So here goes.<br>
<br>
Some people (mwarfield, Michael Coss) want to send uevents into<br>
specific containers, i.e. consoles or X displays. Michael Warfield<br>
(AIUI) does this by moving devices into /dev/lxc/$container/.<br>
At the containers track at plumbers a few weeks ago, Michael Coss<br>
presented a solution developed at Lucent where uevents were sent only<br>
to the initial netns, and a userspace daemon checks a database and<br>
forwards uevents directly into containers so that containers can hotplug<br>
as needed: <a href="http://www.linuxplumbersconf.org/2014/ocw/sessions/2157" target="_blank">http://www.linuxplumbersconf.org/2014/ocw/sessions/2157</a><br>
<br>
Several people have wanted to use iscsi in containers. AIUI Containers<br>
(at least non-userns) can use iscsi devices if they are moved into<br>
the containers namespace, however Clint was wanting to go further<br>
and actually be able to create iscsi devices inside containers.<br>
My memory may fail me, but I believe that to solve that we'd need<br>
to extend the current netlink backend, which (IIRC) only accepts<br>
connections from the initial netns. More realistically, I'd envision<br>
an answer to this being a userspace daemon on the host which the<br>
containers can talk to to make requests. OTOH it also feels similar to<br>
the loop device namespacing issues which would be far more elegantly<br>
solved in the kernel (imo). Does anyone know of existing work to this<br>
end (either way) for iscsi?<br>
<br>
A few people have worked at the device driver level to actually<br>
namespace the devices themselves. For instance, Cellrox supports<br>
switching the active display betweenThere was a general good feeling in LPC 2014 that things are moving multiple containers. When c2<br>
is the active container, its writes go to the real display. When<br>
it is not the active container, its writes are buffered. This allows<br>
the devices to be namespaced without any actual general "device namespace"<br>
support in the kernel. </blockquote><div><br></div><div>That's a mostly accurate description of what we do.</div><div>Our main goal is "interactive" containers (for mobile handsets). </div><div>In the most common use case,</div><div>that translates to foreground container, with access to input/output devices</div><div>and background containers, with access to multiplexed devices.</div><div><br></div><div>That dictates that device driver behavior may differ depending on the "context" of the container in which the device was opened and whether the container belongs to foreground or background ("context").</div><div><div>In our implementation, we created devns to hold that "context", but we are considering</div><div>to use alternatives, such as userns, cgroupns or even device cgroup for holding that "context". </div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">ISTR there was another company doing something<br>
similar (I don't recall for which devices), but can't remember who/what<br>
at the moment.<br></blockquote><div><br></div><div>I am not sure who/what you are referring to. </div><div>The closest thing that pops in my mind,</div><div>WRT foreground/background containers management </div><div>is the way that systemd-logind handles session switching.</div><div>The systemd approach requires more collaboration with the graphics server running</div><div>inside the container, but it also issues revoke calls of open input devices</div><div>and DRM_IOCTL_DROP_MASTER ioctl on open display devices in background context,</div><div>to overcome rouge (or irresponsive) graphics servers.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
As I alluded to earlier, Seth had previously done a bit of work on<br>
namespacing loop so that containers could create and use their own<br>
loop devices. For the moment that's been shelved and he's focused on<br>
fuse inside containers instead. However, at the kernel summit this<br>
year Ted T'so said that at least mounting ext4 inside a container<br>
"should" work, meaning that any issues (i.e. ability to corrupt the<br>
supserblock reader with trash data by a malicious container admin)<br>
would be considered a bug (rather than "don't do that" misuse). I<br>
hope to follow up on that with some simple tests, and of course<br>
loopdev in containers will become far more compelling if we can<br>
actually mount ext4 in a container (which we can't right now).<br>
<br>
There's probably more and I aplogize to anyone who's work I neglected<br>
to mention here. I think we're at a point where collaboration would<br>
be useful.<br>
<br>
thanks,<br>
-serge<br>
<br>
PS - I certainly get some details wrong. I'm gonna lie and claim that<br>
I did so on purpose to encourage responses/discussion :)<br>
</blockquote></div><br></div></div>