[lxc-devel] Device Namespaces

Tue Oct 29 00:30:01 UTC 2013

On 10/28/2013 7:31 PM, Andrey Wagin wrote:
> I had experience of implementing this functionality in OpenVZ kernel. 
> I had requirements to not modify user-space tools, so that 
> implementations looks as dirty hack, but even hotplug of devices are 
> workin there.
Same here.  I want the container to run as much as possible unmodified 
code.  And in my case, that means an unmodified udev as well.  Ideally, 
I want those, and only those uevents that are destined for that 
container to go there.  This requires a few kernel modifications, but 
nothing massive.  I'm working on getting permission from my management 
to post the patch set that implements this here.

> I would prefer to think a bit more about userspace solution. We can 
> try to expand udev functionality.
My changes requires a userspace daemon that runs on the host system to 
forward messages after applying whatever policy the admin wants or needs 
for uevent to containers.  In my case, I need mouse, keyboard and 
display events to go the the appropriate container. Others might want 
serial device plugin events, whatever.
The daemon listens to the same netlink socket, and then writes to a 
simple device that forwards the event to the appropriate container 
netlink socket.  These are read or not by the udev/systemd whatever 
running in the container which does whatever is needed.  The daemon in 
our case, also handles creating the base devices in a host filesystem 
that is bound to the containers /dev directory.

> or we can teach udev to listens on a unix domain socket. The host udev 
> listens netlink. When it gets an event about a new device, it decides 
> for which containers it must be avaliable, does all required actions 
> and sends events in containers. Probably the protocol of notifications 
> must be unified for all udev-like services.
These minimal kernel changes + a host daemon fixes "most" of the 
problems.  There are a few warts..notable sysfs and devtmpfs.
> Sorry if a following idea will sound crazy. Can we use fuse 
> filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs, 
> it will mount fuse-sysfs, which is implemented by userspace program on 
> host system. * This way allows to emulate the behavior of uevent files 
> in containers, if we will use unix sockets between udev services. * 
> Probably a userspace daemon will be more flexible and customizable 
> than something in kernel Do we have a use case when a perfomance of 
> sysfs is critical?
I started working on a devtmpfs FUSE.  And the issues are many. There's 
the performance penalty, the security, etc.  It looks possible and might 
be doable but in the short term for me, it's easier to have a directory 
that the host can modify, and bind mount that directory to the 
containers /dev, and just don't use devtmpfs in the container.  I do 
need a way to stop the mounting of "bad" kernel filesystems to prevent 
the adversarial container from harming the host.

FUSE might be a better match for sysfs, but you'd need to have a filter 
that manages the massive directed graph, and prunes unnecessary thing 
from the graph on a per container basis.  My group is working on this 
right now to see how bad it will be.

Ultimately, I'd really rather have a containerized sysfs and devtmpfs, 
but I suspect that there's going to be a lot of push back on doing that 
in the kernel.

-- 
---Michael J Coss