[lxc-devel] [RFC PATCH 0/2] Loop device psuedo filesystem

Wed May 28 16:10:10 UTC 2014

On Wed, May 28, 2014 at 12:32 AM, Seth Forshee
<seth.forshee at canonical.com> wrote:
> On Tue, May 27, 2014 at 03:19:15PM -0700, Andy Lutomirski wrote:
>> On Tue, May 27, 2014 at 2:58 PM, Seth Forshee
>> <seth.forshee at canonical.com> wrote:
>> > I'm posting these patches in response to the ongoing discussion of loop
>> > devices in containers at [1].
>> >
>> > The patches implement a psuedo filesystem for loop devices, which will
>> > allow use of loop devices in containters using standard utilities. Under
>> > normal use a loopfs mount will initially contain a single device node
>> > for loop-control which can be used to request and release loop devices.
>> > Any devices allocated via this node will automatically appear in that
>> > loopfs mount (and in devtmpfs) but not in any other loopfs mounts.
>> > CAP_SYS_ADMIN in the userns of the process which performed the mount is
>> > allowed to perform privileged loop ioctls on these devices.
>> >
>> > Alternately loopfs can be mounted with the hostmount option, intended
>> > for mounting /dev/loop in the host. This is the default mount for any
>> > devices not created via loop-control in a loopfs mount (e.g. devices
>> > created during driver init, devices created via /dev/loop-control, etc).
>> > This is only available to system-wide CAP_SYS_ADMIN.
>> >
>> > I still have some testing to do on these patches, but they work at
>> > minimum for simple use cases. It's possible to use an unmodified losetup
>> > if it's new enough to know about loop-control, with a couple of caveats:
>> >
>> >  * /dev/loop-control must be symlinked to /dev/loop/loop-control
>> >  * In some cases losetup attempts to use /dev/loopN when the device node
>> >    is at /dev/loop/N. For example, 'losetup -f disk.img' fails.
>> >
>> > Device nodes for loop partitions are not created in loopfs. These
>> > devices are created by the generic block layer, and the loop driver has
>> > no way of knowing when they are created, so some kind of hook into the
>> > driver will be needed to support this.
>>
>> This is entertaining and a bit terrifying :)
>>
>> ISTM that what you've done is to create a way for per-userns devices
>> to live in a special filesystem and for userns containers to
>> instantiate those devices by offloading all the hard work to the
>> kernel.
>>
>> What if we generalized this?
>>
>> For example, we could add a concept of ephemeral devices.  An
>> ephemeral device is a device that can be referenced by an inode with a
>> guarantee that the inode will *never* accidentally point to a
>> different device [1].  Then we add a concept of the userns that owns a
>> struct device.
>>
>> To make this safe, we'll need to make sure that old host udev will not
>> see non-init-userns devices, ever.  This is easy enough to do, but
>> doing it elegantly might take some design work.
>
> To do this wouldn't we need a generic way to know which namespace a
> device goes with? Greg has clearly stated that he doesn't want to do
> this.

This is IMO silly.  If Greg doesn't want any kind of namespaces in the
device core, then sticking considerably more complicated namespaces
into the *loop* driver is just absurd.

>
>> To make this useful, we'll need a way for things inside user
>> namespaces to create the device nodes.  I can imagine at least three
>> ways to make this work.
>>
>> a) Allow mknod on a tmpfs created by a particular userns to succeed if
>> the targetting struct device is owned by that userns or a child and if
>> the caller is ns_capable(CAP_MKNOD).
>> b) Create a new filesystem that has some special ioctl or whatever to do it.
>> c) Have real per-user-ns devtmpfs.
>>
>> Now, to get loop working in a userns, we need a way for the userns (or
>> the host!) to create a new loop-control device owned by that userns
>> and we need to tweak the loop driver to make the created loop devices
>> be owned by the userns.
>
> The patches I posted previously more or less did this using per-ns
> devtmpfs, aside from the ephimeral part. The feedback was "just do it in
> loop," so I sent these to facilitate discussing this option with
> something concrete. I personally still like the per-ns devtmpfs
> approach, but that's been nacked.

The ephemeral part might not be needed using devtmpfs if devtmpfs can
guarantee that the device nodes go away if the device goes away.  I
don't know whether it can make that guarantee.

>
> (a) might be interesting, but I'd expect the same objections to be
> raised as for (c). And it seems to me that (b) is just a alternate
> interface for (a).
>

True.

>> (Note: I'm deliberately ignoring the fact that just doing this for
>> loop seems to be almost entirely useless right now: you still can't
>> mount the things.)
>
> You could also argue that it's useless to be able to mount things if you
> have no block device on which to mount them. We have to start somewhere.
>

True.

But if we take this particular route, then I can imagine a real mess
when someone wants to mount a non-loop device, and we get stuck on how
to expose the device node.  Sigh.

--Andy