[lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

Sun May 25 08:12:10 UTC 2014

On Sat, 2014-05-24 at 22:25 +0000, Serge Hallyn wrote:
> Quoting James Bottomley (James.Bottomley at HansenPartnership.com):
> > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > Quoting Andy Lutomirski (luto at amacapital.net):
> > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" <serge at hallyn.com> wrote:
> > > >>> 
> > > >>> Quoting Richard Weinberger (richard at nod.at):
> > > >>>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > >>>>> Quoting Richard Weinberger (richard.weinberger at gmail.com):
> > > >>>>>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman <gregkh at linuxfoundation.org> wrote:
> > > >>>>>>> Then don't use a container to build such a thing, or fix the build scripts to not do that :)
> > > >>>>>> 
> > > >>>>>> I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
> > > >>>>>> would much better fit in. Please don't put more complexity into containers. They are already horrible
> > > >>>>>> complex and error prone.
> > > >>>>> 
> > > >>>>> I, naturally, disagree :)  The only use case which is inherently not valid for containers is running a
> > > >>>>> kernel.  Practically speaking there are other things which likely will never be possible, but if someone 
> > > >>>>> offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
> > > >>>>> 
> > > >>>>> "That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
> > > >>>>> resulting in the development of pid namespaces.  "We have to work out (x) first" can be valid (and I can
> > > >>>>> think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
> > > >>>>> 
> > > >>>>> Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
> > > >>>>> code and many kernel features which support them.  Being more precise would, if the argument is valid, lend
> > > >>>>> it a lot more weight.
> > > >>>> 
> > > >>>> We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
> > > >>>> internals better I also wrote my own userspace to create/start containers. There are so many things which can
> > > >>>> hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
> > > >>>> user is allowed to mount filesystems.
> > > >>> 
> > > >>> That is currently not the case.  They can mount some virtual filesystems and do bind mounts, but cannot mount
> > > >>> most real filesystems.  This keeps us protected (for now) from potentially unsafe superblock readers in the 
> > > >>> kernel.
> > > >>> 
> > > >>>> Ask Andy, he found already lots of nasty things...
> > > >> 
> > > >> I don't think I have anything brilliant to add to this discussion right now, except possibly:
> > > >> 
> > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
> > > >> untrusted user can cause a block device to appear.  That user doesn't need permission to mount it
> > > > 
> > > > Interesting point.  This would further suggest that we absolutely must ensure that a loop device which shows up in
> > > > the container does not also show up in the host.
> > > 
> > > Can I suggest the usage of the devices cgroup to achieve that?
> > 
> > Not really ... cgroups impose resource limits, it's namespaces that
> > impose visibility separations.  In theory this can be done with the
> > device namespace that's been proposed; however, a simpler way is simply
> > to rm the device node in the host and mknod it in the guest.  I don't
> > really see host visibility as a huge problem: in a shared OS
> > virtualisation it's not really possible securely to separate the guest
> > from the host (only vice versa).
> > 
> > But I really don't think we want to do it this way.  Giving a container
> > the ability to do a mount is too dangerous.  What we want to do is
> > intercept the mount in the host and perform it on behalf of the guest as
> > host root in the guest's mount namespace.  If you do it that way, it
> 
> That doesn't help the problem of guests being able to provide bad input
> for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> suffering a failure of the imagination - what problem exactly does it solve?

Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised host would have, so I don't necessarily see it as our
problem.

The problem vectored mount solves is the one of not wanting root in the
container to have unfettered access to sys_mount because it allows the
host to vet all calls and execute the ones it likes in the context of
real root (possibly after modifying the parameters).

James