[lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

Thu May 15 20:26:28 UTC 2014

Quoting Richard Weinberger (richard at nod.at):
> Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > Quoting Richard Weinberger (richard.weinberger at gmail.com):
> >> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
> >> <gregkh at linuxfoundation.org> wrote:
> >>> Then don't use a container to build such a thing, or fix the build
> >>> scripts to not do that :)
> >>
> >> I second this.
> >> To me it looks like some folks try to (ab)use Linux containers
> >> for purposes where KVM would much better fit in.
> >> Please don't put more complexity into containers. They are already
> >> horrible complex
> >> and error prone.
> > 
> > I, naturally, disagree :)  The only use case which is inherently not
> > valid for containers is running a kernel.  Practically speaking there
> > are other things which likely will never be possible, but if someone
> > offers a way to do something in containers, "you can't do that in
> > containers" is not an apropos response.
> > 
> > "That abstraction is wrong" is certainly valid, as when vpids were
> > originally proposed and rejected, resulting in the development of
> > pid namespaces.  "We have to work out (x) first" can be valid (and
> > I can think of examples here), assuming it's not just trying to hide
> > behind a catch-22/chicken-egg problem.
> > 
> > Finally, saying "containers are complex and error prone" is conflating
> > several large suites of userspace code and many kernel features which
> > support them.  Being more precise would, if the argument is valid,
> > lend it a lot more weight.
> 
> We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
> To understand the internals better I also wrote my own userspace to create/start
> containers. There are so many things which can hurt you badly.
> With user namespaces we expose a really big attack surface to regular users.
> I.e. Suddenly a user is allowed to mount filesystems.

That is currently not the case.  They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems.  This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.

> Ask Andy, he found already lots of nasty things...

Yes, of course, and there may be more to come...

> I agree that user namespaces are the way to go, all the papering with LSM
> over security issues is much worse.
> But we have to make sure that we don't add too much features too fast.

Agreed.  Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints." 

On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).

> That said, I like containers a lot because they are cheap but as they are lightweight
> also therefore also isolation level is lightweight.
> IMHO containers are not a cheap replacement for KVM.

The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings.  Which is why "this is not a use case for
containers" is not the right way to push back, whether or not the
feature ends up being appropriate.

-serge