[lxc-devel] mounts...

Sat Nov 14 16:06:07 UTC 2009

Andrian Nord wrote:
> On Sat, Nov 14, 2009 at 02:54:20PM +0300, Michael Tokarev wrote:
>> Hello!
>>
>> Several questions here if I can... ;)
>>
>> Why mountpoints in the per-container fstab can't be relative
>> to the container's rootfs?  It's trivial to implement by
>> allowing non-absolute pathnames in there and chdir'ing into
>> the rootfs prior to mounting.  That already works when running
>> lxc-start from within the container's rootfs.
> 
> That's not so trivial, imo. I'm working on 'config variables' in this
> problem in mind. This will allow you to specify something like
> ${lxc.rootfs}/... inside your fstab.

What's wrong with chdir(rootfs)?  This way the mountpoints
might be relative.  It works already if I cd to the rootfs
before lxc-start, and it works with a tiny 3-liner patch
for conf.c to do that internally.

>> I think it's the way to go really - to _require_ non-absolute
>> mountpoints in the container's mount file.  Partly because it's
>> not a good idea to mount to a directory which is not visible from
>> within a container (but it might be useful still to grant access
>> to only part of the filesystem to a given container).  And partly
>> because it's just somewhat ugly.
> 
> There could be many possible situations, when you may need to mount
> something inside hn when container is started. It could be, for example,
> container's rootfs on some network file system, or some other resources.

I understand that sometimes it might be useful to mount something else
when starting a container.  I _think_ it's more useful to have that same
mount in a host still.  Maybe some template-root is not very good with
that mounting on the host node, I dunno.  But this is not the point --
the point is that most of the time, in some 99% cases, the mount point
will be inside the container's root, and it's best to keep the whole
thing "transportable" by default (so that it will be possible to move
the rootfs and whole config still works).

> Anyway - this is more flexible approach, and, imo, there is no great
> need for making it less flexible. Variables should be just ok for lazy
> mantaining =) (before that, you may write trivial script that utilizes
> 'sed' for changing values from some template file, whatever you need). 

Shell script is THE most flexible approach.  Because you may extend it
any weird way without trying to program it all beforehand in the support
tools.  Shell language already have well-defined variables and variable
substitutions, control structure and many other things, and it's commonly
understand by almost everyone.  There's no need to invent something to
this sort again but much, much more limited.

>> Second question is about the "other" mountpoints that exists on
>> the host system when starting a container.  Is is a good idea to
>> umount "unrelated" filesystems that are not used in a container
>> but are still shown in /proc/mounts?  I mean, is there a way to
>> access these from within a container somehow, bypassing the
>> "container barrier"?
> 
> No. This is conceptual thing of 'isolation'. Why should you need it? If
> you need some partiion mounted inside container - mount it when
> starting. If you need some part of hn's filesystem - mount-bind it when
> starting.

You misunderstand.

I'm concerned about the "other" filesystems which are visible in the
/proc/mounts inside the container.  At a first glance one can't access
these, but I'd say better be safe than sorry and just umount at least
the ones which can be umounted.

For that, again, a shell script that will set up the namespace is a
way to go, at least initially (very easy to test some approach).

I see it is doing a simple chroot.  Which can be broken out if you're
root, right? :)  So the thing _is_ of some concern, I think.

>> That's just... nonsense ;)
> 
> Yes, yes. This is an kernel issue, I suppose, something about not fully
> virtualized procfs. But I don't know any userspace utility that uses
> /proc/mounts directly. They all are using /etc/mtab. Just clear it from
> bad entries and all whould be ok.

Well, /etc/mtab is an obsolete relic.  It should not be used, exactly because
it does not reflect reality, because a process might be in different namespace
and the like.  Also, I prefer read-only root filesystem, so that /etc is
non-writable too (sure I can symlink /etc/mtab to /var/run/mtab or the
like, but this is getting uglier and uglier).

Yes, /etc/mtab is a possible work-around, at least till the right solution
can be found.  It wont work in all cases (and wont work in several my use
cases already - I plan to move a running system from a virtual machine to
a container, and on that system we already utilize different mount namespaces,
chroots with other distros and the like, where /proc/mounts is quite important).

But it is still a workaround, and such one that brings us back to legacy stuff.

But umounting some stuff before running the container definitely helps.

I wonder if it is possible to go even further than that, with all the funny
bind mounts and mount --move and pivot_root and stuff like that, and actually
umount everything inside the container, everything that is not needed.

>> Where that /tmp/lxc-rC7sKKP come from?  What's the reason to
>> create a separate mount to start with, why not use rootfs directly?
>> I _think_ I don't understand something here and a separate mount is
>> actually required to be a rootfs for a container, in a way similar
>> to somewhat-fake (in a sense that on normal system it contains nothing)
>> rootfs on real host system.
> 
> I don't know why it's done exactly, but if you will try to ls -l inside this
> dir, you may find, that it's empty - it seems to be some kind of
> side-effect of unsharing mount namespace. Probably this approach are
> used to keep access to container's rootfs from hn (still there is other
> way of accessing - through proc-fs, but it's much more complicated, imo).

I know it's empty.

What I don't know is _why_ it is _used_ to beging with.  Why the rootfs
_directory_ can't be used directly.

>> But maybe /var/lib/lxc/rootfs is better suited for that instead of a random
>> name in /tmp ?  And maybe it's a good idea to actually show whole mount tree
>> (at least as long as it's not modified in a container) on a host system?
> 
> Mounts are done after 'cloning', i.e. in other namespace. They are
> inaccessible from outside of container, so, following your previous
> words - this is nonsense =)

They're inaccessible, but they're shown by tools like lsof (that is,
/proc/$pid/fd/ points to non-existing ugly-named stuff).

Here I asked why /tmp/randomname is better than /var/lib/lxc/containername.
The latter looks at least less ugly, and it becomes obvious which container
it is.  My eye hurts when I see some deep directory hierarchies referenced
from /tmp/ - it immediately looks like some break-in attempt.  With
/var/lib/lxc/containername (or any other recognizable name in non-world-writable
dir) it is better from this point of view.

>> And finally, isn't it simpler to run a script (or an external command) to
>> prepare the container's namespace (and do other necessary things) than to try
>> to do everything from within the conffile?  I mean, instead of stuff like
>> the mounting (processing mounts file or conffile entries), setting up
>> cgroups(*), hostname, mounting consoles etc, there might be a place to call
>> a specified shell script that does all that and other things.
> 
> For me conffile is much simplier approach to mantain, then write custom scripts.

But a script does not hurt anyway, does it?  And it lets you do anything that is
not covered by the supporting tools.  Like, for example, umounting extra things
before actually running the container, setting up selinux policies or security
attributes on the newly created dirs, making /dev/mqueue filesystem optional for
systems that does not support it (i had to recompile my kernel because lxc_init
failing) and so on.

[]
>> (*) for cgroups, especially for devices, it's quite ugly to specify things
>> by device numbers, having in mind the dynamic nature of devices nowadays.
> 
> Device numbers are changing really rare. Actually, I don't think, that device
> number for /dev/null, zero or tty* was ever changed =)

Yes, a few well-known device nodes never changed.  But heck, /dev/null is
far easier to read than "c 1:3".  That's about simplicity of maintenance
too.

> Yeah. And from what source shell script whould take values that it
> should apply? You will need some config-structure anyway. For me, shell

A shell script _is_ a config file.  No need to write some advanced shell
script that parses some config file, just write a bunch of inline commands
with direct config parameters.  For example, write a few 'mount foo bar'
lines instead of the same amount of lines in fstab format.  Some can
be made conditional some may depend on the host node and so on.

> script is something redundant. If you have some examples, why it's
> better then conffile - post them, please. (Actually, most of work, that
> lxc does could be done from shell script, but how fast it will be? =))

See above for quite a few examples.  And especially please tell me why
reinventing a bad whell is betteer than using already existing good one... :)

As for speed.  If you want to create a container for every http request,
you're lost already.  It isn't a big deal to run a few commands at system
startup really.  Especially since it's not a real bottleneck.  A few ms
to run shell and spawn a few mount commands from there is not a big deal.

If you're concerned about speed, don't try to collect all lxc.mount.entry
lines into a temp file, run lxc.mount handler on it and remove it later,
but process each lxc.mount.entry directly instead ;)

Thanks!

/mjt