[lxc-devel] mounts...

Sat Nov 14 19:25:33 UTC 2009

On Sat, Nov 14, 2009 at 07:06:07PM +0300, Michael Tokarev wrote:
> What's wrong with chdir(rootfs)?  This way the mountpoints
> might be relative.  It works already if I cd to the rootfs
> before lxc-start, and it works with a tiny 3-liner patch
> for conf.c to do that internally.

That's nothing, actually. That's just, as for me, is redundand (it's
actually done in setup_rootfs, that's called after setup_mounts) and
_maybe_ has some problems when bind'ing to tmp dir. That's question for
some professional, the best way to know will be to post mentioned patch
and ask upstream to review it.

> > There could be many possible situations, when you may need to mount
> > something inside hn when container is started. It could be, for example,
> > container's rootfs on some network file system, or some other resources.
> 
> I understand that sometimes it might be useful to mount something else
> when starting a container.  I _think_ it's more useful to have that same
> mount in a host still.  Maybe some template-root is not very good with
> that mounting on the host node, I dunno.  But this is not the point --
> the point is that most of the time, in some 99% cases, the mount point
> will be inside the container's root, and it's best to keep the whole
> thing "transportable" by default (so that it will be possible to move
> the rootfs and whole config still works).

Yes, you are right, but your first suggestion was to "_require_"
relative paths ;). Why you may want to mount, for example, rootfs
mount's not into hn's /etc/fstab? Because, in assumption that it's
located somewhere on network, it may be not availible on hn's start (it
maybe even located inside one of container, don't known why it could be,
but anyway =)) and should be mounted as on-demand strategy.

> Shell script is THE most flexible approach.  Because you may extend it
> any weird way without trying to program it all beforehand in the support
> tools.  Shell language already have well-defined variables and variable
> substitutions, control structure and many other things, and it's commonly
> understand by almost everyone.  There's no need to invent something to
> this sort again but much, much more limited.

Eh? That's was about fstab file and forcing relative paths.

> You misunderstand.
> 
> I'm concerned about the "other" filesystems which are visible in the
> /proc/mounts inside the container.  At a first glance one can't access
> these, but I'd say better be safe than sorry and just umount at least
> the ones which can be umounted.
> 
> For that, again, a shell script that will set up the namespace is a
> way to go, at least initially (very easy to test some approach).
> 
> I see it is doing a simple chroot.  Which can be broken out if you're
> root, right? :)  So the thing _is_ of some concern, I think.

Yes, this make sense, but you should be _very_ carefull when unmounting
- /dev/ and some other mounts inside of hn may be vital for container,
but may not be accessible from it. Unmounting them may be successfull,
but then will fail on futher steps of initialization. But generaly this
is a good idea, if you will find some universal pattern of detecting
"useless" mounts.

> Well, /etc/mtab is an obsolete relic.  It should not be used, exactly because
> it does not reflect reality, because a process might be in different namespace
> and the like.  Also, I prefer read-only root filesystem, so that /etc is
> non-writable too (sure I can symlink /etc/mtab to /var/run/mtab or the
> like, but this is getting uglier and uglier).

Yeah? Could you say this to mount's upstream? =) Also, /proc/mounts does
not detect mount --bind's correctly, etc. /etc/mtab shows what _was
mounted_ by userspace, while /proc/mounts shows what is mounted by
whatever it does, without exact information, _how_ it was mounted, only
result.

Em. Read-only root file system for container? Ok, but, as for me, it
better to have single, but small read-write rootfs for each container,
than 5-6 for each, but rootfs would be read-only. Container could be
easily ressurected on disk failture =).

> Yes, /etc/mtab is a possible work-around, at least till the right solution
> can be found.  It wont work in all cases (and wont work in several my use
> cases already - I plan to move a running system from a virtual machine to
> a container, and on that system we already utilize different mount namespaces,
> chroots with other distros and the like, where /proc/mounts is quite important).

Yes, yes, but this is used by some utilities that don't know what means
'ugly', so, if you are speaking about your own experience with df or
something like that, solution is simple - cleaup mtab =)

> 
> But it is still a workaround, and such one that brings us back to legacy stuff.
> 
> But umounting some stuff before running the container definitely helps.
> 
> I wonder if it is possible to go even further than that, with all the funny
> bind mounts and mount --move and pivot_root and stuff like that, and actually
> umount everything inside the container, everything that is not needed.

Maybe. You may try. But deffintly, that's not a great issue that needs
urgent fixing, for me, at least =)

> I know it's empty.
> 
> What I don't know is _why_ it is _used_ to beging with.  Why the rootfs
> _directory_ can't be used directly.

I've said my opinion about that. You may comment this few lines and
test. You whould probably end with empty rootfs directory. This will
disallow you to mount some directories from container inside other
containers, in particular. But, again, you should test it.

> Here I asked why /tmp/randomname is better than /var/lib/lxc/containername.
> The latter looks at least less ugly, and it becomes obvious which container
> it is.  My eye hurts when I see some deep directory hierarchies referenced
> from /tmp/ - it immediately looks like some break-in attempt.  With
> /var/lib/lxc/containername (or any other recognizable name in non-world-writable
> dir) it is better from this point of view.

That's not an security issue, I suppose, as it's actully binds specified
rootfs inside that dir.
> 
> But a script does not hurt anyway, does it?  And it lets you do anything that is
> not covered by the supporting tools.  Like, for example, umounting extra things
> before actually running the container, setting up selinux policies or security
> attributes on the newly created dirs, making /dev/mqueue filesystem optional for
> systems that does not support it (i had to recompile my kernel because lxc_init
> failing) and so on.

Of course, it does not =) But, still, it should not _replace_ config
file and, probably, should somehow report all changes that lxc may
understand back to it, for some logging of futher reporting to user, for
instance.

> Yes, a few well-known device nodes never changed.  But heck, /dev/null is
> far easier to read than "c 1:3".  That's about simplicity of maintenance
> too.

If you will find some way of easy mapping of node->number without
specifing a giant database... And, also, what about regular expressions?

> A shell script _is_ a config file.  No need to write some advanced shell
> script that parses some config file, just write a bunch of inline commands
> with direct config parameters.  For example, write a few 'mount foo bar'
> lines instead of the same amount of lines in fstab format.  Some can
> be made conditional some may depend on the host node and so on.

If config file will do any work additional to what config file does -
yes, it is. But now think about what if you have thousands of
containers?

Shell script will always require more knowledge that config file, so it
should not be an only option, imo. At least, in case of config file,
upstream could be sure, from where some user's problems are come.

> See above for quite a few examples.  And especially please tell me why
> reinventing a bad whell is betteer than using already existing good one... :)

Yes, yes. With that particular words some mantainers are using lua
scripts for config, or python, or perl...

An option of specifying callback script for all work, that is more
complicated that config can handle, or course is a good one, but it
should not replace config, again.

Config is simplier, faster and is working in most setups.

Any work with policies, acls etc I prefer to do inside
some wrapper scripts, that are using lxc-* commands, for instance,
inside some kind of rc-system startup script.

Don't forget, that lxc is low-level utilies, so they are probably
intended to be embedded, rather than to embed. But, if you are up for
implementing callback script, it might be usefull, of course.

> As for speed.  If you want to create a container for every http request,
> you're lost already.  It isn't a big deal to run a few commands at system
> startup really.  Especially since it's not a real bottleneck.  A few ms
> to run shell and spawn a few mount commands from there is not a big deal.

Yeah. I could remember my old iptables script, that was implemented
using bash. Now it is in perl, it still lags, but not so much, at
least =)

Really, you never knows how complicated it will be tommorow, and how
laggy it will be than, especially if you whould start not one, but few
containers at once, to speedup startup process (on multicore).

> If you're concerned about speed, don't try to collect all lxc.mount.entry
> lines into a temp file, run lxc.mount handler on it and remove it later,
> but process each lxc.mount.entry directly instead ;)

Hm, I haven't seen how lxc.mount.entry works, so I don't know what do
you mean,