[lxc-devel] read-only container root

Tue Feb 16 15:37:16 UTC 2010

Daniel Lezcano wrote:
> Michael Tokarev wrote:
>> lxc-start: No such file or directory - failed to mount a new instance 
>> of '/dev/pts'
>> I'm experimenting with a read-only root fs in the container.
>> So far it does not work.
>>
>> First of all, when trying to start a container in a read-only root
>> lxc-start complains:
>>   lxc-start: Read-only file system - can't make temporary mountpoint
>>
>> This is in conf.c:setup_rootfs_pivot_root() function.  That function
>> uses optional parameter "lxc.pivotdir", or creates (and later removes)
>> a temporary directory for pivot_root.  Obviously there's no way to
>> create a directory in a read-only filesystem.
>>
> Why do you need to use a read-only root fs ?

There's no _need_, but it's an extension on a principle of least
privilege, and also helps keeping things in a more accurate way
and also guarantees that no bad things will happen with the system
in case of any unexpected power failure and things like that (in
that case, say, /var might be badly damaged still, but the system
will actually boot to the point where some repairment tools are
available).

>> But lxc.pivotdir does not work either. In the function mentioned above
>> it is used with leading dot (eg. if I specify "lxc.pivotdir=pivot" in
>> the config file the pivot_root() syscall will be made to ".pivot" with
>> leading dot, not to "pivot"), but later on it is used without that dot,
>> and fails...
[]
> It's a bug introduced with the pivot_root feature. Investigation on the 
> way.

I tried to debug it too, but realized that the last git repo I have
locally is from 22th Jan, which is almost a month from now, and I've
seen quite some changes mentioned on the list.  So it is either that
the changes hasn't been comitted, or the git repository has been
moved somewhere else.  It actually was my 3rd email I planned to
write, asking what's up with the git repo... ;)

[]
> Ok, so your need is to call a script between:
> 
> lxc.mount.entry = /dev dev tmpfs noexec,nosuid,mode=0755
> 
> ...
> lxc.tty = 4
> 
> where the script will populate /dev, right ?
> 
> mmh, not obvious.

Or maybe just call it _instead_ of specifying all the
above (lxc.mount.entry and lxc.tty), leaving only things
such as network device setup (which can't easily be done
from shell) to lxc-start.

[]
> What about the lxc.script configuration line which calls a script at the 
> point it is in the configuration file ?

That's not possible.  The configuration is an _unordered_ set
of key=value pairs.  lxc-start calls different functions now
at pre-defined (programmatically) order, regardless of the
order in which the config file is written.

The specified script (lxc.script) should also be called at some
(random) pre-determined point in the container setup procedure.
In that case the script can _replace_ some things from the
config file if they're at "wrong" order or are staying in
the way.  But it's still not obvious where's that "random"
place is: for example, is it before lxc-start (implicitly!)
mounts /dev/pts or after?  For my example the script should
run before /dev/pts is mounted, but maybe someone will want
to run some other program that uses pseudo-terminals, which
obviously should be done after /dev/pts is mounted (granted
I can't think of such a situation/program for now).

>> The whole mess started when I realized that bind-mounting host's /dev
>> works perfectly _except_ the syslogging, -- /dev/log does not work with
>> multiple containers, only the container where syslogd (re)started last
>> works, all the rest gives "ECONNREFUSED" when trying to send any message
>> to /dev/log.
>>   
> /dev/log is an af_unix socket, the network is isolated, the af_unix 
> belongs to the network namespace.
> It's probable /dev/log is unlinked, created again and binded by syslogd. 
> So as /dev/ is shared between the containers, the last one get the socket.
> Any process outside of the container trying to access this socket won't 
> be able.

That's what I figured, and it's quite obvious thing to do really.

Actually it might be a good idea to not start syslogd in containers
and inherit real /dev from host, -- this way all logging will be
automatically sent to central syslog (hopefully :).  But that works
up until the host syslogd will be restarted, and at this point we're
back at ECONNREFUSED.

Note another my email about mounting new filesystems within containers.
In this context, like, after restarting syslogd on the host, is it possible
to bind-mount host's /dev/log to container's /dev/log (provided they were
bind-mointed before)?

Thanks!

/mjt