[lxc-devel] read-only container root

Tue Feb 16 20:48:59 UTC 2010

Michael Tokarev wrote:
> Daniel Lezcano wrote:
>> Michael Tokarev wrote:
>>> lxc-start: No such file or directory - failed to mount a new 
>>> instance of '/dev/pts'
>>> I'm experimenting with a read-only root fs in the container.
>>> So far it does not work.
>>>
>>> First of all, when trying to start a container in a read-only root
>>> lxc-start complains:
>>>   lxc-start: Read-only file system - can't make temporary mountpoint
>>>
>>> This is in conf.c:setup_rootfs_pivot_root() function.  That function
>>> uses optional parameter "lxc.pivotdir", or creates (and later removes)
>>> a temporary directory for pivot_root.  Obviously there's no way to
>>> create a directory in a read-only filesystem.
>>>
>> Why do you need to use a read-only root fs ?
>
> There's no _need_, but it's an extension on a principle of least
> privilege, and also helps keeping things in a more accurate way
> and also guarantees that no bad things will happen with the system
> in case of any unexpected power failure and things like that (in
> that case, say, /var might be badly damaged still, but the system
> will actually boot to the point where some repairment tools are
> available).

Agree, that makes sense.

>>> But lxc.pivotdir does not work either. In the function mentioned above
>>> it is used with leading dot (eg. if I specify "lxc.pivotdir=pivot" in
>>> the config file the pivot_root() syscall will be made to ".pivot" with
>>> leading dot, not to "pivot"), but later on it is used without that dot,
>>> and fails...
> []
>> It's a bug introduced with the pivot_root feature. Investigation on 
>> the way.
>
> I tried to debug it too, but realized that the last git repo I have
> locally is from 22th Jan, which is almost a month from now, and I've
> seen quite some changes mentioned on the list.  So it is either that
> the changes hasn't been comitted, or the git repository has been
> moved somewhere else.  It actually was my 3rd email I planned to
> write, asking what's up with the git repo... ;)
No, the repo is still there but at the moment I am very busy, so I 
didn't pushed the patches to the git tree.
There is a pending patches to fix a readlink I didn't took because the 
patchset I am about to commit removes this code.

The pending patchset is the fix of the console as I mentioned some weeks 
ago when I said the console sucks and the shutdown / reboot support.
I sent the patchset, but as nobody commented it, I supposed no hurry for it.

> []
>> Ok, so your need is to call a script between:
>>
>> lxc.mount.entry = /dev dev tmpfs noexec,nosuid,mode=0755
>>
>> ...
>> lxc.tty = 4
>>
>> where the script will populate /dev, right ?
>>
>> mmh, not obvious.
>
> Or maybe just call it _instead_ of specifying all the
> above (lxc.mount.entry and lxc.tty), leaving only things
> such as network device setup (which can't easily be done
> from shell) to lxc-start.
The console ttys are created outside of the container and the /dev/pts/X 
is bind mounted on $rootfs/dev/ttyY
 - if this option is not set you won't have the tty
 - if this option is set but /dev is not yet populated (no /dev/ttyY) 
the tty creation will fail.

This is for this reason I proposed this feature which is though to 
implement, but maybe it's overkill.

Otherwise, nothing prevent to create a configuration file with only the 
network and call a script doing all the configuration, for example:

#!/bin/sh
ip addr add ...
route add ...
mount ....
chroot $rootfs
exec /sbin/init

And then call lxc-start -n mycontainer mylauncher.sh

no ?

> []
>> What about the lxc.script configuration line which calls a script at 
>> the point it is in the configuration file ?
>
> That's not possible.  The configuration is an _unordered_ set
> of key=value pairs.  lxc-start calls different functions now
> at pre-defined (programmatically) order, regardless of the
> order in which the config file is written.
>
> The specified script (lxc.script) should also be called at some
> (random) pre-determined point in the container setup procedure.
> In that case the script can _replace_ some things from the
> config file if they're at "wrong" order or are staying in
> the way.  But it's still not obvious where's that "random"
> place is: for example, is it before lxc-start (implicitly!)
> mounts /dev/pts or after?  For my example the script should
> run before /dev/pts is mounted, but maybe someone will want
> to run some other program that uses pseudo-terminals, which
> obviously should be done after /dev/pts is mounted (granted
> I can't think of such a situation/program for now).

Yes, absolutely. The configuration will need some rework. This is why I 
said "not obvious" :)
Maybe I misunderstood, but do have an alternative solution ?

>>> The whole mess started when I realized that bind-mounting host's /dev
>>> works perfectly _except_ the syslogging, -- /dev/log does not work with
>>> multiple containers, only the container where syslogd (re)started last
>>> works, all the rest gives "ECONNREFUSED" when trying to send any 
>>> message
>>> to /dev/log.
>>>   
>> /dev/log is an af_unix socket, the network is isolated, the af_unix 
>> belongs to the network namespace.
>> It's probable /dev/log is unlinked, created again and binded by 
>> syslogd. So as /dev/ is shared between the containers, the last one 
>> get the socket.
>> Any process outside of the container trying to access this socket 
>> won't be able.
>
> That's what I figured, and it's quite obvious thing to do really.
>
> Actually it might be a good idea to not start syslogd in containers
> and inherit real /dev from host, -- this way all logging will be
> automatically sent to central syslog (hopefully :).  But that works
> up until the host syslogd will be restarted, and at this point we're
> back at ECONNREFUSED.
That won't work too because the check is done when connecting too, so 
all the programs belonging to a container with a private network will 
fail to log anything. That should be easy to check with "logger".

> Note another my email about mounting new filesystems within containers.
> In this context, like, after restarting syslogd on the host, is it 
> possible
> to bind-mount host's /dev/log to container's /dev/log (provided they were
> bind-mointed before)?
Yes, IMO it is possible.

I did something similar by proxying the syslog with a process which 
inherited a fd connected to /dev/log in a container. This process then 
creates a new af_unix socket in a temporary location and bind mount it 
to /dev/log. And after, it accepts data from this newly created socket 
and forward them to the inherited fd.