[Lxc-users] [systemd-devel] Unable to run systemd in an LXC / cgroup container.

Thu Oct 25 15:59:10 UTC 2012

Sorry for taking a few days to get back on this.  I was delivering a
guest lecture up at Fordham University last Tuesday so I was out of
pocket a couple of days or I would have responded sooner...

On Mon, 2012-10-22 at 16:59 -0400, Michael H. Warfield wrote:
> On Mon, 2012-10-22 at 22:50 +0200, Lennart Poettering wrote:
> > On Mon, 22.10.12 11:48, Michael H. Warfield (mhw at WittsEnd.com) wrote:
> > 
> > > > > To summarize the problem...  The LXC startup binary sets up various
> > > > > things for /dev and /dev/pts for the container to run properly and this
> > > > > works perfectly fine for SystemV start-up scripts and/or Upstart.
> > > > > Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
> > > > > on /dev/pts which then break things horribly.  This is because the
> > > > > kernel currently lacks namespaces for devices and won't for some time to
> > > > > come (in design).  When devtmpfs gets mounted over top of /dev in the
> > > > > container, it then hijacks the hosts console tty and several other
> > > > > devices which had been set up through bind mounts by LXC and should have
> > > > > been LEFT ALONE.
> > > 
> > > > Please initialize a minimal tmpfs on /dev. systemd will then work fine.
> > > 
> > > My containers have a reasonable /dev that work with Upstart just fine
> > > but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
> > > that minimal /dev required?

> > Well, it can be any kind of mount really. Just needs to be a mount. And
> > the idea is to use tmpfs for this.

> > What /dev are you currently using? It's probably not a good idea to
> > reuse the hosts' /dev, since it contains so many device nodes that
> > should not be accessible/visible to the container.

> Got it.  And that explains the problems we're seeing but also what I'm
> seeing in some libvirt-lxc related pages, which is a separate and
> distinct project in spite of the similarities in the name...

> http://wiki.1tux.org/wiki/Lxc/Installation#Additional_notes

> Unfortunately, in our case, merely getting a mount in there is a
> complication in that it also has to be populated but, at least, we
> understand the problem set now.

Ok...  Serge and I were corresponding on the lxc-users list and he had a
suggestion that worked but I consider to be a bit of a sub-optimal
workaround.  Ironically, it was to mount devtmpfs on /dev.  We don't
(currently) have a method to auto-populate a tmpfs mount with the needed
devices and this provided it.  It does have a problem that makes me
uncomfortable in that the container now has visibility into the
hosts /dev system.  I'm a security expert and I'm not comfortable with
that "solution" even with the controls we have.  We can control access
but still, not happy with that.

I now have a container that starts with systemd running more or less
properly.  We do have some problems with the convention that has been
set up, however.

When running in this mode, you run on the console and you don't spawn
getty's on the tty's.  There seems to be a problem with this.

In this mode, if I manually start the container in a terminal window,
that eventually results in a login prompt there.  Under sysvinit and
upstart I don't get that and can detach.

If I run lxc-console (which attaches to one of the vtys) it gives me
nothing.  Under sysvinit and upstart I get vty login prompts because
they have started getty on those vtys.  This is important in case
network access has not started for one reason or another and the
container was started detached in the background.

If I start lxc-start in detached mode (-d -o {logfile}) lxc-start
redirects the system console to the log file and goes daemon.  In this
case, the systemd container hangs and never starts.

I SUSPECT the hang condition is something to do with systemd trying to
start and interactive console on /dev/console, which sysvinit and
upstart do not do.  Maybe we have to do something different with the
redirects in this case, but it's not working consistent with the other
packages.  We should also start appropriate gettys on those vtys if they
are configured.  Maybe start the getty's if the tty? exists up to a
configured limit (and don't restart if they immediately fail) and
obviously don't start them if they don't.  It then gives up control over
that process.  Also don't start a login on /dev/console if you DO start
a getty?  That would make your behavior congruent with that of the other
two systems.

I've got some more problems relating to shutting down containers, some
of which may be related to mounting tmpfs on /run to which /var/run is
symlinked to.  We're doing halt / restart detection by monitoring utmp
in that directory but it looks like utmp isn't even in that directory
anymore and mounting tmpfs on it was always problematical.  We may have
to have a more generic method to detect when a container has shut down
or is restarting in that case.  I'm also finding we end up with dangling
resources where we can't remove to cgroup directories after a halt and
that creates a serious problem I have to investigate further.  Not sure
if it's a host problem running on F17 or it something to do with running
systemd in a container but I can not shut down this particular container
and subsequently restart it without restarting the entire host.  Not
good is an understatement.

Regards,
Mike

> > > > systemd will make use of pre-existing mounts if they exist, and only
> > > > mount something new if they don't exist.
> > > 
> > > So you're saying that, if we have something mounted on /dev, that's what
> > > prevents systemd from mounting devtmpfs on /dev?  
> 
> > Yes.
> 
> > > But, I have systemd running on my host system (F17) and containers with
> > > sysvinit or upstart inits are all starting just fine.  That sounds like
> > > it should impact all containers as pivot_root() is issued before systemd
> > > in the container is started.  Or am I missing something here?  That
> > > sounds like a problem for Serge and others to investigate further.  I'll
> > > see about trying that workaround though.
> 
> > The "shared" issue is F18, and it's about running LXC on a systemd
> > system, not about running systemd inside of LXC.
> 
> Whew!  I'll deal with F18 when I need to deal with F18.  That explains
> why my F17 hosts are running and gives Serge and others a chance to
> address this, forewarned.  Thanks for that info.
> 
> > Lennart
> 
> > -- 
> > Lennart Poettering - Red Hat, Inc.
> 
> Regards,
> Mike

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20121025/cbdaa3fd/attachment.pgp>