[lxc-devel] Shutting down containers properly

Fri May 25 11:56:01 UTC 2012

Hi,

Currently, lxc-stop sends SIGKILL to the init process of the container,
which causes all the other processes in the container to also receive
a SIGKILL. I don't think that is a good course of action, since sending
SIGKILL to for example a database server can lead to potential data
loss.

A much better way of stopping containers would be in my opinion to
first send the container a shutdown signal - and then wait for a
specified amount of time before really killing the container with a
KILL signal.

Unfortunately, no init system will react to SIGTERM and shut down the
container, so it is not quite as easy. I've looked a bit at different
init systems to see how to properly shut them down:

  - lxc application containers (lxc-execute): lxc-init will do a
    kill(-1, SIGTERM) if it receives a SIGTERM itself, so sending
    it a SIGTERM is sufficient to initiate a proper shutdown

  - sysvinit: open /run/initctl (newer Debian) or /dev/initctl (older
    Debian and other distros) and send them a binary message to switch
    to runlevel 0

  - upstart: connect to DBus and tell it to switch to runlevel 0

  - systemd: either connect to DBus and tell it to switch runlevel or
    send SIGRTMIN + 4, that will also cause a shutdown

  - sysvinit + upstart + systemd also all provide a 'telinit' binary,
    where calling 'telinit 0' will initiate a shutdown

My proposal would be the following:

lxc-stop first sends a new SHUTDOWN command (instead of the current
STOP command), which initiates the shutdown and returns immediately.
The command handler in lxc-start will then initiate a shutdown of the
container (see below). lxc-stop will wait for a given amount of seconds
and if the container is not stopped by then, it will send the current
STOP command to actually kill the container with SIGKILL.

On the other hand, add a --force option that will make lxc-stop still
be able to kill all processes immediately.

Now how to shut down the container? In lxc.conf there should be a new
configuration option, lxc.shutdown_method, which can carry the
following values: "application", "sysvinit", "systemd" and "exec".
For application containers started with lxc-execute, it will default
to "application", for system containers started with lxc-start, it will
default to "sysvinit".

The following actions will be performed:

"application": send SIGTERM to init process of the container

"sysvinit": fork(), child process does setns() for mount namespace,
             tries to send signal to /run/initctl and /dev/initctl
             (whichever exists), but first checks whether st_dev and
             st_ino entries do NOT match those of the host's files,
             so we don't accidentally shut down the host (if the
             container hares filesystem with the host)

"systemd": send SIGRTMIN + 4 to init process of the container

"exec": run lxc-attach for the container with the contents of the
         new option lxc.shutdown_command as parameter

I haven't included any explicit method for shutting down upstart, so
containers running upstart inside (assuming that's even possible, I
don't know much about upstart) should probably use method exec and
execute telinit 0 inside the container. Sending simple signals to the
init process as in application / systemd or opening a FIFO and writing
some bytes for sysvinit is still quite trivial, but implementing DBus
(esp. across container boundaries) - which would be required for native
upstart shutdown support - seems like overkill to me.

On the other hand, I do want to explicitly implement the sysvinit way,
since there we can check that we're definitely not going to shut down
the host accidentally (by checking the device/inode numbers of the
initctl FIFOs), which we can't be 100% sure of with exec.

Caveats:

1. application / systemd methods should always work, since we just send
a signal to the init process; sysvinit will only work if attaching
mount namespaces is implemented in the kernel and exec only if full
lxc-attach works (so all namespaces). But the worst case scenario here
is that we still kill all processes in the container with lxc-stop if
the kernel doesn't support attach, so there is no loss for current
users.

2. If the container is frozen, the current logic first sends the KILL
signal and then unfreezes it, so the container immediately goes away.
However, how should we react if we just want to shut it down? Unfreeze
it and send the shutdown signal? Or just kill it immediately? Or do
nothing and report an error?

Thoughts?

(Note: I'd be willing to implement this feature, once a consensus is
reached on how to proceeed.)

Regards,
Christian