[lxc-users] How to recover from ERROR state

Mon Jan 14 09:05:46 UTC 2019

On 11-09-18 15:13, Kees Bakker wrote:
> Hey,
>
> Every now and then we have one or more containers in state ERROR.
> Is there a clever method to recover from that, other than
> rebooting the LXD server?
>
> Killing the monitor and the forkstart does help. And also a kworker
> process (kworker/u16:0) is eating up one of the CPUs with 100% load.
> lxc info gives "error: Monitor is hung"
>
> I'm running Ubuntu 16.04 with BTRFS. The kernel is 4.15.0-33-generic

Today it happened once again. This time it is on an Ubuntu 18.04
system with lvm storage backend. Kernel 4.15.0-34-generic.
We don't stop/start containers usually. When they run it is all
nice and dandy. But when we stop and start a container there
is a big chance to trigger this ERROR situation. This time I needed to
change the profile to get a bigger root volume in the container.

There is a lxc monitor process hanging, and a kworker at 100% CPU
load. The "lxc start" command hangs. Now "lxc list" shows the container
in ERROR state. "lxc info" shows
    Error: Monitor is hung
Killing the monitor does not help to revive from this situation. The only
thing I can do is to reboot the LXD host. As you can imagine this is
horrible, since there are several other containers running.

Christian told us that this is probably a kernel problem.
  "If it is a kernel bug you're hitting there's nothing that LXD can do to help you."

What I would like to know
* are there more people who see this problem?
* if not, why are we hitting it so often?
* what kernel problem are we talking about?

LXD is great, but this problem is becoming a nightmare, snif.
-- 
Kees