[lxc-devel] Potential deadlock with lxcfs and lxc-freeze

Wed Feb 17 15:21:36 UTC 2016

> Fabian Grünbichler <f.gruenbichler at proxmox.com> hat am 12. Februar 2016 um
> 13:53 geschrieben:
> 
> Summary so far: uptime, ps and any other process accessing /proc/uptime within
> a
> container using lxcfs can pretty reliably make the whole container unfreezable
> and cause the uptime-accessing process itself and the associated lxcfs process
> to wait forever. Both states persist until either the container is shutdown or
> the waiting lxcfs process is forcibly killed. The latter will also allow an
> ongoing, hanging freeze to finish.

Finally got to the bottom of this issue. It is somewhat mitigated in recent
lxcfs versions because of the init PID caching mechanism that was recently
introduced, which does not cause a double fork for every read of /proc/uptime
anymore, but only for each time the init's PID is cached (again).

The root cause is described in a glibc bug from 2013[1], which also describes
the only possible workaround for this at the moment: if using setns() to change
PID namespace (which only applies to the callers children), the parent must not
use fork(), but create the child via clone(). Otherwise, the forked child and
the parent might have identical PIDs which fails an assertion in glibc's fork.c
(the colliding PIDs are in different namespaces, but glibc does not care about
PID namespaces in this code path). Using clone() avoids the issue because there
is no such assertion in the clone code path ;)

Limiting the range of free PIDs on the host and in the container so that both
share the same small range might still trigger the fork assertion in current
lxcfs (and thus the freeze bug, if lxc-freeze is called in parallel), although I
haven't tried this so far.

There are three occurrences of this (anti-)pattern ("fork() - setns() - fork()")
in the current lxcfs code base, which should probably be patched:
- write_task_init_pid_exit(), which was the culprit in the described uptime bug
- pid_to_ns_wrapper() and pid_from_ns_wrapper(), which are used for reading and
writing the tasks and cgroup.procs files.

I'd be willing to write the patches, if desired.

Sidenote: freezing a container while running "cat
/sys/fs/cgroup/freezer/lxc/108/tasks
/sys/fs/cgroup/freezer/lxc/108/cgroup.procs" sometimes causes error messages in
the container: "cat: /sys/fs/cgroup/freezer/lxc/108/cgroup.procs: Interrupted
system call" and less frequently, error messages in lxcfs' journal output: "Feb
17 16:11:54 host lxcfs[15643]: send_creds: Error getting reply from server over
socketpair". Whenever the error message in the journal appears, the lxc-freeze
process which is running at the same time hangs for a second or two (my guess is
that lxc-freeze and send_creds are racy). So far, I could not trigger indefinite
hangs of lxc-freeze or the container, like with the old uptime code.

Regards,
Fabian

1: https://sourceware.org/bugzilla/show_bug.cgi?id=15392