[lxc-devel] setns/fork(glibc) weird interaction, lxc-attach may hang sporadically

Christian Seiler christian at iwakd.de
Tue Apr 23 14:03:49 UTC 2013


Hi there,

I've found a problem that comes from the interaction between setns()
and glibc's implementation of fork() (at least on x86_64). It is rather
complicated to describe, and forgive me if my mail is very long. I
know of several ways how to address the problem but wanted to have a
discussion first before actually changing something.

For starters, let's read some glibc source code, just because it's
so much fun. ;-)

 From nptl/sysdeps/unix/sysv/linux/fork.c (glibc source code, [1])

-------------------------------------------------------------------
   _IO_list_lock ();

#ifndef NDEBUG
   pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid);
#endif

   /* ... stuff left out, including syscall to fork itself ... */

   if (pid == 0)
     {
       struct pthread *self = THREAD_SELF;

       assert (THREAD_GETMEM (self, tid) != ppid);

       /* ... stuff left out ... */

       /* Reset locks in the I/O code.  */
       _IO_list_resetlock ();
-------------------------------------------------------------------

The problem here is the following: Before the actual kernel syscall
fork, glibc's fork() function (which is called from within lxc-attach)
first locks all stdio stuff (for thread safety, because forking with
multiple threads can lead to trouble otherwise), then retrieves the
current thread's thread id (which is equivalent to the process id in
single-threaded applications) and then does the actual fork().

Then, if the return value of the kernel is 0 (i.e. we are the child
process of the fork), it retrieves the current thread id again and then
does an assert() that seems trivial, namely that the current thread id
(i.e. process id of the child process) is not equal to the process id
of the parent process it gathered earlier.

This is when things fall apart with setns(): The current lxc-attach
logic is to do one fork(), then do a setns() in the child and do
another fork() and exec() in the grandchild, the parent and grandparent
both waiting for the child to exit. The first fork() is irrelevant for
this discussion (it is only there for user namespaces anyway, which
don't change anything here), but the main problem is the following
basic logic:

1. setns()
2. fork()

Because you can have the situation where the initial process has a
thread id of 42. The setns() call for the pid namespace doesn't
actually change the process/thread id (this is by design), it only
makes sure that all children of the current process are in the new pid
namespace (and then have a PPID of 0 inside the pidns).

Now it gets interesting: fork() creates a said child. The pid of that
child in the root namespace (in which the pid of the parent resides)
is obviously not going to be 42, but most likely 43 or so. However, in
the child namespace, where the child will call the kernel and ask for
its own pid/tid, it may be the case that the child receives the pid 42!
This means that the tid/pid of the child will be 42, and also the ppid
variable in the glibc code will be 42 and then that assert() will fail!

This means that even though attach worked perfectly, fork() may fail
spuriously if there is a pid collision. (And it's only glibc's fork(),
not the kernel's, that one still works fine!)

Now it gets even worse: if assert() fails, it wants to do a printf().
The problem is that the _IO_list_lock() call before the fork(), which
needs to be there for thread safety, locks all I/O, so printf()
deadlocks. That means that the child process of lxc-attach hangs
because the assert can never cause the process to exit and the parent
process waits forever (it doesn't know the process had some problem,
becaues one would expect fork() to either work or return with an error,
not deadlock).

The following example shows how the problem manifests itself in real
life:

root at host:~# ps ax | grep attach
16225 pts/12   S+     0:00 lxc-attach -n container -- test -d /
16226 pts/12   S+     0:00 lxc-attach -n container -- test -d /
16227 pts/12   S+     0:00 lxc-attach -n container -- test -d /
root at host:~# pstree -alp 16225
lxc-attach,16225 -n container -- test -d /
   └─lxc-attach,16226 -n container -- test -d /
       └─lxc-attach,16227 -n container -- test -d /

Separate session:

root at container:~# ps ax | grep attach
16226 ?        S+     0:00 lxc-attach -n container -- test -d /

The process 16226 inside the container corresponds to the process 16227
outside the container. As you can see, the pid of the process inside is
the same as the parent's pid outside, which, according to my
explanation above, causes the deadlock.

If I run lxc-attach again with the same parameters on the host, it will
easily succeed, unless I run into the same problem again by chance.

To reproduce the problem in order to see what was actually going on and
to be able to debug this and come up with the above explanation, I ran
a script in an endless loop trying to do the above command again and
again until it finally deadlocked.

Killing the process will terminate lxc-attach successfully, but make it
fail.

The last bit with the deadlock I see as a clear problem in glibc(),
since the assert() will deadlock either way, regardless of the reason.
I will report that bug later. (Also note that if using -DNDEBUG the
code will fail to compile due to a missing variable declaration, there
should be additional #ifdefs around the asserts(), although this leads
me to believe that nobody has compiled glibc with -DNDEBUG in at least
six years, otherwise this would have surfaced much, much earlier.)

The question is: What do we do to resolve the other problem? I don't
think the assert in glibc() in an by itself should necessarily go away,
because in the vast, vast, vast majority of cases the check is very
reasonable.

The long-term solution for glibc could be to also wrap setns() to
remember if it was called to enter a new pidns and in that case not to
do the check, but otherwise still do it.

But we don't know against what version of glibc lxc-attach is going to
be build, so we need a better solution in the short term:

  - Call the fork syscall directly via syscall(__NR_fork)
    Since lxc-attach is single-threaded anyway, the things glibc does
    aren't strictly needed, although the cache for getpid() is not
    updated afterwards.

  - Use clone(), where apparently, if I read the glibc source correctly,
    the wrapper doesn't use that logic. Makes life a bit more
    complicated because we need an extra function and can't just check
    the return value of a syscall. Advantages: glibc's clone() wrapper
    does seem to update the pid cache according to the source [2] and
    we could use CLONE_PARENT to keep only one process around (make the
    middle one exit immediately) and make the synchronization logic a
    bit simpler.

  - Do something else?

The only possible complication could arrive from the fact that the
pthread_atfork() handlers are not called with either the fork() syscall
or the clone() function. This is relevant in cases where the host uses
e.g. nss-ldap. That being said, we do exec() away pretty fast (and any
open fds or such nss implementations will have O_CLOEXEC set), but we
will do that only after invoking getpwuid() to get the shell if no
command was specified. If a socket or the such is still open then, nss
could produce some weird effects since it's the same socket as that of
the parent process. On the other hand, lxc-attach doesn't do nss
lookups at all before fork()ing away, so this is probably a non-issue.

My suggestion would be to use clone() and to use that occasion to
simplify the synchronization logic between all of those processes by
using CLONE_PARENT. If I can get an agreement on this, I'll implement
this, but I wanted to hear some thoughts in advance.

-- Christian

[1] 
http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/unix/sysv/linux/fork.c?rev=1.16&content-type=text/x-cvsweb-markup&cvsroot=glibc
[2] 
http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/sysdeps/unix/sysv/linux/x86_64/clone.S?rev=1.7&content-type=text/x-cvsweb-markup&cvsroot=glibc





More information about the lxc-devel mailing list