[lxc-devel] setns/fork(glibc) weird interaction, lxc-attach may hang sporadically

Tue Apr 23 16:55:49 UTC 2013

On Tue, 23 Apr 2013 15:03:49 +0100
Christian Seiler <christian at iwakd.de> wrote:

> Hi there,
> 
> I've found a problem that comes from the interaction between setns()
> and glibc's implementation of fork() (at least on x86_64). It is
> rather complicated to describe, and forgive me if my mail is very
> long. I know of several ways how to address the problem but wanted to
> have a discussion first before actually changing something.
> 
> For starters, let's read some glibc source code, just because it's
> so much fun. ;-)
> 
>  From nptl/sysdeps/unix/sysv/linux/fork.c (glibc source code, [1])
> 
> -------------------------------------------------------------------
>    _IO_list_lock ();
> 
> #ifndef NDEBUG
>    pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid);
> #endif
> 
>    /* ... stuff left out, including syscall to fork itself ... */
> 
>    if (pid == 0)
>      {
>        struct pthread *self = THREAD_SELF;
> 
>        assert (THREAD_GETMEM (self, tid) != ppid);
> 
>        /* ... stuff left out ... */
> 
>        /* Reset locks in the I/O code.  */
>        _IO_list_resetlock ();
> -------------------------------------------------------------------
> 
> The problem here is the following: Before the actual kernel syscall
> fork, glibc's fork() function (which is called from within lxc-attach)
> first locks all stdio stuff (for thread safety, because forking with
> multiple threads can lead to trouble otherwise), then retrieves the
> current thread's thread id (which is equivalent to the process id in
> single-threaded applications) and then does the actual fork().
> 
> Then, if the return value of the kernel is 0 (i.e. we are the child
> process of the fork), it retrieves the current thread id again and
> then does an assert() that seems trivial, namely that the current
> thread id (i.e. process id of the child process) is not equal to the
> process id of the parent process it gathered earlier.
> 
> This is when things fall apart with setns(): The current lxc-attach
> logic is to do one fork(), then do a setns() in the child and do
> another fork() and exec() in the grandchild, the parent and
> grandparent both waiting for the child to exit. The first fork() is
> irrelevant for this discussion (it is only there for user namespaces
> anyway, which don't change anything here), but the main problem is
> the following basic logic:
> 
> 1. setns()
> 2. fork()
> 
> Because you can have the situation where the initial process has a
> thread id of 42. The setns() call for the pid namespace doesn't
> actually change the process/thread id (this is by design), it only
> makes sure that all children of the current process are in the new pid
> namespace (and then have a PPID of 0 inside the pidns).
> 
> Now it gets interesting: fork() creates a said child. The pid of that
> child in the root namespace (in which the pid of the parent resides)
> is obviously not going to be 42, but most likely 43 or so. However, in
> the child namespace, where the child will call the kernel and ask for
> its own pid/tid, it may be the case that the child receives the pid
> 42! This means that the tid/pid of the child will be 42, and also the
> ppid variable in the glibc code will be 42 and then that assert()
> will fail!
> 
> This means that even though attach worked perfectly, fork() may fail
> spuriously if there is a pid collision. (And it's only glibc's fork(),
> not the kernel's, that one still works fine!)
> 
> Now it gets even worse: if assert() fails, it wants to do a printf().
> The problem is that the _IO_list_lock() call before the fork(), which
> needs to be there for thread safety, locks all I/O, so printf()
> deadlocks. That means that the child process of lxc-attach hangs
> because the assert can never cause the process to exit and the parent
> process waits forever (it doesn't know the process had some problem,
> becaues one would expect fork() to either work or return with an
> error, not deadlock).
> 
> The following example shows how the problem manifests itself in real
> life:
> 
> root at host:~# ps ax | grep attach
> 16225 pts/12   S+     0:00 lxc-attach -n container -- test -d /
> 16226 pts/12   S+     0:00 lxc-attach -n container -- test -d /
> 16227 pts/12   S+     0:00 lxc-attach -n container -- test -d /
> root at host:~# pstree -alp 16225
> lxc-attach,16225 -n container -- test -d /
>    └─lxc-attach,16226 -n container -- test -d /
>        └─lxc-attach,16227 -n container -- test -d /
> 
> Separate session:
> 
> root at container:~# ps ax | grep attach
> 16226 ?        S+     0:00 lxc-attach -n container -- test -d /
> 
> The process 16226 inside the container corresponds to the process
> 16227 outside the container. As you can see, the pid of the process
> inside is the same as the parent's pid outside, which, according to my
> explanation above, causes the deadlock.
> 
> If I run lxc-attach again with the same parameters on the host, it
> will easily succeed, unless I run into the same problem again by
> chance.
> 
> To reproduce the problem in order to see what was actually going on
> and to be able to debug this and come up with the above explanation,
> I ran a script in an endless loop trying to do the above command
> again and again until it finally deadlocked.
> 
> Killing the process will terminate lxc-attach successfully, but make
> it fail.
> 
> The last bit with the deadlock I see as a clear problem in glibc(),
> since the assert() will deadlock either way, regardless of the reason.
> I will report that bug later. (Also note that if using -DNDEBUG the
> code will fail to compile due to a missing variable declaration, there
> should be additional #ifdefs around the asserts(), although this leads
> me to believe that nobody has compiled glibc with -DNDEBUG in at least
> six years, otherwise this would have surfaced much, much earlier.)
> 
> The question is: What do we do to resolve the other problem? I don't
> think the assert in glibc() in an by itself should necessarily go
> away, because in the vast, vast, vast majority of cases the check is
> very reasonable.
> 
> The long-term solution for glibc could be to also wrap setns() to
> remember if it was called to enter a new pidns and in that case not to
> do the check, but otherwise still do it.
> 
> But we don't know against what version of glibc lxc-attach is going to
> be build, so we need a better solution in the short term:

Hi Christian, I agree we need a short term solution and one that works
with existing glibcs. My question is if you think that glibc will
eventually fix this at some point? I pulled down the most recent
core-utils to see if/how they had worked around it but it looks to me
like nsenter could possibly run into the same thing unless -F is given.
I can try to reproduce the hang with nsenter, which might help
encourage glibc to fix it sooner :)

>   - Call the fork syscall directly via syscall(__NR_fork)
>     Since lxc-attach is single-threaded anyway, the things glibc does
>     aren't strictly needed, although the cache for getpid() is not
>     updated afterwards.
> 
>   - Use clone(), where apparently, if I read the glibc source
> correctly, the wrapper doesn't use that logic. Makes life a bit more
>     complicated because we need an extra function and can't just check
>     the return value of a syscall. Advantages: glibc's clone() wrapper
>     does seem to update the pid cache according to the source [2] and
>     we could use CLONE_PARENT to keep only one process around (make
> the middle one exit immediately) and make the synchronization logic a
>     bit simpler.
> 
>   - Do something else?
> 
> The only possible complication could arrive from the fact that the
> pthread_atfork() handlers are not called with either the fork()
> syscall or the clone() function. This is relevant in cases where the
> host uses e.g. nss-ldap. That being said, we do exec() away pretty
> fast (and any open fds or such nss implementations will have
> O_CLOEXEC set), but we will do that only after invoking getpwuid() to
> get the shell if no command was specified. If a socket or the such is
> still open then, nss could produce some weird effects since it's the
> same socket as that of the parent process. On the other hand,
> lxc-attach doesn't do nss lookups at all before fork()ing away, so
> this is probably a non-issue.
> 
> My suggestion would be to use clone() and to use that occasion to
> simplify the synchronization logic between all of those processes by
> using CLONE_PARENT. If I can get an agreement on this, I'll implement
> this, but I wanted to hear some thoughts in advance.
> 
> -- Christian
> 
> [1] 
> http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/unix/sysv/linux/fork.c?rev=1.16&content-type=text/x-cvsweb-markup&cvsroot=glibc
> [2] 
> http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/sysdeps/unix/sysv/linux/x86_64/clone.S?rev=1.7&content-type=text/x-cvsweb-markup&cvsroot=glibc
> 
> 
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring
> service that delivers powerful full stack analytics. Optimize and
> monitor your browser, app, & servers with just a few lines of code.
> Try New Relic and get this awesome Nerd Life shirt!
> http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________ Lxc-devel mailing list
> Lxc-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lxc-devel