[lxc-users] lxc_monitor exiting, but not cleaning monitor-fifo?

Wed Apr 2 14:42:41 UTC 2014

On Tue, 01 Apr 2014 22:15:25 +0200
Florian Klink <flokli at flokli.de> wrote:

> Am 01.04.2014 01:49, schrieb Dwight Engen:
> > On Mon, 31 Mar 2014 23:18:13 +0200
> > Florian Klink <flokli at flokli.de> wrote:
> > 
> >> Am 31.03.2014 21:13, schrieb Dwight Engen:
> >>> On Mon, 31 Mar 2014 20:34:15 +0200
> >>> Florian Klink <flokli at flokli.de> wrote:
> >>>
> >>>> Am 31.03.2014 20:10, schrieb Dwight Engen:
> >>>>> On Sat, 29 Mar 2014 23:39:33 +0100
> >>>>> Florian Klink <flokli at flokli.de> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> when running multiple lxc actions in row using the command line
> >>>>>> tools, I sometimes observe the following state:
> >>>>>>
> >>>>>>
> >>>>>> - lxc-monitord is not running anymore
> >>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
> >>>>>> "refusing connection"
> >>>>>>
> >>>>>> In the logs, I then see the following:
> >>>>>>
> >>>>>>
> >>>>>> lxc-start 1395671045.703 ERROR    lxc_monitor - connect :
> >>>>>> backing off 10 lxc-start 1395671045.713 ERROR    lxc_monitor -
> >>>>>> connect : backing off 50 lxc-start 1395671045.763 ERROR
> >>>>>> lxc_monitor - connect : backing off 100 lxc-start
> >>>>>> 1395671045.864 ERROR lxc_monitor - connect : Connection refused
> >>>>>>
> >>>>>>
> >>>>>> ... and the command fails.
> >>>>>  
> >>>>> The only time I've seen this happen is if lxc-monitord is hard
> >>>>> killed so it doesn't have a chance to clean up and remove the
> >>>>> socket.
> >>>>
> >>>> Here, it's happening quite frequently. However, the script never
> >>>> kills lxc-monitord on its own, it just tries to detect and fix
> >>>> this state by removing the socket file...
> >>>
> >>> Right, removing the socket file makes it so another lxc-monitord
> >>> will start, but the question is why is the first one exiting
> >>> without cleaning up? Can you reliably reproduce it at will? If so
> >>> then maybe you could attach an strace to lxc-monitord and see why
> >>> it is exiting.
> >>
> >> I was so far not successful in reproducing the bug while having an
> >> strace running. :-( But I'll continue to try!
> 
> Success :-) I managed to get an strace while trying to reproduce the
> bug. I gzipped and attached it to this mail.
> 
> Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
> /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
> 
> I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
> script and waited for lxc-monitord (and strace too) to stop.
> 
> Then I started my script again and had the "leftover monitor-fifo
> state".

Unfortunately, I don't think that strace shows the problem. It looks to
me like a normal exit with a successful
unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.

You can't really run monitord by hand like that since it is expecting a
pipe fd as argv[2]. Thats why I was suggesting attaching to it. So
something like:

lxc-start <your ct>
lxc-monitor -n '.*'

in another terminal:
ps aux |grep monitord  -> find the pid of lxc-monitord
strace -v -t -o straceout.txt -p <pid of monitord>

and then do whatever you do to make things fail :)

> >>>
> >>>>>
> >>>>>>
> >>>>>> A possible workaround would be checking for non-running
> >>>>>> lxc-monitord process but existing monitor-fifo file then
> >>>>>> removing the fifo if it exists before running the next lxc
> >>>>>> command, but thats ugly ;-)
> >>>>>
> >>>>> Is there a good non-racy way to do this? I guess monitord could
> >>>>> write its pid in $LXCPATH and we could kill(pid, 0) it. 
> >>
> >> I also think that lxc should be able to recover from this problem
> >> automatically.
> > 
> > I agree, though I would like to understand the root cause. Can you
> > try out the attached patch? I think it will cure your issues.
> > 
> 
> Thanks for the patch! Just tell me if you need more information for
> the strace above. If not, I'll happily apply the patch :-)

You can try the patch to see if it solves your issue, though I'd still
like to understand why its happening in the first place. I may rework
the patch based on Serge's suggestion, but it'd be nice to know if the
one I sent does fix what you are seeing. It worked for all the
hard-kill cases I tried.

> >>>>>  
> >>>>>> Is this behaviour known? Is there some missing "cleanup code"
> >>>>>> in lxc(_monitord) or why is it failing like this?
> >>>>>  
> >>>>> Currently it catches SIGILL, SIGSEGV, SIGBUS, and SIGTERM and
> >>>>> cleans up. Other than hard kill I'm not sure what else might
> >>>>> cause it to exit without cleaning up.
> >>>>
> >>>> I shutdown containers with `lxc-stop -n container-name`
> >>>> (lxc.stopsignal=30 (SIGPWR)), however this signal should never go
> >>>> to lxc_monitord, right?
> >>>
> >>> Right, that goes to the init process of the container. 
>