[lxc-users] lxc_monitor exiting, but not cleaning monitor-fifo?

Tue Apr 8 14:30:02 UTC 2014

On Fri, 04 Apr 2014 22:22:05 +0200
Florian Klink <flokli at flokli.de> wrote:

> Am 02.04.2014 16:42, schrieb Dwight Engen:
> > On Tue, 01 Apr 2014 22:15:25 +0200
> > Florian Klink <flokli at flokli.de> wrote:
> > 
> >> Am 01.04.2014 01:49, schrieb Dwight Engen:
> >>> On Mon, 31 Mar 2014 23:18:13 +0200
> >>> Florian Klink <flokli at flokli.de> wrote:
> >>>
> >>>> Am 31.03.2014 21:13, schrieb Dwight Engen:
> >>>>> On Mon, 31 Mar 2014 20:34:15 +0200
> >>>>> Florian Klink <flokli at flokli.de> wrote:
> >>>>>
> >>>>>> Am 31.03.2014 20:10, schrieb Dwight Engen:
> >>>>>>> On Sat, 29 Mar 2014 23:39:33 +0100
> >>>>>>> Florian Klink <flokli at flokli.de> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> when running multiple lxc actions in row using the command
> >>>>>>>> line tools, I sometimes observe the following state:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> - lxc-monitord is not running anymore
> >>>>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
> >>>>>>>> "refusing connection"
> >>>>>>>>
> >>>>>>>> In the logs, I then see the following:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> lxc-start 1395671045.703 ERROR    lxc_monitor - connect :
> >>>>>>>> backing off 10 lxc-start 1395671045.713 ERROR    lxc_monitor
> >>>>>>>> - connect : backing off 50 lxc-start 1395671045.763 ERROR
> >>>>>>>> lxc_monitor - connect : backing off 100 lxc-start
> >>>>>>>> 1395671045.864 ERROR lxc_monitor - connect : Connection
> >>>>>>>> refused
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ... and the command fails.
> >>>>>>>  
> >>>>>>> The only time I've seen this happen is if lxc-monitord is hard
> >>>>>>> killed so it doesn't have a chance to clean up and remove the
> >>>>>>> socket.
> >>>>>>
> >>>>>> Here, it's happening quite frequently. However, the script
> >>>>>> never kills lxc-monitord on its own, it just tries to detect
> >>>>>> and fix this state by removing the socket file...
> >>>>>
> >>>>> Right, removing the socket file makes it so another lxc-monitord
> >>>>> will start, but the question is why is the first one exiting
> >>>>> without cleaning up? Can you reliably reproduce it at will? If
> >>>>> so then maybe you could attach an strace to lxc-monitord and
> >>>>> see why it is exiting.
> >>>>
> >>>> I was so far not successful in reproducing the bug while having
> >>>> an strace running. :-( But I'll continue to try!
> >>
> >> Success :-) I managed to get an strace while trying to reproduce
> >> the bug. I gzipped and attached it to this mail.
> >>
> >> Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
> >> /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
> >>
> >> I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
> >> script and waited for lxc-monitord (and strace too) to stop.
> >>
> >> Then I started my script again and had the "leftover monitor-fifo
> >> state".
> > 
> > Unfortunately, I don't think that strace shows the problem. It
> > looks to me like a normal exit with a successful
> > unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.
> > 
> > You can't really run monitord by hand like that since it is
> > expecting a pipe fd as argv[2]. Thats why I was suggesting
> > attaching to it. So something like:
> > 
> > lxc-start <your ct>
> > lxc-monitor -n '.*'
> > 
> > in another terminal:
> > ps aux |grep monitord  -> find the pid of lxc-monitord
> > strace -v -t -o straceout.txt -p <pid of monitord>
> > 
> > and then do whatever you do to make things fail :)
> 
> I was not able to get an strace of the bug. I think was is only
> triggered by a lot of lxc-monitord start/stop traffic ;-)
> 
> > 
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> A possible workaround would be checking for non-running
> >>>>>>>> lxc-monitord process but existing monitor-fifo file then
> >>>>>>>> removing the fifo if it exists before running the next lxc
> >>>>>>>> command, but thats ugly ;-)
> >>>>>>>
> >>>>>>> Is there a good non-racy way to do this? I guess monitord
> >>>>>>> could write its pid in $LXCPATH and we could kill(pid, 0) it. 
> >>>>
> >>>> I also think that lxc should be able to recover from this problem
> >>>> automatically.
> >>>
> >>> I agree, though I would like to understand the root cause. Can you
> >>> try out the attached patch? I think it will cure your issues.
> >>>
> >>
> >> Thanks for the patch! Just tell me if you need more information for
> >> the strace above. If not, I'll happily apply the patch :-)
> > 
> > You can try the patch to see if it solves your issue, though I'd
> > still like to understand why its happening in the first place. I
> > may rework the patch based on Serge's suggestion, but it'd be nice
> > to know if the one I sent does fix what you are seeing. It worked
> > for all the hard-kill cases I tried.
> 
> Both patches, the pidfile version and the reworked version fixed my
> problem. So I'm very happy with it :-)
> 
> 
> Will this patch also go to the stable-1.0 branch?
> I'd really like to see this fixed in the 1.0.3 release ;-)

Looks like Stéphane did pull it onto stable so you should be good.
Thanks for trying to debug/strace it. I still don't know why this is
happening in the first place but at least this should work around the
problem when it does happen.

> Florian
>