[lxc-users] lxc_monitor exiting, but not cleaning monitor-fifo?

Florian Klink flokli at flokli.de
Fri Apr 4 20:22:05 UTC 2014


Am 02.04.2014 16:42, schrieb Dwight Engen:
> On Tue, 01 Apr 2014 22:15:25 +0200
> Florian Klink <flokli at flokli.de> wrote:
> 
>> Am 01.04.2014 01:49, schrieb Dwight Engen:
>>> On Mon, 31 Mar 2014 23:18:13 +0200
>>> Florian Klink <flokli at flokli.de> wrote:
>>>
>>>> Am 31.03.2014 21:13, schrieb Dwight Engen:
>>>>> On Mon, 31 Mar 2014 20:34:15 +0200
>>>>> Florian Klink <flokli at flokli.de> wrote:
>>>>>
>>>>>> Am 31.03.2014 20:10, schrieb Dwight Engen:
>>>>>>> On Sat, 29 Mar 2014 23:39:33 +0100
>>>>>>> Florian Klink <flokli at flokli.de> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> when running multiple lxc actions in row using the command line
>>>>>>>> tools, I sometimes observe the following state:
>>>>>>>>
>>>>>>>>
>>>>>>>> - lxc-monitord is not running anymore
>>>>>>>> - /run/lxc/var/lib/lxc/monitor-fifo still exists, but is
>>>>>>>> "refusing connection"
>>>>>>>>
>>>>>>>> In the logs, I then see the following:
>>>>>>>>
>>>>>>>>
>>>>>>>> lxc-start 1395671045.703 ERROR    lxc_monitor - connect :
>>>>>>>> backing off 10 lxc-start 1395671045.713 ERROR    lxc_monitor -
>>>>>>>> connect : backing off 50 lxc-start 1395671045.763 ERROR
>>>>>>>> lxc_monitor - connect : backing off 100 lxc-start
>>>>>>>> 1395671045.864 ERROR lxc_monitor - connect : Connection refused
>>>>>>>>
>>>>>>>>
>>>>>>>> ... and the command fails.
>>>>>>>  
>>>>>>> The only time I've seen this happen is if lxc-monitord is hard
>>>>>>> killed so it doesn't have a chance to clean up and remove the
>>>>>>> socket.
>>>>>>
>>>>>> Here, it's happening quite frequently. However, the script never
>>>>>> kills lxc-monitord on its own, it just tries to detect and fix
>>>>>> this state by removing the socket file...
>>>>>
>>>>> Right, removing the socket file makes it so another lxc-monitord
>>>>> will start, but the question is why is the first one exiting
>>>>> without cleaning up? Can you reliably reproduce it at will? If so
>>>>> then maybe you could attach an strace to lxc-monitord and see why
>>>>> it is exiting.
>>>>
>>>> I was so far not successful in reproducing the bug while having an
>>>> strace running. :-( But I'll continue to try!
>>
>> Success :-) I managed to get an strace while trying to reproduce the
>> bug. I gzipped and attached it to this mail.
>>
>> Its the output of strace -f -s 200 /usr/lib/lxc/lxc-monitord
>> /var/lib/lxc /run/lxc/var/lib/lxc/monitor-fifo &> strace_output.txt
>>
>> I fired a bunch of lxc-starts and lxc-stops in row, then stopped my
>> script and waited for lxc-monitord (and strace too) to stop.
>>
>> Then I started my script again and had the "leftover monitor-fifo
>> state".
> 
> Unfortunately, I don't think that strace shows the problem. It looks to
> me like a normal exit with a successful
> unlink("/run/lxc//var/lib/lxc/monitor-fifo") = 0 right near the end.
> 
> You can't really run monitord by hand like that since it is expecting a
> pipe fd as argv[2]. Thats why I was suggesting attaching to it. So
> something like:
> 
> lxc-start <your ct>
> lxc-monitor -n '.*'
> 
> in another terminal:
> ps aux |grep monitord  -> find the pid of lxc-monitord
> strace -v -t -o straceout.txt -p <pid of monitord>
> 
> and then do whatever you do to make things fail :)

I was not able to get an strace of the bug. I think was is only
triggered by a lot of lxc-monitord start/stop traffic ;-)

> 
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> A possible workaround would be checking for non-running
>>>>>>>> lxc-monitord process but existing monitor-fifo file then
>>>>>>>> removing the fifo if it exists before running the next lxc
>>>>>>>> command, but thats ugly ;-)
>>>>>>>
>>>>>>> Is there a good non-racy way to do this? I guess monitord could
>>>>>>> write its pid in $LXCPATH and we could kill(pid, 0) it. 
>>>>
>>>> I also think that lxc should be able to recover from this problem
>>>> automatically.
>>>
>>> I agree, though I would like to understand the root cause. Can you
>>> try out the attached patch? I think it will cure your issues.
>>>
>>
>> Thanks for the patch! Just tell me if you need more information for
>> the strace above. If not, I'll happily apply the patch :-)
> 
> You can try the patch to see if it solves your issue, though I'd still
> like to understand why its happening in the first place. I may rework
> the patch based on Serge's suggestion, but it'd be nice to know if the
> one I sent does fix what you are seeing. It worked for all the
> hard-kill cases I tried.

Both patches, the pidfile version and the reworked version fixed my
problem. So I'm very happy with it :-)


Will this patch also go to the stable-1.0 branch?
I'd really like to see this fixed in the 1.0.3 release ;-)

Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20140404/717bbae3/attachment.pgp>


More information about the lxc-users mailing list