[Lxc-users] Containers slow to start after 1600

Mon Mar 11 17:41:33 UTC 2013

Quoting Benoit Lourdelet (blourdel at juniper.net):
> Hello,
> 
> I am running LXC 0.8.0 kernel 3.7.9 and try to start more than 1000 small containers : around 10MB of RAM per containers.
> 
> Starting around the first 1600 happens smoothy - I have a 32 virtual core machine - but then everything gets very slow :
> 
> up to a minute per contain creation.  Ultimately the server CPU goes to 100%.
> 
> I get this error  multiple time in  the syslog :
> 
> 
> [ 2402.961711] INFO: task lxc-start:128486 blocked for more than 120 seconds.
> [ 2402.961717] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2402.961724] lxc-start D ffffffff8180cc60 0 128486 1 0x00000000
> [ 2402.961727] ffff883c30359cb0 0000000000000086 ffff883c2ea3c800 ffff883c2f550600
> [ 2402.961734] ffff883c2d955c00 ffff883c30359fd8 ffff883c30359fd8 ffff883c30359fd8
> [ 2402.961741] ffff881fd35e5c00 ffff883c2d955c00 ffff883c3533ec10 ffffffff81cac4e0
> [ 2402.961747] Call Trace:
> [ 2402.961753] [<ffffffff816dbfc9>] schedule+0x29/0x70
> [ 2402.961758] [<ffffffff816dc27e>] schedule_preempt_disabled+0xe/0x10
> [ 2402.961763] [<ffffffff816dadd7>] __mutex_lock_slowpath+0xd7/0x150
> [ 2402.961768] [<ffffffff8158b911>] ? net_alloc_generic+0x21/0x30
> [ 2402.961772] [<ffffffff816da9ea>] mutex_lock+0x2a/0x50
> [ 2402.961777] [<ffffffff8158c044>] copy_net_ns+0x84/0x110
> [ 2402.961782] [<ffffffff81081f4b>] create_new_namespaces+0xdb/0x180
> [ 2402.961787] [<ffffffff8108210c>] copy_namespaces+0x8c/0xd0
> [ 2402.961792] [<ffffffff81055ea0>] copy_process+0x970/0x1550
> [ 2402.961796] [<ffffffff8119e542>] ? do_filp_open+0x42/0xa0
> [ 2402.961801] [<ffffffff81056bc9>] do_fork+0xf9/0x340
> [ 2402.961806] [<ffffffff81199de6>] ? final_putname+0x26/0x50
> [ 2402.961811] [<ffffffff81199ff9>] ? putname+0x29/0x40
> [ 2402.961816] [<ffffffff8101d498>] sys_clone+0x28/0x30
> [ 2402.961819] [<ffffffff816e5c23>] stub_clone+0x13/0x20
> [ 2402.961823] [<ffffffff816e5919>] ? system_call_fastpath+0x16/0x1b

Interesting.  It could of course be some funky cache or hash issue, but
what does /proc/meminfo show?  10M ram per container may be true in
userspace, but the network stacks etc are also taking up kernel memory.

I assume the above trace is one container waiting on another to finish
it's netns alloc.  If you could get dmesg output from echo t >
/proc/sysrq-trigger during one of these slow starts it could show where
the other is hung.

-serge