[Lxc-users] Containers slow to start after 1600

Tue Mar 12 17:31:32 UTC 2013

Hello Serge,

I am running on a 256MB RAM host, with plenty of free memory.

I issue  echo t > /proc/sysrq-trigger  when containers was taking 30s to start , it gave the following. Nothing that caught my attention.

this block is repeated of each running container:
[46825.718046] rt_rq[31]:/lxc/lwb2002
[46825.718048]   .rt_nr_running                 : 0
[46825.718050]   .rt_throttled                  : 0
[46825.718052]   .rt_time                       : 0.000000
[46825.718053]   .rt_runtime                    : 0.000000

then :

[46825.718056] 
[46825.718056] rt_rq[31]:/lxc
[46825.718059]   .rt_nr_running                 : 0
[46825.718060]   .rt_throttled                  : 0
[46825.718062]   .rt_time                       : 0.000000
[46825.718064]   .rt_runtime                    : 0.000000
[46825.718069] 
[46825.718069] rt_rq[31]:/libvirt/lxc
[46825.718071]   .rt_nr_running                 : 0
[46825.718073]   .rt_throttled                  : 0
[46825.718075]   .rt_time                       : 0.000000
[46825.718077]   .rt_runtime                    : 0.000000
[46825.718080] 
[46825.718080] rt_rq[31]:/libvirt/qemu
[46825.718083]   .rt_nr_running                 : 0
[46825.718084]   .rt_throttled                  : 0
[46825.718086]   .rt_time                       : 0.000000
[46825.718088]   .rt_runtime                    : 0.000000
[46825.718091] 
[46825.718091] rt_rq[31]:/libvirt
[46825.718093]   .rt_nr_running                 : 0
[46825.718095]   .rt_throttled                  : 0
[46825.718097]   .rt_time                       : 0.000000
[46825.718099]   .rt_runtime                    : 0.000000
[46825.718105] 
[46825.718105] rt_rq[31]:/
[46825.718107]   .rt_nr_running                 : 0
[46825.718109]   .rt_throttled                  : 0
[46825.718111]   .rt_time                       : 0.000000
[46825.718113]   .rt_runtime                    : 950.000000
[46825.718115] 
[46825.718115] runnable tasks:
[46825.718115]             task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
[46825.718115] ----------------------------------------------------------------------------------------------------------
[46825.727356] 

regards

Benoit

root at ieng-serv06:/root/scripts# cat /proc/meminfo 
MemTotal:       264124804 kB
MemFree:        234107144 kB
Buffers:         3429676 kB
Cached:          1650712 kB
SwapCached:            0 kB
Active:         10496560 kB
Inactive:        3224732 kB
Active(anon):    8695932 kB
Inactive(anon):    84348 kB
Active(file):    1800628 kB
Inactive(file):  3140384 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               136 kB
Writeback:             0 kB
AnonPages:       8640928 kB
Mapped:            17868 kB
Shmem:            139380 kB
Slab:           10287240 kB
SReclaimable:    5977640 kB
SUnreclaim:      4309600 kB
KernelStack:      312000 kB
PageTables:      1989464 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    132062400 kB
Committed_AS:   76627724 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1117304 kB
VmallocChunk:   34222330512 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      206416 kB
DirectMap2M:     5003264 kB
DirectMap1G:    263192576 kB

On 11 Mar 2013, at 18:41, Serge Hallyn wrote:

> Quoting Benoit Lourdelet (blourdel at juniper.net):
>> Hello,
>> 
>> I am running LXC 0.8.0 kernel 3.7.9 and try to start more than 1000 small containers : around 10MB of RAM per containers.
>> 
>> Starting around the first 1600 happens smoothy - I have a 32 virtual core machine - but then everything gets very slow :
>> 
>> up to a minute per contain creation.  Ultimately the server CPU goes to 100%.
>> 
>> I get this error  multiple time in  the syslog :
>> 
>> 
>> [ 2402.961711] INFO: task lxc-start:128486 blocked for more than 120 seconds.
>> [ 2402.961717] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2402.961724] lxc-start D ffffffff8180cc60 0 128486 1 0x00000000
>> [ 2402.961727] ffff883c30359cb0 0000000000000086 ffff883c2ea3c800 ffff883c2f550600
>> [ 2402.961734] ffff883c2d955c00 ffff883c30359fd8 ffff883c30359fd8 ffff883c30359fd8
>> [ 2402.961741] ffff881fd35e5c00 ffff883c2d955c00 ffff883c3533ec10 ffffffff81cac4e0
>> [ 2402.961747] Call Trace:
>> [ 2402.961753] [<ffffffff816dbfc9>] schedule+0x29/0x70
>> [ 2402.961758] [<ffffffff816dc27e>] schedule_preempt_disabled+0xe/0x10
>> [ 2402.961763] [<ffffffff816dadd7>] __mutex_lock_slowpath+0xd7/0x150
>> [ 2402.961768] [<ffffffff8158b911>] ? net_alloc_generic+0x21/0x30
>> [ 2402.961772] [<ffffffff816da9ea>] mutex_lock+0x2a/0x50
>> [ 2402.961777] [<ffffffff8158c044>] copy_net_ns+0x84/0x110
>> [ 2402.961782] [<ffffffff81081f4b>] create_new_namespaces+0xdb/0x180
>> [ 2402.961787] [<ffffffff8108210c>] copy_namespaces+0x8c/0xd0
>> [ 2402.961792] [<ffffffff81055ea0>] copy_process+0x970/0x1550
>> [ 2402.961796] [<ffffffff8119e542>] ? do_filp_open+0x42/0xa0
>> [ 2402.961801] [<ffffffff81056bc9>] do_fork+0xf9/0x340
>> [ 2402.961806] [<ffffffff81199de6>] ? final_putname+0x26/0x50
>> [ 2402.961811] [<ffffffff81199ff9>] ? putname+0x29/0x40
>> [ 2402.961816] [<ffffffff8101d498>] sys_clone+0x28/0x30
>> [ 2402.961819] [<ffffffff816e5c23>] stub_clone+0x13/0x20
>> [ 2402.961823] [<ffffffff816e5919>] ? system_call_fastpath+0x16/0x1b
> 
> Interesting.  It could of course be some funky cache or hash issue, but
> what does /proc/meminfo show?  10M ram per container may be true in
> userspace, but the network stacks etc are also taking up kernel memory.
> 
> I assume the above trace is one container waiting on another to finish
> it's netns alloc.  If you could get dmesg output from echo t >
> /proc/sysrq-trigger during one of these slow starts it could show where
> the other is hung.
> 
> -serge
>