<div>

                    Hi all,

                </div><div><br></div><div>We encounter a bug today when one of our systems enter soft-lockup when we try to start a container. Unfortunately at that point, we have to do a power cycle because we can’t access the system anymore. Here is the kernel.log:</div><div><br></div><div><div>[14164995.081770] BUG: soft lockup - CPU#3 stuck for 22s! [lxc-start:20066]</div><div>[14164995.081784] Modules linked in: overlayfs(F) veth(F) xt_CHECKSUM(F) quota_v2(F) quota_tree(F) bridge(F) stp(F) llc(F) ipt_MASQUERADE(F) xt_nat(F) xt_tcpudp(F) iptable_nat(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) xt_LOG(F) iptable_filter(F) iptable_mangle(F) ip_tables(F) x_tables(F) intel_rapl(F) crct10dif_pclmul(F) crc32_pclmul(F) ghash_clmulni_intel(F) aesni_intel(F) ablk_helper(F) cryptd(F) lrw(F) gf128mul(F) glue_helper(F) aes_x86_64(F) microcode(F) isofs(F) xfs(F) libcrc32c(F) raid10(F) raid456(F) async_pq(F) async_xor(F) xor(F) async_memcpy(F) async_raid6_recov(F) raid6_pq(F) async_tx(F) raid1(F) raid0(F) multipath(F) linear(F)</div><div>[14164995.081820] CPU: 3 PID: 20066 Comm: lxc-start Tainted: GF   B        3.13.4 #1</div><div>[14164995.081823] task: ffff880107da9810 ti: ffff8800f494e000 task.ti: ffff8800f494e000</div><div>[14164995.081825] RIP: e030:[<ffffffff811e266b>]  [<ffffffff811e266b>] __lookup_mnt+0x5b/0x80</div><div>[14164995.081835] RSP: e02b:ffff8800f494fcd8  EFLAGS: 00000296</div><div>[14164995.081837] RAX: ffffffff81c6b7e0 RBX: 00000000011e7ab2 RCX: ffff8810a36890b0</div><div>[14164995.081838] RDX: 0000000000000997 RSI: ffff881005054f00 RDI: ffff881017f2fba0</div><div>[14164995.081840] RBP: ffff8800f494fce8 R08: 0035313638363436 R09: ffff881005054f00</div><div>[14164995.081841] R10: 0001010000000000 R11: ffffc90000000000 R12: ffff8810a29a3000</div><div>[14164995.081842] R13: ffff8800f494ff28 R14: ffff8800f494fdb8 R15: 0000000000000000</div><div>[14164995.081848] FS:  00007fabd0fec800(0000) GS:ffff88110e4c0000(0000) knlGS:0000000000000000</div><div>[14164995.081850] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b</div><div>[14164995.081851] CR2: 0000000001dce000 CR3: 00000000f515f000 CR4: 0000000000002660</div><div>[14164995.081853] Stack:</div><div>[14164995.081854]  ffff8800f494fd18 ffffffff81c6b7e0 ffff8800f494fd18 ffffffff811e2740</div><div>[14164995.081857]  ffff8800f494fdb8 ffff8800f494ff28 ffff8810a29a3000 ffff8800f494ff28</div><div>[14164995.081860]  ffff8800f494fd38 ffffffff811cd17e ffff8800f494fda8 ffff8810a29a3000</div><div>[14164995.081862] Call Trace:</div><div>[14164995.081868]  [<ffffffff811e2740>] lookup_mnt+0x30/0x70</div><div>[14164995.081872]  [<ffffffff811cd17e>] follow_mount+0x5e/0x70</div><div>[14164995.081875]  [<ffffffff811cffd2>] mountpoint_last+0xc2/0x1e0</div><div>[14164995.081877]  [<ffffffff811d01c7>] path_mountpoint+0xd7/0x450</div><div>[14164995.081883]  [<ffffffff817639e3>] ? _raw_spin_unlock_irqrestore+0x23/0x50</div><div>[14164995.081888]  [<ffffffff811a80a3>] ? kmem_cache_alloc+0x1d3/0x1f0</div><div>[14164995.081891]  [<ffffffff811d225a>] ? getname_flags+0x5a/0x190</div><div>[14164995.081893]  [<ffffffff811d225a>] ? getname_flags+0x5a/0x190</div><div>[14164995.081896]  [<ffffffff811d0574>] filename_mountpoint+0x34/0xc0</div><div>[14164995.081899]  [<ffffffff811d2f9a>] user_path_mountpoint_at+0x4a/0x70</div><div>[14164995.081902]  [<ffffffff811e317f>] SyS_umount+0x7f/0x3b0</div><div>[14164995.081907]  [<ffffffff8102253d>] ? syscall_trace_leave+0xdd/0x150</div><div>[14164995.081912]  [<ffffffff8176c87f>] tracesys+0xe1/0xe6</div><div>[14164995.081913] Code: 03 0d a2 56 b3 00 48 8b 01 48 89 45 f8 48 8b 55 f8 31 c0 48 39 ca 74 2b 48 89 d0 eb 13 0f 1f 00 48 8b 00 48 89 45 f8 48 8b 45 f8 <48> 39 c8 74 18 48 8b 50 10 48 83 c2 20 48 39 d7 75 e3 48 39 70</div><div><br></div><div>After this point, it seems that all lxc-start will fail,but the system continues to run until we power-cycled it.</div><div><br></div><div>When i inspected some of the containers that were started during that time, i saw that one of them has an existing lxc_putold directory ( which should be removed when the container finished starting up right ? ). However, i'm not sure if that is related to the lockup above.</div><div><br></div><div>The host is running on a 12.04 ec2 server, with lxc 1.0.0 and kernel 3.13.0-12.32</div><div><br></div><div>Cheers,</div><div>Daniel.</div></div>

                <div></div>