<div dir="ltr">We built the kernel based on ubuntu tree with some patches backported from 3.14 ( <a href="https://github.com/nitrous-io/linux/commits/stable-trusty">https://github.com/nitrous-io/linux/commits/stable-trusty</a> )<div>
I can try to run another stress test to see if we can replicate the bug with the debug kernel.<br></div><div><br></div><div>Hmm, moving to overlayfs wasnt actually considered ( we actually moved from overlayfs to aufs because we had some weird bugs with overlayfs, but i forgot what they were ). However, we can try to remove the bind mounts and see if that helps.</div>
<div><br></div><div>Daniel.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Mar 12, 2014 at 10:46 PM, Serge Hallyn <span dir="ltr"><<a href="mailto:serge.hallyn@ubuntu.com" target="_blank">serge.hallyn@ubuntu.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">Quoting Dao Quang Minh (<a href="mailto:dqminh89@gmail.com">dqminh89@gmail.com</a>):<br>
> Hi all,<br>
><br>
> We encounter a bug today when one of our systems enter soft-lockup when we try to start a container. Unfortunately at that point, we have to do a power cycle because we can’t access the system anymore. Here is the kernel.log:<br>
><br>
> [14164995.081770] BUG: soft lockup - CPU#3 stuck for 22s! [lxc-start:20066]<br>
> [14164995.081784] Modules linked in: overlayfs(F) veth(F) xt_CHECKSUM(F) quota_v2(F) quota_tree(F) bridge(F) stp(F) llc(F) ipt_MASQUERADE(F) xt_nat(F) xt_tcpudp(F) iptable_nat(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) xt_LOG(F) iptable_filter(F) iptable_mangle(F) ip_tables(F) x_tables(F) intel_rapl(F) crct10dif_pclmul(F) crc32_pclmul(F) ghash_clmulni_intel(F) aesni_intel(F) ablk_helper(F) cryptd(F) lrw(F) gf128mul(F) glue_helper(F) aes_x86_64(F) microcode(F) isofs(F) xfs(F) libcrc32c(F) raid10(F) raid456(F) async_pq(F) async_xor(F) xor(F) async_memcpy(F) async_raid6_recov(F) raid6_pq(F) async_tx(F) raid1(F) raid0(F) multipath(F) linear(F)<br>
> [14164995.081820] CPU: 3 PID: 20066 Comm: lxc-start Tainted: GF B 3.13.4 #1<br>
> [14164995.081823] task: ffff880107da9810 ti: ffff8800f494e000 task.ti: ffff8800f494e000<br>
> [14164995.081825] RIP: e030:[<ffffffff811e266b>] [<ffffffff811e266b>] __lookup_mnt+0x5b/0x80<br>
> [14164995.081835] RSP: e02b:ffff8800f494fcd8 EFLAGS: 00000296<br>
> [14164995.081837] RAX: ffffffff81c6b7e0 RBX: 00000000011e7ab2 RCX: ffff8810a36890b0<br>
> [14164995.081838] RDX: 0000000000000997 RSI: ffff881005054f00 RDI: ffff881017f2fba0<br>
> [14164995.081840] RBP: ffff8800f494fce8 R08: 0035313638363436 R09: ffff881005054f00<br>
> [14164995.081841] R10: 0001010000000000 R11: ffffc90000000000 R12: ffff8810a29a3000<br>
> [14164995.081842] R13: ffff8800f494ff28 R14: ffff8800f494fdb8 R15: 0000000000000000<br>
> [14164995.081848] FS: 00007fabd0fec800(0000) GS:ffff88110e4c0000(0000) knlGS:0000000000000000<br>
> [14164995.081850] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b<br>
> [14164995.081851] CR2: 0000000001dce000 CR3: 00000000f515f000 CR4: 0000000000002660<br>
> [14164995.081853] Stack:<br>
> [14164995.081854] ffff8800f494fd18 ffffffff81c6b7e0 ffff8800f494fd18 ffffffff811e2740<br>
> [14164995.081857] ffff8800f494fdb8 ffff8800f494ff28 ffff8810a29a3000 ffff8800f494ff28<br>
> [14164995.081860] ffff8800f494fd38 ffffffff811cd17e ffff8800f494fda8 ffff8810a29a3000<br>
> [14164995.081862] Call Trace:<br>
> [14164995.081868] [<ffffffff811e2740>] lookup_mnt+0x30/0x70<br>
> [14164995.081872] [<ffffffff811cd17e>] follow_mount+0x5e/0x70<br>
> [14164995.081875] [<ffffffff811cffd2>] mountpoint_last+0xc2/0x1e0<br>
> [14164995.081877] [<ffffffff811d01c7>] path_mountpoint+0xd7/0x450<br>
> [14164995.081883] [<ffffffff817639e3>] ? _raw_spin_unlock_irqrestore+0x23/0x50<br>
> [14164995.081888] [<ffffffff811a80a3>] ? kmem_cache_alloc+0x1d3/0x1f0<br>
> [14164995.081891] [<ffffffff811d225a>] ? getname_flags+0x5a/0x190<br>
> [14164995.081893] [<ffffffff811d225a>] ? getname_flags+0x5a/0x190<br>
> [14164995.081896] [<ffffffff811d0574>] filename_mountpoint+0x34/0xc0<br>
> [14164995.081899] [<ffffffff811d2f9a>] user_path_mountpoint_at+0x4a/0x70<br>
> [14164995.081902] [<ffffffff811e317f>] SyS_umount+0x7f/0x3b0<br>
> [14164995.081907] [<ffffffff8102253d>] ? syscall_trace_leave+0xdd/0x150<br>
> [14164995.081912] [<ffffffff8176c87f>] tracesys+0xe1/0xe6<br>
> [14164995.081913] Code: 03 0d a2 56 b3 00 48 8b 01 48 89 45 f8 48 8b 55 f8 31 c0 48 39 ca 74 2b 48 89 d0 eb 13 0f 1f 00 48 8b 00 48 89 45 f8 48 8b 45 f8 <48> 39 c8 74 18 48 8b 50 10 48 83 c2 20 48 39 d7 75 e3 48 39 70<br>
><br>
> After this point, it seems that all lxc-start will fail,but the system continues to run until we power-cycled it.<br>
><br>
> When i inspected some of the containers that were started during that time, i saw that one of them has an existing lxc_putold directory ( which should be removed when the container finished starting up right ? ). However, i'm not sure if that is related to the lockup above.<br>
><br>
> The host is running on a 12.04 ec2 server, with lxc 1.0.0 and kernel 3.13.0-12.32<br>
<br>
</div></div>Hi,<br>
<br>
where did you get yoru kernel? Is there an updated version you can<br>
fetch or build?<br>
<br>
You might want to grab or build the debug symbols and see if you<br>
can track down what's actually happening in the kernel. The stack<br>
trace doesn't really make sense to me - I see where getname_flags<br>
calls __getname() which is kmem_cache_alloc(), but I don't see<br>
how that gets us to path_mountpoint(). interrupt?<br>
<br>
Anyway, my *guess* is this is a bug in aufs, which unfortunately is<br>
not upstream. If you could try replacing aufs with overlayfs, and<br>
see if that causes the same problem, that would be a helpful datapoint.<br>
<br>
-serge<br>
_______________________________________________<br>
lxc-users mailing list<br>
<a href="mailto:lxc-users@lists.linuxcontainers.org">lxc-users@lists.linuxcontainers.org</a><br>
<a href="http://lists.linuxcontainers.org/listinfo/lxc-users" target="_blank">http://lists.linuxcontainers.org/listinfo/lxc-users</a></blockquote></div><br></div>