[lxc-users] LXD containers lose outbound network

Mon Jun 27 10:15:08 UTC 2016

Ok, this happens again and again!
Like this LXD is not usable in production. I cannot restart LXD every few days.

I'll answer Fajar's questions from below here:

By "inbound" I mean connections from the host/internet to he container. Those work and keep working. I have port forwarding enabled.
By "outbound" I mean connections from the container to the host/internet. The latter keep failing after some time (several days or so).

On the host:
I can ping the container just fine.

In the container:
I can ping lxdbr0:

root at taskd:~# ping 10.0.8.1                                                                                                                       
PING 10.0.8.1 (10.0.8.1) 56(84) bytes of data.                                                                                                    
64 bytes from 10.0.8.1: icmp_seq=1 ttl=64 time=0.202 ms                                                                                           
64 bytes from 10.0.8.1: icmp_seq=2 ttl=64 time=0.121 ms                                                                                           
64 bytes from 10.0.8.1: icmp_seq=3 ttl=64 time=0.144 ms

And "tcpdump -i lxdbr0" on the host shows:
11:29:23.570390 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 12901, seq 1, length 64
11:29:23.570459 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 12901, seq 1, length 64
11:29:24.569336 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 12901, seq 2, length 64
11:29:24.569386 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 12901, seq 2, length 64
11:29:25.568580 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 12901, seq 3, length 64
11:29:25.568630 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 12901, seq 3, length 64

However, I cannot ping an outside IP:
root at taskd:~# ping 8.8.8.8                                                                                                                        
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

On the host I see:
11:30:14.343238 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 12902, seq 1, length 64
11:30:15.350848 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 12902, seq 2, length 64
11:30:16.352577 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 12902, seq 3, length 64
11:30:17.352640 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 12902, seq 4, length 64
11:30:18.352628 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 12902, seq 5, length 64

When trying to ping google.com I see:
11:30:52.847738 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 12903, seq 1, length 64
11:30:53.854716 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 12903, seq 2, length 64
11:30:54.862632 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 12903, seq 3, length 64
11:30:55.870632 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 12903, seq 4, length 64
11:30:56.878594 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 12903, seq 5, length 64

But at the same time I can ping google.com from the host!

After running

service lxd stop
service lxd-bridge stop
service lxd start

on the host, everything works again.

Here the same "tcpdump -i lxdbr0" as above:
12:11:44.317375 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 13076, seq 1, length 64
12:11:44.317477 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 13076, seq 1, length 64
12:11:45.316620 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 13076, seq 2, length 64
12:11:45.316680 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 13076, seq 2, length 64
12:11:46.316587 IP 10.0.8.54 > 10.0.8.1: ICMP echo request, id 13076, seq 3, length 64
12:11:46.316645 IP 10.0.8.1 > 10.0.8.54: ICMP echo reply, id 13076, seq 3, length 64

12:11:55.044655 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 13077, seq 1, length 64
12:11:55.045254 IP google-public-dns-a.google.com > 10.0.8.54: ICMP echo reply, id 13077, seq 1, length 64
12:11:56.044626 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 13077, seq 2, length 64
12:11:56.045285 IP google-public-dns-a.google.com > 10.0.8.54: ICMP echo reply, id 13077, seq 2, length 64
12:11:57.044617 IP 10.0.8.54 > google-public-dns-a.google.com: ICMP echo request, id 13077, seq 3, length 64
12:11:57.045264 IP google-public-dns-a.google.com > 10.0.8.54: ICMP echo reply, id 13077, seq 3, length 64

12:12:15.553335 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 13078, seq 1, length 64
12:12:15.554093 IP zrh04s08-in-f14.1e100.net > 10.0.8.54: ICMP echo reply, id 13078, seq 1, length 64
12:12:16.554574 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 13078, seq 2, length 64
12:12:16.555275 IP zrh04s08-in-f14.1e100.net > 10.0.8.54: ICMP echo reply, id 13078, seq 2, length 64
12:12:17.553578 IP 10.0.8.54 > zrh04s08-in-f14.1e100.net: ICMP echo request, id 13078, seq 3, length 64
12:12:17.554337 IP zrh04s08-in-f14.1e100.net > 10.0.8.54: ICMP echo reply, id 13078, seq 3, length 64

The ARP requests I have removed, because they are the same in both cases.

What could be happening to LXD after it's been running for a while??

Thanks

-----"lxc-users" <lxc-users-bounces at lists.linuxcontainers.org> wrote: -----
To: LXC users mailing-list <lxc-users at lists.linuxcontainers.org>
From: "Fajar A. Nugraha" 
Sent by: "lxc-users" 
Date: 05/30/2016 7:14
Subject: Re: [lxc-users] LXD containers lose outbound network

On Sun, May 29, 2016 at 1:30 PM,  <david.andel at bli.uzh.ch> wrote:
Hi

My LXD has the following network configuration:

root at qumind:~# egrep -v '(^#|^$)' /etc/default/lxd-bridge 
USE_LXD_BRIDGE="true"
LXD_BRIDGE="lxdbr0"
UPDATE_PROFILE="true"
LXD_CONFILE=""
LXD_DOMAIN="lxd"
LXD_IPV4_ADDR="10.0.8.1"
LXD_IPV4_NETMASK="255.255.255.0"
LXD_IPV4_NETWORK="10.0.8.0/24"
LXD_IPV4_DHCP_RANGE="10.0.8.2,10.0.8.254"
LXD_IPV4_DHCP_MAX="253"
LXD_IPV4_NAT="true"
LXD_IPV6_ADDR=""
LXD_IPV6_MASK=""
LXD_IPV6_NETWORK=""
LXD_IPV6_NAT="false"
LXD_IPV6_PROXY="false"

And the network works fine at first. However, after some time outbound connections fail, while inbound connections continue working.
It affects all LXD containers.

What do you mean "outbound" and "inbound"?

>From that setup, you have a NAT network. So others servers in your LAN shouldn't be able to access your containers, unless you also setup port forwarding (which you didn't mention). So "inbound" can't mean "other servers in your LAN accessing your container" in your case.

If by "inbound" you mean "even the host can't access the container", then something is definitely wrong. I'd start by using simple "ping" test when that happens, coupled with "tcpdump" on both the host (lxdbr0 and veth*) and container (eth0) side.

ÿ
And it is not enough to just run 

root at qumind:~# service lxd-bridge stop
Job for lxd-bridge.service canceled.
root at qumind:~# service lxd restart

while the containers are running. The behaviour stays the same.

Obviously. You can't delete a bridge that has interfaces attached (which is the case when containers are running)

ÿI have to stop the containers first, then restart the LXD bridge and start the containers again.
Only then the outbound connections work again - until I have to restart it all again.

What could be the culprit?

Start with the basics:
- test host <-> container networking first. Use "ping" and "tcpdump" to help
- look for error/weird messages at syslog, e.g. "iptables" or "conntrack"

ÿThanks

PS:
To stop all running containers I am using 
for i in $(lxc list | grep RUNNING | awk -F'|' '{print $2}' | tr -d [:blank:]); do lxc stop $i; done
I think it would be convenient to be able to just say 
lxc stop all

"service lxd stop" would stop all running containers before stopping lxd. And "service lxd start" after that will start containers that were previously started, as well as containers withÿboot.autostart: "true"

--ÿ
Fajar 
_______________________________________________
lxc-users mailing list
lxc-users at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-users/attachments/20160627/dd1236e7/attachment-0001.html>