[lxc-users] lxc progress and a few questions (Trying criu from github)

Sun Apr 10 21:52:08 UTC 2016

Greetings,

OK, another try - simplified the test environment as much as possible,
in hopes of getting this working. Ubuntu 16.04, up to date.

Two changes:

I've reverted to the default lxc network configuration to eliminate
corner cases and focus on live migration issues.

I've uninstalled the criu package and built criu from git -
(https://github.com/tych0/criu/tree/cgroup-root-mount)

Steps to reproduce failure:

Create 1 new ubuntu 16.04 container on each of the 2 lxd hosts
Issue the lxc move command

Result:

root at ronnie:~# lxc move second lxd:
error: Error transferring container data: checkpoint failed:
(03.725544) Error (net.c:1048): mount failed: Device or resource busy
(03.742659) Error (namespaces.c:910): Namespaces dumping finished with
error 65280
(03.747443) Error (cr-dump.c:1600): Dumping FAILED.
root at ronnie:~#

Tail of /var/log/lxd/second/migration_dump_2016-04-10T14\:39\:43-07\:00.log:

(03.722460) Process: 579(23837)
(03.722464) ----------------------------------------
(03.722477) Dumping 1(23057)'s namespaces
(03.723377) Dump CGROUP namespace info 14 via 23057
(03.724066) Dump UTS namespace 11 via 23057
(03.724662) Dump IPC namespace 10 via 23057
(03.724924) IPC shared memory segments: 0
(03.724934) IPC message queues: 0
(03.724941) IPC semaphore sets: 0
(03.725346) Dump NET namespace info 9 via 23057
(03.725525) Mount ns' sysfs in crtools-sys.JZ9n0X
(03.725544) Error (net.c:1048): mount failed: Device or resource busy
(03.742659) Error (namespaces.c:910): Namespaces dumping finished with
error 65280
(03.742876) Unlock network
(03.742883) Running network-unlock scripts
(03.747031) Unfreezing tasks into 1
(03.747049)     Unseizing 23057 into 1
(03.747061)     Unseizing 23147 into 1
(03.747069)     Unseizing 23148 into 1
(03.747077)     Unseizing 23182 into 1
(03.747084)     Unseizing 23272 into 1
(03.747091)     Unseizing 23279 into 1
(03.747125)     Unseizing 23282 into 1
(03.747140)     Unseizing 23297 into 1
(03.747162)     Unseizing 23301 into 1
(03.747181)     Unseizing 23306 into 1
(03.747223)     Unseizing 23309 into 1
(03.747232)     Unseizing 23330 into 1
(03.747247)     Unseizing 23345 into 1
(03.747287)     Unseizing 23474 into 1
(03.747299)     Unseizing 23577 into 1
(03.747310)     Unseizing 23675 into 1
(03.747319)     Unseizing 23688 into 1
(03.747328)     Unseizing 23691 into 1
(03.747339)     Unseizing 23835 into 1
(03.747348)     Unseizing 23836 into 1
(03.747357)     Unseizing 23837 into 1
(03.747443) Error (cr-dump.c:1600): Dumping FAILED.

I'm happy to supply additional info, or test patches -

Regards,

Jake

On Fri, Apr 8, 2016 at 4:40 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
> Ah, never mind - it doesn't appear to be solely a criu issue - even
> migration of stopped containers hangs forever now.
>
> Jake
>
> On Fri, Apr 8, 2016 at 4:23 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>> Ubuntu 16.04, up to date -
>>
>> After today's updates, including a kernel upgrade to 4.4.0-18, I tried
>> live migration again:
>>
>> root at raskolnikov:~# lxc move third lxd2:
>>
>> One hour later:
>>
>> root at raskolnikov:~# lxc move third lxd2:
>>
>> Still stuck, and the migration file in /var/log/lxd/third has not been created.
>>
>> Tycho said on Mar 30 that the situation should be sorted soon, but
>> mentioned the git repo:
>> https://github.com/tych0/criu/tree/cgroup-root-mount
>>
>> Should live migration work with criu from git?
>>
>> Feel free to advise me on what information I can supply, not only for
>> the ct migration issues, but also for the new dhcp issue
>>
>> Regards,
>>
>> Jake
>>
>>
>> On Thu, Apr 7, 2016 at 11:01 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>>> (Bump) -
>>>
>>> Any thoughts on what to try for the CT migration and dhcp issues?
>>> Running up to date ubuntu 16.04 beta -
>>>
>>> Regards,
>>>
>>> Jake
>>>
>>> On Wed, Apr 6, 2016 at 3:18 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>>>> Greetings -
>>>>
>>>> I'be not yet been able to reproduce that one shining moment from Mar
>>>> 29 when live migration of privileged containers was working, under
>>>> kernel 4.4.0-15
>>>>
>>>> To recap. live container migration broke with 4.4.0-16, and is still
>>>> broken in 4.4.0-17 - but  now, instead of producing an error message,
>>>> an attempt to live migrate a container merely hangs forever. Is that
>>>> expected, or should I be seeing something more? BTW - the migration
>>>> dump log for that container hasn't been touched for a week. I'll be
>>>> glad to supply more info if this is not a known issue.
>>>>
>>>> Recent updates seem to have created a new problem. the CTs which
>>>> configure their own network settings work (aside from migration) but
>>>> none of the CTs which depend on dhcp are getting IPs. BTW I'm using a
>>>> bridge connected to my local network and dhcp, not the default lxc
>>>> dhcp server. I see the packets on the host bridge, but they don't
>>>> reach the dhcp server. I'd be curious to know if there have been any
>>>> dhcp issues since recent updates. If not, I'll need to troubleshoot
>>>> other causes, but it's odd that dhcp simply stops working for all CTs
>>>> on both lxd hosts after updates.
>>>>
>>>> Jake
>>>>
>>>>
>>>> On Wed, Mar 30, 2016 at 6:27 AM, Tycho Andersen
>>>> <tycho.andersen at canonical.com> wrote:
>>>>> On Tue, Mar 29, 2016 at 11:17:26PM -0700, jjs - mainphrame wrote:
>>>>>> Well, I've found some interesting things here today. I created a couple of
>>>>>> privileged xenial containers, and sure enough, I was able to live migrate
>>>>>> them back and forth between the 2 lxd hosts.
>>>>>>
>>>>>> So far, so good.
>>>>>>
>>>>>> Then I did an apt upgrade - among the changes was a kernel change from
>>>>>>  4.4.0-15 to 4.4.0-16 - and live migration stopped working.
>>>>>>
>>>>>> Here are the failure messages that resulted from attempting the very same
>>>>>> live migrations that worked before the upgrade and reboot into 4.4.0-16:
>>>>>>
>>>>>> root at raskolnikov:~# lxc move akira lxd2:
>>>>>> error: Error transferring container data: checkpoint failed:
>>>>>> (00.092234) Error (mount.c:740): mnt: 83:./sys/fs/cgroup/devices doesn't
>>>>>> have a proper root mount
>>>>>> (00.098187) Error (cr-dump.c:1600): Dumping FAILED.
>>>>>>
>>>>>>
>>>>>> root at ronnie:~# lxc move third lxd:
>>>>>> error: Error transferring container data: checkpoint failed:
>>>>>> (00.076107) Error (mount.c:740): mnt: 326:./sys/fs/cgroup/perf_event
>>>>>> doesn't have a proper root mount
>>>>>> (00.080388) Error (cr-dump.c:1600): Dumping FAILED.
>>>>>
>>>>> Yep, this is a known issue with -16. We need both a kernel patch and a
>>>>> patch to CRIU before it will start working again. I have a branch at:
>>>>>
>>>>> https://github.com/tych0/criu/tree/cgroup-root-mount
>>>>>
>>>>> which should work if you want to keep playing with it, but hopefully
>>>>> we'll have the situation sorted out in the next few days.
>>>>>
>>>>> Tycho
>>>>>
>>>>>> Jake
>>>>>>
>>>>>> PS - Thanks for the html mail heads-up - I've been using google mail
>>>>>> services for this domain. I'll have to look into the config options, and
>>>>>> see if I can do the needful.
>>>>>
>>>>>>
>>>>>> On Tue, Mar 29, 2016 at 12:45 PM, Andrey Repin <anrdaemon at yandex.ru> wrote:
>>>>>>
>>>>>> > Greetings, jjs - mainphrame!
>>>>>> >
>>>>>> > >> On Mon, Mar 28, 2016 at 08:47:24PM -0700, jjs - mainphrame wrote:
>>>>>> >  >>> I've looked at ct migration between 2 ubuntu 16.04 hosts today, and
>>>>>> > had
>>>>>> >  >>> some interesting problems;  I find that migration of stopped
>>>>>> > containers
>>>>>> >  >>> works fairly reliably; but live migration, well, it transfers a lot of
>>>>>> >  >>> data, then exits with a failure message. I can then move the same
>>>>>> >  >>> container, stopped, with no problem.
>>>>>> >  >>>
>>>>>> >  >>> The error is the same every time, a failure of "mkdtemp" -
>>>>>> > >>
>>>>>> > >>  It looks like your host /tmp isn't writable by the uid map that the
>>>>>> > >>  container is being restored as?
>>>>>> >
>>>>>> >
>>>>>> > > Which is odd, since /tmp has 1777 perms on both hosts, so I don't see how
>>>>>> > > it could be a permissions problem. Surely the default apparmor profile is
>>>>>> > > not the cause? You did give me a new idea though, and I'll set up a test
>>>>>> > > with privileged containers for comparison. Is there a switch to enable
>>>>>> > verbose logging?
>>>>>> >
>>>>>> > I've ran into the same issue once. Stumbled upon it for nearly a month,
>>>>>> > falsely
>>>>>> > blaming LXC.
>>>>>> > Recreating a container's rootfs from scratch resolved the issue.
>>>>>> > I know not of what caused it to begin with, must've been some kind of
>>>>>> > glitch.
>>>>>> >
>>>>>> > P.S.
>>>>>> > It would be great if you can configure your mail client to not use HTML
>>>>>> > format
>>>>>> > for lists.
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > With best regards,
>>>>>> > Andrey Repin
>>>>>> > Tuesday, March 29, 2016 22:43:04
>>>>>> >
>>>>>> > Sorry for my terrible english...
>>>>>> > _______________________________________________
>>>>>> > lxc-users mailing list
>>>>>> > lxc-users at lists.linuxcontainers.org
>>>>>> > http://lists.linuxcontainers.org/listinfo/lxc-users
>>>>>> >
>>>>>
>>>>>> _______________________________________________
>>>>>> lxc-users mailing list
>>>>>> lxc-users at lists.linuxcontainers.org
>>>>>> http://lists.linuxcontainers.org/listinfo/lxc-users
>>>>>
>>>>> _______________________________________________
>>>>> lxc-users mailing list
>>>>> lxc-users at lists.linuxcontainers.org
>>>>> http://lists.linuxcontainers.org/listinfo/lxc-users