[lxc-users] lxc progress and a few questions (Trying criu from github)
jjs - mainphrame
jjs at mainphrame.com
Sun Apr 10 21:52:08 UTC 2016
Greetings,
OK, another try - simplified the test environment as much as possible,
in hopes of getting this working. Ubuntu 16.04, up to date.
Two changes:
I've reverted to the default lxc network configuration to eliminate
corner cases and focus on live migration issues.
I've uninstalled the criu package and built criu from git -
(https://github.com/tych0/criu/tree/cgroup-root-mount)
Steps to reproduce failure:
Create 1 new ubuntu 16.04 container on each of the 2 lxd hosts
Issue the lxc move command
Result:
root at ronnie:~# lxc move second lxd:
error: Error transferring container data: checkpoint failed:
(03.725544) Error (net.c:1048): mount failed: Device or resource busy
(03.742659) Error (namespaces.c:910): Namespaces dumping finished with
error 65280
(03.747443) Error (cr-dump.c:1600): Dumping FAILED.
root at ronnie:~#
Tail of /var/log/lxd/second/migration_dump_2016-04-10T14\:39\:43-07\:00.log:
(03.722460) Process: 579(23837)
(03.722464) ----------------------------------------
(03.722477) Dumping 1(23057)'s namespaces
(03.723377) Dump CGROUP namespace info 14 via 23057
(03.724066) Dump UTS namespace 11 via 23057
(03.724662) Dump IPC namespace 10 via 23057
(03.724924) IPC shared memory segments: 0
(03.724934) IPC message queues: 0
(03.724941) IPC semaphore sets: 0
(03.725346) Dump NET namespace info 9 via 23057
(03.725525) Mount ns' sysfs in crtools-sys.JZ9n0X
(03.725544) Error (net.c:1048): mount failed: Device or resource busy
(03.742659) Error (namespaces.c:910): Namespaces dumping finished with
error 65280
(03.742876) Unlock network
(03.742883) Running network-unlock scripts
(03.747031) Unfreezing tasks into 1
(03.747049) Unseizing 23057 into 1
(03.747061) Unseizing 23147 into 1
(03.747069) Unseizing 23148 into 1
(03.747077) Unseizing 23182 into 1
(03.747084) Unseizing 23272 into 1
(03.747091) Unseizing 23279 into 1
(03.747125) Unseizing 23282 into 1
(03.747140) Unseizing 23297 into 1
(03.747162) Unseizing 23301 into 1
(03.747181) Unseizing 23306 into 1
(03.747223) Unseizing 23309 into 1
(03.747232) Unseizing 23330 into 1
(03.747247) Unseizing 23345 into 1
(03.747287) Unseizing 23474 into 1
(03.747299) Unseizing 23577 into 1
(03.747310) Unseizing 23675 into 1
(03.747319) Unseizing 23688 into 1
(03.747328) Unseizing 23691 into 1
(03.747339) Unseizing 23835 into 1
(03.747348) Unseizing 23836 into 1
(03.747357) Unseizing 23837 into 1
(03.747443) Error (cr-dump.c:1600): Dumping FAILED.
I'm happy to supply additional info, or test patches -
Regards,
Jake
On Fri, Apr 8, 2016 at 4:40 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
> Ah, never mind - it doesn't appear to be solely a criu issue - even
> migration of stopped containers hangs forever now.
>
> Jake
>
> On Fri, Apr 8, 2016 at 4:23 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>> Ubuntu 16.04, up to date -
>>
>> After today's updates, including a kernel upgrade to 4.4.0-18, I tried
>> live migration again:
>>
>> root at raskolnikov:~# lxc move third lxd2:
>>
>> One hour later:
>>
>> root at raskolnikov:~# lxc move third lxd2:
>>
>> Still stuck, and the migration file in /var/log/lxd/third has not been created.
>>
>> Tycho said on Mar 30 that the situation should be sorted soon, but
>> mentioned the git repo:
>> https://github.com/tych0/criu/tree/cgroup-root-mount
>>
>> Should live migration work with criu from git?
>>
>> Feel free to advise me on what information I can supply, not only for
>> the ct migration issues, but also for the new dhcp issue
>>
>> Regards,
>>
>> Jake
>>
>>
>> On Thu, Apr 7, 2016 at 11:01 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>>> (Bump) -
>>>
>>> Any thoughts on what to try for the CT migration and dhcp issues?
>>> Running up to date ubuntu 16.04 beta -
>>>
>>> Regards,
>>>
>>> Jake
>>>
>>> On Wed, Apr 6, 2016 at 3:18 PM, jjs - mainphrame <jjs at mainphrame.com> wrote:
>>>> Greetings -
>>>>
>>>> I'be not yet been able to reproduce that one shining moment from Mar
>>>> 29 when live migration of privileged containers was working, under
>>>> kernel 4.4.0-15
>>>>
>>>> To recap. live container migration broke with 4.4.0-16, and is still
>>>> broken in 4.4.0-17 - but now, instead of producing an error message,
>>>> an attempt to live migrate a container merely hangs forever. Is that
>>>> expected, or should I be seeing something more? BTW - the migration
>>>> dump log for that container hasn't been touched for a week. I'll be
>>>> glad to supply more info if this is not a known issue.
>>>>
>>>> Recent updates seem to have created a new problem. the CTs which
>>>> configure their own network settings work (aside from migration) but
>>>> none of the CTs which depend on dhcp are getting IPs. BTW I'm using a
>>>> bridge connected to my local network and dhcp, not the default lxc
>>>> dhcp server. I see the packets on the host bridge, but they don't
>>>> reach the dhcp server. I'd be curious to know if there have been any
>>>> dhcp issues since recent updates. If not, I'll need to troubleshoot
>>>> other causes, but it's odd that dhcp simply stops working for all CTs
>>>> on both lxd hosts after updates.
>>>>
>>>> Jake
>>>>
>>>>
>>>> On Wed, Mar 30, 2016 at 6:27 AM, Tycho Andersen
>>>> <tycho.andersen at canonical.com> wrote:
>>>>> On Tue, Mar 29, 2016 at 11:17:26PM -0700, jjs - mainphrame wrote:
>>>>>> Well, I've found some interesting things here today. I created a couple of
>>>>>> privileged xenial containers, and sure enough, I was able to live migrate
>>>>>> them back and forth between the 2 lxd hosts.
>>>>>>
>>>>>> So far, so good.
>>>>>>
>>>>>> Then I did an apt upgrade - among the changes was a kernel change from
>>>>>> 4.4.0-15 to 4.4.0-16 - and live migration stopped working.
>>>>>>
>>>>>> Here are the failure messages that resulted from attempting the very same
>>>>>> live migrations that worked before the upgrade and reboot into 4.4.0-16:
>>>>>>
>>>>>> root at raskolnikov:~# lxc move akira lxd2:
>>>>>> error: Error transferring container data: checkpoint failed:
>>>>>> (00.092234) Error (mount.c:740): mnt: 83:./sys/fs/cgroup/devices doesn't
>>>>>> have a proper root mount
>>>>>> (00.098187) Error (cr-dump.c:1600): Dumping FAILED.
>>>>>>
>>>>>>
>>>>>> root at ronnie:~# lxc move third lxd:
>>>>>> error: Error transferring container data: checkpoint failed:
>>>>>> (00.076107) Error (mount.c:740): mnt: 326:./sys/fs/cgroup/perf_event
>>>>>> doesn't have a proper root mount
>>>>>> (00.080388) Error (cr-dump.c:1600): Dumping FAILED.
>>>>>
>>>>> Yep, this is a known issue with -16. We need both a kernel patch and a
>>>>> patch to CRIU before it will start working again. I have a branch at:
>>>>>
>>>>> https://github.com/tych0/criu/tree/cgroup-root-mount
>>>>>
>>>>> which should work if you want to keep playing with it, but hopefully
>>>>> we'll have the situation sorted out in the next few days.
>>>>>
>>>>> Tycho
>>>>>
>>>>>> Jake
>>>>>>
>>>>>> PS - Thanks for the html mail heads-up - I've been using google mail
>>>>>> services for this domain. I'll have to look into the config options, and
>>>>>> see if I can do the needful.
>>>>>
>>>>>>
>>>>>> On Tue, Mar 29, 2016 at 12:45 PM, Andrey Repin <anrdaemon at yandex.ru> wrote:
>>>>>>
>>>>>> > Greetings, jjs - mainphrame!
>>>>>> >
>>>>>> > >> On Mon, Mar 28, 2016 at 08:47:24PM -0700, jjs - mainphrame wrote:
>>>>>> > >>> I've looked at ct migration between 2 ubuntu 16.04 hosts today, and
>>>>>> > had
>>>>>> > >>> some interesting problems; I find that migration of stopped
>>>>>> > containers
>>>>>> > >>> works fairly reliably; but live migration, well, it transfers a lot of
>>>>>> > >>> data, then exits with a failure message. I can then move the same
>>>>>> > >>> container, stopped, with no problem.
>>>>>> > >>>
>>>>>> > >>> The error is the same every time, a failure of "mkdtemp" -
>>>>>> > >>
>>>>>> > >> It looks like your host /tmp isn't writable by the uid map that the
>>>>>> > >> container is being restored as?
>>>>>> >
>>>>>> >
>>>>>> > > Which is odd, since /tmp has 1777 perms on both hosts, so I don't see how
>>>>>> > > it could be a permissions problem. Surely the default apparmor profile is
>>>>>> > > not the cause? You did give me a new idea though, and I'll set up a test
>>>>>> > > with privileged containers for comparison. Is there a switch to enable
>>>>>> > verbose logging?
>>>>>> >
>>>>>> > I've ran into the same issue once. Stumbled upon it for nearly a month,
>>>>>> > falsely
>>>>>> > blaming LXC.
>>>>>> > Recreating a container's rootfs from scratch resolved the issue.
>>>>>> > I know not of what caused it to begin with, must've been some kind of
>>>>>> > glitch.
>>>>>> >
>>>>>> > P.S.
>>>>>> > It would be great if you can configure your mail client to not use HTML
>>>>>> > format
>>>>>> > for lists.
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > With best regards,
>>>>>> > Andrey Repin
>>>>>> > Tuesday, March 29, 2016 22:43:04
>>>>>> >
>>>>>> > Sorry for my terrible english...
>>>>>> > _______________________________________________
>>>>>> > lxc-users mailing list
>>>>>> > lxc-users at lists.linuxcontainers.org
>>>>>> > http://lists.linuxcontainers.org/listinfo/lxc-users
>>>>>> >
>>>>>
>>>>>> _______________________________________________
>>>>>> lxc-users mailing list
>>>>>> lxc-users at lists.linuxcontainers.org
>>>>>> http://lists.linuxcontainers.org/listinfo/lxc-users
>>>>>
>>>>> _______________________________________________
>>>>> lxc-users mailing list
>>>>> lxc-users at lists.linuxcontainers.org
>>>>> http://lists.linuxcontainers.org/listinfo/lxc-users
More information about the lxc-users
mailing list