[lxc-users] Does cpu cgroup has been enabled in lxc/lxd

Mon Nov 5 16:32:07 UTC 2018

On November 5, 2018 8:12:35 AM GMT+03:00, kemi <kemi.wang at intel.com> wrote:
>
>
>On 2018/11/2 下午8:05, Fajar A. Nugraha wrote:
>> On Fri, Nov 2, 2018 at 8:44 AM, kemi <kemi.wang at intel.com> wrote:
>> 
>>>
>>> thx for your question.
>>> In our case, our customers want to run android games within
>containers on
>>> cloud.
>>>
>> 
>> It might be possible for you to adjust https://anbox.io/ to run on
>lxd
>> instead of lxc. YMMV.
>> 
>
>anbox provides a GUI interface to run android in container.
>We don't need that GUI which leads to extra overhead. Also,
>Anbox can't offer thousands of containers running in parallel.
>
>> There are two problems we have known.
>>> The first one occurs during Android OS boot, the coldboot of Android
>>> requires to
>>> write uevent file in /sys, this will trigger an uevent broadcast to
>all of
>>> listeners
>>> (udev daemons) in user space (this uevent is sent from kernel via
>>> netlink),
>>> with the increase of container number (200+), we found the boot
>latency
>>> has
>>> reached 1~2 mins. And latency would be intolerable when the number
>reaches
>>> 500.  

That is no longer true from kernels 4.17 onwards. I should really write a
blogpost about my patchset it seems. This keeps popping up every now and then.
So, I'm going to explain this in a little more detail here.
Uevents were previously broadcast into all network namespaces. This was
obviously problematic:
- You could be smarter than you should be and trick the system into running a
second udev daemon in a non-initial network namespace that is owned by the
initial user namespace. That has the potential to wreck the system. However
this only affects privileged containers that would be dumb enough to mount /sys
read-write.
- You could see an insane performance hit when you ran large numbers of
containers that each ran a udev daemon since the kernel would broadcast these
events to all of them. This is made worse by the fact that in non-initial
network namespaces that are owned by non-initial user namespaces the kernel
would not fix up the uid and gid relative to the owning user namespace of the
network namespace. That meant user space would see those events with
INVALD_{G,U}ID which causes udev to ignore those events. Effectively, the
kernel was screaming uevents into the void for absolutely no good reason.
Moreover the id permissions weren't even fixed up for namespaced devices such
as network devices that can be owned by different network namespaces (e.g.
moving a physical network device into an unprivileged container)
- You could technically spy on the hosts device events from an unprivileged
container. It's probably not an attack vector but it is definitely an
information leak.
- You had no way of delegating a device to a container since uevents that were
received for it were unuseable (cf. above) but you also had no way of
injecting/forwarding uevents to a container.

For all those reasons I wrote several patches that namespace uevents and allow
injecting uevents:

- 94e5e3087a67c765be98592b36d8d187566478d5
- 692ec06d7c92af8ca841a6367648b9b3045344fd
- 26045a7b14bc7a5455e411d820110f66557d6589        
- a3498436b3a0f8ec289e6847e1de40b4123e1639

So, the first two patches make it possible to forward/inject uevents into other
network namespaces if the caller has CAP_NET_ADMIN in the owning user namespace
of the target network namespace. This effectively allows for device namespaces.
Any forwarded/injected uevent should strip/not add a sequence number. The
kernel will append the correct sequence number to the buffer itself.

The following two patches are concerned with isolating uevents aka namespacing
them more cleanly. Because #legacybehavior we came up with the following logic:
uevents are restricted to all network namespaces that are owned by the initial
user namespace. This implies that all non-initial network namespaces that are
owned by non-initial user namespaces do not receive any uevents unless the
kobject (in-kernel device representation) (e.g. network devices) carries a
namespace tag or a uevent is forwarded/injected. My patches ensure that network
namespace specific uevents and forwarded/injected uevents get their permissions
fixed-up according to the owning user namespace of the target network
namespace. This has the nice consequence that delegated network devices
(physical, virtual, SRIO-V) can now be seen by udev inside unprivileged
containers.

So if uevents were a bottleneck for you then it shouldn't be the case anymore
for unprivileged containers at least. The in-kernel locking is also improved by
my patches and I have plans to further improve it. I just need to find the
time.

If you're running privileged containers and uevents are still a bottleneck for
you we can think about a per-network-namespace sysctl that might allow you to
opt-in or out per network namespace. Although I doubt that's a clean enough
option.

>>>
>>>
>> I don't see udev running inside it's lxc container, so perhaps
>they've
>> managed to solve that issue

Udevd will usually not run in unprivileged containers since /sys is
mounted ro so it won't start. However, in unprivileged containers /sys can
safely be mounted rw and udev will start.
This also makes sense on kernels with my patches added. (cf. above).

>> 
>
>root at kemi-desktop:/home/kemi/git# lxc list                             
>                                                                       
>+--------+---------+---------------------+------+------------+-----------+
>|  NAME  |  STATE  |        IPV4         | IPV6 |    TYPE    |
>SNAPSHOTS |
>+--------+---------+---------------------+------+------------+-----------+
>| first  | RUNNING | 10.70.45.163 (eth0) |      | PERSISTENT | 0       
> |
>+--------+---------+---------------------+------+------------+-----------+
>| second | STOPPED |                     |      | PERSISTENT | 0       
> |
>+--------+---------+---------------------+------+------------+-----------+
>root at kemi-desktop:/home/kemi/git# 
>root at kemi-desktop:/home/kemi/git# lxc exec first -- bash
>root at first:~# 
>root at first:~# ps -ef|grep udev
>root        61     1  0 Nov01 ?        00:00:00
>/lib/systemd/systemd-udevd
>root      2252  2241  0 05:07 ?        00:00:00 grep --color=auto udev
>root at first:~# 
>
>Seems udevd (I used ubuntu 18.04 image) is running in lxc container.
>Correct me if I misunderstood something, thx.
>
>> 
>> The second one occurs when an app in container begins to run, it will
>read
>>> /sys/devices/system/cpu/online file to get avilable cpu number
>before
>>> creating
>>> threads accordingly. Then. the problem is,  sysfs now is shared with
>host,
>>> it will get the CPU number equals to host thread number even if the
>cpu
>>> number
>>> of container is limited.
>>>
>>>
>> If it simply reads the file, you could simply mount a text file on
>it.
>> Similar to what lxcfs does, but simpler.
>> 
>
>Good suggestion. We are considering this workaround.
>But it may not be a common solution, because on one knows which file in
>/sys
>will be used by app in userspace.
>
>> 
>> _______________________________________________
>> lxc-users mailing list
>> lxc-users at lists.linuxcontainers.org
>> http://lists.linuxcontainers.org/listinfo/lxc-users
>> 
>_______________________________________________
>lxc-users mailing list
>lxc-users at lists.linuxcontainers.org
>http://lists.linuxcontainers.org/listinfo/lxc-users