[lxc-users] Does cpu cgroup has been enabled in lxc/lxd

Tue Nov 6 09:15:31 UTC 2018

Hi, Christian
  Appreciated for your detailed explanation here:)

On 2018/11/6 上午12:32, Christian Brauner wrote:

> 
> That is no longer true from kernels 4.17 onwards. 

Yes, it should be.
I googled for a solution for this issue and luckily found your patch series.
The evaluation work is on going and I plan to port your patch series to ubuntu
16.04 with kernel version 4.4.98 in my case.

> I should really write a
> blogpost about my patchset it seems. This keeps popping up every now and then.
> So, I'm going to explain this in a little more detail here.
> Uevents were previously broadcast into all network namespaces. This was
> obviously problematic:
> - You could be smarter than you should be and trick the system into running a
> second udev daemon in a non-initial network namespace that is owned by the
> initial user namespace. That has the potential to wreck the system. However
> this only affects privileged containers that would be dumb enough to mount /sys
> read-write.
> - You could see an insane performance hit when you ran large numbers of
> containers that each ran a udev daemon since the kernel would broadcast these
> events to all of them. This is made worse by the fact that in non-initial
> network namespaces that are owned by non-initial user namespaces the kernel
> would not fix up the uid and gid relative to the owning user namespace of the
> network namespace. That meant user space would see those events with
> INVALD_{G,U}ID which causes udev to ignore those events.

Agree. 
But, why does the broadcast of uevent to all of listeners (ueventd in Android)
lead to a long response latency of ueventd.
If the uevent is broadcast to all of the listers in parallel,right?
and all of listers gets the notification at the same time, we should not observe long 
response latency of ueventd. I probably misunderstand something here, please correct me.

> Effectively, the
> kernel was screaming uevents into the void for absolutely no good reason.
> Moreover the id permissions weren't even fixed up for namespaced devices such
> as network devices that can be owned by different network namespaces (e.g.
> moving a physical network device into an unprivileged container)
> - You could technically spy on the hosts device events from an unprivileged
> container. It's probably not an attack vector but it is definitely an
> information leak.
> - You had no way of delegating a device to a container since uevents that were
> received for it were unuseable (cf. above) but you also had no way of
> injecting/forwarding uevents to a container.
> 
> For all those reasons I wrote several patches that namespace uevents and allow
> injecting uevents:
> 
> - 94e5e3087a67c765be98592b36d8d187566478d5
> - 692ec06d7c92af8ca841a6367648b9b3045344fd
> - 26045a7b14bc7a5455e411d820110f66557d6589        
> - a3498436b3a0f8ec289e6847e1de40b4123e1639
> 
> So, the first two patches make it possible to forward/inject uevents into other
> network namespaces if the caller has CAP_NET_ADMIN in the owning user namespace
> of the target network namespace. This effectively allows for device namespaces.
> Any forwarded/injected uevent should strip/not add a sequence number. The
> kernel will append the correct sequence number to the buffer itself.
> 
> The following two patches are concerned with isolating uevents aka namespacing
> them more cleanly. Because #legacybehavior we came up with the following logic:
> uevents are restricted to all network namespaces that are owned by the initial
> user namespace. This implies that all non-initial network namespaces that are
> owned by non-initial user namespaces do not receive any uevents unless the
> kobject (in-kernel device representation) (e.g. network devices) carries a
> namespace tag or a uevent is forwarded/injected. My patches ensure that network
> namespace specific uevents and forwarded/injected uevents get their permissions
> fixed-up according to the owning user namespace of the target network
> namespace. This has the nice consequence that delegated network devices
> (physical, virtual, SRIO-V) can now be seen by udev inside unprivileged
> containers.
> 
> So if uevents were a bottleneck for you then it shouldn't be the case anymore
> for unprivileged containers at least. 

Yes. The ueventd should not be started in non-privileged container.
We will try to use non-privileged container in future, but it takes time.
Currently, we are using privileged container.

> The in-kernel locking is also improved by
> my patches and I have plans to further improve it. I just need to find the
> time.
> 
> If you're running privileged containers and uevents are still a bottleneck for
> you we can think about a per-network-namespace sysctl that might allow you to
> opt-in or out per network namespace. Although I doubt that's a clean enough
> option.
> 

If I understand correctly, even in privileged containers, the uevent broadcast issue
will not be a problem with your patch series above. Since the uevent will only be 
forward to the particular lister which has the same network namespace id with that uevent.