[lxc-devel] Kernel bug? Setuid apps and user namespaces

Fri Oct 25 15:00:12 UTC 2013

Quoting Sean Pajot (sean.pajot at execulink.com):
> On 10/23/2013 12:54 AM, Eric W. Biederman wrote:
> > Sean Pajot <sean.pajot at execulink.com> writes:
> > 
> >> On 10/22/2013 03:50 PM, Eric W. Biederman wrote:
> >>> Serge Hallyn <serge.hallyn at ubuntu.com> writes:
> >>>
> >>>> Quoting Sean Pajot (sean.pajot at execulink.com):
> >>>>> I've been playing with User Namespaces somewhat extensively and I think I've
> >>>>> come across a bug in the handling of /proc/$PID/ entries.
> >>>>>
> >>>>> This is my example case on a 3.10.x kernel:
> >>>>>
> >>>>> -- /var/lib/lxc/test1/config
> >>>>>
> >>>>> lxc.rootfs = /lxc/c1
> >>>>> lxc.id_map = u 0 1000000 100000
> >>>>> lxc.id_map = g 0 1000000 100000
> >>>>> lxc.network.type = none
> >>>>>
> >>>>> lxc.tty = 6
> >>>>>
> >>>>> == END
> >>>>>
> >>>>> On one console login as a non-root user and run "su", as an example of a
> >>>>> setuid root application. On another console login as root and examine
> >>>>> /proc/$(pidof su). You'll find all the files are owned by the "nobody" user
> >>>>> and inaccessible. The reason is on the host you'll find these files are owned
> >>>>> by "root", uid 0, which is odd because in the container they should be uid
> >>>>> 1000000 from the mappings.
> >>>>>
> >>>>> I tracked down the cause to kernel source file /fs/proc/base.c function
> >>>>> pid_revalidate which contains static references to GLOBAL_ROOT_UID and
> >>>>> GLOBAL_ROOT_GID which are always UID 0 on the host. This little patch, which
> >>>>> might not be correct in terms of kernel standards, appears to mostly solve the
> >>>>> issue. It doesn't affect all entries in /proc/$PID but gets the majority of them.
> >>>>>
> >>>>> Thoughts or opinions?
> >>>>
> >>>> Awesome - I've seen this bug and so far not had time to dig.  
> >>>>
> >>>> The patch offhand looks good to me.  Do you mind sending it to
> >>>> lkml?
> 
> Given the discussion that this has started to create I'm going to hold off on
> that. Maybe someone else should take over since it sounds like this is going
> in other directions.

...

> >>
> >>
> >>>>
> >>>> Acked-by: Serge E. Hallyn <serge.hallyn at ubuntu.com>
> >>>>
> >>
> >> Well I wasn't expecting that... :)
> >>
> >>>
> >>> It is definitely worth looking at.  I punted on this when I did the
> >>> initial round of conversions.  Tasks that we don't consider dumpable are
> >>> weird.
> >>>
> >>> At first glance this fine.  However __task_cred does not return NULL so
> >>> handling that case is nonsense and confusing.
> >>>
> >>> Eric
> >>>
> >>
> >> I thought so, but I wanted to have a failsafe since I'm running this code on
> >> the same machine I'm typing this message on.
> >> This is my first patch that had a chance of making it into the kernel so I'm
> >> honestly making things up as I go. I put that there so in the event a NULL
> >> cred showed up there would be known symptoms besides an Oops.
> >>
> >> On my system I still have the "ns" directory marked as owned by host's uid 0
> >> but since the permissions are 511 (?) and the namespace objects are owned by
> >> container's uid 0 it doesn't really impact much. That could probably use
> >> fixing but the use cases are generally usable now.
> >>
> >> That aside, you really think it's okay for inclusion in the kernel with
> >> cred!=NULL fixed?
> > 
> > Someone needs to read and think through all of the corner cases and see
> > if we can ever have a time when task_dumpable is false but root in the
> > container would not or should not be able to see everything.
> > 
> > In particular I am worried about the case of a setuid app calling setns,
> > and entering a lesser privileged user namespace.  In my foggy mind that
> > might be a security problem.  And there might be other similar crazy
> > cases.
> > 
> > But the code itself looks good, and the bug hunting seems solid.
> > 
> > If my concerns about a setuid app calling setns are valid what we can
> > likely do with dumpable is record the kuid of the userns root when the
> > task becomes non-dumpable, and use that for i_uid and i_gid.
> 
> I see calling setns as a process voluntarily putting itself at a the mercy of
> said namespace. Also there are potential ways to protect yourself, such as not
> joining the PID namespace as well, so from my naive standpoint it's not that
> big of a concern.

I'd agree.  However, I guess an appropriate question is whether there is
a reasonable way for lxc to work around this.  Maybe clone, fork, and
clone(CLONE_NEWPID | CLONE_PARENT) or somesuch (since init has to be
pid 1, but we want - iiuc - not be the first task to have done
CLONE_NEWUSER)?