[lxc-devel] Kernel bug? Setuid apps and user namespaces

Fri Apr 4 19:28:11 UTC 2014

On Fri, Apr 4, 2014 at 12:10 PM, Serge Hallyn <serge.hallyn at ubuntu.com> wrote:
> Quoting Andy Lutomirski (luto at amacapital.net):
>> On Fri, Apr 4, 2014 at 11:30 AM, Serge Hallyn <serge.hallyn at ubuntu.com> wrote:
>> > Quoting Andy Lutomirski (luto at amacapital.net):
>> >> On 04/02/2014 10:32 AM, Serge E. Hallyn wrote:
>> >> > (Sorry - the lxc-devel list has moved, so replying to all with the
>> >> > correct list address;   please reply to this rather than my previous
>> >> > email)
>> >> >
>> >> > Quoting Serge Hallyn (serge.hallyn at ubuntu.com):
>> >> >> Hi Eric,
>> >> >>
>> >> >> (sorry, I don't seem to have the email I actually wanted to reply
>> >> >> to in my mbox, but it is
>> >> >> https://lists.linuxcontainers.org/pipermail/lxc-devel/2013-October/005857.html)
>> >> >>
>> >> >> You'd said,
>> >> >>> Someone needs to read and think through all of the corner cases and see
>> >> >>> if we can ever have a time when task_dumpable is false but root in the
>> >> >>> container would not or should not be able to see everything.
>> >> >>>
>> >> >>> In particular I am worried about the case of a setuid app calling setns,
>> >> >>> and entering a lesser privileged user namespace.  In my foggy mind that
>> >> >>> might be a security problem.  And there might be other similar crazy
>> >> >>> cases.
>> >> >>
>> >> >> Can we make use of current->mm->exe_file->f_cred->user_ns?
>> >> >>
>> >> >> So either always use
>> >> >> make_kgid(current->mm->exe_file->f_cred->user_ns, 0)
>> >> >> instead of make_kuid(cred->user_ns, 0), or check that
>> >> >> (current->mm->exe_file->f_cred->user_ns == cred->user_ns)
>> >> >> and, if not, assume that the caller has done a setns?
>> >>
>> >> Do you have a summary of the issue?  I'm a little lost here.
>> >
>> > Sure - when running an unprivileged container, tasks which become
>> > !dumpable end up with /proc/$pid/fd/ being owned by the global
>> > root user, which inside the container is nobody:nogroup.  Examples
>> > are the user's sshd threads and apache, and in the past I think I've
>> > seen it with logind or getty too.
>>
>> Other than the aesthetics, why does this matter?  Things in the
>> container who are actually mapped to nobody still can't access those
>> files?
>
> Bc root cannot look at the fds.

Right.  I guess this is a problem.

>
>> The alternative (using the container's owner) sounds a bit scary.
>
> If the file being run belongs to the container, why would it be scary?
> Bc some fds may have been not closed when the task did execve, where
> the previous bprm file may have been on the host?

Meh.  I'm not worried about that case, and that one probably doesn't
cause !dumpable anyway.  The nasty cases are unshare and setns.

I'm starting to think that we need to extend dumpable to something
much more general like a list of struct creds that someone needs to be
able to ptrace, *in addition to current creds* in order to access
sensitive /proc files, coredumps, etc.  If you get started as setuid,
then you start with two struct creds in the list (or maybe just your
euid and uid).  If you get started !setuid, then your initial creds
are in the list.  It's possible that few or no things will need to
change that list after execve.

If all of the entries and current->cred are in the same user_ns, then
we can dump as userns root.  If they're in different usernses, then we
dump as global root or maybe the common ancestor root.
setuid(getuid()) and other such nastiness may have to empty the list,
or maybe we can just use a prctl for that.

If this idea works, it would be straightforward to implement, it might
solve a number of problems.

--Andy