[lxc-devel] Kernel bug? Setuid apps and user namespaces

Fri Oct 25 00:13:11 UTC 2013

On 10/23/2013 12:54 AM, Eric W. Biederman wrote:
> Sean Pajot <sean.pajot at execulink.com> writes:
> 
>> On 10/22/2013 03:50 PM, Eric W. Biederman wrote:
>>> Serge Hallyn <serge.hallyn at ubuntu.com> writes:
>>>
>>>> Quoting Sean Pajot (sean.pajot at execulink.com):
>>>>> I've been playing with User Namespaces somewhat extensively and I think I've
>>>>> come across a bug in the handling of /proc/$PID/ entries.
>>>>>
>>>>> This is my example case on a 3.10.x kernel:
>>>>>
>>>>> -- /var/lib/lxc/test1/config
>>>>>
>>>>> lxc.rootfs = /lxc/c1
>>>>> lxc.id_map = u 0 1000000 100000
>>>>> lxc.id_map = g 0 1000000 100000
>>>>> lxc.network.type = none
>>>>>
>>>>> lxc.tty = 6
>>>>>
>>>>> == END
>>>>>
>>>>> On one console login as a non-root user and run "su", as an example of a
>>>>> setuid root application. On another console login as root and examine
>>>>> /proc/$(pidof su). You'll find all the files are owned by the "nobody" user
>>>>> and inaccessible. The reason is on the host you'll find these files are owned
>>>>> by "root", uid 0, which is odd because in the container they should be uid
>>>>> 1000000 from the mappings.
>>>>>
>>>>> I tracked down the cause to kernel source file /fs/proc/base.c function
>>>>> pid_revalidate which contains static references to GLOBAL_ROOT_UID and
>>>>> GLOBAL_ROOT_GID which are always UID 0 on the host. This little patch, which
>>>>> might not be correct in terms of kernel standards, appears to mostly solve the
>>>>> issue. It doesn't affect all entries in /proc/$PID but gets the majority of them.
>>>>>
>>>>> Thoughts or opinions?
>>>>
>>>> Awesome - I've seen this bug and so far not had time to dig.  
>>>>
>>>> The patch offhand looks good to me.  Do you mind sending it to
>>>> lkml?

Given the discussion that this has started to create I'm going to hold off on
that. Maybe someone else should take over since it sounds like this is going
in other directions.

>>
>>
>>>>
>>>> Acked-by: Serge E. Hallyn <serge.hallyn at ubuntu.com>
>>>>
>>
>> Well I wasn't expecting that... :)
>>
>>>
>>> It is definitely worth looking at.  I punted on this when I did the
>>> initial round of conversions.  Tasks that we don't consider dumpable are
>>> weird.
>>>
>>> At first glance this fine.  However __task_cred does not return NULL so
>>> handling that case is nonsense and confusing.
>>>
>>> Eric
>>>
>>
>> I thought so, but I wanted to have a failsafe since I'm running this code on
>> the same machine I'm typing this message on.
>> This is my first patch that had a chance of making it into the kernel so I'm
>> honestly making things up as I go. I put that there so in the event a NULL
>> cred showed up there would be known symptoms besides an Oops.
>>
>> On my system I still have the "ns" directory marked as owned by host's uid 0
>> but since the permissions are 511 (?) and the namespace objects are owned by
>> container's uid 0 it doesn't really impact much. That could probably use
>> fixing but the use cases are generally usable now.
>>
>> That aside, you really think it's okay for inclusion in the kernel with
>> cred!=NULL fixed?
> 
> Someone needs to read and think through all of the corner cases and see
> if we can ever have a time when task_dumpable is false but root in the
> container would not or should not be able to see everything.
> 
> In particular I am worried about the case of a setuid app calling setns,
> and entering a lesser privileged user namespace.  In my foggy mind that
> might be a security problem.  And there might be other similar crazy
> cases.
> 
> But the code itself looks good, and the bug hunting seems solid.
> 
> If my concerns about a setuid app calling setns are valid what we can
> likely do with dumpable is record the kuid of the userns root when the
> task becomes non-dumpable, and use that for i_uid and i_gid.

I see calling setns as a process voluntarily putting itself at a the mercy of
said namespace. Also there are potential ways to protect yourself, such as not
joining the PID namespace as well, so from my naive standpoint it's not that
big of a concern.

I can see there are means of mitigating attack surfaces but not completely
eliminating them. Taking lxc-attach as an example it joins all namespaces and
then calls clone/fork as a prerequisite to joining the PID namespace properly.
At that moment it's potentially vulnerable to an attack from within the container.

>From the above scenario I see two risks. The first - which can't be mitigated
completely at this time (assuming my patch stands) - is that the process
carries things like file descriptors from outside the namespace. lxc-attach's
exe isn't in the container and its memory mappings are not either, yet they're
carried in along with it. The second risk would involve this process not
securing itself properly, such as by failing to switch mount namespaces or not
closing these namespace handles. This would risk exposing the external
namespace to access from within the sub-namespace.

> 
> Eric
> 
>>>>> --- linux-3.10-clean/fs/proc/base.c	2013-06-30 18:13:29.000000000 -0400
>>>>> +++ linux-3.10-patched/fs/proc/base.c	2013-10-22 13:28:22.561262197 -0400
>>>>> @@ -1632,17 +1632,17 @@
>>>>>  	task = get_proc_task(inode);
>>>>>
>>>>>  	if (task) {
>>>>> +		rcu_read_lock();
>>>>> +	        cred = __task_cred(task);
>>>>>  		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
>>>>>  		    task_dumpable(task)) {
>>>>> -			rcu_read_lock();
>>>>> -			cred = __task_cred(task);
>>>>>  			inode->i_uid = cred->euid;
>>>>>  			inode->i_gid = cred->egid;
>>>>> -			rcu_read_unlock();
>>>>>  		} else {
>>>>> -			inode->i_uid = GLOBAL_ROOT_UID;
>>>>> -			inode->i_gid = GLOBAL_ROOT_GID;
>>>>> +			inode->i_uid = cred ? make_kuid(cred->user_ns, 0) : GLOBAL_ROOT_UID;
>>>>> +			inode->i_gid = cred ? make_kgid(cred->user_ns, 0) : GLOBAL_ROOT_GID;
>>>>>  		}
>>>>> +		rcu_read_unlock();
>>>>>  		inode->i_mode &= ~(S_ISUID | S_ISGID);
>>>>>  		security_task_to_inode(task, inode);
>>>>>  		put_task_struct(task);
>>>
>