[lxc-devel] Kernel bug? Setuid apps and user namespaces

Tue Oct 22 22:01:49 UTC 2013

On 10/22/2013 03:50 PM, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn at ubuntu.com> writes:
> 
>> Quoting Sean Pajot (sean.pajot at execulink.com):
>>> I've been playing with User Namespaces somewhat extensively and I think I've
>>> come across a bug in the handling of /proc/$PID/ entries.
>>>
>>> This is my example case on a 3.10.x kernel:
>>>
>>> -- /var/lib/lxc/test1/config
>>>
>>> lxc.rootfs = /lxc/c1
>>> lxc.id_map = u 0 1000000 100000
>>> lxc.id_map = g 0 1000000 100000
>>> lxc.network.type = none
>>>
>>> lxc.tty = 6
>>>
>>> == END
>>>
>>> On one console login as a non-root user and run "su", as an example of a
>>> setuid root application. On another console login as root and examine
>>> /proc/$(pidof su). You'll find all the files are owned by the "nobody" user
>>> and inaccessible. The reason is on the host you'll find these files are owned
>>> by "root", uid 0, which is odd because in the container they should be uid
>>> 1000000 from the mappings.
>>>
>>> I tracked down the cause to kernel source file /fs/proc/base.c function
>>> pid_revalidate which contains static references to GLOBAL_ROOT_UID and
>>> GLOBAL_ROOT_GID which are always UID 0 on the host. This little patch, which
>>> might not be correct in terms of kernel standards, appears to mostly solve the
>>> issue. It doesn't affect all entries in /proc/$PID but gets the majority of them.
>>>
>>> Thoughts or opinions?
>>
>> Awesome - I've seen this bug and so far not had time to dig.  
>>
>> The patch offhand looks good to me.  Do you mind sending it to
>> lkml?

>>
>> Acked-by: Serge E. Hallyn <serge.hallyn at ubuntu.com>
>>

Well I wasn't expecting that... :)

> 
> It is definitely worth looking at.  I punted on this when I did the
> initial round of conversions.  Tasks that we don't consider dumpable are
> weird.
> 
> At first glance this fine.  However __task_cred does not return NULL so
> handling that case is nonsense and confusing.
> 
> Eric
> 

I thought so, but I wanted to have a failsafe since I'm running this code on
the same machine I'm typing this message on.
This is my first patch that had a chance of making it into the kernel so I'm
honestly making things up as I go. I put that there so in the event a NULL
cred showed up there would be known symptoms besides an Oops.

On my system I still have the "ns" directory marked as owned by host's uid 0
but since the permissions are 511 (?) and the namespace objects are owned by
container's uid 0 it doesn't really impact much. That could probably use
fixing but the use cases are generally usable now.

That aside, you really think it's okay for inclusion in the kernel with
cred!=NULL fixed?

>>> --- linux-3.10-clean/fs/proc/base.c	2013-06-30 18:13:29.000000000 -0400
>>> +++ linux-3.10-patched/fs/proc/base.c	2013-10-22 13:28:22.561262197 -0400
>>> @@ -1632,17 +1632,17 @@
>>>  	task = get_proc_task(inode);
>>>
>>>  	if (task) {
>>> +		rcu_read_lock();
>>> +	        cred = __task_cred(task);
>>>  		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
>>>  		    task_dumpable(task)) {
>>> -			rcu_read_lock();
>>> -			cred = __task_cred(task);
>>>  			inode->i_uid = cred->euid;
>>>  			inode->i_gid = cred->egid;
>>> -			rcu_read_unlock();
>>>  		} else {
>>> -			inode->i_uid = GLOBAL_ROOT_UID;
>>> -			inode->i_gid = GLOBAL_ROOT_GID;
>>> +			inode->i_uid = cred ? make_kuid(cred->user_ns, 0) : GLOBAL_ROOT_UID;
>>> +			inode->i_gid = cred ? make_kgid(cred->user_ns, 0) : GLOBAL_ROOT_GID;
>>>  		}
>>> +		rcu_read_unlock();
>>>  		inode->i_mode &= ~(S_ISUID | S_ISGID);
>>>  		security_task_to_inode(task, inode);
>>>  		put_task_struct(task);
> 

-- 
execulink
TELECOM

Sean Pajot
System Administrator
1127 Ridgeway Rd, Woodstock
tel: 519.456.7249
email: sean.pajot at execulink.com
www.execulink.ca