[lxc-devel] Kernel bug? Setuid apps and user namespaces
Sean Pajot
sean.pajot at execulink.com
Tue Oct 22 22:01:49 UTC 2013
On 10/22/2013 03:50 PM, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn at ubuntu.com> writes:
>
>> Quoting Sean Pajot (sean.pajot at execulink.com):
>>> I've been playing with User Namespaces somewhat extensively and I think I've
>>> come across a bug in the handling of /proc/$PID/ entries.
>>>
>>> This is my example case on a 3.10.x kernel:
>>>
>>> -- /var/lib/lxc/test1/config
>>>
>>> lxc.rootfs = /lxc/c1
>>> lxc.id_map = u 0 1000000 100000
>>> lxc.id_map = g 0 1000000 100000
>>> lxc.network.type = none
>>>
>>> lxc.tty = 6
>>>
>>> == END
>>>
>>> On one console login as a non-root user and run "su", as an example of a
>>> setuid root application. On another console login as root and examine
>>> /proc/$(pidof su). You'll find all the files are owned by the "nobody" user
>>> and inaccessible. The reason is on the host you'll find these files are owned
>>> by "root", uid 0, which is odd because in the container they should be uid
>>> 1000000 from the mappings.
>>>
>>> I tracked down the cause to kernel source file /fs/proc/base.c function
>>> pid_revalidate which contains static references to GLOBAL_ROOT_UID and
>>> GLOBAL_ROOT_GID which are always UID 0 on the host. This little patch, which
>>> might not be correct in terms of kernel standards, appears to mostly solve the
>>> issue. It doesn't affect all entries in /proc/$PID but gets the majority of them.
>>>
>>> Thoughts or opinions?
>>
>> Awesome - I've seen this bug and so far not had time to dig.
>>
>> The patch offhand looks good to me. Do you mind sending it to
>> lkml?
>>
>> Acked-by: Serge E. Hallyn <serge.hallyn at ubuntu.com>
>>
Well I wasn't expecting that... :)
>
> It is definitely worth looking at. I punted on this when I did the
> initial round of conversions. Tasks that we don't consider dumpable are
> weird.
>
> At first glance this fine. However __task_cred does not return NULL so
> handling that case is nonsense and confusing.
>
> Eric
>
I thought so, but I wanted to have a failsafe since I'm running this code on
the same machine I'm typing this message on.
This is my first patch that had a chance of making it into the kernel so I'm
honestly making things up as I go. I put that there so in the event a NULL
cred showed up there would be known symptoms besides an Oops.
On my system I still have the "ns" directory marked as owned by host's uid 0
but since the permissions are 511 (?) and the namespace objects are owned by
container's uid 0 it doesn't really impact much. That could probably use
fixing but the use cases are generally usable now.
That aside, you really think it's okay for inclusion in the kernel with
cred!=NULL fixed?
>>> --- linux-3.10-clean/fs/proc/base.c 2013-06-30 18:13:29.000000000 -0400
>>> +++ linux-3.10-patched/fs/proc/base.c 2013-10-22 13:28:22.561262197 -0400
>>> @@ -1632,17 +1632,17 @@
>>> task = get_proc_task(inode);
>>>
>>> if (task) {
>>> + rcu_read_lock();
>>> + cred = __task_cred(task);
>>> if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
>>> task_dumpable(task)) {
>>> - rcu_read_lock();
>>> - cred = __task_cred(task);
>>> inode->i_uid = cred->euid;
>>> inode->i_gid = cred->egid;
>>> - rcu_read_unlock();
>>> } else {
>>> - inode->i_uid = GLOBAL_ROOT_UID;
>>> - inode->i_gid = GLOBAL_ROOT_GID;
>>> + inode->i_uid = cred ? make_kuid(cred->user_ns, 0) : GLOBAL_ROOT_UID;
>>> + inode->i_gid = cred ? make_kgid(cred->user_ns, 0) : GLOBAL_ROOT_GID;
>>> }
>>> + rcu_read_unlock();
>>> inode->i_mode &= ~(S_ISUID | S_ISGID);
>>> security_task_to_inode(task, inode);
>>> put_task_struct(task);
>
--
execulink
TELECOM
Sean Pajot
System Administrator
1127 Ridgeway Rd, Woodstock
tel: 519.456.7249
email: sean.pajot at execulink.com
www.execulink.ca
More information about the lxc-devel
mailing list