[Lxc-users] dropping capabilities

Mon Oct 4 22:09:02 UTC 2010

Quoting Daniel Lezcano (daniel.lezcano at free.fr):
> On 10/04/2010 10:54 PM, richard -rw- weinberger wrote:
> > Hi Daniel!
> >
> > On Mon, Oct 4, 2010 at 9:51 PM, Daniel Lezcano<daniel.lezcano at free.fr>  wrote:
> >    
> >> On 10/04/2010 06:18 PM, richard -rw- weinberger wrote:
> >>      
> >>> On Sun, Oct 3, 2010 at 9:01 PM, richard -rw- weinberger
> >>> <richard.weinberger at gmail.com>    wrote:
> >>>
> >>>        
> >>>> I'm using lxc to run a few virtual private servers.
> >>>> What capabilities are harmful and should be dropped using "lxc.cap.drop"?
> >>>>
> >>>>          
> >>> Is my question too trivial or too stupid? ;)
> >>>
> >>>        
> >> hum, not trivial at all :)
> >>
> >> I am not sure there is a default set of capabilities to be dropped.
> >> Certainly some should be dropped like CAP_SYS_MODULE but others will depend
> >> on what the user expect to do with the container and what scripts will be
> >> run inside the container.
> >>
> >> We have certainly think about the root user inside a container, is it secure
> >> ? IMO, until the user namespace is not complete, it is not secure.
> >>      
> > I thought the user namespace is complete.
> > What is missing?
> >    
> 
> I am not sure, but something like "who did what", so if you are root on 
> the host and you mount a filesystem, when you create an user namespace, 
> will be root inside but not the same root as the host, and you won't be 
> able to umount what the host's root has mounted before. I didn't 
> followed the discussion about this very closely so I may be wrong.
> I prefer let Serge explain what is missing, he will be much more clear 
> than me :)
> (cc'ed Serge).

Right - really 'who owns what' is what we don't track correctly in the
context of user namespaces.  'who' should now be not uid, but
(user_namespace, uid).  In particular we don't currently have answers
for that with VFS or capable() requests.  A file (including /proc and
/cgroup files) should be owned by (user_ns, uid), and likewise a task
who you are trying to kill.  So a task with credentials
(init_user_ns,500),(child_user_ns1,0), in other words owned by uid 500
in the 'initial' user namespace, but root in the container, would be
denied CAP_KILL to a task, or write to a file, owned by
(init_user_ns,0).

The capabilities part of that is actually started by a patch by Eric
Biederman which is sitting in http://git.kernel.org/?p=linux/kernel/git/sergeh/linux-cr.git;a=shortlog;h=refs/heads/userns.feb16.1

(see patch http://git.kernel.org/?p=linux/kernel/git/sergeh/linux-cr.git;a=commit;h=58e3ce401f746f2865a6c9872d9205e202c2c5a2 in particular)

-serge