[lxc-devel] [PATCH] cgfs: don't mount /sys/fs/cgroup readonly

Fri May 2 21:04:15 UTC 2014

Hi Serge,

> /sys/fs/cgroup is just a size-limited tmpfs, and making it ro does
> nothing to affect our ability alter mount settings of its subdirs.
> OTOH making it ro can upset mountall in the container which tries
> to remount it rw, which may be refused.
> 
> So just don't do it.
> 
> Signed-off-by: Serge Hallyn <serge.hallyn at ubuntu.com>
> Cc: Christian Seiler <christian at iwakd.de>

NAK!

> ---
>  src/lxc/cgfs.c | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/src/lxc/cgfs.c b/src/lxc/cgfs.c
> index db2a973..ba7df89 100644
> --- a/src/lxc/cgfs.c
> +++ b/src/lxc/cgfs.c
> @@ -1413,14 +1413,6 @@ static bool cgroupfs_mount_cgroup(void *hdata, const char *root, int type)
>  				SYSERROR("error bind-mounting %s to %s", mp->mount_point, abs_path);
>  				goto out_error;
>  			}
> -			/* main cgroup path should be read-only */
> -			if (type == LXC_AUTO_CGROUP_FULL_RO || type == LXC_AUTO_CGROUP_FULL_MIXED) {
> -				r = mount(NULL, abs_path, NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL);
> -				if (r < 0) {
> -					SYSERROR("error re-mounting %s readonly", abs_path);
> -					goto out_error;
> -				}
> -			}

This is not the tmpfs! Also, this breaks logic if you want the whole
cgroup tree to be read-only. Without this cgroup-full:ro and
cgroup-full:mixed is identical to cgroup-full:rw!

The code here is for the mount points /sys/fs/cgroup/$subsystem.
abs_path contains the path to the mount point inside the container of
that. This can be triggered only by cgroup-full:{ro,mixed}. Let's
examine the cases:

cgroup-full:ro: In that case, the entire cgroup hierarchy should be
mounted read-only. No mount will be done for the container's own cgroup,
so NOT remounted read-only here will leave the hierarchy read-write and
therefore making the ro flag meaningless.

cgroup-full:mixed: In that case, the cgroup hierarchy should be mounted
read-only with the exception of the own cgroup (see documentation). For
this case, it is also needed that /sys/fs/cgroup/$subsystem inside the
container be remounted read-only. In the subsequent code, the
container's own cgroup will be bind-remounted onto itself read-write if
mixed is specified, which is meaningless if the parent mount is already
read-write.

(Note that the idea here is that this only works as a restriction for
containers without CAP_SYS_ADMIN; otherwise, the container can change
the mount options anyway.)

So if you remove this part where the local mount point is made readonly,
then it's equivalent to removing the :ro and :mixed flags of cgroup-full.

If this doesn't work for Ubuntu (mountall is upstart, therefore Ubuntu,
right?), then it's probably better just to update the manpage and say
that Ubuntu only works with cgroup-full:rw. Also, if you think that the
default for cgroup-full (which is currently cgroup-full:mixed) should
rather be cgroup-full:rw, then we could also change that (see
confile.c). But please don't remove this piece of code.

If you just want the tmpfs to be read-write, then the following does that:

> @@ -1487,14 +1479,6 @@ static bool cgroupfs_mount_cgroup(void *hdata, const char *root, int type)
>  		parts = NULL;
>  	}
>  
> -	/* try to remount the tmpfs readonly, since the container shouldn't
> -	 * change anything (this will also make sure that trying to create
> -	 * new cgroups outside the allowed area fails with an error instead
> -	 * of simply causing this to create directories in the tmpfs itself)
> -	 */
> -	if (type != LXC_AUTO_CGROUP_RW && type != LXC_AUTO_CGROUP_FULL_RW)
> -		mount(NULL, path, NULL, MS_REMOUNT|MS_RDONLY, NULL);
> -
>  	free(path);
>  
>  	return true;

Although here I also see a problem for a different use case:

Let's say you have a container which is mounted cgroup:mixed (not
cgroup-full). Let's assume for the sake of argument that we only have
the 'cpu' hierarchy. Then the following mountpoints will exist (for the
case cgroup:mixed):

/sys/fs/cgroup                  [tmpfs, ro]
/sys/fs/cgroup/cpu/lxc/c1       [bind-mount of host, rw]

If the container now has a script that says:
echo $$ > /sys/fs/cgroup/cpu/tasks
to put itself into the root cgroup, then in the current version of LXC,
this will fail with a "read-only filesystem" error, i.e. there is an
immediate feedback for the script that something went wrong and it
couldn't put itself into the root cgroup.

If you now make the tmpfs (even if you just do the tmpfs part, not the
other part of the code) read-write, then the above script will just
create a file "tasks" in the tmpfs and put the script's current PID in
it; the script will never know that this did not have the desired effect
of moving itself into the root cgroup. So the script now thinks it did
something which it actually didn't do. In order to avoid that, I
explicitly remounted the tmpfs readonly for that case.

Perhaps you could tell me what you did and what the problem was, before
modifying this part of the code? Then we could figure out a solution for
that together. I put quite a bit of thought into it to make sure the
logic behind it is sane, and I don't see the reason for your changes.

Regards,
Christian