[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...

Michael H. Warfield mhw at WittsEnd.com
Sat Nov 9 23:48:57 UTC 2013


On Thu, 2013-10-31 at 13:00 -0500, Serge Hallyn wrote: 
> Quoting Michael H. Warfield (mhw at WittsEnd.com):
> > I did incorporate your suggestion of using the hash of the rootfs path
> > as the subdirectory under the hosts /dev/ for the container.  I also

> (Printed this out to look it over, just putting all my comments together
> here) :

> 1. I think if /dev is not devtmpfs, we should just bail on this.

I think what we can do is create a tmpfs instantiation here and roll
with it.  The challenge is in the unpriv-user case, but I think I can
address that as well now.

> 2. You say in comments that you're using the cgroup name, but it seems
>    you're actually just using the container name?

Yeah, that was incorrect, and, as it turns out, unnecessary.

> 3. The cgroup name used to be unique, but now each mounted cgroupfs
>    can actually have a different name for the same container (if some
>    of them didn't get cleaned out well).

> I'm just thinking out loud here, so this may not be better, but how
> about

> 1. create /dev/.lxc as you're doing

I'm going to create /dev/.lxc and, if /dev is not devtmpfs, mount a
tmpfs instantiation on it.  Then, also create a /dev/.lxc/user directory
under that mode 1777 (like /tmp with the sticky bit on).  That allows us
to create unpriv container dirs under that, whether it's tmpfs or
devtmpfs.

> 2. (if container is going to use this) create /dev/.lxc/$nonce.
>    We can use hash("$lxcpath/$lxcname"), or just mkstemp(), or
>    just an increasing integer.

> 3. Create $lxcpath/$lxcname/.dev (if the container needs it) and
>    shared-bind-mount /dev/.lxc/$nonce onto it.  Now we can tell
>    which /dev/.lxc/* is mounted by looking at the mount table.

> 4. slave-bind-bind mount $lxcpath/$lxcname/.dev into the starting
>    container's /dev.

This one has a gotcha.  Where that code is at, the bind mounts are not
showing in the hosts mount table or fs.  I fell back to using symlinks
(which done overload the mountable with yet more mounts).

> Not sure whether we should have lxc.autodev = 2 mean use this scheme,
> but I'd be fine with basically always doing this so long as /dev/ is
> devtmpfs and lxc.autodev is set for the container.  (So making
> the container's /dev a tmpfs would just be a fallback).

> Thoughts?

Gotta new patch.

1) Autodetect of systemd is based on dereferencing symlinks and
detecting if systemd is in play.  This is sort of orthogonal to much of
this.  I found where I could find the exec command and had to add a
parameter to lxc_setup to pass the handler data blob in.  That could use
some more eyes on it to make sure there's nothing that won't cause
problems there that I missed.  I'm not sure of all the paths through
that handler ops structure.

2) Detects if devtmpfs is mounted on /dev or if /dev/.lxc has something
(ramfs) mounted on it.

3) Uses /dev/.lxc/user for unpriv user containers.

4) Creates symlinks back from $lxcpath/$name/rootfs.dev back
to /dev/.lxc/{containerdev}.

5) Falls back to mounting tmpfs on container /dev/ if all else fails.

Now...  That all being said...  It has one corner case.

After a system is rebooted, /dev/.lxc does not exist.  If /dev is
devtmpfs the first privileged user container start will populated it and
everything should be fine.  If you, as an unpriv user is the first to
run, it will fall back to #5.

In my mind, the boot logic should really be the boot startup logic
Stéphane where this gets populated and we can eliminate that corner case
entirely.  Even if no containers get fired up at boot, that could set
up /dev/.lxc and /dev/.lxc/users in advance and we should have all the
bases covered.

This time I'll submit this one as a patch.  :-)=)

Coming shortly...

> -serge

Regards,
Mike

> > added additional symlinks for the container name to the hash
> > subdirectory.  I'm not totally sure I'm happy with this, since the hash
> > changes if the rootfs is moved but the whole thing is thrown away if the
> > machine is rebooted anyways (shrug).  So it's a minor niggle.  Maybe a
> > persistent uuid for the container might be an idea worth considering.
> > Maybe it's too minor to worry too much about.
> > 
> > I thought about adding an additional directory under them for
> > "containers" and "by-hashes" and maybe "by-uuid" but that can all be
> > addressed in the future.  This is just down-and-dirty / prove it can
> > work, kind of code...
> > 
> > I did "steal" some code from monitor.c over into conf.c for the hash
> > routine.  Maybe that needs to be redone in a generalize function in case
> > we need to change it and, then, everything tags along.  A thought there.
> > 
> > -- 
> > 
> > This code will create a "/dev/.lxc" subdirectory on the host and
> > populated it with "/dev/.lxc/hash({rootfs})" directories with
> > "/dev/.lxc/{container}" symlinks to them.  Those directories will then
> > be bind mounted to the container /dev/ directories and populated when
> > autodev = 1.
> > 
> > They are NOT cleaned up upon container shutdown or reboot of the
> > container which can allow for some level of reboot persistence (as long
> > as it's not the host reboot).  The symlinks are unlinked and relinked
> > allowing for multiple containers having the same names (and extensions
> > in the cgroup names) even though I personally feel that's an abonminal
> > (albeit rare) practice.  It should all work.
> > 
> > It does not currently incorporate any udev rules or hooks but is
> > eminately amenable to them.  I'm working on that.  That's the ultimate
> > goal, to facilitate more utilization of dynamic devices with symlinks
> > and device renumbering.
> > 
> > Right now, it simply replaces the use of tmpfs bind mounted to /dev in
> > the container with a subdirectory of devtmpfs.  With this, we can
> > address a few other problems wrt dynamic devices and udev rules,
> > including shared devices.  It includes some logic for automagically
> > switching on autodev which, I admit, is suboptimal but I would like to
> > see this mature where autodev=1 is the default and can be overridden in
> > the unusual case where it has to be disabled.
> > 
> > It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
> > host.  This is, potentially, a serious corner case if we have a host
> > with a static /dev on root and hosting a systemd based container.  That
> > is going to break.  Bad juju.  How common?  I don't know.  I feel that
> > check probably needs to go in but I felt like I wanted this code to get
> > out for discussion first.  What do we do it is not the case?  I
> > suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
> > ourselves.  That code is also not there and the idea is open for
> > discussion.
> > 
> > Currently, I have this running with autodev = 1 for all my containers,
> > no matter if they are systemd based or upstart based or sysv init based.
> > All of them are working cleanly and this seems to have solved a few past
> > problems with bad modes when autdev was set to one for non-systemd
> > containers (early Fedora and CentOS).
> > 
> > This is a git diff against 1.0.0.alpha2.  After the jump...
> > 
> > Regards,
> > Mike
> > -- 
> > Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
> >    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
> >    NIC whois: MHW9          | An optimist believes we live in the best of all
> >  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> > 
> > -- 
> > diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> > index 208c08b..21baf20 100644
> > --- a/src/lxc/conf.c
> > +++ b/src/lxc/conf.c
> > @@ -29,6 +29,7 @@
> >  #include <string.h>
> >  #include <dirent.h>
> >  #include <unistd.h>
> > +#include <inttypes.h>
> >  #include <sys/wait.h>
> >  #include <sys/syscall.h>
> >  #include <time.h>
> > @@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
> >  	return 0;
> >  }
> >  
> > +
> > +/*
> > + * Note: This is a verbatum copy of what is in monitor.c.  We're just
> > + * usint it here to generate a safe subdirectory in /dev/ for the
> > + * containers /dev/
> > + */
> > +
> > +/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
> > + * FNV has good anti collision properties and we're not worried
> > + * about pre-image resistance or one-way-ness, we're just trying to make
> > + * the name unique in the 108 bytes of space we have.
> > + */
> > +#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
> > +static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
> > +{
> > +	unsigned char *bp;
> > +
> > +	for(bp = buf; bp < (unsigned char *)buf + len; bp++)
> > +	{
> > +		/* xor the bottom with the current octet */
> > +		hval ^= (uint64_t)*bp;
> > +
> > +		/* gcc optimised:
> > +		 * multiply by the 64 bit FNV magic prime mod 2^64
> > +		 */
> > +		hval += (hval << 1) + (hval << 4) + (hval << 5) +
> > +			(hval << 7) + (hval << 8) + (hval << 40);
> > +	}
> > +
> > +	return hval;
> > +}
> > +
> > +/*
> > + * Locate a devtmpfs mount (should be on /dev) and create a container
> > + * subdirectory on it which we can then bind mount to the container
> > + * /dev instead of mounting a tmpfs there.
> > + * If we fail, return NULL.
> > + * Else return the pointer to the name buffer with the string to
> > + * the devtmpfs subdirectory.
> > + */
> > +
> > +char *mk_devtmpfs(const char *name, char *path)
> > +{
> > +	int ret;
> > +	struct stat s;
> > +	char tmp_path[MAXPATHLEN];
> > +	char sym_path[MAXPATHLEN];
> > +	char *base_path = "/dev/.lxc"; 
> > +	uint64_t hash;
> > +
> > +	/*
> > +	 * We're starting off with a fixed path /dev and assuming this is
> > +	 * a devtmpfs mount.  These are poor assumptions.  We need to add
> > +	 * more robust code to find it and verify it then use it.
> > +	 */
> > +
> > +	if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		/* This is just making /dev/.lxc it better work or we're done */
> > +		ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* Now check and/or create our container subname */
> > +	/* Symlink path... */
> > +	ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	/* Actual directory path */
> > +	hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
> > +	ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
> > +	/* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* This sets up a symlink from the container name (cgroup name) in
> > +	 * /dev/.lxc over to the actual hash based device directory.  This
> > +	 * is strictly for a convenient cross reference and may be used
> > +	 * by udev rules to move devices in and out of the container spaces.
> > +	 */
> > +
> > +	unlink( sym_path );
> > +	ret = symlink(tmp_path, sym_path);
> > +
> > +	if ( ret < 0 ) {
> > +		SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
> > +	}
> > +
> > +	strcpy( path, tmp_path );
> > +	return path;
> > +}
> > +
> > +
> >  /*
> >   * Do we want to add options for max size of /dev and a file to
> >   * specify which devices to create?
> >   */
> > -static int mount_autodev(char *root)
> > +static int mount_autodev(const char *name, char *root)
> >  {
> >  	int ret;
> > +	struct stat s;
> >  	char path[MAXPATHLEN];
> > +	char devtmpfs_path[MAXPATHLEN];
> >  
> >  	INFO("Mounting /dev under %s\n", root);
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
> >  	if (ret < 0 || ret > MAXPATHLEN)
> >  		return -1;
> > -	ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	if (mk_devtmpfs( name, devtmpfs_path ) ) {
> > +		ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
> > +	} else {
> > +		ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	}
> >  	if (ret) {
> >  		SYSERROR("Failed to mount /dev at %s\n", root);
> >  		return -1;
> > @@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
> >  	if (ret < 0 || ret >= MAXPATHLEN)
> >  		return -1;
> > -	ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > -	if (ret) {
> > -		SYSERROR("Failed to create /dev/pts in container");
> > -		return -1;
> > +	/*
> > +	 * If we are running on a devtmpfs mapping, dev/pts may already exist.
> > +	 * If not, then create it and exit if that fails...
> > +	 */
> > +	if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if (ret) {
> > +			SYSERROR("Failed to create /dev/pts in container");
> > +			return -1;
> > +		}
> >  	}
> >  
> >  	INFO("Mounted /dev under %s\n", root);
> > @@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
> >  
> >  	new->loglevel = LXC_LOG_PRIORITY_NOTSET;
> >  	new->personality = -1;
> > +	new->autodev = -1;
> >  	new->console.log_path = NULL;
> >  	new->console.log_fd = -1;
> >  	new->console.path = NULL;
> > @@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * This routine is called when the configuration does not already specify a value
> > + * for autodev (mounting a file system on /dev and populating it in a container).
> > + * If a hard override value has not be specified, then we try to apply some
> > + * heuristics to determine if we should switch to autodev mode.
> > + *
> > + * For instance, if the container has an /etc/systemd/system directory then it
> > + * is probably running systemd as the init process and it needs the autodev
> > + * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
> > + * in the host.
> > + *
> > + * We may also want to enable autodev if the host has devtmpfs mounted on its
> > + * /dev as this then enable us to use subdirectories under /dev for the container
> > + * /dev directories and we can fake udev devices.
> > + */
> > +int check_autodev( const char *rootfs )
> > +{
> > +	char absrootfs[MAXPATHLEN];
> > +	int ret;
> > +	struct stat s;
> > +	char path[MAXPATHLEN];
> > +
> > +	INFO("Testing for systemd in %s\n", rootfs);
> > +
> > +	if (rootfs == NULL || strlen(rootfs) == 0)
> > +		return -2;
> > +
> > +	if (!realpath(rootfs, absrootfs))
> > +		return -2;
> > +
> > +	/* Note here: we could instead check for /etc/systemd/system.conf  */
> > +	ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
> > +	if (ret < 0 || ret > MAXPATHLEN)
> > +		return -2;
> > +
> > +	if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
> > +		return 1;
> > +
> > +
> > +	/* Add future checks here.
> > +	 *	Return positive if we should go autodev
> > +	 *	Return 0 if we should NOT go autodev
> > +	 *	Return negative if we encounter an error or can not determine...
> > +	 */
> > +
> > +	/* All else fails, disable autodev */
> > +	return 0;
> > +}
> > +
> >  int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
> >  {
> >  	if (setup_utsname(lxc_conf->utsname)) {
> > @@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > -		if (mount_autodev(lxc_conf->rootfs.mount)) {
> > +	if (lxc_conf->autodev < 0) {
> > +		lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
> > +	}
> > +
> > +	if (lxc_conf->autodev > 0) {
> > +		if (mount_autodev(name, lxc_conf->rootfs.mount)) {
> >  			ERROR("failed to mount /dev in the container");
> >  			return -1;
> >  		}
> > @@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > +	if (lxc_conf->autodev > 0) {
> >  		if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
> >  			ERROR("failed to run autodev hooks for container '%s'.", name);
> >  			return -1;
> > 
> 
> 
> 
> > ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> > the latest Intel processors and coprocessors. See abstracts and register >
> > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 
> 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131109/887237c3/attachment.pgp>


More information about the lxc-devel mailing list