[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...

Michael H. Warfield mhw at WittsEnd.com
Fri Nov 1 16:49:44 UTC 2013


On Thu, 2013-10-31 at 13:00 -0500, Serge Hallyn wrote: 
> Quoting Michael H. Warfield (mhw at WittsEnd.com):
> > I did incorporate your suggestion of using the hash of the rootfs path
> > as the subdirectory under the hosts /dev/ for the container.  I also
> 
> (Printed this out to look it over, just putting all my comments together
> here) :
> 
> 1. I think if /dev is not devtmpfs, we should just bail on this.
> 
> 2. You say in comments that you're using the cgroup name, but it seems
>    you're actually just using the container name?

Ok...  I'm going to experiment with this a bit but check me on this...
In the routine "lxc_setup()" the first parameter is "name".  Is that the
cgroup name or just the container name?  I take it, from your remark,
this is just the container name and the unique cgroup name may be
something different.  Is that something I should be pulling out of the
cgroup info structure?

> 3. The cgroup name used to be unique, but now each mounted cgroupfs
>    can actually have a different name for the same container (if some
>    of them didn't get cleaned out well).

Yeah, I'm trying to get my head wrapped around that code now and figure
out the name that I want to use there.

> I'm just thinking out loud here, so this may not be better, but how
> about
> 
> 1. create /dev/.lxc as you're doing
> 
> 2. (if container is going to use this) create /dev/.lxc/$nonce.
>    We can use hash("$lxcpath/$lxcname"), or just mkstemp(), or
>    just an increasing integer.
> 
> 3. Create $lxcpath/$lxcname/.dev (if the container needs it) and
>    shared-bind-mount /dev/.lxc/$nonce onto it.  Now we can tell
>    which /dev/.lxc/* is mounted by looking at the mount table.

Ok...  Looking at this now...  I seen in the $lxcpath/$lxcname where we
are creating rootfsproc and rootfssys.  Why not continue that hallowed
tradition and add a rootfsdev as "$lxcpath/$lxcname/rootfsdev" and bind
mount them all together?

> 4. slave-bind-bind mount $lxcpath/$lxcname/.dev into the starting
>    container's /dev.

> Not sure whether we should have lxc.autodev = 2 mean use this scheme,
> but I'd be fine with basically always doing this so long as /dev/ is
> devtmpfs and lxc.autodev is set for the container.  (So making
> the container's /dev a tmpfs would just be a fallback).

I like this idea.  Actually doesn't involve moving a lot of code,
either.

> Thoughts?

> -serge

Regards,
Mike

> > added additional symlinks for the container name to the hash
> > subdirectory.  I'm not totally sure I'm happy with this, since the hash
> > changes if the rootfs is moved but the whole thing is thrown away if the
> > machine is rebooted anyways (shrug).  So it's a minor niggle.  Maybe a
> > persistent uuid for the container might be an idea worth considering.
> > Maybe it's too minor to worry too much about.
> > 
> > I thought about adding an additional directory under them for
> > "containers" and "by-hashes" and maybe "by-uuid" but that can all be
> > addressed in the future.  This is just down-and-dirty / prove it can
> > work, kind of code...
> > 
> > I did "steal" some code from monitor.c over into conf.c for the hash
> > routine.  Maybe that needs to be redone in a generalize function in case
> > we need to change it and, then, everything tags along.  A thought there.
> > 
> > -- 
> > 
> > This code will create a "/dev/.lxc" subdirectory on the host and
> > populated it with "/dev/.lxc/hash({rootfs})" directories with
> > "/dev/.lxc/{container}" symlinks to them.  Those directories will then
> > be bind mounted to the container /dev/ directories and populated when
> > autodev = 1.
> > 
> > They are NOT cleaned up upon container shutdown or reboot of the
> > container which can allow for some level of reboot persistence (as long
> > as it's not the host reboot).  The symlinks are unlinked and relinked
> > allowing for multiple containers having the same names (and extensions
> > in the cgroup names) even though I personally feel that's an abonminal
> > (albeit rare) practice.  It should all work.
> > 
> > It does not currently incorporate any udev rules or hooks but is
> > eminately amenable to them.  I'm working on that.  That's the ultimate
> > goal, to facilitate more utilization of dynamic devices with symlinks
> > and device renumbering.
> > 
> > Right now, it simply replaces the use of tmpfs bind mounted to /dev in
> > the container with a subdirectory of devtmpfs.  With this, we can
> > address a few other problems wrt dynamic devices and udev rules,
> > including shared devices.  It includes some logic for automagically
> > switching on autodev which, I admit, is suboptimal but I would like to
> > see this mature where autodev=1 is the default and can be overridden in
> > the unusual case where it has to be disabled.
> > 
> > It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
> > host.  This is, potentially, a serious corner case if we have a host
> > with a static /dev on root and hosting a systemd based container.  That
> > is going to break.  Bad juju.  How common?  I don't know.  I feel that
> > check probably needs to go in but I felt like I wanted this code to get
> > out for discussion first.  What do we do it is not the case?  I
> > suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
> > ourselves.  That code is also not there and the idea is open for
> > discussion.
> > 
> > Currently, I have this running with autodev = 1 for all my containers,
> > no matter if they are systemd based or upstart based or sysv init based.
> > All of them are working cleanly and this seems to have solved a few past
> > problems with bad modes when autdev was set to one for non-systemd
> > containers (early Fedora and CentOS).
> > 
> > This is a git diff against 1.0.0.alpha2.  After the jump...
> > 
> > Regards,
> > Mike
> > -- 
> > Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
> >    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
> >    NIC whois: MHW9          | An optimist believes we live in the best of all
> >  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> > 
> > -- 
> > diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> > index 208c08b..21baf20 100644
> > --- a/src/lxc/conf.c
> > +++ b/src/lxc/conf.c
> > @@ -29,6 +29,7 @@
> >  #include <string.h>
> >  #include <dirent.h>
> >  #include <unistd.h>
> > +#include <inttypes.h>
> >  #include <sys/wait.h>
> >  #include <sys/syscall.h>
> >  #include <time.h>
> > @@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
> >  	return 0;
> >  }
> >  
> > +
> > +/*
> > + * Note: This is a verbatum copy of what is in monitor.c.  We're just
> > + * usint it here to generate a safe subdirectory in /dev/ for the
> > + * containers /dev/
> > + */
> > +
> > +/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
> > + * FNV has good anti collision properties and we're not worried
> > + * about pre-image resistance or one-way-ness, we're just trying to make
> > + * the name unique in the 108 bytes of space we have.
> > + */
> > +#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
> > +static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
> > +{
> > +	unsigned char *bp;
> > +
> > +	for(bp = buf; bp < (unsigned char *)buf + len; bp++)
> > +	{
> > +		/* xor the bottom with the current octet */
> > +		hval ^= (uint64_t)*bp;
> > +
> > +		/* gcc optimised:
> > +		 * multiply by the 64 bit FNV magic prime mod 2^64
> > +		 */
> > +		hval += (hval << 1) + (hval << 4) + (hval << 5) +
> > +			(hval << 7) + (hval << 8) + (hval << 40);
> > +	}
> > +
> > +	return hval;
> > +}
> > +
> > +/*
> > + * Locate a devtmpfs mount (should be on /dev) and create a container
> > + * subdirectory on it which we can then bind mount to the container
> > + * /dev instead of mounting a tmpfs there.
> > + * If we fail, return NULL.
> > + * Else return the pointer to the name buffer with the string to
> > + * the devtmpfs subdirectory.
> > + */
> > +
> > +char *mk_devtmpfs(const char *name, char *path)
> > +{
> > +	int ret;
> > +	struct stat s;
> > +	char tmp_path[MAXPATHLEN];
> > +	char sym_path[MAXPATHLEN];
> > +	char *base_path = "/dev/.lxc"; 
> > +	uint64_t hash;
> > +
> > +	/*
> > +	 * We're starting off with a fixed path /dev and assuming this is
> > +	 * a devtmpfs mount.  These are poor assumptions.  We need to add
> > +	 * more robust code to find it and verify it then use it.
> > +	 */
> > +
> > +	if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		/* This is just making /dev/.lxc it better work or we're done */
> > +		ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* Now check and/or create our container subname */
> > +	/* Symlink path... */
> > +	ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	/* Actual directory path */
> > +	hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
> > +	ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
> > +	/* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* This sets up a symlink from the container name (cgroup name) in
> > +	 * /dev/.lxc over to the actual hash based device directory.  This
> > +	 * is strictly for a convenient cross reference and may be used
> > +	 * by udev rules to move devices in and out of the container spaces.
> > +	 */
> > +
> > +	unlink( sym_path );
> > +	ret = symlink(tmp_path, sym_path);
> > +
> > +	if ( ret < 0 ) {
> > +		SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
> > +	}
> > +
> > +	strcpy( path, tmp_path );
> > +	return path;
> > +}
> > +
> > +
> >  /*
> >   * Do we want to add options for max size of /dev and a file to
> >   * specify which devices to create?
> >   */
> > -static int mount_autodev(char *root)
> > +static int mount_autodev(const char *name, char *root)
> >  {
> >  	int ret;
> > +	struct stat s;
> >  	char path[MAXPATHLEN];
> > +	char devtmpfs_path[MAXPATHLEN];
> >  
> >  	INFO("Mounting /dev under %s\n", root);
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
> >  	if (ret < 0 || ret > MAXPATHLEN)
> >  		return -1;
> > -	ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	if (mk_devtmpfs( name, devtmpfs_path ) ) {
> > +		ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
> > +	} else {
> > +		ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	}
> >  	if (ret) {
> >  		SYSERROR("Failed to mount /dev at %s\n", root);
> >  		return -1;
> > @@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
> >  	if (ret < 0 || ret >= MAXPATHLEN)
> >  		return -1;
> > -	ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > -	if (ret) {
> > -		SYSERROR("Failed to create /dev/pts in container");
> > -		return -1;
> > +	/*
> > +	 * If we are running on a devtmpfs mapping, dev/pts may already exist.
> > +	 * If not, then create it and exit if that fails...
> > +	 */
> > +	if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if (ret) {
> > +			SYSERROR("Failed to create /dev/pts in container");
> > +			return -1;
> > +		}
> >  	}
> >  
> >  	INFO("Mounted /dev under %s\n", root);
> > @@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
> >  
> >  	new->loglevel = LXC_LOG_PRIORITY_NOTSET;
> >  	new->personality = -1;
> > +	new->autodev = -1;
> >  	new->console.log_path = NULL;
> >  	new->console.log_fd = -1;
> >  	new->console.path = NULL;
> > @@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * This routine is called when the configuration does not already specify a value
> > + * for autodev (mounting a file system on /dev and populating it in a container).
> > + * If a hard override value has not be specified, then we try to apply some
> > + * heuristics to determine if we should switch to autodev mode.
> > + *
> > + * For instance, if the container has an /etc/systemd/system directory then it
> > + * is probably running systemd as the init process and it needs the autodev
> > + * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
> > + * in the host.
> > + *
> > + * We may also want to enable autodev if the host has devtmpfs mounted on its
> > + * /dev as this then enable us to use subdirectories under /dev for the container
> > + * /dev directories and we can fake udev devices.
> > + */
> > +int check_autodev( const char *rootfs )
> > +{
> > +	char absrootfs[MAXPATHLEN];
> > +	int ret;
> > +	struct stat s;
> > +	char path[MAXPATHLEN];
> > +
> > +	INFO("Testing for systemd in %s\n", rootfs);
> > +
> > +	if (rootfs == NULL || strlen(rootfs) == 0)
> > +		return -2;
> > +
> > +	if (!realpath(rootfs, absrootfs))
> > +		return -2;
> > +
> > +	/* Note here: we could instead check for /etc/systemd/system.conf  */
> > +	ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
> > +	if (ret < 0 || ret > MAXPATHLEN)
> > +		return -2;
> > +
> > +	if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
> > +		return 1;
> > +
> > +
> > +	/* Add future checks here.
> > +	 *	Return positive if we should go autodev
> > +	 *	Return 0 if we should NOT go autodev
> > +	 *	Return negative if we encounter an error or can not determine...
> > +	 */
> > +
> > +	/* All else fails, disable autodev */
> > +	return 0;
> > +}
> > +
> >  int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
> >  {
> >  	if (setup_utsname(lxc_conf->utsname)) {
> > @@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > -		if (mount_autodev(lxc_conf->rootfs.mount)) {
> > +	if (lxc_conf->autodev < 0) {
> > +		lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
> > +	}
> > +
> > +	if (lxc_conf->autodev > 0) {
> > +		if (mount_autodev(name, lxc_conf->rootfs.mount)) {
> >  			ERROR("failed to mount /dev in the container");
> >  			return -1;
> >  		}
> > @@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > +	if (lxc_conf->autodev > 0) {
> >  		if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
> >  			ERROR("failed to run autodev hooks for container '%s'.", name);
> >  			return -1;
> > 
> 
> 
> 
> > ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> > the latest Intel processors and coprocessors. See abstracts and register >
> > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 
> 

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131101/ea7906ea/attachment.pgp>


More information about the lxc-devel mailing list