[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...

Michael H. Warfield mhw at WittsEnd.com
Fri Nov 1 21:36:33 UTC 2013


On Thu, 2013-10-31 at 13:00 -0500, Serge Hallyn wrote: 
> Quoting Michael H. Warfield (mhw at WittsEnd.com):
> > I did incorporate your suggestion of using the hash of the rootfs path
> > as the subdirectory under the hosts /dev/ for the container.  I also
> 
> (Printed this out to look it over, just putting all my comments together
> here) :
> 
> 1. I think if /dev is not devtmpfs, we should just bail on this.
> 
> 2. You say in comments that you're using the cgroup name, but it seems
>    you're actually just using the container name?
> 
> 3. The cgroup name used to be unique, but now each mounted cgroupfs
>    can actually have a different name for the same container (if some
>    of them didn't get cleaned out well).

Maybe I misunderstood something here but I thought you told me at
LinuxPlumbers that the container name no longer need to be unique.  I'm
finding that not to be the case...

I have a container named "Fedora19" already.  I set up and independent
container in another directory I'm also calling Fedora19.  When I start
it up I get this:

[root at hydra WildMachines]# lxc-start -n Fedora19 -f !$
lxc-start -n Fedora19 -f Fedora19/config
lxc-start: failed (98) to create the command service
point /var/lib/lxc/Fedora19/command
lxc-start: ##
lxc-start: # The container appears to be already running!
lxc-start: ##
lxc-start: failed to initialize the container

So it definitely does NOT like two containers with the same name.  So,
why can I not just simply use the container name?  Is that going to
change?  How do I start two containers with the same name?

Maybe you meant that the container name could be associated
independently with the cgroup name but these names all seem to need to
be unique.

Regards,
Mike

> I'm just thinking out loud here, so this may not be better, but how
> about
> 
> 1. create /dev/.lxc as you're doing
> 
> 2. (if container is going to use this) create /dev/.lxc/$nonce.
>    We can use hash("$lxcpath/$lxcname"), or just mkstemp(), or
>    just an increasing integer.
> 
> 3. Create $lxcpath/$lxcname/.dev (if the container needs it) and
>    shared-bind-mount /dev/.lxc/$nonce onto it.  Now we can tell
>    which /dev/.lxc/* is mounted by looking at the mount table.
> 
> 4. slave-bind-bind mount $lxcpath/$lxcname/.dev into the starting
>    container's /dev.
> 
> Not sure whether we should have lxc.autodev = 2 mean use this scheme,
> but I'd be fine with basically always doing this so long as /dev/ is
> devtmpfs and lxc.autodev is set for the container.  (So making
> the container's /dev a tmpfs would just be a fallback).
> 
> Thoughts?
> 
> -serge
> 
> > added additional symlinks for the container name to the hash
> > subdirectory.  I'm not totally sure I'm happy with this, since the hash
> > changes if the rootfs is moved but the whole thing is thrown away if the
> > machine is rebooted anyways (shrug).  So it's a minor niggle.  Maybe a
> > persistent uuid for the container might be an idea worth considering.
> > Maybe it's too minor to worry too much about.
> > 
> > I thought about adding an additional directory under them for
> > "containers" and "by-hashes" and maybe "by-uuid" but that can all be
> > addressed in the future.  This is just down-and-dirty / prove it can
> > work, kind of code...
> > 
> > I did "steal" some code from monitor.c over into conf.c for the hash
> > routine.  Maybe that needs to be redone in a generalize function in case
> > we need to change it and, then, everything tags along.  A thought there.
> > 
> > -- 
> > 
> > This code will create a "/dev/.lxc" subdirectory on the host and
> > populated it with "/dev/.lxc/hash({rootfs})" directories with
> > "/dev/.lxc/{container}" symlinks to them.  Those directories will then
> > be bind mounted to the container /dev/ directories and populated when
> > autodev = 1.
> > 
> > They are NOT cleaned up upon container shutdown or reboot of the
> > container which can allow for some level of reboot persistence (as long
> > as it's not the host reboot).  The symlinks are unlinked and relinked
> > allowing for multiple containers having the same names (and extensions
> > in the cgroup names) even though I personally feel that's an abonminal
> > (albeit rare) practice.  It should all work.
> > 
> > It does not currently incorporate any udev rules or hooks but is
> > eminately amenable to them.  I'm working on that.  That's the ultimate
> > goal, to facilitate more utilization of dynamic devices with symlinks
> > and device renumbering.
> > 
> > Right now, it simply replaces the use of tmpfs bind mounted to /dev in
> > the container with a subdirectory of devtmpfs.  With this, we can
> > address a few other problems wrt dynamic devices and udev rules,
> > including shared devices.  It includes some logic for automagically
> > switching on autodev which, I admit, is suboptimal but I would like to
> > see this mature where autodev=1 is the default and can be overridden in
> > the unusual case where it has to be disabled.
> > 
> > It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
> > host.  This is, potentially, a serious corner case if we have a host
> > with a static /dev on root and hosting a systemd based container.  That
> > is going to break.  Bad juju.  How common?  I don't know.  I feel that
> > check probably needs to go in but I felt like I wanted this code to get
> > out for discussion first.  What do we do it is not the case?  I
> > suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
> > ourselves.  That code is also not there and the idea is open for
> > discussion.
> > 
> > Currently, I have this running with autodev = 1 for all my containers,
> > no matter if they are systemd based or upstart based or sysv init based.
> > All of them are working cleanly and this seems to have solved a few past
> > problems with bad modes when autdev was set to one for non-systemd
> > containers (early Fedora and CentOS).
> > 
> > This is a git diff against 1.0.0.alpha2.  After the jump...
> > 
> > Regards,
> > Mike
> > -- 
> > Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
> >    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
> >    NIC whois: MHW9          | An optimist believes we live in the best of all
> >  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> > 
> > -- 
> > diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> > index 208c08b..21baf20 100644
> > --- a/src/lxc/conf.c
> > +++ b/src/lxc/conf.c
> > @@ -29,6 +29,7 @@
> >  #include <string.h>
> >  #include <dirent.h>
> >  #include <unistd.h>
> > +#include <inttypes.h>
> >  #include <sys/wait.h>
> >  #include <sys/syscall.h>
> >  #include <time.h>
> > @@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
> >  	return 0;
> >  }
> >  
> > +
> > +/*
> > + * Note: This is a verbatum copy of what is in monitor.c.  We're just
> > + * usint it here to generate a safe subdirectory in /dev/ for the
> > + * containers /dev/
> > + */
> > +
> > +/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
> > + * FNV has good anti collision properties and we're not worried
> > + * about pre-image resistance or one-way-ness, we're just trying to make
> > + * the name unique in the 108 bytes of space we have.
> > + */
> > +#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
> > +static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
> > +{
> > +	unsigned char *bp;
> > +
> > +	for(bp = buf; bp < (unsigned char *)buf + len; bp++)
> > +	{
> > +		/* xor the bottom with the current octet */
> > +		hval ^= (uint64_t)*bp;
> > +
> > +		/* gcc optimised:
> > +		 * multiply by the 64 bit FNV magic prime mod 2^64
> > +		 */
> > +		hval += (hval << 1) + (hval << 4) + (hval << 5) +
> > +			(hval << 7) + (hval << 8) + (hval << 40);
> > +	}
> > +
> > +	return hval;
> > +}
> > +
> > +/*
> > + * Locate a devtmpfs mount (should be on /dev) and create a container
> > + * subdirectory on it which we can then bind mount to the container
> > + * /dev instead of mounting a tmpfs there.
> > + * If we fail, return NULL.
> > + * Else return the pointer to the name buffer with the string to
> > + * the devtmpfs subdirectory.
> > + */
> > +
> > +char *mk_devtmpfs(const char *name, char *path)
> > +{
> > +	int ret;
> > +	struct stat s;
> > +	char tmp_path[MAXPATHLEN];
> > +	char sym_path[MAXPATHLEN];
> > +	char *base_path = "/dev/.lxc"; 
> > +	uint64_t hash;
> > +
> > +	/*
> > +	 * We're starting off with a fixed path /dev and assuming this is
> > +	 * a devtmpfs mount.  These are poor assumptions.  We need to add
> > +	 * more robust code to find it and verify it then use it.
> > +	 */
> > +
> > +	if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		/* This is just making /dev/.lxc it better work or we're done */
> > +		ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* Now check and/or create our container subname */
> > +	/* Symlink path... */
> > +	ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	/* Actual directory path */
> > +	hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
> > +	ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
> > +	/* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* This sets up a symlink from the container name (cgroup name) in
> > +	 * /dev/.lxc over to the actual hash based device directory.  This
> > +	 * is strictly for a convenient cross reference and may be used
> > +	 * by udev rules to move devices in and out of the container spaces.
> > +	 */
> > +
> > +	unlink( sym_path );
> > +	ret = symlink(tmp_path, sym_path);
> > +
> > +	if ( ret < 0 ) {
> > +		SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
> > +	}
> > +
> > +	strcpy( path, tmp_path );
> > +	return path;
> > +}
> > +
> > +
> >  /*
> >   * Do we want to add options for max size of /dev and a file to
> >   * specify which devices to create?
> >   */
> > -static int mount_autodev(char *root)
> > +static int mount_autodev(const char *name, char *root)
> >  {
> >  	int ret;
> > +	struct stat s;
> >  	char path[MAXPATHLEN];
> > +	char devtmpfs_path[MAXPATHLEN];
> >  
> >  	INFO("Mounting /dev under %s\n", root);
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
> >  	if (ret < 0 || ret > MAXPATHLEN)
> >  		return -1;
> > -	ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	if (mk_devtmpfs( name, devtmpfs_path ) ) {
> > +		ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
> > +	} else {
> > +		ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	}
> >  	if (ret) {
> >  		SYSERROR("Failed to mount /dev at %s\n", root);
> >  		return -1;
> > @@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
> >  	if (ret < 0 || ret >= MAXPATHLEN)
> >  		return -1;
> > -	ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > -	if (ret) {
> > -		SYSERROR("Failed to create /dev/pts in container");
> > -		return -1;
> > +	/*
> > +	 * If we are running on a devtmpfs mapping, dev/pts may already exist.
> > +	 * If not, then create it and exit if that fails...
> > +	 */
> > +	if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if (ret) {
> > +			SYSERROR("Failed to create /dev/pts in container");
> > +			return -1;
> > +		}
> >  	}
> >  
> >  	INFO("Mounted /dev under %s\n", root);
> > @@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
> >  
> >  	new->loglevel = LXC_LOG_PRIORITY_NOTSET;
> >  	new->personality = -1;
> > +	new->autodev = -1;
> >  	new->console.log_path = NULL;
> >  	new->console.log_fd = -1;
> >  	new->console.path = NULL;
> > @@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * This routine is called when the configuration does not already specify a value
> > + * for autodev (mounting a file system on /dev and populating it in a container).
> > + * If a hard override value has not be specified, then we try to apply some
> > + * heuristics to determine if we should switch to autodev mode.
> > + *
> > + * For instance, if the container has an /etc/systemd/system directory then it
> > + * is probably running systemd as the init process and it needs the autodev
> > + * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
> > + * in the host.
> > + *
> > + * We may also want to enable autodev if the host has devtmpfs mounted on its
> > + * /dev as this then enable us to use subdirectories under /dev for the container
> > + * /dev directories and we can fake udev devices.
> > + */
> > +int check_autodev( const char *rootfs )
> > +{
> > +	char absrootfs[MAXPATHLEN];
> > +	int ret;
> > +	struct stat s;
> > +	char path[MAXPATHLEN];
> > +
> > +	INFO("Testing for systemd in %s\n", rootfs);
> > +
> > +	if (rootfs == NULL || strlen(rootfs) == 0)
> > +		return -2;
> > +
> > +	if (!realpath(rootfs, absrootfs))
> > +		return -2;
> > +
> > +	/* Note here: we could instead check for /etc/systemd/system.conf  */
> > +	ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
> > +	if (ret < 0 || ret > MAXPATHLEN)
> > +		return -2;
> > +
> > +	if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
> > +		return 1;
> > +
> > +
> > +	/* Add future checks here.
> > +	 *	Return positive if we should go autodev
> > +	 *	Return 0 if we should NOT go autodev
> > +	 *	Return negative if we encounter an error or can not determine...
> > +	 */
> > +
> > +	/* All else fails, disable autodev */
> > +	return 0;
> > +}
> > +
> >  int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
> >  {
> >  	if (setup_utsname(lxc_conf->utsname)) {
> > @@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > -		if (mount_autodev(lxc_conf->rootfs.mount)) {
> > +	if (lxc_conf->autodev < 0) {
> > +		lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
> > +	}
> > +
> > +	if (lxc_conf->autodev > 0) {
> > +		if (mount_autodev(name, lxc_conf->rootfs.mount)) {
> >  			ERROR("failed to mount /dev in the container");
> >  			return -1;
> >  		}
> > @@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > +	if (lxc_conf->autodev > 0) {
> >  		if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
> >  			ERROR("failed to run autodev hooks for container '%s'.", name);
> >  			return -1;
> > 
> 
> 
> 
> > ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> > the latest Intel processors and coprocessors. See abstracts and register >
> > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 
> 

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131101/ee044916/attachment.pgp>


More information about the lxc-devel mailing list