[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...

Michael H. Warfield mhw at WittsEnd.com
Thu Oct 31 22:25:45 UTC 2013


On Thu, 2013-10-31 at 13:00 -0500, Serge Hallyn wrote: 
> Quoting Michael H. Warfield (mhw at WittsEnd.com):
> > I did incorporate your suggestion of using the hash of the rootfs path
> > as the subdirectory under the hosts /dev/ for the container.  I also

> (Printed this out to look it over, just putting all my comments together
> here) :

> 1. I think if /dev is not devtmpfs, we should just bail on this.

Sort of concur and I think I even made a remark in some of the code
about checking for that.

I'm of two schools of thought here.

1) Mount our own instance of devtmpfs in a private area under our
control.

2) Bail entirely.  This would be a fall back, in any case, if we didn't
have devtmpfs available to us (is that possible with modern kernels?).

> 2. You say in comments that you're using the cgroup name, but it seems
>    you're actually just using the container name?

I thought I was.  Maybe I misunderstood...

> 3. The cgroup name used to be unique, but now each mounted cgroupfs
>    can actually have a different name for the same container (if some
>    of them didn't get cleaned out well).

Ok...  One of my problems in that particular area of code is knowing
where to get at some things that are not in the lxc_conf.  I thought the
"name" parameter was the cgroup name but apparently not.  I could use
some guidance there.

> I'm just thinking out loud here, so this may not be better, but how
> about

> 1. create /dev/.lxc as you're doing

> 2. (if container is going to use this) create /dev/.lxc/$nonce.
>    We can use hash("$lxcpath/$lxcname"), or just mkstemp(), or
>    just an increasing integer.

Well, I was ok with what you said about using the hash of the rootfs
real path, which is what corresponded to what you had in container.c for
the monitor socket.  All things being equal, I'd like to stick with
that.  As a convention, I also like sticking in a symlink for the
container name pointing at that hash name.  That has some advantages for
diagnostic purposes to poke around in the containers /dev without having
to go through headstands figuring out where it is.

> 3. Create $lxcpath/$lxcname/.dev (if the container needs it) and
>    shared-bind-mount /dev/.lxc/$nonce onto it.  Now we can tell
>    which /dev/.lxc/* is mounted by looking at the mount table.

Hmmm...  Ok...  I think I see where you're going with that.  I'll have
to think on that one.

> 4. slave-bind-bind mount $lxcpath/$lxcname/.dev into the starting
>    container's /dev.

> Not sure whether we should have lxc.autodev = 2 mean use this scheme,
> but I'd be fine with basically always doing this so long as /dev/ is
> devtmpfs and lxc.autodev is set for the container.  (So making
> the container's /dev a tmpfs would just be a fallback).

> Thoughts?

Definitely 3 and 4 are worth doing.  I'm not so sure about 2.  Since
we're already using the hash of the rootfs path for the monitor socket,
I don't see a problem keeping that here, at least for now.  But there is
the little details of having that hashing code in two source files now.
Should that be moved to a common source file?

I do have one other niggle, and I'm surprised you didn't ding me on that
(since you expressed concerns earlier).  The automatic autodev detection
is in there.  I did see in at least one other spot where we detect a
potentially hazardous condition and bail.  So there's some reasonable
precedence for some safety checking.

Someone in another threat suggested checking the symlinks for anything
with "systemd" and autosetting autodev if detected.  It's better than
what I have now but I'm concerned about just looking at /sbin/init,
especially in the case where a process is specified on the command line.
I'd like to implement it that way but I don't see where I have the
"command to execute" available to me.  Is it in the config structure
somewhere and I'm just blind?

I did see that my autodev detect logic was triggering on Ubuntu.  Seems
that Ubuntu has systemd available but not enabled?  IAC, all my Ubuntu
containers are perfectly happy with autodev enabled, even if upstart
doesn't need it.

I'll work in this over the next couple of days and do this additional
redirection.

> -serge

> > added additional symlinks for the container name to the hash
> > subdirectory.  I'm not totally sure I'm happy with this, since the hash
> > changes if the rootfs is moved but the whole thing is thrown away if the
> > machine is rebooted anyways (shrug).  So it's a minor niggle.  Maybe a
> > persistent uuid for the container might be an idea worth considering.
> > Maybe it's too minor to worry too much about.
> > 
> > I thought about adding an additional directory under them for
> > "containers" and "by-hashes" and maybe "by-uuid" but that can all be
> > addressed in the future.  This is just down-and-dirty / prove it can
> > work, kind of code...
> > 
> > I did "steal" some code from monitor.c over into conf.c for the hash
> > routine.  Maybe that needs to be redone in a generalize function in case
> > we need to change it and, then, everything tags along.  A thought there.
> > 
> > -- 
> > 
> > This code will create a "/dev/.lxc" subdirectory on the host and
> > populated it with "/dev/.lxc/hash({rootfs})" directories with
> > "/dev/.lxc/{container}" symlinks to them.  Those directories will then
> > be bind mounted to the container /dev/ directories and populated when
> > autodev = 1.
> > 
> > They are NOT cleaned up upon container shutdown or reboot of the
> > container which can allow for some level of reboot persistence (as long
> > as it's not the host reboot).  The symlinks are unlinked and relinked
> > allowing for multiple containers having the same names (and extensions
> > in the cgroup names) even though I personally feel that's an abonminal
> > (albeit rare) practice.  It should all work.
> > 
> > It does not currently incorporate any udev rules or hooks but is
> > eminately amenable to them.  I'm working on that.  That's the ultimate
> > goal, to facilitate more utilization of dynamic devices with symlinks
> > and device renumbering.
> > 
> > Right now, it simply replaces the use of tmpfs bind mounted to /dev in
> > the container with a subdirectory of devtmpfs.  With this, we can
> > address a few other problems wrt dynamic devices and udev rules,
> > including shared devices.  It includes some logic for automagically
> > switching on autodev which, I admit, is suboptimal but I would like to
> > see this mature where autodev=1 is the default and can be overridden in
> > the unusual case where it has to be disabled.
> > 
> > It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
> > host.  This is, potentially, a serious corner case if we have a host
> > with a static /dev on root and hosting a systemd based container.  That
> > is going to break.  Bad juju.  How common?  I don't know.  I feel that
> > check probably needs to go in but I felt like I wanted this code to get
> > out for discussion first.  What do we do it is not the case?  I
> > suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
> > ourselves.  That code is also not there and the idea is open for
> > discussion.
> > 
> > Currently, I have this running with autodev = 1 for all my containers,
> > no matter if they are systemd based or upstart based or sysv init based.
> > All of them are working cleanly and this seems to have solved a few past
> > problems with bad modes when autdev was set to one for non-systemd
> > containers (early Fedora and CentOS).
> > 
> > This is a git diff against 1.0.0.alpha2.  After the jump...
> > 
> > Regards,
> > Mike
> > -- 
> > Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
> >    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
> >    NIC whois: MHW9          | An optimist believes we live in the best of all
> >  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> > 
> > -- 
> > diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> > index 208c08b..21baf20 100644
> > --- a/src/lxc/conf.c
> > +++ b/src/lxc/conf.c
> > @@ -29,6 +29,7 @@
> >  #include <string.h>
> >  #include <dirent.h>
> >  #include <unistd.h>
> > +#include <inttypes.h>
> >  #include <sys/wait.h>
> >  #include <sys/syscall.h>
> >  #include <time.h>
> > @@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
> >  	return 0;
> >  }
> >  
> > +
> > +/*
> > + * Note: This is a verbatum copy of what is in monitor.c.  We're just
> > + * usint it here to generate a safe subdirectory in /dev/ for the
> > + * containers /dev/
> > + */
> > +
> > +/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
> > + * FNV has good anti collision properties and we're not worried
> > + * about pre-image resistance or one-way-ness, we're just trying to make
> > + * the name unique in the 108 bytes of space we have.
> > + */
> > +#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
> > +static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
> > +{
> > +	unsigned char *bp;
> > +
> > +	for(bp = buf; bp < (unsigned char *)buf + len; bp++)
> > +	{
> > +		/* xor the bottom with the current octet */
> > +		hval ^= (uint64_t)*bp;
> > +
> > +		/* gcc optimised:
> > +		 * multiply by the 64 bit FNV magic prime mod 2^64
> > +		 */
> > +		hval += (hval << 1) + (hval << 4) + (hval << 5) +
> > +			(hval << 7) + (hval << 8) + (hval << 40);
> > +	}
> > +
> > +	return hval;
> > +}
> > +
> > +/*
> > + * Locate a devtmpfs mount (should be on /dev) and create a container
> > + * subdirectory on it which we can then bind mount to the container
> > + * /dev instead of mounting a tmpfs there.
> > + * If we fail, return NULL.
> > + * Else return the pointer to the name buffer with the string to
> > + * the devtmpfs subdirectory.
> > + */
> > +
> > +char *mk_devtmpfs(const char *name, char *path)
> > +{
> > +	int ret;
> > +	struct stat s;
> > +	char tmp_path[MAXPATHLEN];
> > +	char sym_path[MAXPATHLEN];
> > +	char *base_path = "/dev/.lxc"; 
> > +	uint64_t hash;
> > +
> > +	/*
> > +	 * We're starting off with a fixed path /dev and assuming this is
> > +	 * a devtmpfs mount.  These are poor assumptions.  We need to add
> > +	 * more robust code to find it and verify it then use it.
> > +	 */
> > +
> > +	if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		/* This is just making /dev/.lxc it better work or we're done */
> > +		ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* Now check and/or create our container subname */
> > +	/* Symlink path... */
> > +	ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	/* Actual directory path */
> > +	hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
> > +	ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
> > +	/* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
> > +	if (ret < 0 || ret >= MAXPATHLEN)
> > +		return NULL;
> > +
> > +	if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if ( ret ) {
> > +			return NULL;
> > +		}
> > +	}
> > +
> > +	/* This sets up a symlink from the container name (cgroup name) in
> > +	 * /dev/.lxc over to the actual hash based device directory.  This
> > +	 * is strictly for a convenient cross reference and may be used
> > +	 * by udev rules to move devices in and out of the container spaces.
> > +	 */
> > +
> > +	unlink( sym_path );
> > +	ret = symlink(tmp_path, sym_path);
> > +
> > +	if ( ret < 0 ) {
> > +		SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
> > +	}
> > +
> > +	strcpy( path, tmp_path );
> > +	return path;
> > +}
> > +
> > +
> >  /*
> >   * Do we want to add options for max size of /dev and a file to
> >   * specify which devices to create?
> >   */
> > -static int mount_autodev(char *root)
> > +static int mount_autodev(const char *name, char *root)
> >  {
> >  	int ret;
> > +	struct stat s;
> >  	char path[MAXPATHLEN];
> > +	char devtmpfs_path[MAXPATHLEN];
> >  
> >  	INFO("Mounting /dev under %s\n", root);
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
> >  	if (ret < 0 || ret > MAXPATHLEN)
> >  		return -1;
> > -	ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	if (mk_devtmpfs( name, devtmpfs_path ) ) {
> > +		ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
> > +	} else {
> > +		ret = mount("none", path, "tmpfs", 0, "size=100000");
> > +	}
> >  	if (ret) {
> >  		SYSERROR("Failed to mount /dev at %s\n", root);
> >  		return -1;
> > @@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
> >  	ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
> >  	if (ret < 0 || ret >= MAXPATHLEN)
> >  		return -1;
> > -	ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > -	if (ret) {
> > -		SYSERROR("Failed to create /dev/pts in container");
> > -		return -1;
> > +	/*
> > +	 * If we are running on a devtmpfs mapping, dev/pts may already exist.
> > +	 * If not, then create it and exit if that fails...
> > +	 */
> > +	if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > +		ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > +		if (ret) {
> > +			SYSERROR("Failed to create /dev/pts in container");
> > +			return -1;
> > +		}
> >  	}
> >  
> >  	INFO("Mounted /dev under %s\n", root);
> > @@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
> >  
> >  	new->loglevel = LXC_LOG_PRIORITY_NOTSET;
> >  	new->personality = -1;
> > +	new->autodev = -1;
> >  	new->console.log_path = NULL;
> >  	new->console.log_fd = -1;
> >  	new->console.path = NULL;
> > @@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * This routine is called when the configuration does not already specify a value
> > + * for autodev (mounting a file system on /dev and populating it in a container).
> > + * If a hard override value has not be specified, then we try to apply some
> > + * heuristics to determine if we should switch to autodev mode.
> > + *
> > + * For instance, if the container has an /etc/systemd/system directory then it
> > + * is probably running systemd as the init process and it needs the autodev
> > + * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
> > + * in the host.
> > + *
> > + * We may also want to enable autodev if the host has devtmpfs mounted on its
> > + * /dev as this then enable us to use subdirectories under /dev for the container
> > + * /dev directories and we can fake udev devices.
> > + */
> > +int check_autodev( const char *rootfs )
> > +{
> > +	char absrootfs[MAXPATHLEN];
> > +	int ret;
> > +	struct stat s;
> > +	char path[MAXPATHLEN];
> > +
> > +	INFO("Testing for systemd in %s\n", rootfs);
> > +
> > +	if (rootfs == NULL || strlen(rootfs) == 0)
> > +		return -2;
> > +
> > +	if (!realpath(rootfs, absrootfs))
> > +		return -2;
> > +
> > +	/* Note here: we could instead check for /etc/systemd/system.conf  */
> > +	ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
> > +	if (ret < 0 || ret > MAXPATHLEN)
> > +		return -2;
> > +
> > +	if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
> > +		return 1;
> > +
> > +
> > +	/* Add future checks here.
> > +	 *	Return positive if we should go autodev
> > +	 *	Return 0 if we should NOT go autodev
> > +	 *	Return negative if we encounter an error or can not determine...
> > +	 */
> > +
> > +	/* All else fails, disable autodev */
> > +	return 0;
> > +}
> > +
> >  int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
> >  {
> >  	if (setup_utsname(lxc_conf->utsname)) {
> > @@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > -		if (mount_autodev(lxc_conf->rootfs.mount)) {
> > +	if (lxc_conf->autodev < 0) {
> > +		lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
> > +	}
> > +
> > +	if (lxc_conf->autodev > 0) {
> > +		if (mount_autodev(name, lxc_conf->rootfs.mount)) {
> >  			ERROR("failed to mount /dev in the container");
> >  			return -1;
> >  		}
> > @@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> >  		return -1;
> >  	}
> >  
> > -	if (lxc_conf->autodev) {
> > +	if (lxc_conf->autodev > 0) {
> >  		if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
> >  			ERROR("failed to run autodev hooks for container '%s'.", name);
> >  			return -1;
> > 
> 
> 
> 
> > ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
> > the latest Intel processors and coprocessors. See abstracts and register >
> > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 
> 
> ------------------------------------------------------------------------------
> Android is increasing in popularity, but the open development platform that
> developers love is also attractive to malware creators. Download this white
> paper to learn more about secure code signing practices that can help keep
> Android apps secure.
> http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 

-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131031/2061f523/attachment.pgp>


More information about the lxc-devel mailing list