[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...

Michael H. Warfield mhw at WittsEnd.com
Sun Oct 20 02:44:42 UTC 2013


Ok all!

I'm late/slow with this.  What else is new.  I thought I could hammer
this out in a week or so after LinuxPlumbers but my day job just kept on
getting in my face...  Reality comes crashing in...

This is what Serge and I were discussing there in NOLA with regards to
using devtmpfs instead of tmpfs for /dev/ in containers where autodev is
set to 1.  My remark was "this should suck less".  That would be
primarily required (but not limited to) the cases and distros running
systemd in the container (not just Fedora, but positively Fedora).  My
goal has been to make this functional to the point that anything
requiring a /dev (anything other than a simple app) should work with it.

This is not a PATCH (well, it is, it's just not being submitted for
inclusion yet).  This is a Proof Of Concept (POC).  It's a patch against
GIT 1.0.0.alpha2.  It's what I was suggesting and we were discussing at
LinuxPlumbers several weeks ago.  I want discussion and I want rocks
thrown at it.  It's merely my idea of how to solve some corner cases and
some unusual use cases (several of which I own) in a generic manner.
From what I see with what I have, I don't seen any downside to doing
this but I don't own all the use cases (obviously).

-- 
Serge -

I did incorporate your suggestion of using the hash of the rootfs path
as the subdirectory under the hosts /dev/ for the container.  I also
added additional symlinks for the container name to the hash
subdirectory.  I'm not totally sure I'm happy with this, since the hash
changes if the rootfs is moved but the whole thing is thrown away if the
machine is rebooted anyways (shrug).  So it's a minor niggle.  Maybe a
persistent uuid for the container might be an idea worth considering.
Maybe it's too minor to worry too much about.

I thought about adding an additional directory under them for
"containers" and "by-hashes" and maybe "by-uuid" but that can all be
addressed in the future.  This is just down-and-dirty / prove it can
work, kind of code...

I did "steal" some code from monitor.c over into conf.c for the hash
routine.  Maybe that needs to be redone in a generalize function in case
we need to change it and, then, everything tags along.  A thought there.

-- 

This code will create a "/dev/.lxc" subdirectory on the host and
populated it with "/dev/.lxc/hash({rootfs})" directories with
"/dev/.lxc/{container}" symlinks to them.  Those directories will then
be bind mounted to the container /dev/ directories and populated when
autodev = 1.

They are NOT cleaned up upon container shutdown or reboot of the
container which can allow for some level of reboot persistence (as long
as it's not the host reboot).  The symlinks are unlinked and relinked
allowing for multiple containers having the same names (and extensions
in the cgroup names) even though I personally feel that's an abonminal
(albeit rare) practice.  It should all work.

It does not currently incorporate any udev rules or hooks but is
eminately amenable to them.  I'm working on that.  That's the ultimate
goal, to facilitate more utilization of dynamic devices with symlinks
and device renumbering.

Right now, it simply replaces the use of tmpfs bind mounted to /dev in
the container with a subdirectory of devtmpfs.  With this, we can
address a few other problems wrt dynamic devices and udev rules,
including shared devices.  It includes some logic for automagically
switching on autodev which, I admit, is suboptimal but I would like to
see this mature where autodev=1 is the default and can be overridden in
the unusual case where it has to be disabled.

It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
host.  This is, potentially, a serious corner case if we have a host
with a static /dev on root and hosting a systemd based container.  That
is going to break.  Bad juju.  How common?  I don't know.  I feel that
check probably needs to go in but I felt like I wanted this code to get
out for discussion first.  What do we do it is not the case?  I
suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
ourselves.  That code is also not there and the idea is open for
discussion.

Currently, I have this running with autodev = 1 for all my containers,
no matter if they are systemd based or upstart based or sysv init based.
All of them are working cleanly and this seems to have solved a few past
problems with bad modes when autdev was set to one for non-systemd
containers (early Fedora and CentOS).

This is a git diff against 1.0.0.alpha2.  After the jump...

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

-- 
diff --git a/src/lxc/conf.c b/src/lxc/conf.c
index 208c08b..21baf20 100644
--- a/src/lxc/conf.c
+++ b/src/lxc/conf.c
@@ -29,6 +29,7 @@
 #include <string.h>
 #include <dirent.h>
 #include <unistd.h>
+#include <inttypes.h>
 #include <sys/wait.h>
 #include <sys/syscall.h>
 #include <time.h>
@@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
 	return 0;
 }
 
+
+/*
+ * Note: This is a verbatum copy of what is in monitor.c.  We're just
+ * usint it here to generate a safe subdirectory in /dev/ for the
+ * containers /dev/
+ */
+
+/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
+ * FNV has good anti collision properties and we're not worried
+ * about pre-image resistance or one-way-ness, we're just trying to make
+ * the name unique in the 108 bytes of space we have.
+ */
+#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
+static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
+{
+	unsigned char *bp;
+
+	for(bp = buf; bp < (unsigned char *)buf + len; bp++)
+	{
+		/* xor the bottom with the current octet */
+		hval ^= (uint64_t)*bp;
+
+		/* gcc optimised:
+		 * multiply by the 64 bit FNV magic prime mod 2^64
+		 */
+		hval += (hval << 1) + (hval << 4) + (hval << 5) +
+			(hval << 7) + (hval << 8) + (hval << 40);
+	}
+
+	return hval;
+}
+
+/*
+ * Locate a devtmpfs mount (should be on /dev) and create a container
+ * subdirectory on it which we can then bind mount to the container
+ * /dev instead of mounting a tmpfs there.
+ * If we fail, return NULL.
+ * Else return the pointer to the name buffer with the string to
+ * the devtmpfs subdirectory.
+ */
+
+char *mk_devtmpfs(const char *name, char *path)
+{
+	int ret;
+	struct stat s;
+	char tmp_path[MAXPATHLEN];
+	char sym_path[MAXPATHLEN];
+	char *base_path = "/dev/.lxc"; 
+	uint64_t hash;
+
+	/*
+	 * We're starting off with a fixed path /dev and assuming this is
+	 * a devtmpfs mount.  These are poor assumptions.  We need to add
+	 * more robust code to find it and verify it then use it.
+	 */
+
+	if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
+		/* This is just making /dev/.lxc it better work or we're done */
+		ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
+		if ( ret ) {
+			return NULL;
+		}
+	}
+
+	/* Now check and/or create our container subname */
+	/* Symlink path... */
+	ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
+	if (ret < 0 || ret >= MAXPATHLEN)
+		return NULL;
+
+	/* Actual directory path */
+	hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
+	ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
+	/* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
+	if (ret < 0 || ret >= MAXPATHLEN)
+		return NULL;
+
+	if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
+		ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
+		if ( ret ) {
+			return NULL;
+		}
+	}
+
+	/* This sets up a symlink from the container name (cgroup name) in
+	 * /dev/.lxc over to the actual hash based device directory.  This
+	 * is strictly for a convenient cross reference and may be used
+	 * by udev rules to move devices in and out of the container spaces.
+	 */
+
+	unlink( sym_path );
+	ret = symlink(tmp_path, sym_path);
+
+	if ( ret < 0 ) {
+		SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
+	}
+
+	strcpy( path, tmp_path );
+	return path;
+}
+
+
 /*
  * Do we want to add options for max size of /dev and a file to
  * specify which devices to create?
  */
-static int mount_autodev(char *root)
+static int mount_autodev(const char *name, char *root)
 {
 	int ret;
+	struct stat s;
 	char path[MAXPATHLEN];
+	char devtmpfs_path[MAXPATHLEN];
 
 	INFO("Mounting /dev under %s\n", root);
 	ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
 	if (ret < 0 || ret > MAXPATHLEN)
 		return -1;
-	ret = mount("none", path, "tmpfs", 0, "size=100000");
+	if (mk_devtmpfs( name, devtmpfs_path ) ) {
+		ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
+	} else {
+		ret = mount("none", path, "tmpfs", 0, "size=100000");
+	}
 	if (ret) {
 		SYSERROR("Failed to mount /dev at %s\n", root);
 		return -1;
@@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
 	ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
 	if (ret < 0 || ret >= MAXPATHLEN)
 		return -1;
-	ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
-	if (ret) {
-		SYSERROR("Failed to create /dev/pts in container");
-		return -1;
+	/*
+	 * If we are running on a devtmpfs mapping, dev/pts may already exist.
+	 * If not, then create it and exit if that fails...
+	 */
+	if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
+		ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
+		if (ret) {
+			SYSERROR("Failed to create /dev/pts in container");
+			return -1;
+		}
 	}
 
 	INFO("Mounted /dev under %s\n", root);
@@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
 
 	new->loglevel = LXC_LOG_PRIORITY_NOTSET;
 	new->personality = -1;
+	new->autodev = -1;
 	new->console.log_path = NULL;
 	new->console.log_fd = -1;
 	new->console.path = NULL;
@@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
 	return 0;
 }
 
+/*
+ * This routine is called when the configuration does not already specify a value
+ * for autodev (mounting a file system on /dev and populating it in a container).
+ * If a hard override value has not be specified, then we try to apply some
+ * heuristics to determine if we should switch to autodev mode.
+ *
+ * For instance, if the container has an /etc/systemd/system directory then it
+ * is probably running systemd as the init process and it needs the autodev
+ * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
+ * in the host.
+ *
+ * We may also want to enable autodev if the host has devtmpfs mounted on its
+ * /dev as this then enable us to use subdirectories under /dev for the container
+ * /dev directories and we can fake udev devices.
+ */
+int check_autodev( const char *rootfs )
+{
+	char absrootfs[MAXPATHLEN];
+	int ret;
+	struct stat s;
+	char path[MAXPATHLEN];
+
+	INFO("Testing for systemd in %s\n", rootfs);
+
+	if (rootfs == NULL || strlen(rootfs) == 0)
+		return -2;
+
+	if (!realpath(rootfs, absrootfs))
+		return -2;
+
+	/* Note here: we could instead check for /etc/systemd/system.conf  */
+	ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
+	if (ret < 0 || ret > MAXPATHLEN)
+		return -2;
+
+	if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
+		return 1;
+
+
+	/* Add future checks here.
+	 *	Return positive if we should go autodev
+	 *	Return 0 if we should NOT go autodev
+	 *	Return negative if we encounter an error or can not determine...
+	 */
+
+	/* All else fails, disable autodev */
+	return 0;
+}
+
 int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
 {
 	if (setup_utsname(lxc_conf->utsname)) {
@@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
 		return -1;
 	}
 
-	if (lxc_conf->autodev) {
-		if (mount_autodev(lxc_conf->rootfs.mount)) {
+	if (lxc_conf->autodev < 0) {
+		lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
+	}
+
+	if (lxc_conf->autodev > 0) {
+		if (mount_autodev(name, lxc_conf->rootfs.mount)) {
 			ERROR("failed to mount /dev in the container");
 			return -1;
 		}
@@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
 		return -1;
 	}
 
-	if (lxc_conf->autodev) {
+	if (lxc_conf->autodev > 0) {
 		if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
 			ERROR("failed to run autodev hooks for container '%s'.", name);
 			return -1;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131019/2a4adb39/attachment.pgp>


More information about the lxc-devel mailing list