[lxc-devel] [Not A Patch] [POC] Proof of concept code for using devtmpfs for autodev and more...
Michael H. Warfield
mhw at WittsEnd.com
Fri Nov 1 16:49:44 UTC 2013
On Thu, 2013-10-31 at 13:00 -0500, Serge Hallyn wrote:
> Quoting Michael H. Warfield (mhw at WittsEnd.com):
> > I did incorporate your suggestion of using the hash of the rootfs path
> > as the subdirectory under the hosts /dev/ for the container. I also
>
> (Printed this out to look it over, just putting all my comments together
> here) :
>
> 1. I think if /dev is not devtmpfs, we should just bail on this.
>
> 2. You say in comments that you're using the cgroup name, but it seems
> you're actually just using the container name?
Ok... I'm going to experiment with this a bit but check me on this...
In the routine "lxc_setup()" the first parameter is "name". Is that the
cgroup name or just the container name? I take it, from your remark,
this is just the container name and the unique cgroup name may be
something different. Is that something I should be pulling out of the
cgroup info structure?
> 3. The cgroup name used to be unique, but now each mounted cgroupfs
> can actually have a different name for the same container (if some
> of them didn't get cleaned out well).
Yeah, I'm trying to get my head wrapped around that code now and figure
out the name that I want to use there.
> I'm just thinking out loud here, so this may not be better, but how
> about
>
> 1. create /dev/.lxc as you're doing
>
> 2. (if container is going to use this) create /dev/.lxc/$nonce.
> We can use hash("$lxcpath/$lxcname"), or just mkstemp(), or
> just an increasing integer.
>
> 3. Create $lxcpath/$lxcname/.dev (if the container needs it) and
> shared-bind-mount /dev/.lxc/$nonce onto it. Now we can tell
> which /dev/.lxc/* is mounted by looking at the mount table.
Ok... Looking at this now... I seen in the $lxcpath/$lxcname where we
are creating rootfsproc and rootfssys. Why not continue that hallowed
tradition and add a rootfsdev as "$lxcpath/$lxcname/rootfsdev" and bind
mount them all together?
> 4. slave-bind-bind mount $lxcpath/$lxcname/.dev into the starting
> container's /dev.
> Not sure whether we should have lxc.autodev = 2 mean use this scheme,
> but I'd be fine with basically always doing this so long as /dev/ is
> devtmpfs and lxc.autodev is set for the container. (So making
> the container's /dev a tmpfs would just be a fallback).
I like this idea. Actually doesn't involve moving a lot of code,
either.
> Thoughts?
> -serge
Regards,
Mike
> > added additional symlinks for the container name to the hash
> > subdirectory. I'm not totally sure I'm happy with this, since the hash
> > changes if the rootfs is moved but the whole thing is thrown away if the
> > machine is rebooted anyways (shrug). So it's a minor niggle. Maybe a
> > persistent uuid for the container might be an idea worth considering.
> > Maybe it's too minor to worry too much about.
> >
> > I thought about adding an additional directory under them for
> > "containers" and "by-hashes" and maybe "by-uuid" but that can all be
> > addressed in the future. This is just down-and-dirty / prove it can
> > work, kind of code...
> >
> > I did "steal" some code from monitor.c over into conf.c for the hash
> > routine. Maybe that needs to be redone in a generalize function in case
> > we need to change it and, then, everything tags along. A thought there.
> >
> > --
> >
> > This code will create a "/dev/.lxc" subdirectory on the host and
> > populated it with "/dev/.lxc/hash({rootfs})" directories with
> > "/dev/.lxc/{container}" symlinks to them. Those directories will then
> > be bind mounted to the container /dev/ directories and populated when
> > autodev = 1.
> >
> > They are NOT cleaned up upon container shutdown or reboot of the
> > container which can allow for some level of reboot persistence (as long
> > as it's not the host reboot). The symlinks are unlinked and relinked
> > allowing for multiple containers having the same names (and extensions
> > in the cgroup names) even though I personally feel that's an abonminal
> > (albeit rare) practice. It should all work.
> >
> > It does not currently incorporate any udev rules or hooks but is
> > eminately amenable to them. I'm working on that. That's the ultimate
> > goal, to facilitate more utilization of dynamic devices with symlinks
> > and device renumbering.
> >
> > Right now, it simply replaces the use of tmpfs bind mounted to /dev in
> > the container with a subdirectory of devtmpfs. With this, we can
> > address a few other problems wrt dynamic devices and udev rules,
> > including shared devices. It includes some logic for automagically
> > switching on autodev which, I admit, is suboptimal but I would like to
> > see this mature where autodev=1 is the default and can be overridden in
> > the unusual case where it has to be disabled.
> >
> > It also does not, currently, CHECK if devtmpfs is mounted on /dev in the
> > host. This is, potentially, a serious corner case if we have a host
> > with a static /dev on root and hosting a systemd based container. That
> > is going to break. Bad juju. How common? I don't know. I feel that
> > check probably needs to go in but I felt like I wanted this code to get
> > out for discussion first. What do we do it is not the case? I
> > suggested a ${lxc_path}/.devtmpfs/ directory and mount an instantiation
> > ourselves. That code is also not there and the idea is open for
> > discussion.
> >
> > Currently, I have this running with autodev = 1 for all my containers,
> > no matter if they are systemd based or upstart based or sysv init based.
> > All of them are working cleanly and this seems to have solved a few past
> > problems with bad modes when autdev was set to one for non-systemd
> > containers (early Fedora and CentOS).
> >
> > This is a git diff against 1.0.0.alpha2. After the jump...
> >
> > Regards,
> > Mike
> > --
> > Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw at WittsEnd.com
> > /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
> > NIC whois: MHW9 | An optimist believes we live in the best of all
> > PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
> >
> > --
> > diff --git a/src/lxc/conf.c b/src/lxc/conf.c
> > index 208c08b..21baf20 100644
> > --- a/src/lxc/conf.c
> > +++ b/src/lxc/conf.c
> > @@ -29,6 +29,7 @@
> > #include <string.h>
> > #include <dirent.h>
> > #include <unistd.h>
> > +#include <inttypes.h>
> > #include <sys/wait.h>
> > #include <sys/syscall.h>
> > #include <time.h>
> > @@ -1164,20 +1165,128 @@ static int setup_rootfs_pivot_root(const char *rootfs, const char *pivotdir)
> > return 0;
> > }
> >
> > +
> > +/*
> > + * Note: This is a verbatum copy of what is in monitor.c. We're just
> > + * usint it here to generate a safe subdirectory in /dev/ for the
> > + * containers /dev/
> > + */
> > +
> > +/* Note we don't use SHA-1 here as we don't want to depend on HAVE_GNUTLS.
> > + * FNV has good anti collision properties and we're not worried
> > + * about pre-image resistance or one-way-ness, we're just trying to make
> > + * the name unique in the 108 bytes of space we have.
> > + */
> > +#define FNV1A_64_INIT ((uint64_t)0xcbf29ce484222325ULL)
> > +static uint64_t fnv_64a_buf(void *buf, size_t len, uint64_t hval)
> > +{
> > + unsigned char *bp;
> > +
> > + for(bp = buf; bp < (unsigned char *)buf + len; bp++)
> > + {
> > + /* xor the bottom with the current octet */
> > + hval ^= (uint64_t)*bp;
> > +
> > + /* gcc optimised:
> > + * multiply by the 64 bit FNV magic prime mod 2^64
> > + */
> > + hval += (hval << 1) + (hval << 4) + (hval << 5) +
> > + (hval << 7) + (hval << 8) + (hval << 40);
> > + }
> > +
> > + return hval;
> > +}
> > +
> > +/*
> > + * Locate a devtmpfs mount (should be on /dev) and create a container
> > + * subdirectory on it which we can then bind mount to the container
> > + * /dev instead of mounting a tmpfs there.
> > + * If we fail, return NULL.
> > + * Else return the pointer to the name buffer with the string to
> > + * the devtmpfs subdirectory.
> > + */
> > +
> > +char *mk_devtmpfs(const char *name, char *path)
> > +{
> > + int ret;
> > + struct stat s;
> > + char tmp_path[MAXPATHLEN];
> > + char sym_path[MAXPATHLEN];
> > + char *base_path = "/dev/.lxc";
> > + uint64_t hash;
> > +
> > + /*
> > + * We're starting off with a fixed path /dev and assuming this is
> > + * a devtmpfs mount. These are poor assumptions. We need to add
> > + * more robust code to find it and verify it then use it.
> > + */
> > +
> > + if ( 0 != access(base_path, F_OK) || 0 != stat(base_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > + /* This is just making /dev/.lxc it better work or we're done */
> > + ret = mkdir(base_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > + if ( ret ) {
> > + return NULL;
> > + }
> > + }
> > +
> > + /* Now check and/or create our container subname */
> > + /* Symlink path... */
> > + ret = snprintf(sym_path, MAXPATHLEN, "%s/%s", base_path, name);
> > + if (ret < 0 || ret >= MAXPATHLEN)
> > + return NULL;
> > +
> > + /* Actual directory path */
> > + hash = fnv_64a_buf(path, ret, FNV1A_64_INIT);
> > + ret = snprintf(tmp_path, MAXPATHLEN, "%s/%016" PRIx64, base_path, hash);
> > + /* ret = snprintf(tmp_path, MAXPATHLEN, "%s/%s", base_path, name); */
> > + if (ret < 0 || ret >= MAXPATHLEN)
> > + return NULL;
> > +
> > + if ( 0 != access(tmp_path, F_OK) || 0 != stat(tmp_path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > + ret = mkdir(tmp_path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > + if ( ret ) {
> > + return NULL;
> > + }
> > + }
> > +
> > + /* This sets up a symlink from the container name (cgroup name) in
> > + * /dev/.lxc over to the actual hash based device directory. This
> > + * is strictly for a convenient cross reference and may be used
> > + * by udev rules to move devices in and out of the container spaces.
> > + */
> > +
> > + unlink( sym_path );
> > + ret = symlink(tmp_path, sym_path);
> > +
> > + if ( ret < 0 ) {
> > + SYSERROR("WARNING: Failed to create symlink '%s'->'%s'\n", sym_path, tmp_path);
> > + }
> > +
> > + strcpy( path, tmp_path );
> > + return path;
> > +}
> > +
> > +
> > /*
> > * Do we want to add options for max size of /dev and a file to
> > * specify which devices to create?
> > */
> > -static int mount_autodev(char *root)
> > +static int mount_autodev(const char *name, char *root)
> > {
> > int ret;
> > + struct stat s;
> > char path[MAXPATHLEN];
> > + char devtmpfs_path[MAXPATHLEN];
> >
> > INFO("Mounting /dev under %s\n", root);
> > ret = snprintf(path, MAXPATHLEN, "%s/dev", root);
> > if (ret < 0 || ret > MAXPATHLEN)
> > return -1;
> > - ret = mount("none", path, "tmpfs", 0, "size=100000");
> > + if (mk_devtmpfs( name, devtmpfs_path ) ) {
> > + ret = mount(devtmpfs_path , path, NULL, MS_BIND, 0 );
> > + } else {
> > + ret = mount("none", path, "tmpfs", 0, "size=100000");
> > + }
> > if (ret) {
> > SYSERROR("Failed to mount /dev at %s\n", root);
> > return -1;
> > @@ -1185,10 +1294,16 @@ static int mount_autodev(char *root)
> > ret = snprintf(path, MAXPATHLEN, "%s/dev/pts", root);
> > if (ret < 0 || ret >= MAXPATHLEN)
> > return -1;
> > - ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > - if (ret) {
> > - SYSERROR("Failed to create /dev/pts in container");
> > - return -1;
> > + /*
> > + * If we are running on a devtmpfs mapping, dev/pts may already exist.
> > + * If not, then create it and exit if that fails...
> > + */
> > + if ( 0 != access(path, F_OK) || 0 != stat(path, &s) || 0 == S_ISDIR(s.st_mode) ) {
> > + ret = mkdir(path, S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH);
> > + if (ret) {
> > + SYSERROR("Failed to create /dev/pts in container");
> > + return -1;
> > + }
> > }
> >
> > INFO("Mounted /dev under %s\n", root);
> > @@ -2370,6 +2485,7 @@ struct lxc_conf *lxc_conf_init(void)
> >
> > new->loglevel = LXC_LOG_PRIORITY_NOTSET;
> > new->personality = -1;
> > + new->autodev = -1;
> > new->console.log_path = NULL;
> > new->console.log_fd = -1;
> > new->console.path = NULL;
> > @@ -3036,6 +3152,55 @@ int uid_shift_ttys(int pid, struct lxc_conf *conf)
> > return 0;
> > }
> >
> > +/*
> > + * This routine is called when the configuration does not already specify a value
> > + * for autodev (mounting a file system on /dev and populating it in a container).
> > + * If a hard override value has not be specified, then we try to apply some
> > + * heuristics to determine if we should switch to autodev mode.
> > + *
> > + * For instance, if the container has an /etc/systemd/system directory then it
> > + * is probably running systemd as the init process and it needs the autodev
> > + * mount to prevent it from mounting devtmpfs on /dev on it's own causing conflicts
> > + * in the host.
> > + *
> > + * We may also want to enable autodev if the host has devtmpfs mounted on its
> > + * /dev as this then enable us to use subdirectories under /dev for the container
> > + * /dev directories and we can fake udev devices.
> > + */
> > +int check_autodev( const char *rootfs )
> > +{
> > + char absrootfs[MAXPATHLEN];
> > + int ret;
> > + struct stat s;
> > + char path[MAXPATHLEN];
> > +
> > + INFO("Testing for systemd in %s\n", rootfs);
> > +
> > + if (rootfs == NULL || strlen(rootfs) == 0)
> > + return -2;
> > +
> > + if (!realpath(rootfs, absrootfs))
> > + return -2;
> > +
> > + /* Note here: we could instead check for /etc/systemd/system.conf */
> > + ret = snprintf(path, MAXPATHLEN, "%s/etc/systemd/system", absrootfs);
> > + if (ret < 0 || ret > MAXPATHLEN)
> > + return -2;
> > +
> > + if ( 0 == access(path, F_OK) && 0 == stat(path, &s) && S_ISDIR(s.st_mode) )
> > + return 1;
> > +
> > +
> > + /* Add future checks here.
> > + * Return positive if we should go autodev
> > + * Return 0 if we should NOT go autodev
> > + * Return negative if we encounter an error or can not determine...
> > + */
> > +
> > + /* All else fails, disable autodev */
> > + return 0;
> > +}
> > +
> > int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath, struct cgroup_process_info *cgroup_info)
> > {
> > if (setup_utsname(lxc_conf->utsname)) {
> > @@ -3058,8 +3223,12 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> > return -1;
> > }
> >
> > - if (lxc_conf->autodev) {
> > - if (mount_autodev(lxc_conf->rootfs.mount)) {
> > + if (lxc_conf->autodev < 0) {
> > + lxc_conf->autodev = check_autodev(lxc_conf->rootfs.mount);
> > + }
> > +
> > + if (lxc_conf->autodev > 0) {
> > + if (mount_autodev(name, lxc_conf->rootfs.mount)) {
> > ERROR("failed to mount /dev in the container");
> > return -1;
> > }
> > @@ -3097,7 +3266,7 @@ int lxc_setup(const char *name, struct lxc_conf *lxc_conf, const char *lxcpath,
> > return -1;
> > }
> >
> > - if (lxc_conf->autodev) {
> > + if (lxc_conf->autodev > 0) {
> > if (run_lxc_hooks(name, "autodev", lxc_conf, lxcpath, NULL)) {
> > ERROR("failed to run autodev hooks for container '%s'.", name);
> > return -1;
> >
>
>
>
> > ------------------------------------------------------------------------------
> > October Webinars: Code for Performance
> > Free Intel webinars can help you accelerate application performance.
> > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
> > the latest Intel processors and coprocessors. See abstracts and register >
> > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
>
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
>
>
--
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131101/ea7906ea/attachment.pgp>
More information about the lxc-devel
mailing list