[Lxc-users] Containing a user-space application

Tue Jun 11 20:03:13 UTC 2013

Quoting Barry Jaspan (barry.jaspan at acquia.com):
> I am just getting started with LXC. I'm using Ubuntu 12.04 (Precise). After
> a week of reading and experimenting, I have the beginnings of a working
> prototype and a handful of questions. :)
> 
> First, my use case. I'm running a number of customer-provided applications
> on a shared host. My only goal for LXC (at least for today) is to provide a
> read-only root filesystem for each application while the host itself has a
> read-write root filesystem (I have other security mechanisms in place, too,
> that are out of scope for this list). I do not want a fully contained
> virtual OS with its own init; I just want to contain the app. Like the
> recent thread "Sharing container rootfs", I do *not* want a separate copy
> of the rootfs for each application. I need to share the host's rootfs for
> every app, and I'll explicitly bind-mount in directories that need to be
> unique or read-write for each app (e.g. /tmp).
> 
> On my prototype host, I have bind-mounted the host's root fs to
> /var/lib/lxc/cloud/rootfs with
> 
> mount --bind / /var/lib/lxc/cloud/rootfs
> mount -o remount,ro /var/lib/lxc/cloud/rootfs
> 
> My prototype LXC config file is quite simple:
> 
> lxc.utsname = container
> lxc.rootfs = /var/lib/lxc/cloud/rootfs
> # For each directory in the app container that needs to be read-write or
> unique:
> lxc.mount.entry=/FOO /var/lib/lxc/cloud/rootfs/FOO none rw,bind 0 0
> 
> I start the application with lxc-execute -n uniquename -f
> /var/lib/lxc/cloud/config -- [cmd]. This basically seems to work. The root
> fs within the container is read-only, and the directories I am explicitly
> bind over-mounting are read-write. Yay. I am also getting a pid namespace,
> I guess by default, which is fine.
> 
> Here are some questions:
> 
> * I'm currently sharing the host's /dev, which I know is a mistake. I can
> make my own inside rootfs and bind-mount it in. Any suggestions for a

You can either set lxc.autodev = 1 to have get a tmpfs autopopulated
with the minimum you need, or look at the code in conf.c which fills
it in for the list of devices.

> minimal device set to include? My goal is to prevent the application from
> escaping the container. I guess I can start with what the busybox template

You'll be needing to at least drop some capabilities then right?  (Oh,
just saw the end of the email, I see)

> creates and add whatever we discover our customer apps need.
> 
> * lxc-attach does not work. This isn't that critical as I don't think I
> need it, I just want to make sure I'm not doing something wrong.
> 
> host# lxc-attach -n test ps
> lxc-attach: No such file or directory - failed to open '/proc/16808/ns/pid'
> lxc-attach: failed to enter the namespace
> host# ls /proc/16808/ns
> ipc  net  uts
> 
> Why does /proc/PID/ns/pid not exist? Also note that I did not create an ipc

Your kernel is not new enough.

> or net namespace (at least on purpose) but I do have those files.
> 
> * When I run "bash" as the command under lxc-execute, it is not connecting
> to the tty correctly (this isn't necessary a problem, since my customer
> apps are not bash and are not interactive, but it makes me think something
> is wrong that should be fixed). The first character typed results in

That's because lxc is set up redirecting the console to /dev/console in
the container, expecting getty to hook up your task's stdin/out/err to
that.

When I've asked about this before, Daniel has said lxc-execute is meant
to be used with containers which do not have their own rootfs.  So if
you were to not set an lxc.rootfs, then this doesn't happen (pretty sure
I tested that just yesterday as a side effect of another test).

There may be an elegant 'fix' for this, by detecting that lxc-init is
being used and then assuming getty won't run.

> * Within my container, /etc/mtab is the host's mtab, so "mount" displays
> the host's mounts, while "cat /proc/mounts" display's the container's
> mounts (namespaced). I tried bind-mounting an empty, read-write file on top
> of /etc/mtab, but then it just stayed empty. What is supposed to update
> /etc/mtab? (For that matter, why isn't it just a symlink to /proc/mounts?)

Why don't you make it so?  (You can even bind-mount it using a mount
entry in the container's configuration)

> PS: For the curious, the other security mechanism I'm experimenting with is
> setting and locking all "securebits" and setting the customer app's
> capability bounding set to be empty; see capabilities(7) for details (this
> turns out to be somewhat tricky to achieve, but is possible). Once done,
> even if the customer achieves root, it will not have any of root's
> superuser capabilities. However, it *will* have full access to files owned
> by uid 0; that's why I want the root filesystem to be read-only.

Ah, I see.  Note you still might want to use seccomp to protect against
syscall p0wnage.

-serge