[lxc-users] Recent LXC / LXD and shared file systen infrastructures

Thu Nov 19 20:36:40 UTC 2015

On 19.11.2015 03:25, Serge Hallyn wrote:
> Quoting Jäkel, Guido (G.Jaekel at dnb.de):
>> Dear experts,
>>
>> I wonder if the current versions of LXD (and LXC) are aware of a shared file infrastructure like NFS. I'm using LXC 0.8 since a couple of years on a setup based on a bunch of diskless bladeservers (Cisco UCS) and a central NFS filer (Netapp). All the root filesystem (the containers and the host) are a formed by individual directory trees on the NFS. All setup of resources, network and filesystem (root and shared data) is externalized from the Containers - all the configuration information is on a shared resource, too. From this, I'm able to start each container (at a time) on any host. Now it's time to upgrade (or better say rebuild) all of it to get the promises of recent cgroup handling, lxc-fs, uid/gid-shifting and so on.
>>
>> I want to ask, if this design is covered by the infrastructure assumptions of current LXD. In special, if I "transfer" (copy or move) a container to another host, is it possible to configure the lxc daemon environment in such a way, that the container root file system will not be copied from host A to host B because it's "already there"? Is it save to share caching directories for images?
> 
> I'm not quite sure how your site is put together, I'd assume you have one large NFS server and the nodes simply mount it locally?  (Nothing fancier going on?)

Dear Serge,

thank you for discussion.

Nothing fancier at all. There are diskless bladeservers, booting the kernel via PXE and mount their root filesystem via NFS. Yes, this is a subtree on a "large" (just some TB) NFS volume and there also subtrees for the rootfs of each container or for shared data (stages with business data in stages, common areas for program installation or shared configurations)

The interface between the Linux and the File storage is at filesystem layer (NFS) and not at blockdevice layer with a filesystem driver on the side of the operating system. This is because all responsibility for actual sizing, mirroring/backup/restore, deduplication is at the side of the Filer.

> Lxd stores its configuration information in a sqlite3 db, so the basic premise of your current setup doesn't work.  

But every host (and every container) has it's own filesystem and *may* get it's own "local" sqlite db for the local lxd, if needed. 

By the way i would prefer any kind of textual format (even a "xml-hell") for such things like a configuration information over any proprietary representation because it will violate KISS but yield marginal benefits for the user of an application. Of course it's "much simpler" for the developers of this application ;) But an user is enforced to convert it forward to and back from some textual representation if he want to deal with it.

> In lxd you'd want to share the 'images' with all the nodes, and let each local lxd start new local containers based on the shared images.  You could then, if you wanted, 'publish' the complete containers as new images in the central repository.
>
> Creating a container from an image can be done the following ways:
> 
> 1. plain directory copy

This should be usable because it's the same mechanism i currently use to manage it (clone from another or unused (template) directory). But creating new containers is a very rare usecase for me, even if there are a couple of (at first start) "identical" containers. The quick changing thing (in timescale of weeks) we're dealing with are Java Application and this change is on a layer on to of this infrastructure.

> The more interesting question is what we can do to speed up the transfers for you.  One interesting thing might be to use LVs on pvs which sit on RBD devices shared with all hosts.  I've never done done this, would be interested to hear how it performs.

I think i don't want any transfer at all. If i start to play with it, i'll see if there is a way to customize the transfer into a null operation or if i may simply abstain from some commands and use a stop-here-start-there instead of a move action.

But a block device layer is no option for us. Because if i use a virtual block device from the filer (as our Windows guys with VMWare ESX), inside the filer it is just a big image file on the filers internal filesystem. But our backup is based on file system layer, too. This means it will backup every time the whole big image file. And even, if this would be a smart incremental backup, one have to restore a whole image and to mount it to get it's contents.