[lxc-devel] versioning the container monitor api

Tue Aug 27 20:04:18 UTC 2013

Hi Serge,

> 	I start a container running a crucial mail server.  I upgrade
> 	lxc.  The new lxc has changed the format of messages for the
> 	commands api.  Now I do 'lxc-list', which queries the running
> 	monitor to check its init pid with LXC_CMD_GET_INIT_PID.  The
> 	container monitor crashes on bad input.

Yes, that's a problem I frequently also had.

> The lxc_af_unix_connect function could start with a handshake with a
> version number, or we could tack a version # onto the lxc_cmd_req
> struct.  Best would be if we agreed the client always sends its version
> to the monitor, then vice versa, and then both sides decide whether
> they can proceed (so both sides can log error).  We could just use
> a monotonically increasing int, hand-inserted.  However that's subject
> to error - if we make a change without remembering to update the version
> number, then we could still get a crash.  We could automate this perhaps
> by having a Makefile do some sort of check, i.e. hashing all the structs
> which may be communicated over the socket.

I think the real solution is far easier: previously, the command
interface changed quite a bit because it was quite a bit more limited
than it is now. But now the basic structure of the current command
interface seems to be rather complete. Each request is just a tuple
(cmd, datalen, data_ptr (mostly ignored)) + possibly additional data of
length datalen on the line afterwards. Each response is (ret, datalen,
data_ptr (mostly ignored)) + possibly data of length datalen on the line
afterwards. I don't see how even quite complicated stuff couldn't in
principle fit in there. The only question is what the semantics of
cmd/ret, datalen, data_ptr and the data itself are.

So we should just declare that for the current commands, the semantics
are completely fixed. Meaning that LXC_CMD_CONSOLE will always have the
same on-the-wire semantics as it currently does.

But let's suppose at some point in the future, LXC_CMD_CONSOLE is
supposed change semantics completely. Then we change the enum to:

typedef enum {
  LXC_CMD_DEPRECATED1,  // <- LXC_CMD_CONSOLE was here
  ...,
  LXC_CMD_CONSOLE,      // <- newly added, gets a new number
  LXC_CMD_MAX,
};

Then we can change the semantics of datalen / data_ptr and additional
data and we will still be backwards compatible with all the other
options. We just have to make sure that the processing routines always
eat up all of the data, even if the command is not recognized, so that
the connection will be in a sane state after that and communication may
proceed.

If the server now doesn't recognize a command, it will issue the trivial
response { -ENOSYS, 0, 0 } back to the client. Then the client will know
that the server is too old / too new to support the command and will
have to cope with it. In the case of something like LXC_CMD_GET_STATE
and LXC_CMD_GET_INIT_PID one might want to write a fallback routine for
the client, in the case of LXC_CMD_CONSOLE perhaps not, depends on why
the change is required.

Add big fat comments in the appropriate parts of commands.h/commands.c
to make sure that nobody changes this (+ perhaps a few unit tests) and
there will be compatibility even between versions.

> But we might want to try and accomodate newer clients talking to
> older versions, somehow. I suspect that'd be fragile, but it might
> be worthwhile.

I think that's generally a good idea (for clients post 1.0; I think for
1.0 it's reasonable to say we do a final incompatible break) and at
least for core functionality it should be policy that there will be
compatibility.

Just my 2¢.

-- Christian