[lxc-users] sysctl.conf and security/limits.conf tuning for running containers

Fri Sep 13 20:39:02 UTC 2019

I'll start this lengthy message with a table-of-contents of sorts.

 === Introduction ===
 === Only a limited number of containers could run usefully ===
 === github lxc/lxd production-setup.md ===
 === Recommended values seem arbitrary, perhaps excessive in some cases ===
 === /sbin/init sometimes missing on apparently healthy containers? ===
 === Using USB disk on container-heavy host used to exceed some queue limit ===
 === Feedback before I report my personal results where I work? ===
 === fs.file-max probably a red-herring ===
 === My previous pty tweaking now raises a distinct question ===
 === Trivial observation re: sysctl which helped me when I noted it ===

 I claim my rambing all joins together, but anyone looking for specific
 observations to critique, or questions to answer, may be aided by
 looking for the "===" lines.  And they provide summaries if you
 read sequentially, too.

 === Introduction ===
 One infers there is kernel tuning required when running an lxc environment.
 By running many conceptual hosts on the one machine, one violates
 some assumptions which normal kernel parameters might make.
 A guide which gives its readers a perspective which would make most of
 the required tuning intuitive would be ideal.

 Myself, I run lxc (3.0.3) on my Ubuntu 18.04 workstation (16G RAM),
 partly to learn about lxc, partly to keep obsolescent software around(!)
 (also, previously, to use 18.04 versions of some things on Ubuntu 16.04)
 and largely to prototype ideas for production in an easily created
 (and destroyed) isolated setting.  I also use it for data isolation;
 sort of modularization.  I can easily take work I have been doing to
 either of a couple of laptops also running lxc.

 Towards the end here, I reveal that even our production servers, presumably
 tuned--by people I generally figure have a lot more kernel intuition than
 I do, do not consistently indicate some of the settings I have managed to
 discover as important.

 === Only a limited number of containers could run usefully ===
 I had had problems on my workstation running more than about 10
 containers; subsequent ones would show as RUNNING, but have no IP
 address.  lxc-attach suggested /sbin/init was actually hung, with
 no apparent way to recover them.  I used to resort to shutting down
 lesser-needed containers to allow new ones to run usefully.

 Then one day, I decided to try and pursue the problem a little harder.

 === github lxc/lxd production-setup.md ===
 Eventually, mostly by checking my mbox archive of this list
 (lxc-users at lists.linuxcontainers.org), I stumbled on...
   https://github.com/lxc/lxd/blob/master/doc/production-setup.md

 It's not clear to me what the context of that document really is.
 Does it end up in the contents of lxd?  (I still use lxc).
 But even referenced directly from the git repository, it still
 provides useful information.

 I summarized that production-setup.md for myself...

  /etc/security/limits.conf
  #<domain>      <type>  <item>         <value>
  *     soft  nofile      1048576     unset
  *     hard  nofile      1048576     unset
  root  soft  nofile      1048576     unset
  root  hard  nofile      1048576     unset
  *     soft  memlock     unlimited   unset
  *     hard  memlock     unlimited   unset

  /etc/sysctl.conf (effective)
  fs.inotify.max_queued_events        1048576  # def:16384
  fs.inotify.max_user_instances       1048576  # def:128
  fs.inotify.max_user_watches         1048576  # def:8192
  vm.max_map_count                    262144   # def:65530  max mma per proc
  kernel.dmesg_restrict               1        # def:0
  net.ipv4.neigh.default.gc_thresh3   8192     # def:1024   arp table limit
  net.ipv6.neigh.default.gc_thresh3   8192     # def:1024   arp table limit
  kernel.keys.maxkeys                 2000     # def:200   non-root key limit
                                              #  s.b. > number of containers
  net.core.netdev_max_backlog  "increase" suggests 182757  (from 1000!)

 During this time of my most recent investigation, I had happened to
 suspect fs.inotify.max_user_watches (Because a "tail" I ran indicated
 that it could not use "inotify" and needed to poll instead).
 (Hey, there I sound like a natural kernel geek, but actually I needed
 to do a few web searches to correlate the tail diagnostic to the setting).

 production-setup.md also has suggestions about txqueuelen, but I will
 assume for now those apply only to systems wanting to generate or
 receive a lot of real network traffic.

 === Recommended values seem arbitrary, perhaps excessive in some cases ===
 In the suggestions above:
   1048576 is 1024*1024 and seems very arbitrary.
 Hopefully, this is mostly increasing the size of edge-pointer tables
 and so doesn't consume a lot of memory unless  the resources do get
 close to the maximum.  I actually used smaller values.  A little more
 in line with the proportions of the defaults (shown above).

   cscf-adm at scspc578-1804:~$ grep '^' /proc/sys/fs/inotify/*
   /proc/sys/fs/inotify/max_queued_events:262144
   /proc/sys/fs/inotify/max_user_instances:131072
   /proc/sys/fs/inotify/max_user_watches:262144
   cscf-adm at scspc578-1804:~$ 

 Searching for more info about netdev_max_backlog found
  https://community.mellanox.com/s/article/linux-sysctl-tuning
 It suggests raising net.core.netdev_max_backlog=250000
 So I went with that.
 I still haven't figured out the significance of 182757, the apparent
 product of two primes, 3 * 60919.  Nor can I see any significance to
 any of its near-adjacent numbers.

 After applying changes similar to the above, I observed very good
 results.  Whereas before I seemed to run into problems at around 12
 containers, I am currently running 17, and have run more.

 === /sbin/init sometimes missing on apparently healthy containers? ===
 Also, I previously observed that the number of /sbin/init processes
 running was significantly fewer than the number of apparently properly
 functional containers.  The good news is that today there are almost
 as many /sbin/init processes running as containers.  The bad news is
 that N(/sbin/init) == N(containers)-1 whereas I would think it should
 equal N(containers)+1
 (That is, I confirmed by sshing to each container in turn and looking
 in "ps" output for /init that two containers had no /init running, but
 they both seem to be generally working).

 The total number of processes I run is, according to "ps", nearly
 always less than 1000.  (Usually "ps -adelfww").

 I almost wonder if that was a transitory problem in Ubuntu 18.04 which
 gets fixed in the containers as the appropriate dist-upgrade gets done.

 === Using USB disk on container-heavy host used to exceed some queue limit ===
 One of these changes, probably either net.core.netdev_max_backlog or
 fs.inotify.max_queued_events, seems to have had the pleasant side effect
 of allowing me to write backups to a USB drive without getting flakiness
 in my user interface, also removing diagnostics which used to occur in
 that situation about some queue limit being raised because of observed
 lost events.

 === Feedback before I report my personal results where I work? ===
 Looking around at our production servers here, I observe that they do not
 seem to have very aggressive tuning of these resources.  Often they appear
 to be using the defaults.  I thought I'd see if I could get some more
 feedback before approaching my boss about that (and besides he's away ill
 today).

 Perhaps there are other standard documents regarding this you can
 point me to.

 === fs.file-max probably a red-herring ===
 Previously, I may have got mediocre success with increasing
 /proc/sys/fs/file-max
 Perhaps raising that value caused container rlimits to use higher
 defaults, or something.
 For a few kernel resources, and that one in particular, I discovered
 you can monitor current consumption with another proc location.
 /proc/sys/fs/file-nr does not seem to indicate high usage of that
 resource.
 Perhaps I saw diagnostics more related to inotify which sounded
 like they were talking about number of file descriptors themselves.
 So when I feel like doing the experiment, I should try reducing my
 raised /proc/sys/fs/file-max

 === My previous pty tweaking now raises a distinct question ===
 Another distinct problem caused me to raise
 /proc/sys/kernel/pty/max
 Given the apparent value /proc/sys/kernel/pty/reserve:1024
 does one need to set kernel/pty/max to (N*1024 plus the total number of
 ptys you expect to allocate) where N is the number of containers
 you expect to run concurrently?
 /proc/sys/kernel/pty/nr
 never seems particularly high now.
 (/proc/sys/kernel/pty/max being another one of the apparent few
 system parameters for which you can monitor the current usage).

 === Trivial observation re: sysctl which helped me when I noted it ===
 "sysctl kernel.pty.max" <=> "cat /proc/sys/kernel/pty/max" sort of.
 I.e. "sysctl A.B.C.D" <=> "cat /proc/sys/A/B/C/D"

 Adrian Pepper
 Computer Science Computing Facility
 University of Waterloo, Ontario, Canada
 arpepper at uwaterloo.ca