[lxc-users] sysctl.conf and security/limits.conf tuning for running containers

Sat Sep 14 07:01:54 UTC 2019

Greetings, Adrian Pepper!

> I'll start this lengthy message with a table-of-contents of sorts.

Next time please post a new message, when you open a new thread to the list.

>  === Only a limited number of containers could run usefully ===
>  I had had problems on my workstation running more than about 10
>  containers; subsequent ones would show as RUNNING, but have no IP
>  address.  lxc-attach suggested /sbin/init was actually hung, with
>  no apparent way to recover them.  I used to resort to shutting down
>  lesser-needed containers to allow new ones to run usefully.
>  
>  Then one day, I decided to try and pursue the problem a little harder.
>  

>  === github lxc/lxd production-setup.md ===
>  Eventually, mostly by checking my mbox archive of this list
>  (lxc-users at lists.linuxcontainers.org), I stumbled on...
>    https://github.com/lxc/lxd/blob/master/doc/production-setup.md

>  It's not clear to me what the context of that document really is.
>  Does it end up in the contents of lxd?  (I still use lxc).
>  But even referenced directly from the git repository, it still
>  provides useful information.

>  I summarized that production-setup.md for myself...

>   /etc/security/limits.conf
>   #<domain>      <type>  <item>         <value>
>   *     soft  nofile      1048576     unset
>   *     hard  nofile      1048576     unset
>   root  soft  nofile      1048576     unset
>   root  hard  nofile      1048576     unset
>   *     soft  memlock     unlimited   unset
>   *     hard  memlock     unlimited   unset

>   /etc/sysctl.conf (effective)
>   fs.inotify.max_queued_events        1048576  # def:16384
>   fs.inotify.max_user_instances       1048576  # def:128
>   fs.inotify.max_user_watches         1048576  # def:8192
>   vm.max_map_count                    262144   # def:65530  max mma per proc
>   kernel.dmesg_restrict               1        # def:0
>   net.ipv4.neigh.default.gc_thresh3   8192     # def:1024   arp table limit
>   net.ipv6.neigh.default.gc_thresh3   8192     # def:1024   arp table limit
>   kernel.keys.maxkeys                 2000     # def:200   non-root key limit
>                                               #  s.b. > number of containers
>   net.core.netdev_max_backlog  "increase" suggests 182757  (from 1000!)

>  During this time of my most recent investigation, I had happened to
>  suspect fs.inotify.max_user_watches (Because a "tail" I ran indicated
>  that it could not use "inotify" and needed to poll instead).
>  (Hey, there I sound like a natural kernel geek, but actually I needed
>  to do a few web searches to correlate the tail diagnostic to the setting).

>  production-setup.md also has suggestions about txqueuelen, but I will
>  assume for now those apply only to systems wanting to generate or
>  receive a lot of real network traffic.

>  === Recommended values seem arbitrary, perhaps excessive in some cases ===
>  In the suggestions above:
>    1048576 is 1024*1024 and seems very arbitrary.
>  Hopefully, this is mostly increasing the size of edge-pointer tables
>  and so doesn't consume a lot of memory unless  the resources do get
>  close to the maximum.  I actually used smaller values.  A little more
>  in line with the proportions of the defaults (shown above).

>    cscf-adm at scspc578-1804:~$ grep '^' /proc/sys/fs/inotify/*
>    /proc/sys/fs/inotify/max_queued_events:262144
>    /proc/sys/fs/inotify/max_user_instances:131072
>    /proc/sys/fs/inotify/max_user_watches:262144
>    cscf-adm at scspc578-1804:~$ 

>  Searching for more info about netdev_max_backlog found
>   https://community.mellanox.com/s/article/linux-sysctl-tuning
>  It suggests raising net.core.netdev_max_backlog=250000
>  So I went with that.
>  I still haven't figured out the significance of 182757, the apparent
>  product of two primes, 3 * 60919.  Nor can I see any significance to
>  any of its near-adjacent numbers.

>  After applying changes similar to the above, I observed very good
>  results.  Whereas before I seemed to run into problems at around 12
>  containers, I am currently running 17, and have run more.

It would be useful if you can discover/describe some direct ways to
investigate limits congestion. Will be way more helpful in tuning container
host systems for specific needs.

>  === /sbin/init sometimes missing on apparently healthy containers? ===
>  Also, I previously observed that the number of /sbin/init processes
>  running was significantly fewer than the number of apparently properly
>  functional containers.  The good news is that today there are almost
>  as many /sbin/init processes running as containers.  The bad news is
>  that N(/sbin/init) == N(containers)-1 whereas I would think it should
>  equal N(containers)+1
>  (That is, I confirmed by sshing to each container in turn and looking
>  in "ps" output for /init that two containers had no /init running, but
>  they both seem to be generally working).

Were they created from custom images?
What do they report as pid 1?

>  The total number of processes I run is, according to "ps", nearly
>  always less than 1000.  (Usually "ps -adelfww").

>  I almost wonder if that was a transitory problem in Ubuntu 18.04 which
>  gets fixed in the containers as the appropriate dist-upgrade gets done.

>  === Using USB disk on container-heavy host used to exceed some queue limit ===
>  One of these changes, probably either net.core.netdev_max_backlog or
>  fs.inotify.max_queued_events, seems to have had the pleasant side effect
>  of allowing me to write backups to a USB drive without getting flakiness
>  in my user interface, also removing diagnostics which used to occur in
>  that situation about some queue limit being raised because of observed
>  lost events.

More likely the fs.inotify.max_queued_events

>  === My previous pty tweaking now raises a distinct question ===
>  Another distinct problem caused me to raise
>  /proc/sys/kernel/pty/max
>  Given the apparent value /proc/sys/kernel/pty/reserve:1024
>  does one need to set kernel/pty/max to (N*1024 plus the total number of
>  ptys you expect to allocate) where N is the number of containers
>  you expect to run concurrently?
>  /proc/sys/kernel/pty/nr
>  never seems particularly high now.
>  (/proc/sys/kernel/pty/max being another one of the apparent few
>  system parameters for which you can monitor the current usage).

Now, this is interesting.
I was routinely killing 3 default login sessions started inside container by
default. For no apparent reason. Seems like I wasn't far off doing that.

>  === Trivial observation re: sysctl which helped me when I noted it ===
>  "sysctl kernel.pty.max" <=> "cat /proc/sys/kernel/pty/max" sort of.
>  I.e. "sysctl A.B.C.D" <=> "cat /proc/sys/A/B/C/D"

Yep. sysctl is a sort of wrapper, you can achieve similar results to sysctl/-w
with simple cat/echo to the respective "files" in /proc

-- 
With best regards,
Andrey Repin
Saturday, September 14, 2019 8:55:58

Sorry for my terrible english...