[lxc-users] sysctl.conf and security/limits.conf tuning for running containers
Adrian Pepper
arpepper at uwaterloo.ca
Fri Sep 13 20:39:02 UTC 2019
I'll start this lengthy message with a table-of-contents of sorts.
=== Introduction ===
=== Only a limited number of containers could run usefully ===
=== github lxc/lxd production-setup.md ===
=== Recommended values seem arbitrary, perhaps excessive in some cases ===
=== /sbin/init sometimes missing on apparently healthy containers? ===
=== Using USB disk on container-heavy host used to exceed some queue limit ===
=== Feedback before I report my personal results where I work? ===
=== fs.file-max probably a red-herring ===
=== My previous pty tweaking now raises a distinct question ===
=== Trivial observation re: sysctl which helped me when I noted it ===
I claim my rambing all joins together, but anyone looking for specific
observations to critique, or questions to answer, may be aided by
looking for the "===" lines. And they provide summaries if you
read sequentially, too.
=== Introduction ===
One infers there is kernel tuning required when running an lxc environment.
By running many conceptual hosts on the one machine, one violates
some assumptions which normal kernel parameters might make.
A guide which gives its readers a perspective which would make most of
the required tuning intuitive would be ideal.
Myself, I run lxc (3.0.3) on my Ubuntu 18.04 workstation (16G RAM),
partly to learn about lxc, partly to keep obsolescent software around(!)
(also, previously, to use 18.04 versions of some things on Ubuntu 16.04)
and largely to prototype ideas for production in an easily created
(and destroyed) isolated setting. I also use it for data isolation;
sort of modularization. I can easily take work I have been doing to
either of a couple of laptops also running lxc.
Towards the end here, I reveal that even our production servers, presumably
tuned--by people I generally figure have a lot more kernel intuition than
I do, do not consistently indicate some of the settings I have managed to
discover as important.
=== Only a limited number of containers could run usefully ===
I had had problems on my workstation running more than about 10
containers; subsequent ones would show as RUNNING, but have no IP
address. lxc-attach suggested /sbin/init was actually hung, with
no apparent way to recover them. I used to resort to shutting down
lesser-needed containers to allow new ones to run usefully.
Then one day, I decided to try and pursue the problem a little harder.
=== github lxc/lxd production-setup.md ===
Eventually, mostly by checking my mbox archive of this list
(lxc-users at lists.linuxcontainers.org), I stumbled on...
https://github.com/lxc/lxd/blob/master/doc/production-setup.md
It's not clear to me what the context of that document really is.
Does it end up in the contents of lxd? (I still use lxc).
But even referenced directly from the git repository, it still
provides useful information.
I summarized that production-setup.md for myself...
/etc/security/limits.conf
#<domain> <type> <item> <value>
* soft nofile 1048576 unset
* hard nofile 1048576 unset
root soft nofile 1048576 unset
root hard nofile 1048576 unset
* soft memlock unlimited unset
* hard memlock unlimited unset
/etc/sysctl.conf (effective)
fs.inotify.max_queued_events 1048576 # def:16384
fs.inotify.max_user_instances 1048576 # def:128
fs.inotify.max_user_watches 1048576 # def:8192
vm.max_map_count 262144 # def:65530 max mma per proc
kernel.dmesg_restrict 1 # def:0
net.ipv4.neigh.default.gc_thresh3 8192 # def:1024 arp table limit
net.ipv6.neigh.default.gc_thresh3 8192 # def:1024 arp table limit
kernel.keys.maxkeys 2000 # def:200 non-root key limit
# s.b. > number of containers
net.core.netdev_max_backlog "increase" suggests 182757 (from 1000!)
During this time of my most recent investigation, I had happened to
suspect fs.inotify.max_user_watches (Because a "tail" I ran indicated
that it could not use "inotify" and needed to poll instead).
(Hey, there I sound like a natural kernel geek, but actually I needed
to do a few web searches to correlate the tail diagnostic to the setting).
production-setup.md also has suggestions about txqueuelen, but I will
assume for now those apply only to systems wanting to generate or
receive a lot of real network traffic.
=== Recommended values seem arbitrary, perhaps excessive in some cases ===
In the suggestions above:
1048576 is 1024*1024 and seems very arbitrary.
Hopefully, this is mostly increasing the size of edge-pointer tables
and so doesn't consume a lot of memory unless the resources do get
close to the maximum. I actually used smaller values. A little more
in line with the proportions of the defaults (shown above).
cscf-adm at scspc578-1804:~$ grep '^' /proc/sys/fs/inotify/*
/proc/sys/fs/inotify/max_queued_events:262144
/proc/sys/fs/inotify/max_user_instances:131072
/proc/sys/fs/inotify/max_user_watches:262144
cscf-adm at scspc578-1804:~$
Searching for more info about netdev_max_backlog found
https://community.mellanox.com/s/article/linux-sysctl-tuning
It suggests raising net.core.netdev_max_backlog=250000
So I went with that.
I still haven't figured out the significance of 182757, the apparent
product of two primes, 3 * 60919. Nor can I see any significance to
any of its near-adjacent numbers.
After applying changes similar to the above, I observed very good
results. Whereas before I seemed to run into problems at around 12
containers, I am currently running 17, and have run more.
=== /sbin/init sometimes missing on apparently healthy containers? ===
Also, I previously observed that the number of /sbin/init processes
running was significantly fewer than the number of apparently properly
functional containers. The good news is that today there are almost
as many /sbin/init processes running as containers. The bad news is
that N(/sbin/init) == N(containers)-1 whereas I would think it should
equal N(containers)+1
(That is, I confirmed by sshing to each container in turn and looking
in "ps" output for /init that two containers had no /init running, but
they both seem to be generally working).
The total number of processes I run is, according to "ps", nearly
always less than 1000. (Usually "ps -adelfww").
I almost wonder if that was a transitory problem in Ubuntu 18.04 which
gets fixed in the containers as the appropriate dist-upgrade gets done.
=== Using USB disk on container-heavy host used to exceed some queue limit ===
One of these changes, probably either net.core.netdev_max_backlog or
fs.inotify.max_queued_events, seems to have had the pleasant side effect
of allowing me to write backups to a USB drive without getting flakiness
in my user interface, also removing diagnostics which used to occur in
that situation about some queue limit being raised because of observed
lost events.
=== Feedback before I report my personal results where I work? ===
Looking around at our production servers here, I observe that they do not
seem to have very aggressive tuning of these resources. Often they appear
to be using the defaults. I thought I'd see if I could get some more
feedback before approaching my boss about that (and besides he's away ill
today).
Perhaps there are other standard documents regarding this you can
point me to.
=== fs.file-max probably a red-herring ===
Previously, I may have got mediocre success with increasing
/proc/sys/fs/file-max
Perhaps raising that value caused container rlimits to use higher
defaults, or something.
For a few kernel resources, and that one in particular, I discovered
you can monitor current consumption with another proc location.
/proc/sys/fs/file-nr does not seem to indicate high usage of that
resource.
Perhaps I saw diagnostics more related to inotify which sounded
like they were talking about number of file descriptors themselves.
So when I feel like doing the experiment, I should try reducing my
raised /proc/sys/fs/file-max
=== My previous pty tweaking now raises a distinct question ===
Another distinct problem caused me to raise
/proc/sys/kernel/pty/max
Given the apparent value /proc/sys/kernel/pty/reserve:1024
does one need to set kernel/pty/max to (N*1024 plus the total number of
ptys you expect to allocate) where N is the number of containers
you expect to run concurrently?
/proc/sys/kernel/pty/nr
never seems particularly high now.
(/proc/sys/kernel/pty/max being another one of the apparent few
system parameters for which you can monitor the current usage).
=== Trivial observation re: sysctl which helped me when I noted it ===
"sysctl kernel.pty.max" <=> "cat /proc/sys/kernel/pty/max" sort of.
I.e. "sysctl A.B.C.D" <=> "cat /proc/sys/A/B/C/D"
Adrian Pepper
Computer Science Computing Facility
University of Waterloo, Ontario, Canada
arpepper at uwaterloo.ca
More information about the lxc-users
mailing list