[lxc-devel] [lxd/master] Add configuration keys for syscall interception
stgraber on Github
lxc-bot at linuxcontainers.org
Tue Jul 16 22:36:46 UTC 2019
A non-text attachment was scrubbed...
Name: not available
Type: text/x-mailbox
Size: 301 bytes
Desc: not available
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20190716/8bfa5a60/attachment-0001.bin>
-------------- next part --------------
From ebb3c23b837e06b0dede10a236b30445b82ca552 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:31:37 -0400
Subject: [PATCH 1/8] doc/userns: Update to match current behavior
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
doc/userns-idmap.md | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/doc/userns-idmap.md b/doc/userns-idmap.md
index dcd61994c8..2038ed6559 100644
--- a/doc/userns-idmap.md
+++ b/doc/userns-idmap.md
@@ -35,10 +35,13 @@ and `newgidmap` (path lookup) can be found on the system, LXD will fail
the startup of any container until this is corrected as this shows a
broken shadow setup.
-If none of those 4 files can be found, then LXD will assume it's running
-on a host using an old version of shadow. In this mode, LXD will assume
-it can use any uids and gids above 65535 and will take the first 65536
-as its default map.
+
+If none of those files can be found, then LXD will assume a 1000000000
+uid/gid range starting at a base uid/gid of 1000000.
+
+This is the most common case and is usually the recommended setup when
+not running on a system which also hosts fully unprivileged containers
+(where the container runtime itself runs as a user).
## Varying ranges between hosts
The source map is sent when moving containers between hosts so that they
From fde9c86b1f804bbff68a0243079882646ae49df4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:47:03 -0400
Subject: [PATCH 2/8] doc: Add documentation for syscall interception
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
doc/syscall-interception.md | 53 +++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
create mode 100644 doc/syscall-interception.md
diff --git a/doc/syscall-interception.md b/doc/syscall-interception.md
new file mode 100644
index 0000000000..cb63a86b98
--- /dev/null
+++ b/doc/syscall-interception.md
@@ -0,0 +1,53 @@
+# System call interception
+LXD supports intercepting some specific system calls from unprivileged
+containers and if they're considered to be safe, will executed with
+elevated privileges on the host.
+
+Doing so comes with a performance impact for the syscall in question and
+will cause some work for LXD to evaluate the request and if allowed,
+process it with elevated privileges.
+
+# Available system calls
+## mknod / mknodat
+The `mknod` and `mknodat` system calls can be used to create a variety of special files.
+
+Most commonly inside containers, they may be called to create block or character devices.
+Creating such devices isn't allowed in unprivileged containers as this
+is a very easy way to escalate privileges by allowing direct write
+access to resources like disks or memory.
+
+But there are files which are safe to create. For those, intercepting
+this syscall may unblock some specific workloads and allow them to run
+inside an unprivileged containers.
+
+The devices which are currently allowed are:
+
+ - overlayfs whiteout (char 0:0)
+ - /dev/console (char 5:1)
+ - /dev/full (char 1:7)
+ - /dev/null (char 1:3)
+ - /dev/random (char 1:8)
+ - /dev/tty (char 5:0)
+ - /dev/urandom (char 1:9)
+ - /dev/zero (char 1:5)
+
+All file types other than block device and character device are sent to
+the kernel as usual, so enabling this feature doesn't change their
+behavior at all.
+
+This can be enabled by setting `security.syscalls.intercept.mknod` to `true`.
+
+## setxattr
+The `setxattr` system call is used to set extended attributes on files.
+
+The attributes which are handled by this currently are:
+
+ - trusted.overlay.opaque (overlayfs directory whiteout)
+
+Note that because the mediation must happen on a number of character
+strings, there is no easy way at present to only intercept the few
+attributes we care about. As we only allow the attributes above, this
+may result in breakage for other attributes that would have been
+previously allowed by the kernel.
+
+This can be enabled by setting `security.syscalls.intercept.setxattr` to `true`.
From 3b9cbade61fe7ba9326bd27c5a9c6fdb9fc1fea4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:52:57 -0400
Subject: [PATCH 3/8] api: Add container_syscall_intercept API extension
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
doc/api-extensions.md | 5 +++++
shared/version/api.go | 1 +
2 files changed, 6 insertions(+)
diff --git a/doc/api-extensions.md b/doc/api-extensions.md
index 76db7cad36..3bf3d1adad 100644
--- a/doc/api-extensions.md
+++ b/doc/api-extensions.md
@@ -798,3 +798,8 @@ Rework the resources API at /1.0/resources, especially:
## container\_exec\_user\_group\_cwd
Adds support for specifying User, Group and Cwd during `POST /1.0/containers/NAME/exec`.
+
+## container\_syscall\_intercept
+Adds the `security.syscalls.intercept.\*` configuration keys to control
+what system calls will be interecepted by LXD and processed with
+elevated permissions.
diff --git a/shared/version/api.go b/shared/version/api.go
index 4b785c0f5a..99fef0b913 100644
--- a/shared/version/api.go
+++ b/shared/version/api.go
@@ -159,6 +159,7 @@ var APIExtensions = []string{
"container_nic_ipfilter",
"resources_v2",
"container_exec_user_group_cwd",
+ "container_syscall_intercept",
}
// APIExtensionsCount returns the number of available API extensions.
From c13ac5af6ef0837a835c8c62327282e9f0825c4c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:02 -0400
Subject: [PATCH 4/8] lxd: Add config keys for syscall interception
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
lxd/container_lxc.go | 22 ++++-----
lxd/seccomp.go | 109 ++++++++++++++++++++++++++++++++++---------
shared/container.go | 10 ++--
3 files changed, 105 insertions(+), 36 deletions(-)
diff --git a/lxd/container_lxc.go b/lxd/container_lxc.go
index 1f83099253..6f1b5dcfba 100644
--- a/lxd/container_lxc.go
+++ b/lxd/container_lxc.go
@@ -1294,11 +1294,21 @@ func (c *containerLXC) initLXC(config bool) error {
}
// Setup Seccomp if necessary
- if ContainerNeedsSeccomp(c) {
+ if seccompContainerNeedsPolicy(c) {
err = lxcSetConfigItem(cc, "lxc.seccomp.profile", SeccompProfilePath(c))
if err != nil {
return err
}
+
+ // Setup notification socket
+ // System requirement errors are handled during policy generation instead of here
+ ok, err := seccompContainerNeedsIntercept(c)
+ if err == nil && ok {
+ err = lxcSetConfigItem(cc, "lxc.seccomp.notify.proxy", fmt.Sprintf("unix:%s", shared.VarPath("seccomp.socket")))
+ if err != nil {
+ return err
+ }
+ }
}
// Setup idmap
@@ -1865,16 +1875,6 @@ func (c *containerLXC) initLXC(config bool) error {
return err
}
- // NOTE: Don't fail in cases where liblxc is recent enough but libseccomp isn't
- // when we add mount() support with user-configurable
- // options, we will want a hard fail if the user configured it
- if !c.IsPrivileged() && !c.state.OS.RunningInUserNS && lxcSupportSeccompNotify(c.state) {
- err = lxcSetConfigItem(cc, "lxc.seccomp.notify.proxy", fmt.Sprintf("unix:%s", shared.VarPath("seccomp.socket")))
- if err != nil {
- return err
- }
- }
-
// Apply raw.lxc
if lxcConfig, ok := c.expandedConfig["raw.lxc"]; ok {
f, err := ioutil.TempFile("", "lxd_config_")
diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index c35eec9379..5f5c4dc468 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -349,11 +349,13 @@ init_module errno 38
finit_module errno 38
delete_module errno 38
`
-const SECCOMP_NOTIFY_POLICY = `mknod notify [1,8192,SCMP_CMP_MASKED_EQ,61440]
+
+const SECCOMP_NOTIFY_MKNOD = `mknod notify [1,8192,SCMP_CMP_MASKED_EQ,61440]
mknod notify [1,24576,SCMP_CMP_MASKED_EQ,61440]
mknodat notify [2,8192,SCMP_CMP_MASKED_EQ,61440]
mknodat notify [2,24576,SCMP_CMP_MASKED_EQ,61440]
-setxattr notify [3,1,SCMP_CMP_EQ]
+`
+const SECCOMP_NOTIFY_SETXATTR = `setxattr notify [3,1,SCMP_CMP_EQ]
`
const COMPAT_BLOCKING_POLICY = `[%s]
@@ -399,9 +401,10 @@ func SeccompProfilePath(c container) string {
return path.Join(seccompPath, c.Name())
}
-func ContainerNeedsSeccomp(c container) bool {
+func seccompContainerNeedsPolicy(c container) bool {
config := c.ExpandedConfig()
+ // Check for text keys
keys := []string{
"raw.seccomp",
"security.syscalls.whitelist",
@@ -415,50 +418,114 @@ func ContainerNeedsSeccomp(c container) bool {
}
}
- compat := config["security.syscalls.blacklist_compat"]
- if shared.IsTrue(compat) {
- return true
+ // Check for boolean keys that default to false
+ keys = []string{
+ "security.syscalls.blacklist_compat",
+ "security.syscalls.intercept.mknod",
+ "security.syscalls.intercept.setxattr",
}
- /* this are enabled by default, so if the keys aren't present, that
- * means "true"
- */
- default_, ok := config["security.syscalls.blacklist_default"]
- if !ok || shared.IsTrue(default_) {
- return true
+ for _, k := range keys {
+ if shared.IsTrue(config[k]) {
+ return true
+ }
+ }
+
+ // Check for boolean keys that default to true
+ keys = []string{
+ "security.syscalls.blacklist_default",
+ }
+
+ for _, k := range keys {
+ value, ok := config[k]
+ if !ok || shared.IsTrue(value) {
+ return true
+ }
}
return false
}
+func seccompContainerNeedsIntercept(c container) (bool, error) {
+ // No need if privileged
+ if c.IsPrivileged() {
+ return false, nil
+ }
+
+ // If nested, assume the host handles it
+ if c.DaemonState().OS.RunningInUserNS {
+ return false, nil
+ }
+
+ config := c.ExpandedConfig()
+
+ keys := []string{
+ "security.syscalls.intercept.mknod",
+ "security.syscalls.intercept.setxattr",
+ }
+
+ needed := false
+ for _, k := range keys {
+ if shared.IsTrue(config[k]) {
+ needed = true
+ break
+ }
+ }
+
+ if needed {
+ if !lxcSupportSeccompNotify(c.DaemonState()) {
+ return needed, fmt.Errorf("System doesn't support syscall interception")
+ }
+ }
+
+ return needed, nil
+}
+
func getSeccompProfileContent(c container) (string, error) {
config := c.ExpandedConfig()
+ // Full policy override
raw := config["raw.seccomp"]
if raw != "" {
return raw, nil
}
+ // Policy header
policy := SECCOMP_HEADER
-
whitelist := config["security.syscalls.whitelist"]
if whitelist != "" {
policy += "whitelist\n[all]\n"
policy += whitelist
- return policy, nil
+ } else {
+ policy += "blacklist\n"
+
+ default_, ok := config["security.syscalls.blacklist_default"]
+ if !ok || shared.IsTrue(default_) {
+ policy += DEFAULT_SECCOMP_POLICY
+ }
}
- policy += "blacklist\n"
+ // Syscall interception
+ ok, err := seccompContainerNeedsIntercept(c)
+ if err != nil {
+ return "", err
+ }
- default_, ok := config["security.syscalls.blacklist_default"]
- if !ok || shared.IsTrue(default_) {
- policy += DEFAULT_SECCOMP_POLICY
+ if ok {
+ if shared.IsTrue(config["security.syscalls.intercept.mknod"]) {
+ policy += SECCOMP_NOTIFY_MKNOD
+ }
+
+ if shared.IsTrue(config["security.syscalls.intercept.setxattr"]) {
+ policy += SECCOMP_NOTIFY_SETXATTR
+ }
}
- if !c.IsPrivileged() && !c.DaemonState().OS.RunningInUserNS && lxcSupportSeccompNotify(c.DaemonState()) {
- policy += SECCOMP_NOTIFY_POLICY
+ if whitelist != "" {
+ return policy, nil
}
+ // Additional blacklist entries
compat := config["security.syscalls.blacklist_compat"]
if shared.IsTrue(compat) {
arch, err := osarch.ArchitectureName(c.Architecture())
@@ -483,7 +550,7 @@ func SeccompCreateProfile(c container) error {
* the mtime on the file for any compiler purpose, so let's just write
* out the profile.
*/
- if !ContainerNeedsSeccomp(c) {
+ if !seccompContainerNeedsPolicy(c) {
return nil
}
diff --git a/shared/container.go b/shared/container.go
index 51e13308f0..5109d1d6ba 100644
--- a/shared/container.go
+++ b/shared/container.go
@@ -265,10 +265,12 @@ var KnownContainerConfigKeys = map[string]func(value string) error{
"security.idmap.isolated": IsBool,
"security.idmap.size": IsUint32,
- "security.syscalls.blacklist_default": IsBool,
- "security.syscalls.blacklist_compat": IsBool,
- "security.syscalls.blacklist": IsAny,
- "security.syscalls.whitelist": IsAny,
+ "security.syscalls.blacklist_default": IsBool,
+ "security.syscalls.blacklist_compat": IsBool,
+ "security.syscalls.blacklist": IsAny,
+ "security.syscalls.intercept.mknod": IsBool,
+ "security.syscalls.intercept.setxattr": IsBool,
+ "security.syscalls.whitelist": IsAny,
"snapshots.schedule": func(value string) error {
if value == "" {
From 3d4eef7ed0a2b6844fd6419018acc6559edd6942 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:41 -0400
Subject: [PATCH 5/8] scripts: Add new config keys to bash completion
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
scripts/bash/lxd-client | 1 +
1 file changed, 1 insertion(+)
diff --git a/scripts/bash/lxd-client b/scripts/bash/lxd-client
index e4d9db13ad..4ace85cdcf 100644
--- a/scripts/bash/lxd-client
+++ b/scripts/bash/lxd-client
@@ -90,6 +90,7 @@ _have lxc && {
security.nesting security.privileged security.protection.delete \
security.protection.shift security.syscalls.blacklist \
security.syscalls.blacklist_compat security.syscalls.blacklist_default \
+ security.syscalls.intercept.mknod security.syscalls.intercept.setxattr \
snapshots.schedule snapshots.schedule.stopped snapshots.pattern \
volatile.apply_quota volatile.apply_template volatile.base_image \
volatile.idmap.base volatile.idmap.current volatile.idmap.next \
From 85a5508426d86c15a8f2b7c00f6a25b2bb503887 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:58 -0400
Subject: [PATCH 6/8] doc/containers: Add the syscall intercept options
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
doc/containers.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/doc/containers.md b/doc/containers.md
index daa7998faf..4c33c2c254 100644
--- a/doc/containers.md
+++ b/doc/containers.md
@@ -77,6 +77,8 @@ security.protection.shift | boolean | false | yes
security.syscalls.blacklist | string | - | no | container\_syscall\_filtering | A '\n' separated list of syscalls to blacklist
security.syscalls.blacklist\_compat | boolean | false | no | container\_syscall\_filtering | On x86\_64 this enables blocking of compat\_\* syscalls, it is a no-op on other arches
security.syscalls.blacklist\_default | boolean | true | no | container\_syscall\_filtering | Enables the default syscall blacklist
+security.syscalls.intercept.mknod | boolean | false | no | container\_syscall\_intercept | Handles the `mknod` and `mknodat` system calls (allows creation of a limited subset of char/block devices)
+security.syscalls.intercept.setxattr | boolean | false | no | container\_syscall\_intercept | Handles the `setxattr` system call (allows setting a limited subset of restricted extended attributes)
security.syscalls.whitelist | string | - | no | container\_syscall\_filtering | A '\n' separated list of syscalls to whitelist (mutually exclusive with security.syscalls.blacklist\*)
snapshots.schedule | string | - | no | snapshot\_scheduling | Cron expression (`<minute> <hour> <dom> <month> <dow>`)
snapshots.schedule.stopped | bool | false | no | snapshot\_scheduling | Controls whether or not stopped containers are to be snapshoted automatically
From 4ebccc1a682795f51401d4d7654c1ddea013e21c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:33:31 -0400
Subject: [PATCH 7/8] lxd/seccomp: Don't mask errors
---
lxd/seccomp.go | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index 5f5c4dc468..2f8ca5cab0 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -556,7 +556,7 @@ func SeccompCreateProfile(c container) error {
profile, err := getSeccompProfileContent(c)
if err != nil {
- return nil
+ return err
}
if err := os.MkdirAll(seccompPath, 0700); err != nil {
From 103459091cce6d3c9abbeafd48d2f9eaa8b70b59 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:34:07 -0400
Subject: [PATCH 8/8] lxd/seccomp: Rename getSeccompProfileContent to
seccompGetPolicyContent
---
lxd/seccomp.go | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index 2f8ca5cab0..e0830c7aa1 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -481,7 +481,7 @@ func seccompContainerNeedsIntercept(c container) (bool, error) {
return needed, nil
}
-func getSeccompProfileContent(c container) (string, error) {
+func seccompGetPolicyContent(c container) (string, error) {
config := c.ExpandedConfig()
// Full policy override
@@ -554,7 +554,7 @@ func SeccompCreateProfile(c container) error {
return nil
}
- profile, err := getSeccompProfileContent(c)
+ profile, err := seccompGetPolicyContent(c)
if err != nil {
return err
}
More information about the lxc-devel
mailing list