[lxc-devel] [lxd/master] Add configuration keys for syscall interception

stgraber on Github lxc-bot at linuxcontainers.org
Tue Jul 16 22:36:46 UTC 2019


A non-text attachment was scrubbed...
Name: not available
Type: text/x-mailbox
Size: 301 bytes
Desc: not available
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20190716/8bfa5a60/attachment-0001.bin>
-------------- next part --------------
From ebb3c23b837e06b0dede10a236b30445b82ca552 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:31:37 -0400
Subject: [PATCH 1/8] doc/userns: Update to match current behavior
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 doc/userns-idmap.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/doc/userns-idmap.md b/doc/userns-idmap.md
index dcd61994c8..2038ed6559 100644
--- a/doc/userns-idmap.md
+++ b/doc/userns-idmap.md
@@ -35,10 +35,13 @@ and `newgidmap` (path lookup) can be found on the system, LXD will fail
 the startup of any container until this is corrected as this shows a
 broken shadow setup.
 
-If none of those 4 files can be found, then LXD will assume it's running
-on a host using an old version of shadow. In this mode, LXD will assume
-it can use any uids and gids above 65535 and will take the first 65536
-as its default map.
+
+If none of those files can be found, then LXD will assume a 1000000000
+uid/gid range starting at a base uid/gid of 1000000.
+
+This is the most common case and is usually the recommended setup when
+not running on a system which also hosts fully unprivileged containers
+(where the container runtime itself runs as a user).
 
 ## Varying ranges between hosts
 The source map is sent when moving containers between hosts so that they

From fde9c86b1f804bbff68a0243079882646ae49df4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:47:03 -0400
Subject: [PATCH 2/8] doc: Add documentation for syscall interception
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 doc/syscall-interception.md | 53 +++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 doc/syscall-interception.md

diff --git a/doc/syscall-interception.md b/doc/syscall-interception.md
new file mode 100644
index 0000000000..cb63a86b98
--- /dev/null
+++ b/doc/syscall-interception.md
@@ -0,0 +1,53 @@
+# System call interception
+LXD supports intercepting some specific system calls from unprivileged
+containers and if they're considered to be safe, will executed with
+elevated privileges on the host.
+
+Doing so comes with a performance impact for the syscall in question and
+will cause some work for LXD to evaluate the request and if allowed,
+process it with elevated privileges.
+
+# Available system calls
+## mknod / mknodat
+The `mknod` and `mknodat` system calls can be used to create a variety of special files.
+
+Most commonly inside containers, they may be called to create block or character devices.
+Creating such devices isn't allowed in unprivileged containers as this
+is a very easy way to escalate privileges by allowing direct write
+access to resources like disks or memory.
+
+But there are files which are safe to create. For those, intercepting
+this syscall may unblock some specific workloads and allow them to run
+inside an unprivileged containers.
+
+The devices which are currently allowed are:
+
+ - overlayfs whiteout (char 0:0)
+ - /dev/console (char 5:1)
+ - /dev/full (char 1:7)
+ - /dev/null (char 1:3)
+ - /dev/random (char 1:8)
+ - /dev/tty (char 5:0)
+ - /dev/urandom (char 1:9)
+ - /dev/zero (char 1:5)
+
+All file types other than block device and character device are sent to
+the kernel as usual, so enabling this feature doesn't change their
+behavior at all.
+
+This can be enabled by setting `security.syscalls.intercept.mknod` to `true`.
+
+## setxattr
+The `setxattr` system call is used to set extended attributes on files.
+
+The attributes which are handled by this currently are:
+
+ - trusted.overlay.opaque (overlayfs directory whiteout)
+
+Note that because the mediation must happen on a number of character
+strings, there is no easy way at present to only intercept the few
+attributes we care about. As we only allow the attributes above, this
+may result in breakage for other attributes that would have been
+previously allowed by the kernel.
+
+This can be enabled by setting `security.syscalls.intercept.setxattr` to `true`.

From 3b9cbade61fe7ba9326bd27c5a9c6fdb9fc1fea4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 17:52:57 -0400
Subject: [PATCH 3/8] api: Add container_syscall_intercept API extension
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 doc/api-extensions.md | 5 +++++
 shared/version/api.go | 1 +
 2 files changed, 6 insertions(+)

diff --git a/doc/api-extensions.md b/doc/api-extensions.md
index 76db7cad36..3bf3d1adad 100644
--- a/doc/api-extensions.md
+++ b/doc/api-extensions.md
@@ -798,3 +798,8 @@ Rework the resources API at /1.0/resources, especially:
 
 ## container\_exec\_user\_group\_cwd
 Adds support for specifying User, Group and Cwd during `POST /1.0/containers/NAME/exec`.
+
+## container\_syscall\_intercept
+Adds the `security.syscalls.intercept.\*` configuration keys to control
+what system calls will be interecepted by LXD and processed with
+elevated permissions.
diff --git a/shared/version/api.go b/shared/version/api.go
index 4b785c0f5a..99fef0b913 100644
--- a/shared/version/api.go
+++ b/shared/version/api.go
@@ -159,6 +159,7 @@ var APIExtensions = []string{
 	"container_nic_ipfilter",
 	"resources_v2",
 	"container_exec_user_group_cwd",
+	"container_syscall_intercept",
 }
 
 // APIExtensionsCount returns the number of available API extensions.

From c13ac5af6ef0837a835c8c62327282e9f0825c4c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:02 -0400
Subject: [PATCH 4/8] lxd: Add config keys for syscall interception
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 lxd/container_lxc.go |  22 ++++-----
 lxd/seccomp.go       | 109 ++++++++++++++++++++++++++++++++++---------
 shared/container.go  |  10 ++--
 3 files changed, 105 insertions(+), 36 deletions(-)

diff --git a/lxd/container_lxc.go b/lxd/container_lxc.go
index 1f83099253..6f1b5dcfba 100644
--- a/lxd/container_lxc.go
+++ b/lxd/container_lxc.go
@@ -1294,11 +1294,21 @@ func (c *containerLXC) initLXC(config bool) error {
 	}
 
 	// Setup Seccomp if necessary
-	if ContainerNeedsSeccomp(c) {
+	if seccompContainerNeedsPolicy(c) {
 		err = lxcSetConfigItem(cc, "lxc.seccomp.profile", SeccompProfilePath(c))
 		if err != nil {
 			return err
 		}
+
+		// Setup notification socket
+		// System requirement errors are handled during policy generation instead of here
+		ok, err := seccompContainerNeedsIntercept(c)
+		if err == nil && ok {
+			err = lxcSetConfigItem(cc, "lxc.seccomp.notify.proxy", fmt.Sprintf("unix:%s", shared.VarPath("seccomp.socket")))
+			if err != nil {
+				return err
+			}
+		}
 	}
 
 	// Setup idmap
@@ -1865,16 +1875,6 @@ func (c *containerLXC) initLXC(config bool) error {
 		return err
 	}
 
-	// NOTE: Don't fail in cases where liblxc is recent enough but libseccomp isn't
-	//       when we add mount() support with user-configurable
-	//       options, we will want a hard fail if the user configured it
-	if !c.IsPrivileged() && !c.state.OS.RunningInUserNS && lxcSupportSeccompNotify(c.state) {
-		err = lxcSetConfigItem(cc, "lxc.seccomp.notify.proxy", fmt.Sprintf("unix:%s", shared.VarPath("seccomp.socket")))
-		if err != nil {
-			return err
-		}
-	}
-
 	// Apply raw.lxc
 	if lxcConfig, ok := c.expandedConfig["raw.lxc"]; ok {
 		f, err := ioutil.TempFile("", "lxd_config_")
diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index c35eec9379..5f5c4dc468 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -349,11 +349,13 @@ init_module errno 38
 finit_module errno 38
 delete_module errno 38
 `
-const SECCOMP_NOTIFY_POLICY = `mknod notify [1,8192,SCMP_CMP_MASKED_EQ,61440]
+
+const SECCOMP_NOTIFY_MKNOD = `mknod notify [1,8192,SCMP_CMP_MASKED_EQ,61440]
 mknod notify [1,24576,SCMP_CMP_MASKED_EQ,61440]
 mknodat notify [2,8192,SCMP_CMP_MASKED_EQ,61440]
 mknodat notify [2,24576,SCMP_CMP_MASKED_EQ,61440]
-setxattr notify [3,1,SCMP_CMP_EQ]
+`
+const SECCOMP_NOTIFY_SETXATTR = `setxattr notify [3,1,SCMP_CMP_EQ]
 `
 
 const COMPAT_BLOCKING_POLICY = `[%s]
@@ -399,9 +401,10 @@ func SeccompProfilePath(c container) string {
 	return path.Join(seccompPath, c.Name())
 }
 
-func ContainerNeedsSeccomp(c container) bool {
+func seccompContainerNeedsPolicy(c container) bool {
 	config := c.ExpandedConfig()
 
+	// Check for text keys
 	keys := []string{
 		"raw.seccomp",
 		"security.syscalls.whitelist",
@@ -415,50 +418,114 @@ func ContainerNeedsSeccomp(c container) bool {
 		}
 	}
 
-	compat := config["security.syscalls.blacklist_compat"]
-	if shared.IsTrue(compat) {
-		return true
+	// Check for boolean keys that default to false
+	keys = []string{
+		"security.syscalls.blacklist_compat",
+		"security.syscalls.intercept.mknod",
+		"security.syscalls.intercept.setxattr",
 	}
 
-	/* this are enabled by default, so if the keys aren't present, that
-	 * means "true"
-	 */
-	default_, ok := config["security.syscalls.blacklist_default"]
-	if !ok || shared.IsTrue(default_) {
-		return true
+	for _, k := range keys {
+		if shared.IsTrue(config[k]) {
+			return true
+		}
+	}
+
+	// Check for boolean keys that default to true
+	keys = []string{
+		"security.syscalls.blacklist_default",
+	}
+
+	for _, k := range keys {
+		value, ok := config[k]
+		if !ok || shared.IsTrue(value) {
+			return true
+		}
 	}
 
 	return false
 }
 
+func seccompContainerNeedsIntercept(c container) (bool, error) {
+	// No need if privileged
+	if c.IsPrivileged() {
+		return false, nil
+	}
+
+	// If nested, assume the host handles it
+	if c.DaemonState().OS.RunningInUserNS {
+		return false, nil
+	}
+
+	config := c.ExpandedConfig()
+
+	keys := []string{
+		"security.syscalls.intercept.mknod",
+		"security.syscalls.intercept.setxattr",
+	}
+
+	needed := false
+	for _, k := range keys {
+		if shared.IsTrue(config[k]) {
+			needed = true
+			break
+		}
+	}
+
+	if needed {
+		if !lxcSupportSeccompNotify(c.DaemonState()) {
+			return needed, fmt.Errorf("System doesn't support syscall interception")
+		}
+	}
+
+	return needed, nil
+}
+
 func getSeccompProfileContent(c container) (string, error) {
 	config := c.ExpandedConfig()
 
+	// Full policy override
 	raw := config["raw.seccomp"]
 	if raw != "" {
 		return raw, nil
 	}
 
+	// Policy header
 	policy := SECCOMP_HEADER
-
 	whitelist := config["security.syscalls.whitelist"]
 	if whitelist != "" {
 		policy += "whitelist\n[all]\n"
 		policy += whitelist
-		return policy, nil
+	} else {
+		policy += "blacklist\n"
+
+		default_, ok := config["security.syscalls.blacklist_default"]
+		if !ok || shared.IsTrue(default_) {
+			policy += DEFAULT_SECCOMP_POLICY
+		}
 	}
 
-	policy += "blacklist\n"
+	// Syscall interception
+	ok, err := seccompContainerNeedsIntercept(c)
+	if err != nil {
+		return "", err
+	}
 
-	default_, ok := config["security.syscalls.blacklist_default"]
-	if !ok || shared.IsTrue(default_) {
-		policy += DEFAULT_SECCOMP_POLICY
+	if ok {
+		if shared.IsTrue(config["security.syscalls.intercept.mknod"]) {
+			policy += SECCOMP_NOTIFY_MKNOD
+		}
+
+		if shared.IsTrue(config["security.syscalls.intercept.setxattr"]) {
+			policy += SECCOMP_NOTIFY_SETXATTR
+		}
 	}
 
-	if !c.IsPrivileged() && !c.DaemonState().OS.RunningInUserNS && lxcSupportSeccompNotify(c.DaemonState()) {
-		policy += SECCOMP_NOTIFY_POLICY
+	if whitelist != "" {
+		return policy, nil
 	}
 
+	// Additional blacklist entries
 	compat := config["security.syscalls.blacklist_compat"]
 	if shared.IsTrue(compat) {
 		arch, err := osarch.ArchitectureName(c.Architecture())
@@ -483,7 +550,7 @@ func SeccompCreateProfile(c container) error {
 	 * the mtime on the file for any compiler purpose, so let's just write
 	 * out the profile.
 	 */
-	if !ContainerNeedsSeccomp(c) {
+	if !seccompContainerNeedsPolicy(c) {
 		return nil
 	}
 
diff --git a/shared/container.go b/shared/container.go
index 51e13308f0..5109d1d6ba 100644
--- a/shared/container.go
+++ b/shared/container.go
@@ -265,10 +265,12 @@ var KnownContainerConfigKeys = map[string]func(value string) error{
 	"security.idmap.isolated": IsBool,
 	"security.idmap.size":     IsUint32,
 
-	"security.syscalls.blacklist_default": IsBool,
-	"security.syscalls.blacklist_compat":  IsBool,
-	"security.syscalls.blacklist":         IsAny,
-	"security.syscalls.whitelist":         IsAny,
+	"security.syscalls.blacklist_default":  IsBool,
+	"security.syscalls.blacklist_compat":   IsBool,
+	"security.syscalls.blacklist":          IsAny,
+	"security.syscalls.intercept.mknod":    IsBool,
+	"security.syscalls.intercept.setxattr": IsBool,
+	"security.syscalls.whitelist":          IsAny,
 
 	"snapshots.schedule": func(value string) error {
 		if value == "" {

From 3d4eef7ed0a2b6844fd6419018acc6559edd6942 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:41 -0400
Subject: [PATCH 5/8] scripts: Add new config keys to bash completion
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 scripts/bash/lxd-client | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/bash/lxd-client b/scripts/bash/lxd-client
index e4d9db13ad..4ace85cdcf 100644
--- a/scripts/bash/lxd-client
+++ b/scripts/bash/lxd-client
@@ -90,6 +90,7 @@ _have lxc && {
       security.nesting security.privileged security.protection.delete \
       security.protection.shift security.syscalls.blacklist \
       security.syscalls.blacklist_compat security.syscalls.blacklist_default \
+      security.syscalls.intercept.mknod security.syscalls.intercept.setxattr \
       snapshots.schedule snapshots.schedule.stopped snapshots.pattern \
       volatile.apply_quota volatile.apply_template volatile.base_image \
       volatile.idmap.base volatile.idmap.current volatile.idmap.next \

From 85a5508426d86c15a8f2b7c00f6a25b2bb503887 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:18:58 -0400
Subject: [PATCH 6/8] doc/containers: Add the syscall intercept options
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Stéphane Graber <stgraber at ubuntu.com>
---
 doc/containers.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/doc/containers.md b/doc/containers.md
index daa7998faf..4c33c2c254 100644
--- a/doc/containers.md
+++ b/doc/containers.md
@@ -77,6 +77,8 @@ security.protection.shift               | boolean   | false             | yes
 security.syscalls.blacklist             | string    | -                 | no            | container\_syscall\_filtering        | A '\n' separated list of syscalls to blacklist
 security.syscalls.blacklist\_compat     | boolean   | false             | no            | container\_syscall\_filtering        | On x86\_64 this enables blocking of compat\_\* syscalls, it is a no-op on other arches
 security.syscalls.blacklist\_default    | boolean   | true              | no            | container\_syscall\_filtering        | Enables the default syscall blacklist
+security.syscalls.intercept.mknod       | boolean   | false             | no            | container\_syscall\_intercept        | Handles the `mknod` and `mknodat` system calls (allows creation of a limited subset of char/block devices)
+security.syscalls.intercept.setxattr    | boolean   | false             | no            | container\_syscall\_intercept        | Handles the `setxattr` system call (allows setting a limited subset of restricted extended attributes)
 security.syscalls.whitelist             | string    | -                 | no            | container\_syscall\_filtering        | A '\n' separated list of syscalls to whitelist (mutually exclusive with security.syscalls.blacklist\*)
 snapshots.schedule                      | string    | -                 | no            | snapshot\_scheduling                 | Cron expression (`<minute> <hour> <dom> <month> <dow>`)
 snapshots.schedule.stopped              | bool      | false             | no            | snapshot\_scheduling                 | Controls whether or not stopped containers are to be snapshoted automatically

From 4ebccc1a682795f51401d4d7654c1ddea013e21c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:33:31 -0400
Subject: [PATCH 7/8] lxd/seccomp: Don't mask errors

---
 lxd/seccomp.go | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index 5f5c4dc468..2f8ca5cab0 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -556,7 +556,7 @@ func SeccompCreateProfile(c container) error {
 
 	profile, err := getSeccompProfileContent(c)
 	if err != nil {
-		return nil
+		return err
 	}
 
 	if err := os.MkdirAll(seccompPath, 0700); err != nil {

From 103459091cce6d3c9abbeafd48d2f9eaa8b70b59 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?St=C3=A9phane=20Graber?= <stgraber at ubuntu.com>
Date: Tue, 16 Jul 2019 18:34:07 -0400
Subject: [PATCH 8/8] lxd/seccomp: Rename getSeccompProfileContent to
 seccompGetPolicyContent

---
 lxd/seccomp.go | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lxd/seccomp.go b/lxd/seccomp.go
index 2f8ca5cab0..e0830c7aa1 100644
--- a/lxd/seccomp.go
+++ b/lxd/seccomp.go
@@ -481,7 +481,7 @@ func seccompContainerNeedsIntercept(c container) (bool, error) {
 	return needed, nil
 }
 
-func getSeccompProfileContent(c container) (string, error) {
+func seccompGetPolicyContent(c container) (string, error) {
 	config := c.ExpandedConfig()
 
 	// Full policy override
@@ -554,7 +554,7 @@ func SeccompCreateProfile(c container) error {
 		return nil
 	}
 
-	profile, err := getSeccompProfileContent(c)
+	profile, err := seccompGetPolicyContent(c)
 	if err != nil {
 		return err
 	}


More information about the lxc-devel mailing list