为什么加载 seccomp 过滤器会影响允许和有效的功能集?

Why does loading seccomp filter affect permitted and effective capability set?

我最近在用 libcaplibseccomp 编写程序,我发现将它们一起使用时出现问题。

在下面的最小可重现示例中,我首先将当前进程的能力设置为仅 P(inheritable) = CAP_NET_RAW,并清除其他能力集。然后,我使用 SCMP_ACT_ALLOW 操作初始化一个 seccomp 过滤器(默认情况下允许所有系统调用),加载它并清理它。

最后,这个程序打印出它当前的能力,并在执行execve()后执行capsh --print来展示它的能力。

#include <linux/capability.h>
#include <sys/capability.h>
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <seccomp.h>

#define CAPSH "/usr/sbin/capsh"

int main(void) {
    cap_value_t net_raw = CAP_NET_RAW;

    cap_t caps = cap_init();
    cap_set_flag(caps, CAP_INHERITABLE, 1, &net_raw, CAP_SET);
    if (cap_set_proc(caps)) {
        perror("cap_set_proc");
    }
    cap_free(caps);

    scmp_filter_ctx ctx;
    if ((ctx = seccomp_init(SCMP_ACT_ALLOW)) == NULL) {
        perror("seccomp_init");
    }

    int rc = 0;
    rc = seccomp_load(ctx); // comment this line later
    if (rc < 0)
        perror("seccomp_load");
    seccomp_release(ctx);

    ssize_t y = 0;
    printf("Process capabilities: %s\n", cap_to_text(cap_get_proc(), &y));
    
    char *argv[] = {
        CAPSH,
        "--print",
        NULL
    };
    execve(CAPSH, argv, NULL);
    return -1;

}

-lcap-lseccomp编译,在root用户(UID=EUID=0)下执行,得到:

Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+i
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

说明当前进程和执行的进程capsh都有inheritable set not empty only。但是,如果我注释行 rc = seccomp_load(ctx);,情况就不同了:

Process capabilities: = cap_net_raw+i
Current: = cap_net_raw+eip cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

之前execve(),结果同上。但在那之后,所有其他功能都回到允许和有效的集合中。

我查阅了capabilities(7),在手册中找到了以下内容:

Capabilities and execution of programs by root
       In order to mirror traditional UNIX semantics, the kernel performs
       special treatment of file capabilities when a process with UID 0
       (root) executes a program and when a set-user-ID-root program is exe‐
       cuted.

       After having performed any changes to the process effective ID that
       were triggered by the set-user-ID mode bit of the binary—e.g.,
       switching the effective user ID to 0 (root) because a set-user-ID-
       root program was executed—the kernel calculates the file capability
       sets as follows:

       1. If the real or effective user ID of the process is 0 (root), then
          the file inheritable and permitted sets are ignored; instead they
          are notionally considered to be all ones (i.e., all capabilities
          enabled).  (There is one exception to this behavior, described
          below in Set-user-ID-root programs that have file capabilities.)

       2. If the effective user ID of the process is 0 (root) or the file
          effective bit is in fact enabled, then the file effective bit is
          notionally defined to be one (enabled).

       These notional values for the file's capability sets are then used as
       described above to calculate the transformation of the process's
       capabilities during execve(2).

       Thus, when a process with nonzero UIDs execve(2)s a set-user-ID-root
       program that does not have capabilities attached, or when a process
       whose real and effective UIDs are zero execve(2)s a program, the cal‐
       culation of the process's new permitted capabilities simplifies to:

           P'(permitted)   = P(inheritable) | P(bounding)

           P'(effective)   = P'(permitted)

       Consequently, the process gains all capabilities in its permitted and
       effective capability sets, except those masked out by the capability
       bounding set.  (In the calculation of P'(permitted), the P'(ambient)
       term can be simplified away because it is by definition a proper sub‐
       set of P(inheritable).)

       The special treatments of user ID 0 (root) described in this subsec‐
       tion can be disabled using the securebits mechanism described below.

这就是我感到困惑的地方:可继承集不为空,根据简化规则,允许集和有效集都不为空。但是,“加载seccomp过滤器”似乎违反了这个规则。

Seccomp 本身不执行此操作,但 libseccomp 执行此操作。

使用strace,你可以看到seccomp_load实际上执行了三个系统调用:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)  = 0
seccomp(SECCOMP_SET_MODE_STRICT, 1, NULL) = -1 EINVAL (Invalid argument)
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=7, filter=0x5572a6213930}) = 0

注意第一个看起来很可疑。

the kernel documentationno_new_privs

With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.

你引用的capabilities(7)

If the real or effective user ID of the process is 0 (root), then the file inheritable and permitted sets are ignored; instead they are notionally considered to be all ones (i.e., all capabilities enabled).

您的代码创建了一个空的功能集 (cap_t caps = cap_init()),并且只添加 CAP_NET_RAW 作为可继承的,不允许任何功能(如 = cap_net_raw+i 中)。然后,因为为这个线程设置了NO_NEW_PRIVS,当调用execve时,允许的集合是恢复到完整集合 就像根进程通常那样(UID = 0 或 EUID = 0)。这解释了您在使用 seccomp_load().

之前和之后从 capsh --print 中看到的内容

NO_NEW_PRIVS 标志一旦设置就无法重置(prctl(2)), and there's a reason seccomp_load() 默认设置它。

要防止seccomp_load()设置NO_NEW_PRIVS,在加载上下文之前添加以下代码:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_NNP, 0);

有关详细信息,请参阅 seccomp_attr_set(3)

但是,您可能应该 以正确的方式 将所需的功能也添加到允许的集合中。

cap_set_flag(caps, CAP_PERMITTED, 1, &net_raw, CAP_SET);