对 cgroup 任务的写入失败是否确定性地非持久性？

Question

考虑以下程序。

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>

void
setup() {
    system("mkdir /sys/fs/cgroup/cpuset/TestingCpuset");
    system("echo 0,1 > /sys/fs/cgroup/cpuset/TestingCpuset/cpuset.cpus");
    system("echo 0 > /sys/fs/cgroup/cpuset/TestingCpuset/cpuset.mems");
}

int
main() {
    setup();
    // Picked to be the pid of a ordinary thread or process on the currently
    // running system.
    const char* validPid = "30100";
    const char* invalidPid = "2";
    const char* taskPath = "/sys/fs/cgroup/cpuset/TestingCpuset/tasks";
    int fd = open(taskPath, O_WRONLY);
    if (fd < 0) {
        fprintf(stderr, "Failed to open %s; errno %d: %s\n", taskPath, errno,
                strerror(errno));
    }
    int retVal = write(fd, invalidPid, strlen(invalidPid));
    if (retVal < 0) {
        fprintf(stderr, "Invalid write of %s to fd %d; errno %d: %s\n",
                invalidPid, fd, errno, strerror(errno));
    }

    retVal = write(fd, validPid, strlen(validPid));
    if (retVal < 0) {
        fprintf(stderr, "Invalid write of %s to fd %d; errno %d: %s\n",
                validPid, fd, errno, strerror(errno));
    }
}

此程序（sudo 下的运行）的输出是：

Invalid write of 2 to fd 3; errno 22: Invalid argument

注意后续写入不会失败；第一次写入失败不会导致下一次写入失败。

这种故障持续性的缺乏是确定性和可靠的吗？

我查看了 write 手册页，但它没有说明任何关于故障持久性的内容。

Answer 1

写入系统调用失败的原因有很多 - 状态没有任何持久性。

Answer 2

Linux 中没有通常与文件描述符关联的错误状态。请参阅下面的链接。

但是，在我们继续之前，请不要使用 < 0 检查错误。如果发生错误（对于 open() or write()，return 值为 -1。如果 write() 成功，则它 return 是写入的字符数。即使对于sysfs 写道，你真的应该检查一下。确实有一个 filesystem/kernel 错误（或错误 "family"），其中 read()/write() returned 一个负值除了 -1（实际上并没有指示错误，而是一个包装的无符号整数成功值，用于对普通文件的非常大的写入），因此，内核现在将所有 reads/writes 限制为略微小于 2 GiB。如果每个人都使用 < 0 检查错误，我们根本不会发现它。

在我看来，最好有点偏执并捕捉意外错误，而不是假设并可能默默地丢失数据。

Is this lack of failure persistence deterministic and reliable?

对于/sys/下的内核伪文件，答案是肯定的：每次写入都被认为是一个单独的操作。先前对同一描述符的读取或写入不会影响当前写入的结果。

写入 sysfs 伪文件只需调用由伪文件表示的可调参数的 store() 方法；参见 fs/sysfs/file.c:sysfs_kf_bin_write()。完全没有状态记录。

（我们可以讨论可调参数是否可以记录以前的赋值尝试并基于此更改其行为，但我们只能说 Linus Torvalds 不会故意让这种事情发生 "fly" 完全没有。）

通常，Linux 内核不会在文件描述中存储任何错误状态。如果我们查看 fs/read_write.c:write() (look for SYSCALL_DEFINE3(write,), we can see that the write() syscall in current kernels invokes the ksys_write(), which verifies the descriptor is valid (returning EBADF error otherwise), and invokes vfs_write(). (It should be noted that if that succeeds, the file position related to the descriptor is updated using file_pos_write(); the file position is not updated atomically. Therefore, multithreaded concurrent writes to the same file descriptor in Linux should use pwrite() or pwritev() 而不是 write()，以避免 window wrt 的竞争。文件位置更新）。

无论如何，vfs_write() 会做一些错误检查（EBADF、EINVAL、EFAULT）和簿记，并调用 __vfs_write()，这是一个包装器调用适当的文件系统特定函数的函数，file->fop->write() 或 file->fop->write_iter().

(我们也可以看看fs/file_table.c for how the Linux kernel manages its internal file descriptor table (per userspace process), include/linux/fdtable.h:struct fdtable for the descriptor table itself, and at include/linux/fs.h:struct file for the definition of Linux file description. There are no members in any of these structures related to "error state" at all. However, it is useful to note the f_op member in struct file: the member is a pointer to a struct file_operations structure, which contains the per-filesystem handlers for basic file operations related to this particular open file (see include/linux/fs.h:struct file_operations)

（请注意，在 Linux 中，系统调用 return 单个整数。对于错误情况，此整数包含 negative 错误编号。零和正值被认为是成功。C 库在用户空间中完全维护 errno。如果您使用 syscall()，则需要检测错误条件并有选择地维护 errno 根据自己的需要。所以，当你看到一个内核系统调用 returning 说 -EINVAL，这意味着它 returns 错误 EINVAL 到用户空间。C 库负责使用 errno == EINVAL 将其转换为 -1。）

同样，描述符中没有记录任何错误状态，并且每个操作都是独立发生的，与之前的操作无关（文件位置除外，在撰写本文时它本身并没有自动更新）。一些文件系统 理论上可以 跟踪操作，并维护与描述符相关联的内部错误状态，但是同样，除非这是文件系统的一个有据可查的功能，其他实现荣誉，它不太可能 Linux 内核开发人员实际上会允许这样的事情。

重要的是要认识到 Linux 内核开发人员必须遵循两个关键原则（因为 Linus 强制执行）：public 内核接口（系统调用、/proc 和 /sys 伪文件）跨内核版本稳定且兼容（请参阅 this LKML message); and sane practice trumps theory, even if mandated by some standard. See for example Torvalds' Wikiquotes, or his posts on the Linux Kernel mailing list (marc.info mirror; lkml.org here）。

我相信他的意见的原因是，正如他自己所说，"because they know they don't have to"。我（尝试）自己这样做，这就是为什么这个答案希望包含足够的参考资料，以便您可以自己验证。

对 cgroup 任务的写入失败是否确定性地非持久性？

Are write failures to cgroup tasks deterministically non-persistent?

c

linux

system-calls

cgroups