`lseek` 如何帮助确定文件是否为空?

How does `lseek` help determine whether a file is empty?

我正在查看从 git blame 中找到的 source code of cat from the GNU coreutils, in particular the circle detection. They are comparing device and inode and that works fine, there is however an extra case where they allow the output to be an input, if the input is empty. Looking at the code, this must the lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size) part. I read the manpages and a discussion,但我仍然 不太明白为什么 这个调用 lseek需要。

这是 cat 检测的要点,如果它会无限耗尽磁盘(请注意,为简洁起见,也删除了一些错误检查,上面链接了完整的源代码):

struct stat stat_buf;
fstat(STDOUT_FILENO, &stat_buf);
out_dev = stat_buf.st_dev;
out_ino = stat_buf.st_ino;
out_isreg = S_ISREG (stat_buf.st_mode) != 0;

// ...
// for <infile> in inputs {
    input_desc = open (infile, file_open_mode); // or STDIN_FILENO
    fstat(input_desc, &stat_buf);
    /* Don't copy a nonempty regular file to itself, as that would
       merely exhaust the output device.  It's better to catch this
       error earlier rather than later.  */
    if (out_isreg 
        && stat_buf.st_dev == out_dev && stat_buf.st_ino == out_ino
        && lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size)         // <--- This is the important line
    {
      // ...
    }
// } (end of for)

我有两种可能的解释,但都显得有些奇怪。

  1. 根据某些标准 (posix),一个文件可能是“空的”,尽管它仍然包含一些信息(用 st_size 计算)和 lseekopen 通过一些默认的抵消来尊重这一点。我不知道为什么会这样,因为empty就是空的,对吧?
  2. 这个比较真是两个条件的“巧妙”组合。这首先对我来说很有意义,因为如果 input_desc 将是 STDIN_FILENO 并且不会有文件传输到 stdinlseek 将失败并显示 ESPIPE (根据手册页)和 return -1。然后,整个语句将是 lseek(...) == -1 || stat_buf.st_size > 0。但这不可能是真的,因为只有在设备和 inode 相同时才会发生这种检查,并且只有在 a) stdin 和 stdout 指向相同的 pty 时才会发生,但是 out_isreg 将是 false 或b) stdin 和 stdout 指向同一个文件,但是 lseek 不能 return -1,对吗?

我还编写了一个小程序,打印出重要部分的 return 值和 errno,但没有什么特别突出的地方:

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv) {
  struct stat out_stat;
  struct stat in_stat;

  if (fstat(STDOUT_FILENO, &out_stat) < 0)
    exit(1);

  printf("this is written to stdout / into the file\n");

  int fd;
  if (argc > 1)
    fd = open(argv[1], O_RDONLY);
  else
    fd = STDIN_FILENO;

  fstat(fd, &in_stat);
  int res = lseek(fd, 0, SEEK_CUR);
  fprintf(stderr,
          "errno after lseek = %d, EBADF = %d, EINVAL = %d, EOVERFLOW = %d, "
          "ESPIPE = %d\n",
          errno, EBADF, EINVAL, EOVERFLOW, ESPIPE);

  fprintf(stderr, "input:\n\tlseek(...) = %d\n\tst_size = %ld\n", res,
          in_stat.st_size);

  printf("outsize is %ld", out_stat.st_size);
}

$ touch empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
        lseek(...) = 0
        st_size = 0
$ echo x > empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
        lseek(...) = 0
        st_size = 0

所以我的研究没有触及我的最终问题:lseek 如何根据 cat 源代码帮助确定此示例中的文件是否为空?

这是我对它进行逆向工程的尝试 - 我找不到任何 public 讨论来解释为什么 lseek() 放在那里(GNU coreutils 中没有其他地方这样做)。

指导性问题是:条件 lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size 何时为假?

测试用例:

#!/bin/bash
# (edited based on comments)

set -x

# arrange for cat to start off past the end of a non-empty file

echo abcdefghi > /tmp/so/catseek/input
# get the shell to open the input file for reading & writing as file descriptor 7
exec 7<>/tmp/so/catseek/input
# read the whole file via that descriptor (but leave it open)
dd <&7
# ask linux what the current file position of file descriptor 7 is
# should be everything dd read, namely 10 bytes, the size of the file
grep ^pos: /proc/self/fdinfo/7
# run cat, with pre and post content so that we know how to locate the interesting part
# "-" will cause cat to reuse its file descriptor 0 rather than creating a new file descriptor
# the redirections tell the shell to redirect file descriptors 1 and 0 to/from our open file descriptor 7
# which, as you'll remember, already has a file position of 10 bytes
strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post <&7 >&7
# now let's see what's in the file
cat /tmp/so/catseek/input

有:

$ cat /tmp/so/catseek/pre
pre
$ cat /tmp/so/catseek/post
post

catlseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size:

+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 2.0641e-05 s, 484 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos:    10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
lseek(0, 0, SEEK_CUR)                   = 14
+++ exited with 0 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post

cat0 < stat_buf.st_size:

+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 3.6415e-05 s, 275 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos:    10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
./src/cat: -: input file is output file
+++ exited with 1 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post

如你所见,当cat开始时,文件位置可能已经在文件末尾之后,只检查文件大小会使cat跳过文件,但是也会触发失败,因为 if 语句中的代码是:

error (0, 0, _("%s: input file is output file"), infile);
ok = false;
goto contin;

使用 lseek() 允许 cat 说“哦,文件是一样的,而且不是空的,但是我们的读取仍然是空的,因为这就是读取 EOF 过去的工作方式,所以我们可以允许这种情况。