高效可靠地检测和删除损坏的空填充文件

Question

可移植Shell 用于检查充满空字节的文件的函数/脚本

我考虑过查看文件签名、魔术字节等，但我不想使用任何间接假设（除了已知的机制，例如下面的 hexdump 输出）

这是我现在使用的：

#!/bin/bash 

file="" 
bail () { 
    >&2 echo "$file doesn't exist" 
    exit 1 
} 

[ -f "" ] ; [[ "$?" != 0 ]] && bail                     # Check valid filename
result="$(head -c4 "$file" | hexdump -ve '1/1 "%.2X"')"   
if [ "$result" == "0000" ] || [ "$result" == "00000000" ] # Get first 4 bytes as pre-condition
    then                                                                                                    # Check for large contiguous blocks of null 
        head -c10000 "$file" | hexdump | \
        if [[ "$(wc -l <<<"$(cat -)")" -le 4 ]]           # By virtue of pre-condition, all output must be null
        then 
            exit 2                                        # Exit Codes
         else 
            exit 0                                         # 0: File is good
        fi                                                 # 1: Validation error
    else                                                   # 2: File is cactus
        exit 0
fi

想法是：

检查前 4 个字节作为前提条件
使用不带-v的hexdump对多组前置条件（空字节）进行分组
使用head -c1K限制内部过度输出
使用 wc -l 检查 <=4 行。任何更多都表示从 null 发生变化。

一些古怪和间接语法的原因是因为 bash 在某些机器上给了我这个 bash: warning: command substitution: ignored null byte in input，我发现这个可以绕过它。

通过上面迭代文件似乎很慢，但暂时可用：

real    0m0.026s
user    0m0.009s
sys     0m0.021s

有没有更好更有效的方法来做到这一点？

Answer 1

如果您需要进行检查的所有系统都支持 /dev/zero，那么您可以使用以下方法测试文件是否仅包含空字节：

[[ $(LC_ALL=C cmp -- "$file" /dev/zero 2>&1) == 'cmp: EOF on '* ]]

cmp 的 POSIX 手册页（cmp (The Open Group Base Specifications Issue 7)) gives a precise specification for the STDERR output in the POSIX locale. LC_ALL=C forces cmp to use the POSIX Locale，因此与 'cmp: EOF on '* 的比较将正常工作。
空文件测试成立。如果你不想这样，你可以在测试中添加一个非空检查：[[ -s $file && ... ]].
cmp 命令参数中的 -- 防止名称以 - 开头的文件被视为 cmp 选项。
当心非常大的文件，或看起来非常大的文件 (sparse files)。 cmp 可能需要很长时间才能运行处理此类文件。您可能需要考虑跳过超过阈值大小的文件。

高效可靠地检测和删除损坏的空填充文件

Detecting and removing corrupt null-filled files efficiently and reliably

unix

filesystems

bash

可移植Shell 用于检查充满空字节的文件的函数/脚本

这是我现在使用的：

有没有更好更有效的方法来做到这一点？