对目录内的文件执行 md5sum 并检查是否存在相同的文件

Question

我正在研究 shell 脚本，并且有一个练习要求计算文件夹中所有文件的 md5 哈希值。它还要求，如果有两个文件具有相同的哈希值，则在终端中打印它们的名称。我的代码可以做到这一点，但一旦找到匹配项，就会打印两次。我不知道如何从下一次迭代中排除第一个文件名。另一件事：禁止创建任何临时文件来帮助完成任务。

#!/bin/bash

ifs=$IFS
IFS=$'\n'

echo "Verifying the files inside the directory..."

for file1 in $(find . -maxdepth 1 -type f | cut -d "/" -f2); do
  md51=$(md5sum $file1  | cut -d " " -f1)
  for file2 in $(find . -maxdepth 1 -type f | cut -d "/" -f2 | grep -v "$file1"); do
    md52=$(md5sum $file2 | cut -d " " -f1)
    if [ "$md51" == "$md52" ]; then
      echo "Files $file1 e $file2 are the same."
    fi
  done
done

我也想知道是否有更有效的方法来完成这项任务。

Answer 1

这个

mapfile -t list < <(find . -maxdepth 1 -type f -exec md5sum {} + | sort)
mapfile -t dups < <(printf "%s\n" "${list[@]}" | grep -f <(printf "^%s\n" "${list[@]}" | sed 's/ .*//' | sort | uniq -d))

# here the array dups containing the all duplicates along with their md5sum
# you can print the array using a simple
printf "%s\n" "${dups[@]}"

并且会得到如下输出：

3b0332e02daabf31651a5a0d81ba830a  ./f2.txt
3b0332e02daabf31651a5a0d81ba830a  ./fff
c9eb23b681c34412f6e6f3168e3990a4  ./both.txt
c9eb23b681c34412f6e6f3168e3990a4  ./f_out
d41d8cd98f00b204e9800998ecf8427e  ./aa
d41d8cd98f00b204e9800998ecf8427e  ./abc def.xxx
d41d8cd98f00b204e9800998ecf8427e  ./dudu
d41d8cd98f00b204e9800998ecf8427e  ./start
d41d8cd98f00b204e9800998ecf8427e  ./xx_yy

以下添加只是为了更漂亮的打印输出

echo "duplicates:"
while read md5; do
        echo "$md5"
        printf "%s\n" "${dups[@]}" | grep "$md5" | sed 's/[^ ]* /  /'
done < <(printf "%s\n" "${dups[@]}" | sed 's/ .*//' | sort -u)

将打印如下内容：

3b0332e02daabf31651a5a0d81ba830a
   ./f2.txt
   ./fff
c9eb23b681c34412f6e6f3168e3990a4
   ./both.txt
   ./f_out
d41d8cd98f00b204e9800998ecf8427e
   ./aa
   ./abc def.xxx
   ./dudu
   ./start
   ./xx_yy

警告：仅当文件名不包含 \n（换行符）字符时才有效。修改脚本一般需要bash4.4+，其中mapfile知道-d参数。

Answer 2

这是一种更有效的方法，它不使用任何临时文件：

#!/bin/bash

# get the sorted md5sum list of all files into an array in one shot
readarray -t arr < <(find . -maxdepth 1 -type f -exec md5sum {} + | sort)
# loop through the array and compare md5sum of contiguous items
for i in "${arr[@]}"; do
  md5="${i/ */}" # extract md5sum part
  [[ "$md5" = "$prev_md5" ]] && printf '%s\n' "$prev_i" "$i"
  prev_md5="$md5"
  prev_i="$i"
done | sort -u

sort -u 需要删除当有两个以上相同文件时打印的重复项

对目录内的文件执行 md5sum 并检查是否存在相同的文件

Do md5sum on files inside a directory and check if there are identical files

bash

shell

md5

md5sum