为什么两个 md5sum 文件的比较工作不正常?
Why comparison of two md5sum files is not working properly?
我有 2 个列表,其中包含带有 md5sum 检查的文件,并且这些列表对相同的文件有不同的路径。
带有校验和的第一个文件中的内容示例 (server.list):
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03 tmp/fastq1_L002_R1_001.fastq.gz
带有校验和的两个文件中的内容示例 (downloaded.list):
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03 /home/projects/fastq1_L002_R1_001.fastq.gz
当我运行以下行时,我得到以下行:
awk -F"/" 'FNR==NR{filearray[]=$NF; next }!( in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list
fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum
为什么我会收到此消息,因为两个文件中的第一列相同?有人可以就这个问题启发我吗?
编辑:
如果我删除路径并只留下文件名,它就可以正常工作。
编辑 2:
如前所述,还有另一种可能的文件路径形式,不以/
开头。在这种情况下,我不能使用 /
作为字段分隔符。
假设:
- 文件名(无路径)和 md5sum 必须匹配
- 文件名可能未按相同顺序列出
- 文件名可能不存在于两个文件中
示例数据:
$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz # match
YYYYf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R5_911.fastq.gz # different md5sum
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz # match
MNOPf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R8_abc.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R9_004.fastq.gz # different filename but matching md5sum (vs last line of other file)
==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz # match
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz # match
XXXXf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R5_911.fastq.gz # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L999_R6_922.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R7_933.fastq.gz # different filename but matching md5sum (vs last line of other file)
解决白色 space 问题以及验证文件名匹配的一个 awk
想法:
awk ' # stick with default field delimiter of white space but ...
{ md5sum=
n=split(,arr,"/") # split 2nd field on "/" delimiter
fname=arr[n]
if (FNR==NR)
filearray[fname]=md5sum
else {
if (fname in filearray && filearray[fname] == )
next
printf "%s has a different md5sum\n",fname
}
}
' downloaded.list server.list
这会生成:
fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum
</code> 上用作数组键的空格导致了问题。删除它:</p>
<pre><code>awk -F"/" '{gsub(/ /, "", )}; FNR==NR{filearray[ ]=$NF; next }!( in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt
我有 2 个列表,其中包含带有 md5sum 检查的文件,并且这些列表对相同的文件有不同的路径。
带有校验和的第一个文件中的内容示例 (server.list):
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03 tmp/fastq1_L002_R1_001.fastq.gz
带有校验和的两个文件中的内容示例 (downloaded.list):
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03 /home/projects/fastq1_L002_R1_001.fastq.gz
当我运行以下行时,我得到以下行:
awk -F"/" 'FNR==NR{filearray[]=$NF; next }!( in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list
fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum
为什么我会收到此消息,因为两个文件中的第一列相同?有人可以就这个问题启发我吗?
编辑:
如果我删除路径并只留下文件名,它就可以正常工作。
编辑 2:
如前所述,还有另一种可能的文件路径形式,不以/
开头。在这种情况下,我不能使用 /
作为字段分隔符。
假设:
- 文件名(无路径)和 md5sum 必须匹配
- 文件名可能未按相同顺序列出
- 文件名可能不存在于两个文件中
示例数据:
$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz # match
YYYYf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R5_911.fastq.gz # different md5sum
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz # match
MNOPf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R8_abc.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R9_004.fastq.gz # different filename but matching md5sum (vs last line of other file)
==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz # match
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz # match
XXXXf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R5_911.fastq.gz # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L999_R6_922.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R7_933.fastq.gz # different filename but matching md5sum (vs last line of other file)
解决白色 space 问题以及验证文件名匹配的一个 awk
想法:
awk ' # stick with default field delimiter of white space but ...
{ md5sum=
n=split(,arr,"/") # split 2nd field on "/" delimiter
fname=arr[n]
if (FNR==NR)
filearray[fname]=md5sum
else {
if (fname in filearray && filearray[fname] == )
next
printf "%s has a different md5sum\n",fname
}
}
' downloaded.list server.list
这会生成:
fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum
</code> 上用作数组键的空格导致了问题。删除它:</p>
<pre><code>awk -F"/" '{gsub(/ /, "", )}; FNR==NR{filearray[ ]=$NF; next }!( in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt