如何创建和保存图像感知哈希的成对汉明距离以输入聚类算法
How to create and save pairwise hamming distances of image perceptual hashes for imput to clustering algorithm
希望有人可以提供有关如何计算一堆哈希的成对汉明距离然后对它们进行聚类的指导。我不太关心性能,而是看我在做什么和我想做什么,无论如何它都会变慢,而且它不会反复 运行 一遍又一遍。
所以...简而言之,我错误地从驱动器上删除了 1000 张照片并且没有备份(我知道...糟糕的做法)。使用各种工具,我能够从驱动器中恢复非常高百分比的照片,但留下了数百张 1000 张照片。由于用于恢复某些照片的技术(例如文件雕刻),一些图像不同程度地损坏,其他是相同的副本,还有一些在视觉上基本相同,但字节与字节不同。
我想做的事情如下:
- 检查每个图像并确定图像文件是否结构损坏(完成)
- 为每个图像生成感知哈希(指纹),以便可以比较图像的相似性和聚类(指纹部分已完成)
- 计算指纹的两两距离
- 聚类成对距离,以便可以一起查看相似图像以帮助手动清理
在附加的脚本中,您会注意到我计算哈希值的几个地方,我将解释为不引起混淆...
- 对于 PIL 支持的图像,我生成三个哈希值,第一个用于原始图像,第二个旋转 90 度,第三个旋转 180 度。这样做是为了在完成成对计算后,我可以考虑方向不同的图像。
- 对于 PIL 不支持的原始图像,我更喜欢从提取的嵌入式预览图像生成的哈希值。我这样做而不是使用原始图像,因为在原始图像文件损坏的情况下,预览图像由于尺寸较小而很可能完好无损,因此可以更好地识别图像是否类似于其他
- 生成哈希值的其他地方是在识别损坏的原始图像的最后努力期间。我会将 extracted/converted 原始图像的哈希值与提取的嵌入式预览图像的哈希值进行比较,如果相似性不符合定义的阈值,则假设整个原始文件可能已损坏。
我需要指导的是如何完成以下任务:
- 对每张图片取三个哈希并计算汉明成对距离
- 每次图像比较只保留最相似的汉明距离
- 将结果输入 scipy 层次聚类,以便我可以对相似的图像进行分组
我只是在学习 Python 所以这是我挑战的一部分......我认为从我从 google 得到的东西我可以通过首先使用 scipy.spatial.distance.pdist,然后对其进行处理以保持每次图像比较的最相似距离,然后将其提供给 scipy 聚类函数。但是我不知道如何组织这个并以正确的格式提供东西等。有人可以提供一些指导吗?
这是我当前的脚本供参考,以防其他人觉得我需要更改以存储某种哈希字典或某种磁盘存储很有趣。
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes = [None] * 4
print(filename)
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
metadata = pyexiv2.ImageMetadata(filepath)
metadata.read()
rotate = 0
try:
im = Image.open(filepath)
except:
None
else:
for x in range(3):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
else:
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
rotate = 0
try:
im = Image.open(temp_preview.name)
except:
None
else:
for x in range(4):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# close temp file
temp_preview.close()
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
with rawpy.imread(filepath) as im:
None
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
if len(metadata_orig.previews) > 0:
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
rhash = imagehash.dhash(Image.open(temp_raw.name),32)
# compare preview with raw image and compute the most similar hamming distance (best)
hamdiff = .0
for h in range(4):
# calculate hamming distance to compare similarity
hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)
if hamdiff < .7: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
# close temp files
temp_raw.close()
print(hamdiff)
print(rhash)
print(hashes[0])
print(hashes[1])
print(hashes[2])
print(hashes[3])
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
if corrupt_str is not None:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
else:
os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
else:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))
已编辑
只是想提供一个更新,说明我最后想到的似乎非常适合我的预期目的,也许它对处于类似情况的其他用户有用。该脚本仍然可以使用一些润色,但除此之外所有内容都在那里。因为我很尊重使用 Python,所以如果有人看到可以大大改进的东西,请告诉我。
该脚本执行以下操作:
- 尝试使用各种方法检测文件结构方面的图像损坏。对于原始图像格式(NEF、DNG、TIF),有时我发现损坏的图像仍然可以正常加载,所以我决定对预览图像和提取的原始图像的 .jpg 进行哈希处理,并比较哈希值以及它们是否相似够了,我假设图像以某种形式损坏了。
- 为每个可以加载的图像创建感知哈希。为基础文件创建了三个(原始文件、原始旋转 90 度、原始旋转 180 度)。此外,对于原始图像,为提取的预览图像创建了额外的 3 个哈希值,这样做是为了在原始图像损坏的情况下,我们仍然可以使用基于完整图像的哈希值(假设预览正常)。
- 对于被识别为损坏的图像,它们会使用一个后缀重命名,该后缀表明损坏以及确定它的原因。
- 成对的汉明距离是通过将哈希值与所有文件对进行比较并存储在一个 numpy 数组中计算出来的。
- 成对距离的平方形式被馈送到 fastcluster 进行聚类
- fastcluster 的输出用于生成树状图以可视化相似图像的集群
我将 numpy 数组保存到磁盘,以便稍后可以重新 运行 fastcluster/dendrogram 部分,而无需重新计算每个文件的哈希值,这很慢。这是我必须修改脚本才能允许的东西....
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
hashes = []
filelist = []
for root, _, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
relpath = os.path.relpath(root, sys.argv[1])
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes_tmp = []
rhash = []
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
try:
im=Image.open(filepath)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
except:
None
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
if ext in (rawimageext):
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
try:
im = Image.open(temp_preview.name)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
im = rawpy.imread(filepath)
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
try:
im = Image.open(temp_raw.name)
for x in range(3):
rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# compare preview with raw image and compute the most similar hamming distance (best)
if len(hashes_tmp) > 0 and len(rhash) > 0:
hamdiff = 1
for rh in rhash:
# calculate hamming distance to compare similarity
hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))
if hamdiff > .3: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
hashes_tmp = hashes_tmp + rhash
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
newfilename = None
if corrupt_str is not None:
if mo is not None:
newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
else:
newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
else:
if mo is not None:
newfilename = re.sub(corruptRegex, '', filename) + ext
if newfilename is not None:
os.rename(filepath,os.path.join(root, newfilename))
if len(hashes_tmp) > 0:
hashes.append(hashes_tmp)
if newfilename is not None:
filelist.append(os.path.join(relpath, newfilename))
else:
filelist.append(os.path.join(relpath, filename))
print(len(filelist))
print(len(hashes))
a = np.empty(shape=(len(filelist),len(filelist)))
for hash_idx1, hash in enumerate(hashes):
a[hash_idx1,hash_idx1] = 0
hash_idx2 = hash_idx1 + 1
while hash_idx2 < len(hashes):
ham_dist = 1
for h1 in hash:
for h2 in hashes[hash_idx2]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
a[hash_idx1,hash_idx2] = ham_dist
a[hash_idx2,hash_idx1] = ham_dist
hash_idx2 = hash_idx2 + 1
print(a)
X = squareform(a)
print(X)
linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')
plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)
with open('numpyarray.npy','wb') as f:
np.save(f,a)
花了一段时间...但我最终弄明白了事情并得到了一个脚本,它可以很好地识别图像是否损坏,然后使用感知哈希来尝试将相似的图像组合在一起。
from PIL import Image, ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import Popen, PIPE
import shlex
import numpy as np
from scipy.cluster.hierarchy import dendrogram, fcluster
from scipy.spatial.distance import squareform
import fastcluster
#import matplotlib.pyplot as plt
import math
import string
from wand.image import Image as wImage
import wand.exceptions
from io import BytesIO
from datetime import datetime
#import fd_table_status
def redirect_stdout():
print("Redirecting stdout and stderr")
sys.stdout.flush() # <--- important when redirecting to files
sys.stderr.flush()
newstdout = os.dup(1)
newstderr = os.dup(2)
devnull = os.open(os.devnull, os.O_WRONLY)
devnull2 = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull, 1)
os.dup2(devnull2,2)
os.close(devnull)
os.close(devnull2)
sys.stdout = os.fdopen(newstdout, 'w')
sys.stderr = os.fdopen(newstderr, 'w')
redirect_stdout()
def ct(linkage_matrix,flist,score):
cluster_id = []
for fidx, file_ in enumerate(flist):
link_ = np.where(linkage_matrix[:,:2] == fidx)[0]
if len(link_) == 1:
link = link_[0]
if linkage_matrix[link][2] <= score:
fcluster_idx = str(link).zfill(len(str(len(linkage_matrix))))
while True:
match = np.where(linkage_matrix[:,:2] == link+1+len(linkage_matrix))[0]
if len(match) == 1:
link = match[0]
link_d = linkage_matrix[link]
if link_d[2] <= score:
fcluster_idx = str(match[0]).zfill(len(str(len(linkage_matrix)))) + fcluster_idx
else:
break
else:
break
else:
fcluster_idx = None
cluster_id.append(fcluster_idx)
return cluster_id
def get_exitcode_stdout_stderr(cmd):
"""
Execute the external command and get its exitcode, stdout and stderr.
"""
args = shlex.split(cmd)
proc = Popen(args, stdout=PIPE, stderr=PIPE, close_fds=True)
out, err = proc.communicate()
exitcode = proc.returncode
del proc
return exitcode, out, err
if os.path.isdir(sys.argv[1]):
start_time = datetime.now()
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
groupRegex = re.compile(r'^\[\d+\]_')
ufrawRegex = re.compile(r'Corrupt data near|Unexpected end of file|has the wrong dimensions!|Cannot open file|Cannot decode file|requests a nonexistent image!')
for subdirs,dirs,files in os.walk(sys.argv[1]):
files.clear()
dirs.clear()
for root,_,files in os.walk(subdirs):
print('\n******** Processing files in ' + root)
hashes = []
w_hash = []
w_hash_idx = []
filelist = []
files_ = []
cnt = 0
for f in files:
#cnt = cnt + 1
#if cnt < 10:
files_.append(f)
continue
cnt = 0
for f_idx, fname in enumerate(files_):
e=None
ext = os.path.splitext(fname.lower())[1]
filepath = os.path.join(root, fname)
imformat = ''
hashes_tmp = []
# reset corrupt string
corrupt_str = None
if ext in (stdimageext + rawimageext):
print(str(int(round(((f_idx+1)/len(files_))*100))) + '%' + ' : ' + fname + '....', end='', flush=True)
try:
with wImage(filename=filepath) as im:
imformat = '.' + im.format.lower()
ext = imformat if imformat is not '' else ext
with im.convert('jpeg') as converted:
jpeg_bin = converted.make_blob()
with Image.open(BytesIO(jpeg_bin)) as im2:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im2.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(1)
hashes_tmp.append(hash_image)
except:
e = sys.exc_info()[0]
errcode = str([k for k, v in wand.exceptions.TYPE_MAP.items() if v == e][0]).zfill(3)
if int(errcode[-2:]) in (15,25,30,35,40,50,55):
corrupt_str = 'magick'
finally:
try:
im.close()
except:
pass
try:
im2.close()
except:
pass
if ext in (stdimageext):
try:
with Image.open(filepath) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(2)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
im.close()
except:
pass
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
#rc = 0
print('.',end='',flush=True)
cmd = 'jpeginfo --check "' + filepath + '"'
exitcode, out, err = get_exitcode_stdout_stderr(cmd)
#rc = call(["jpeginfo", "--check", filepath], stdout=DEVNULL, stderr=DEVNULL, close_fds=True)
if exitcode == 1:
corrupt_str = 'JpegInfo' if corrupt_str == None else corrupt_str
#del rc
if corrupt_str is None:
try:
with Image.open(filepath) as im:
print('.',end='',flush=True)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify' if corrupt_str == None else corrupt_str
else:
try:
with Image.open(filepath) as im:
print('.',end='',flush=True)
temp = im.copy()
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load' if corrupt_str == None else corrupt_str
finally:
try:
temp.close()
except:
pass
try:
im.close()
except:
pass
finally:
try:
im.close()
except:
pass
try:
temp.close()
except:
pass
# raw image processing
if ext in (rawimageext):
print('.',end='',flush=True)
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
if corrupt_str == None:
try:
with rawpy.imread(filepath) as raw:
rgb = raw.postprocess(use_camera_wb=True)
temp_raw = NamedTemporaryFile(suffix='.jpg')
Image.fromarray(rgb).save(temp_raw.name)
with Image.open(temp_raw.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(3)
hashes_tmp.append(hash_image)
except(rawpy.LibRawFatalError):
e = sys.exc_info()[1]
corrupt_str = 'Libraw_FE'
except(rawpy.LibRawNonFatalError):
e = sys.exc_info()[1]
corrupt_str = 'Libraw_NFE'
except:
#print(sys.exc_info())
corrupt_str = 'Libraw'
finally:
try:
im.close()
except:
pass
try:
temp_raw.close()
except:
pass
try:
raw.close()
except:
pass
if corrupt_str == None:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
#rc = 0
cmd = 'ufraw-batch --wb=camera --rotate=camera --out-type=jpg --compression=95 --noexif --lensfun=none --auto-crop --output=' + temp_raw.name + ' --overwrite "' + filepath + '"'
print('.',end='',flush=True)
exitcode, out, err = get_exitcode_stdout_stderr(cmd)
if exitcode == 1 or ufrawRegex.search(str(err)) is not None:
corrupt_str = 'Ufraw' if corrupt_str is None else corrupt_str
tmpfilesize = os.stat(temp_raw.name).st_size
if tmpfilesize > 0:
try:
with Image.open(temp_raw.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(4)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
im.close()
except:
pass
try:
temp_raw.close()
except:
pass
# attempt to extract preview images
imfile = filepath
try:
with pyexiv2.ImageMetadata(imfile) as metadata_orig:
metadata_orig.read()
#for i,p in enumerate(metadata_orig.previews):
if metadata_orig.previews:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
try:
with Image.open(temp_preview.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(5)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
temp_preview.close()
except:
pass
try:
im.close()
except:
pass
except:
pass
finally:
try:
metadata_orig.close()
except:
pass
# compare hashes for all images that were found or extracted and find most dissimilar hamming distance (worst)
if len(hashes_tmp) > 1:
#print('checking_hashes')
print('.',end='',flush=True)
scores = []
for h_idx, hash in enumerate(hashes_tmp):
i = h_idx + 1
while i < len(hashes_tmp):
ham_dist = 1
for h1 in hash[:-1]:
for h2 in hashes_tmp[i][:-1]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
if (hash[-1] == 5 and hashes_tmp[i][-1] != 5) or (hash[-1] != 5 and hashes_tmp[i][-1] == 5):
scores.append([ham_dist,hash[-1],hashes_tmp[i][-1]])
i = i + 1
if scores:
worst = sorted(scores, key = lambda x: x[0])[-1]
if worst[0] > 0.3:
worst1 = str(worst[1])
worst2 = str(worst[2])
corrupt_str = 'hash' + str(round(worst[0]*100,2)) + '_' + worst1 + '-' + worst2 if corrupt_str == None else corrupt_str
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(fname)
newfilename = None
if corrupt_str is not None:
print('Corrupt: ' + corrupt_str)
if mo is not None:
newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', fname) + ext
else:
newfilename = os.path.splitext(fname)[0] + '_[' + corrupt_str + ']' + ext
else:
print('OK!')
if mo is not None:
newfilename = re.sub(corruptRegex, '', fname) + ext
# remove group index from name if present, this will be assigned in the next step if needed
newfilename = newfilename if newfilename is not None else fname
mo = groupRegex.search(newfilename)
if mo is not None:
newfilename = re.sub(groupRegex, '', newfilename)
if hashes_tmp:
# set function unduplicates flattened list
hashes.append(set([item for sublist in hashes_tmp for item in sublist[:-1]]))
filelist.append([root,fname,newfilename, len(hashes_tmp)])
print('******** Grouping similar images... ************')
if len(hashes) > 1:
scores = []
for h_idx, hash in enumerate(hashes):
i = h_idx + 1
while i < len(hashes):
ham_dist = 1
for h1 in hash:
for h2 in hashes[i]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
scores.append(ham_dist)
i = i + 1
X = np.array(scores)
linkage = fastcluster.single(X)
w_hash_idx = [el_idx for el_idx, el in enumerate(filelist) if el[3] > 0]
w_hash = [filelist[i] for i in w_hash_idx]
test=ct(linkage,[el[2] for el in w_hash],.2)
for i, prfx in enumerate(test):
curfilename = w_hash[i][2]
mo = groupRegex.search(curfilename)
newfilename = None
if prfx is not None:
if mo is not None:
newfilename = re.sub(groupRegex, '[' + prfx + ']_', curfilename)
else:
newfilename = '[' + prfx + ']_' + curfilename
else:
if mo is not None:
newfilename = re.sub(groupRegex, '', curfilename)
# if newfilename is not None:
filelist[w_hash_idx[i]][2] = newfilename if newfilename is not None else curfilename
#fig = plt.figure(figsize=(25,25))
#plt.title(root)
#plt.xlabel('perpetual hash hamming distance')
#plt.axvline(x=.15,c='red',linestyle='--')
#dg = dendrogram(linkage, labels=[el[2] for el in w_hash], orientation='right', show_leaf_counts=True)
#ax = fig.gca()
#ax.set_xlim(-.02,ax.get_xlim()[1])
#plt.show
#plt.savefig(os.path.join(root,'dendrogram.pdf'), bbox_inches='tight', dpi=100)
w_hash.clear()
w_hash_idx.clear()
print('******** Renameing file if applicable... ************')
for fr in filelist:
if fr[1] != fr[2]:
#print(fr[1] + ' -- ' + fr[2])
path = fr[0]
os.rename(os.path.join(path,fr[1]),os.path.join(path,fr[2]))
filelist.clear()
duration = datetime.now() - start_time
days = divmod(duration.total_seconds(), 86400) # Get days (without [0]!)
hours = divmod(days[1], 3600) # Use remainder of days to calc hours
minutes = divmod(hours[1], 60) # Use remainder of hours to calc minutes
seconds = divmod(minutes[1], 1) # Use remainder of minutes to calc seconds
print("Time to complete: %d days, %d:%d:%d" % (days[0], hours[0], minutes[0], seconds[0]))
希望有人可以提供有关如何计算一堆哈希的成对汉明距离然后对它们进行聚类的指导。我不太关心性能,而是看我在做什么和我想做什么,无论如何它都会变慢,而且它不会反复 运行 一遍又一遍。
所以...简而言之,我错误地从驱动器上删除了 1000 张照片并且没有备份(我知道...糟糕的做法)。使用各种工具,我能够从驱动器中恢复非常高百分比的照片,但留下了数百张 1000 张照片。由于用于恢复某些照片的技术(例如文件雕刻),一些图像不同程度地损坏,其他是相同的副本,还有一些在视觉上基本相同,但字节与字节不同。
我想做的事情如下:
- 检查每个图像并确定图像文件是否结构损坏(完成)
- 为每个图像生成感知哈希(指纹),以便可以比较图像的相似性和聚类(指纹部分已完成)
- 计算指纹的两两距离
- 聚类成对距离,以便可以一起查看相似图像以帮助手动清理
在附加的脚本中,您会注意到我计算哈希值的几个地方,我将解释为不引起混淆...
- 对于 PIL 支持的图像,我生成三个哈希值,第一个用于原始图像,第二个旋转 90 度,第三个旋转 180 度。这样做是为了在完成成对计算后,我可以考虑方向不同的图像。
- 对于 PIL 不支持的原始图像,我更喜欢从提取的嵌入式预览图像生成的哈希值。我这样做而不是使用原始图像,因为在原始图像文件损坏的情况下,预览图像由于尺寸较小而很可能完好无损,因此可以更好地识别图像是否类似于其他
- 生成哈希值的其他地方是在识别损坏的原始图像的最后努力期间。我会将 extracted/converted 原始图像的哈希值与提取的嵌入式预览图像的哈希值进行比较,如果相似性不符合定义的阈值,则假设整个原始文件可能已损坏。
我需要指导的是如何完成以下任务:
- 对每张图片取三个哈希并计算汉明成对距离
- 每次图像比较只保留最相似的汉明距离
- 将结果输入 scipy 层次聚类,以便我可以对相似的图像进行分组
我只是在学习 Python 所以这是我挑战的一部分......我认为从我从 google 得到的东西我可以通过首先使用 scipy.spatial.distance.pdist,然后对其进行处理以保持每次图像比较的最相似距离,然后将其提供给 scipy 聚类函数。但是我不知道如何组织这个并以正确的格式提供东西等。有人可以提供一些指导吗?
这是我当前的脚本供参考,以防其他人觉得我需要更改以存储某种哈希字典或某种磁盘存储很有趣。
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes = [None] * 4
print(filename)
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
metadata = pyexiv2.ImageMetadata(filepath)
metadata.read()
rotate = 0
try:
im = Image.open(filepath)
except:
None
else:
for x in range(3):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
else:
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
rotate = 0
try:
im = Image.open(temp_preview.name)
except:
None
else:
for x in range(4):
hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
# close temp file
temp_preview.close()
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
with rawpy.imread(filepath) as im:
None
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
if len(metadata_orig.previews) > 0:
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
rhash = imagehash.dhash(Image.open(temp_raw.name),32)
# compare preview with raw image and compute the most similar hamming distance (best)
hamdiff = .0
for h in range(4):
# calculate hamming distance to compare similarity
hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)
if hamdiff < .7: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
# close temp files
temp_raw.close()
print(hamdiff)
print(rhash)
print(hashes[0])
print(hashes[1])
print(hashes[2])
print(hashes[3])
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
if corrupt_str is not None:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
else:
os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
else:
if mo is not None:
os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))
已编辑 只是想提供一个更新,说明我最后想到的似乎非常适合我的预期目的,也许它对处于类似情况的其他用户有用。该脚本仍然可以使用一些润色,但除此之外所有内容都在那里。因为我很尊重使用 Python,所以如果有人看到可以大大改进的东西,请告诉我。
该脚本执行以下操作:
- 尝试使用各种方法检测文件结构方面的图像损坏。对于原始图像格式(NEF、DNG、TIF),有时我发现损坏的图像仍然可以正常加载,所以我决定对预览图像和提取的原始图像的 .jpg 进行哈希处理,并比较哈希值以及它们是否相似够了,我假设图像以某种形式损坏了。
- 为每个可以加载的图像创建感知哈希。为基础文件创建了三个(原始文件、原始旋转 90 度、原始旋转 180 度)。此外,对于原始图像,为提取的预览图像创建了额外的 3 个哈希值,这样做是为了在原始图像损坏的情况下,我们仍然可以使用基于完整图像的哈希值(假设预览正常)。
- 对于被识别为损坏的图像,它们会使用一个后缀重命名,该后缀表明损坏以及确定它的原因。
- 成对的汉明距离是通过将哈希值与所有文件对进行比较并存储在一个 numpy 数组中计算出来的。
- 成对距离的平方形式被馈送到 fastcluster 进行聚类
- fastcluster 的输出用于生成树状图以可视化相似图像的集群
我将 numpy 数组保存到磁盘,以便稍后可以重新 运行 fastcluster/dendrogram 部分,而无需重新计算每个文件的哈希值,这很慢。这是我必须修改脚本才能允许的东西....
from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
devnull = open(os.devnull, 'w')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
hashes = []
filelist = []
for root, _, files in os.walk(sys.argv[1]):
for filename in files:
ext = os.path.splitext(filename.lower())[1]
relpath = os.path.relpath(root, sys.argv[1])
filepath = os.path.join(root, filename)
if ext in (stdimageext + rawimageext):
hashes_tmp = []
rhash = []
# reset corrupt string
corrupt_str = None
if ext in (stdimageext):
try:
im=Image.open(filepath)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
except:
None
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
rc = 0
rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
if rc == 1:
corrupt_str = 'JpegInfo'
if corrupt_str is None:
try:
im = Image.open(filepath)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify'
else:
try:
im = Image.open(filepath)
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load'
# raw image processing
if ext in (rawimageext):
# extract largest embedded preview image first
metadata_orig = pyexiv2.ImageMetadata(filepath)
metadata_orig.read()
if len(metadata_orig.previews) > 0:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
try:
im = Image.open(temp_preview.name)
for x in range(3):
hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
try:
im = rawpy.imread(filepath)
except:
e = sys.exc_info()[0]
corrupt_str = 'Libraw_Load'
else:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
try:
check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)
except:
e = sys.exc_info()[0]
corrupt_str = 'Ufraw-conv'
else:
try:
im = Image.open(temp_raw.name)
for x in range(3):
rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
except:
None
# compare preview with raw image and compute the most similar hamming distance (best)
if len(hashes_tmp) > 0 and len(rhash) > 0:
hamdiff = 1
for rh in rhash:
# calculate hamming distance to compare similarity
hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))
if hamdiff > .3: # raw file is probably corrupt
corrupt_str = 'hash' + str(round(hamdiff*100,2))
hashes_tmp = hashes_tmp + rhash
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(filename)
newfilename = None
if corrupt_str is not None:
if mo is not None:
newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
else:
newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
else:
if mo is not None:
newfilename = re.sub(corruptRegex, '', filename) + ext
if newfilename is not None:
os.rename(filepath,os.path.join(root, newfilename))
if len(hashes_tmp) > 0:
hashes.append(hashes_tmp)
if newfilename is not None:
filelist.append(os.path.join(relpath, newfilename))
else:
filelist.append(os.path.join(relpath, filename))
print(len(filelist))
print(len(hashes))
a = np.empty(shape=(len(filelist),len(filelist)))
for hash_idx1, hash in enumerate(hashes):
a[hash_idx1,hash_idx1] = 0
hash_idx2 = hash_idx1 + 1
while hash_idx2 < len(hashes):
ham_dist = 1
for h1 in hash:
for h2 in hashes[hash_idx2]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
a[hash_idx1,hash_idx2] = ham_dist
a[hash_idx2,hash_idx1] = ham_dist
hash_idx2 = hash_idx2 + 1
print(a)
X = squareform(a)
print(X)
linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')
plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)
with open('numpyarray.npy','wb') as f:
np.save(f,a)
花了一段时间...但我最终弄明白了事情并得到了一个脚本,它可以很好地识别图像是否损坏,然后使用感知哈希来尝试将相似的图像组合在一起。
from PIL import Image, ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import Popen, PIPE
import shlex
import numpy as np
from scipy.cluster.hierarchy import dendrogram, fcluster
from scipy.spatial.distance import squareform
import fastcluster
#import matplotlib.pyplot as plt
import math
import string
from wand.image import Image as wImage
import wand.exceptions
from io import BytesIO
from datetime import datetime
#import fd_table_status
def redirect_stdout():
print("Redirecting stdout and stderr")
sys.stdout.flush() # <--- important when redirecting to files
sys.stderr.flush()
newstdout = os.dup(1)
newstderr = os.dup(2)
devnull = os.open(os.devnull, os.O_WRONLY)
devnull2 = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull, 1)
os.dup2(devnull2,2)
os.close(devnull)
os.close(devnull2)
sys.stdout = os.fdopen(newstdout, 'w')
sys.stderr = os.fdopen(newstderr, 'w')
redirect_stdout()
def ct(linkage_matrix,flist,score):
cluster_id = []
for fidx, file_ in enumerate(flist):
link_ = np.where(linkage_matrix[:,:2] == fidx)[0]
if len(link_) == 1:
link = link_[0]
if linkage_matrix[link][2] <= score:
fcluster_idx = str(link).zfill(len(str(len(linkage_matrix))))
while True:
match = np.where(linkage_matrix[:,:2] == link+1+len(linkage_matrix))[0]
if len(match) == 1:
link = match[0]
link_d = linkage_matrix[link]
if link_d[2] <= score:
fcluster_idx = str(match[0]).zfill(len(str(len(linkage_matrix)))) + fcluster_idx
else:
break
else:
break
else:
fcluster_idx = None
cluster_id.append(fcluster_idx)
return cluster_id
def get_exitcode_stdout_stderr(cmd):
"""
Execute the external command and get its exitcode, stdout and stderr.
"""
args = shlex.split(cmd)
proc = Popen(args, stdout=PIPE, stderr=PIPE, close_fds=True)
out, err = proc.communicate()
exitcode = proc.returncode
del proc
return exitcode, out, err
if os.path.isdir(sys.argv[1]):
start_time = datetime.now()
# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True
# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')
corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
groupRegex = re.compile(r'^\[\d+\]_')
ufrawRegex = re.compile(r'Corrupt data near|Unexpected end of file|has the wrong dimensions!|Cannot open file|Cannot decode file|requests a nonexistent image!')
for subdirs,dirs,files in os.walk(sys.argv[1]):
files.clear()
dirs.clear()
for root,_,files in os.walk(subdirs):
print('\n******** Processing files in ' + root)
hashes = []
w_hash = []
w_hash_idx = []
filelist = []
files_ = []
cnt = 0
for f in files:
#cnt = cnt + 1
#if cnt < 10:
files_.append(f)
continue
cnt = 0
for f_idx, fname in enumerate(files_):
e=None
ext = os.path.splitext(fname.lower())[1]
filepath = os.path.join(root, fname)
imformat = ''
hashes_tmp = []
# reset corrupt string
corrupt_str = None
if ext in (stdimageext + rawimageext):
print(str(int(round(((f_idx+1)/len(files_))*100))) + '%' + ' : ' + fname + '....', end='', flush=True)
try:
with wImage(filename=filepath) as im:
imformat = '.' + im.format.lower()
ext = imformat if imformat is not '' else ext
with im.convert('jpeg') as converted:
jpeg_bin = converted.make_blob()
with Image.open(BytesIO(jpeg_bin)) as im2:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im2.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(1)
hashes_tmp.append(hash_image)
except:
e = sys.exc_info()[0]
errcode = str([k for k, v in wand.exceptions.TYPE_MAP.items() if v == e][0]).zfill(3)
if int(errcode[-2:]) in (15,25,30,35,40,50,55):
corrupt_str = 'magick'
finally:
try:
im.close()
except:
pass
try:
im2.close()
except:
pass
if ext in (stdimageext):
try:
with Image.open(filepath) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(2)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
im.close()
except:
pass
# use jpeginfo against all jpg images as its pretty accurate
if ext in ('.jpg','.jpeg'):
#rc = 0
print('.',end='',flush=True)
cmd = 'jpeginfo --check "' + filepath + '"'
exitcode, out, err = get_exitcode_stdout_stderr(cmd)
#rc = call(["jpeginfo", "--check", filepath], stdout=DEVNULL, stderr=DEVNULL, close_fds=True)
if exitcode == 1:
corrupt_str = 'JpegInfo' if corrupt_str == None else corrupt_str
#del rc
if corrupt_str is None:
try:
with Image.open(filepath) as im:
print('.',end='',flush=True)
im.verify()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Verify' if corrupt_str == None else corrupt_str
else:
try:
with Image.open(filepath) as im:
print('.',end='',flush=True)
temp = im.copy()
im.load()
except:
e = sys.exc_info()[0]
corrupt_str = 'PIL_Load' if corrupt_str == None else corrupt_str
finally:
try:
temp.close()
except:
pass
try:
im.close()
except:
pass
finally:
try:
im.close()
except:
pass
try:
temp.close()
except:
pass
# raw image processing
if ext in (rawimageext):
print('.',end='',flush=True)
# try to load raw using libraw via rawpy first,
# generally if libraw can't load it then ufraw extraction would also fail
if corrupt_str == None:
try:
with rawpy.imread(filepath) as raw:
rgb = raw.postprocess(use_camera_wb=True)
temp_raw = NamedTemporaryFile(suffix='.jpg')
Image.fromarray(rgb).save(temp_raw.name)
with Image.open(temp_raw.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(3)
hashes_tmp.append(hash_image)
except(rawpy.LibRawFatalError):
e = sys.exc_info()[1]
corrupt_str = 'Libraw_FE'
except(rawpy.LibRawNonFatalError):
e = sys.exc_info()[1]
corrupt_str = 'Libraw_NFE'
except:
#print(sys.exc_info())
corrupt_str = 'Libraw'
finally:
try:
im.close()
except:
pass
try:
temp_raw.close()
except:
pass
try:
raw.close()
except:
pass
if corrupt_str == None:
# as a final last ditch effort compare perceptual hashes of extracted
# raw and embedded preview to detect possible internal corruption
# extract and convert raw to jpeg image using ufraw
temp_raw = NamedTemporaryFile(suffix='.jpg')
#rc = 0
cmd = 'ufraw-batch --wb=camera --rotate=camera --out-type=jpg --compression=95 --noexif --lensfun=none --auto-crop --output=' + temp_raw.name + ' --overwrite "' + filepath + '"'
print('.',end='',flush=True)
exitcode, out, err = get_exitcode_stdout_stderr(cmd)
if exitcode == 1 or ufrawRegex.search(str(err)) is not None:
corrupt_str = 'Ufraw' if corrupt_str is None else corrupt_str
tmpfilesize = os.stat(temp_raw.name).st_size
if tmpfilesize > 0:
try:
with Image.open(temp_raw.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(4)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
im.close()
except:
pass
try:
temp_raw.close()
except:
pass
# attempt to extract preview images
imfile = filepath
try:
with pyexiv2.ImageMetadata(imfile) as metadata_orig:
metadata_orig.read()
#for i,p in enumerate(metadata_orig.previews):
if metadata_orig.previews:
preview = metadata_orig.previews[-1]
# save preview to temp file
temp_preview = NamedTemporaryFile()
preview.write_to_file(temp_preview.name)
os.rename(temp_preview.name + preview.extension, temp_preview.name)
try:
with Image.open(temp_preview.name) as im:
hash_image = []
for x in range(3):
print('.',end='',flush=True)
hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
if ''.join(set(hash_i)) != '0':
hash_image.append(hash_i)
if hash_image:
hash_image.append(5)
hashes_tmp.append(hash_image)
except:
pass
finally:
try:
temp_preview.close()
except:
pass
try:
im.close()
except:
pass
except:
pass
finally:
try:
metadata_orig.close()
except:
pass
# compare hashes for all images that were found or extracted and find most dissimilar hamming distance (worst)
if len(hashes_tmp) > 1:
#print('checking_hashes')
print('.',end='',flush=True)
scores = []
for h_idx, hash in enumerate(hashes_tmp):
i = h_idx + 1
while i < len(hashes_tmp):
ham_dist = 1
for h1 in hash[:-1]:
for h2 in hashes_tmp[i][:-1]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
if (hash[-1] == 5 and hashes_tmp[i][-1] != 5) or (hash[-1] != 5 and hashes_tmp[i][-1] == 5):
scores.append([ham_dist,hash[-1],hashes_tmp[i][-1]])
i = i + 1
if scores:
worst = sorted(scores, key = lambda x: x[0])[-1]
if worst[0] > 0.3:
worst1 = str(worst[1])
worst2 = str(worst[2])
corrupt_str = 'hash' + str(round(worst[0]*100,2)) + '_' + worst1 + '-' + worst2 if corrupt_str == None else corrupt_str
# prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
mo = corruptRegex.search(fname)
newfilename = None
if corrupt_str is not None:
print('Corrupt: ' + corrupt_str)
if mo is not None:
newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', fname) + ext
else:
newfilename = os.path.splitext(fname)[0] + '_[' + corrupt_str + ']' + ext
else:
print('OK!')
if mo is not None:
newfilename = re.sub(corruptRegex, '', fname) + ext
# remove group index from name if present, this will be assigned in the next step if needed
newfilename = newfilename if newfilename is not None else fname
mo = groupRegex.search(newfilename)
if mo is not None:
newfilename = re.sub(groupRegex, '', newfilename)
if hashes_tmp:
# set function unduplicates flattened list
hashes.append(set([item for sublist in hashes_tmp for item in sublist[:-1]]))
filelist.append([root,fname,newfilename, len(hashes_tmp)])
print('******** Grouping similar images... ************')
if len(hashes) > 1:
scores = []
for h_idx, hash in enumerate(hashes):
i = h_idx + 1
while i < len(hashes):
ham_dist = 1
for h1 in hash:
for h2 in hashes[i]:
ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
scores.append(ham_dist)
i = i + 1
X = np.array(scores)
linkage = fastcluster.single(X)
w_hash_idx = [el_idx for el_idx, el in enumerate(filelist) if el[3] > 0]
w_hash = [filelist[i] for i in w_hash_idx]
test=ct(linkage,[el[2] for el in w_hash],.2)
for i, prfx in enumerate(test):
curfilename = w_hash[i][2]
mo = groupRegex.search(curfilename)
newfilename = None
if prfx is not None:
if mo is not None:
newfilename = re.sub(groupRegex, '[' + prfx + ']_', curfilename)
else:
newfilename = '[' + prfx + ']_' + curfilename
else:
if mo is not None:
newfilename = re.sub(groupRegex, '', curfilename)
# if newfilename is not None:
filelist[w_hash_idx[i]][2] = newfilename if newfilename is not None else curfilename
#fig = plt.figure(figsize=(25,25))
#plt.title(root)
#plt.xlabel('perpetual hash hamming distance')
#plt.axvline(x=.15,c='red',linestyle='--')
#dg = dendrogram(linkage, labels=[el[2] for el in w_hash], orientation='right', show_leaf_counts=True)
#ax = fig.gca()
#ax.set_xlim(-.02,ax.get_xlim()[1])
#plt.show
#plt.savefig(os.path.join(root,'dendrogram.pdf'), bbox_inches='tight', dpi=100)
w_hash.clear()
w_hash_idx.clear()
print('******** Renameing file if applicable... ************')
for fr in filelist:
if fr[1] != fr[2]:
#print(fr[1] + ' -- ' + fr[2])
path = fr[0]
os.rename(os.path.join(path,fr[1]),os.path.join(path,fr[2]))
filelist.clear()
duration = datetime.now() - start_time
days = divmod(duration.total_seconds(), 86400) # Get days (without [0]!)
hours = divmod(days[1], 3600) # Use remainder of days to calc hours
minutes = divmod(hours[1], 60) # Use remainder of hours to calc minutes
seconds = divmod(minutes[1], 1) # Use remainder of minutes to calc seconds
print("Time to complete: %d days, %d:%d:%d" % (days[0], hours[0], minutes[0], seconds[0]))