查找近似重复和伪造的图像
Find near duplicate and faked images
我正在使用感知散列技术来查找近似重复和精确重复的图像。该代码非常适合查找完全重复的图像。然而,找到近似重复和略微修改的图像似乎很困难。由于它们的哈希值之间的差异分数通常类似于完全不同的随机图像的哈希差异。
为了解决这个问题,我尝试将近似重复图像的像素化减少到 50x50 像素并使它们成为 black/white,但我仍然没有我需要的(分数差异很小)。
这是近似重复图像对的示例:
图片 1 (a1.jpg):
图 2 (b1.jpg):
这些图像的散列分数之间的差异是:24
当 pixeld(50x50 像素)时,它们看起来像这样:
rs_a1.jpg
rs_b1.jpg
像素化图像的散列差异分数更大! : 26
下面是@ann zen 要求的两个近乎重复的图像对示例:
一对 1
对 2
我用来减小图像大小的代码是这样的:
from PIL import Image
with Image.open(image_path) as image:
reduced_image = image.resize((50, 50)).convert('RGB').convert("1")
以及比较两个图像散列的代码:
from PIL import Image
import imagehash
with Image.open(image1_path) as img1:
hashing1 = imagehash.phash(img1)
with Image.open(image2_path) as img2:
hashing2 = imagehash.phash(img2)
print('difference : ', hashing1-hashing2)
与其在找到它们之间的 difference/similarity 之前使用像素化处理图像,不如使用 cv2.GaussianBlur()
method, and then use the cv2.matchTemplate()
方法简单地给它们一些模糊来找到它们之间的相似性:
import cv2
import numpy as np
def process(img):
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return cv2.GaussianBlur(img_gray, (43, 43), 21)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for img1, img2 in zip(img1s, img2s):
conf = confidence(img1, img2)
print(f"Confidence: {round(conf * 100, 2)}%")
输出:
Confidence: 83.6%
Confidence: 84.62%
Confidence: 87.24%
以上程序使用的图像如下:
img1_1.jpg
& img2_1.jpg
:
img1_2.jpg
& img2_2.jpg
:
img1_3.jpg
& img2_3.jpg
:
为了证明模糊不会产生真正的效果 false-positives,我 运行 这个程序:
import cv2
import numpy as np
def process(img):
h, w, _ = img.shape
img = cv2.resize(img, (350, h * w // 350))
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return cv2.GaussianBlur(img_gray, (43, 43), 21)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for i, img1 in enumerate(img1s, 1):
for j, img2 in enumerate(img2s, 1):
conf = confidence(img1, img2)
print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")
输出:
img1_1 img2_1 Confidence: 84.2% # Corresponding images
img1_1 img2_2 Confidence: -10.86%
img1_1 img2_3 Confidence: 16.11%
img1_2 img2_1 Confidence: -2.5%
img1_2 img2_2 Confidence: 84.61% # Corresponding images
img1_2 img2_3 Confidence: 43.91%
img1_3 img2_1 Confidence: 14.49%
img1_3 img2_2 Confidence: 59.15%
img1_3 img2_3 Confidence: 87.25% # Corresponding images
请注意,只有当图像与其对应的图像匹配时,程序才会输出高置信度 (84+%)。
为了比较,这里是 没有 模糊图像的结果:
import cv2
import numpy as np
def process(img):
h, w, _ = img.shape
img = cv2.resize(img, (350, h * w // 350))
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for i, img1 in enumerate(img1s, 1):
for j, img2 in enumerate(img2s, 1):
conf = confidence(img1, img2)
print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")
输出:
img1_1 img2_1 Confidence: 66.73%
img1_1 img2_2 Confidence: -6.97%
img1_1 img2_3 Confidence: 11.01%
img1_2 img2_1 Confidence: 0.31%
img1_2 img2_2 Confidence: 65.33%
img1_2 img2_3 Confidence: 31.8%
img1_3 img2_1 Confidence: 9.57%
img1_3 img2_2 Confidence: 39.74%
img1_3 img2_3 Confidence: 61.16%
这是一种使用 sentence-transformers
library which provides an easy way to compute dense vector representations for images. We can use the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model 确定重复图像和 near-duplicate 图像的定量方法,sentence-transformers
library which provides an easy way to compute dense vector representations for images. We can use the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model 是一个已经在各种(图像、文本)对上训练过的神经网络。为了找到图像重复和 near-duplicates,我们将所有图像编码为向量 space,然后找到与图像非常相似的区域相对应的高密度区域。
当比较两张图片时,它们的得分在 0
到 1.00
之间。我们可以使用阈值参数来识别两个图像是相似的还是不同的。通过将阈值设置得较低,您将获得更大的集群,其中包含更少的相似图像。重复图像的得分为 1.00
,这意味着两张图像完全相同。要找到 near-duplicate 个图像,我们可以将阈值设置为任意值,比如 0.9
。例如,如果两幅图像之间的确定分数大于 0.9
,那么我们可以断定它们是 near-duplicate 图像。
一个例子:
这个数据集有 5 张图片,请注意 cat #1 是如何重复的,而其他的是不同的。
查找重复图像
Score: 100.000%
.\cat1 copy.jpg
.\cat1.jpg
cat1 和它的副本都是一样的。
正在查找 near-duplicate 张图片
Score: 91.116%
.\cat1 copy.jpg
.\cat2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 59.086%
.\bear2.jpg
.\cat2.jpg
Score: 56.025%
.\bear1.jpg
.\cat2.jpg
Score: 53.659%
.\bear1.jpg
.\cat1 copy.jpg
Score: 53.659%
.\bear1.jpg
.\cat1.jpg
Score: 53.225%
.\bear2.jpg
.\cat1.jpg
我们得到了更有趣的不同图像之间的分数比较结果。分数越高,越相似;分数越低,相似度越低。使用 0.9
或 90% 的阈值,我们可以过滤掉 near-duplicate 个图像。
两张图片的对比
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 93.715%
.\tower1.jpg
.\tower2.jpg
代码
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os
# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')
# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(Image.open(filepath))
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)
# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
NUM_SIMILAR_IMAGES = 10
# =================
# DUPLICATES
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
duplicates = [image for image in processed_images if image[0] >= 1]
# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])
# =================
# NEAR DUPLICATES
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower,
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]
for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])
我正在使用感知散列技术来查找近似重复和精确重复的图像。该代码非常适合查找完全重复的图像。然而,找到近似重复和略微修改的图像似乎很困难。由于它们的哈希值之间的差异分数通常类似于完全不同的随机图像的哈希差异。
为了解决这个问题,我尝试将近似重复图像的像素化减少到 50x50 像素并使它们成为 black/white,但我仍然没有我需要的(分数差异很小)。
这是近似重复图像对的示例:
图片 1 (a1.jpg):
图 2 (b1.jpg):
这些图像的散列分数之间的差异是:24
当 pixeld(50x50 像素)时,它们看起来像这样:
rs_a1.jpg
rs_b1.jpg
像素化图像的散列差异分数更大! : 26
下面是@ann zen 要求的两个近乎重复的图像对示例:
一对 1
对 2
我用来减小图像大小的代码是这样的:
from PIL import Image
with Image.open(image_path) as image:
reduced_image = image.resize((50, 50)).convert('RGB').convert("1")
以及比较两个图像散列的代码:
from PIL import Image
import imagehash
with Image.open(image1_path) as img1:
hashing1 = imagehash.phash(img1)
with Image.open(image2_path) as img2:
hashing2 = imagehash.phash(img2)
print('difference : ', hashing1-hashing2)
与其在找到它们之间的 difference/similarity 之前使用像素化处理图像,不如使用 cv2.GaussianBlur()
method, and then use the cv2.matchTemplate()
方法简单地给它们一些模糊来找到它们之间的相似性:
import cv2
import numpy as np
def process(img):
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return cv2.GaussianBlur(img_gray, (43, 43), 21)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for img1, img2 in zip(img1s, img2s):
conf = confidence(img1, img2)
print(f"Confidence: {round(conf * 100, 2)}%")
输出:
Confidence: 83.6%
Confidence: 84.62%
Confidence: 87.24%
以上程序使用的图像如下:
img1_1.jpg
& img2_1.jpg
:
img1_2.jpg
& img2_2.jpg
:
img1_3.jpg
& img2_3.jpg
:
为了证明模糊不会产生真正的效果 false-positives,我 运行 这个程序:
import cv2
import numpy as np
def process(img):
h, w, _ = img.shape
img = cv2.resize(img, (350, h * w // 350))
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
return cv2.GaussianBlur(img_gray, (43, 43), 21)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for i, img1 in enumerate(img1s, 1):
for j, img2 in enumerate(img2s, 1):
conf = confidence(img1, img2)
print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")
输出:
img1_1 img2_1 Confidence: 84.2% # Corresponding images
img1_1 img2_2 Confidence: -10.86%
img1_1 img2_3 Confidence: 16.11%
img1_2 img2_1 Confidence: -2.5%
img1_2 img2_2 Confidence: 84.61% # Corresponding images
img1_2 img2_3 Confidence: 43.91%
img1_3 img2_1 Confidence: 14.49%
img1_3 img2_2 Confidence: 59.15%
img1_3 img2_3 Confidence: 87.25% # Corresponding images
请注意,只有当图像与其对应的图像匹配时,程序才会输出高置信度 (84+%)。
为了比较,这里是 没有 模糊图像的结果:
import cv2
import numpy as np
def process(img):
h, w, _ = img.shape
img = cv2.resize(img, (350, h * w // 350))
return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
def confidence(img1, img2):
res = cv2.matchTemplate(process(img1), process(img2), cv2.TM_CCOEFF_NORMED)
return res.max()
img1s = list(map(cv2.imread, ["img1_1.jpg", "img1_2.jpg", "img1_3.jpg"]))
img2s = list(map(cv2.imread, ["img2_1.jpg", "img2_2.jpg", "img2_3.jpg"]))
for i, img1 in enumerate(img1s, 1):
for j, img2 in enumerate(img2s, 1):
conf = confidence(img1, img2)
print(f"img1_{i} img2_{j} Confidence: {round(conf * 100, 2)}%")
输出:
img1_1 img2_1 Confidence: 66.73%
img1_1 img2_2 Confidence: -6.97%
img1_1 img2_3 Confidence: 11.01%
img1_2 img2_1 Confidence: 0.31%
img1_2 img2_2 Confidence: 65.33%
img1_2 img2_3 Confidence: 31.8%
img1_3 img2_1 Confidence: 9.57%
img1_3 img2_2 Confidence: 39.74%
img1_3 img2_3 Confidence: 61.16%
这是一种使用 sentence-transformers
library which provides an easy way to compute dense vector representations for images. We can use the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model 确定重复图像和 near-duplicate 图像的定量方法,sentence-transformers
library which provides an easy way to compute dense vector representations for images. We can use the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model 是一个已经在各种(图像、文本)对上训练过的神经网络。为了找到图像重复和 near-duplicates,我们将所有图像编码为向量 space,然后找到与图像非常相似的区域相对应的高密度区域。
当比较两张图片时,它们的得分在 0
到 1.00
之间。我们可以使用阈值参数来识别两个图像是相似的还是不同的。通过将阈值设置得较低,您将获得更大的集群,其中包含更少的相似图像。重复图像的得分为 1.00
,这意味着两张图像完全相同。要找到 near-duplicate 个图像,我们可以将阈值设置为任意值,比如 0.9
。例如,如果两幅图像之间的确定分数大于 0.9
,那么我们可以断定它们是 near-duplicate 图像。
一个例子:
这个数据集有 5 张图片,请注意 cat #1 是如何重复的,而其他的是不同的。
查找重复图像
Score: 100.000%
.\cat1 copy.jpg
.\cat1.jpg
cat1 和它的副本都是一样的。
正在查找 near-duplicate 张图片
Score: 91.116%
.\cat1 copy.jpg
.\cat2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 59.086%
.\bear2.jpg
.\cat2.jpg
Score: 56.025%
.\bear1.jpg
.\cat2.jpg
Score: 53.659%
.\bear1.jpg
.\cat1 copy.jpg
Score: 53.659%
.\bear1.jpg
.\cat1.jpg
Score: 53.225%
.\bear2.jpg
.\cat1.jpg
我们得到了更有趣的不同图像之间的分数比较结果。分数越高,越相似;分数越低,相似度越低。使用 0.9
或 90% 的阈值,我们可以过滤掉 near-duplicate 个图像。
两张图片的对比
Score: 91.097%
.\bear1.jpg
.\bear2.jpg
Score: 91.116%
.\cat1.jpg
.\cat2.jpg
Score: 93.715%
.\tower1.jpg
.\tower2.jpg
代码
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os
# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')
# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(Image.open(filepath))
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([Image.open(filepath) for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)
# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
NUM_SIMILAR_IMAGES = 10
# =================
# DUPLICATES
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
duplicates = [image for image in processed_images if image[0] >= 1]
# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])
# =================
# NEAR DUPLICATES
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower,
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]
for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
print(image_names[image_id1])
print(image_names[image_id2])