有没有办法 运行 OpenCV 的 SIFT 更快?
Is there a way to run OpenCV's SIFT faster?
我有一个图像目录,其中包含许多无法识别的重复项。我的目标是识别重复项。由于重复项已被裁剪、调整大小或转换为不同的图像格式,因此无法通过比较它们的哈希值来检测它们。
我写了一个成功检测重复的脚本,但有一个主要缺点:脚本很慢。在一个包含 60 个项目的文件夹的试驾中,运行 花了五个小时(这也可能反映了我越来越多的错误和缓慢的计算机)。由于我的目录中大约有 66,000 张图片,我估计脚本完成需要 229 天。
任何人都可以提出解决方案吗?我的 research has revealed that you can free up memory by "releasing" the image stored in the variable as the loop completes, but all the information on how to do this seems to be written in C, not python. I was also thinking of trying to use orb 而不是 sift,但担心它的准确性。有没有人对这两种选择中哪一种更适合提出建议?或者重写脚本以减少内存占用的方法?非常感谢。
from __future__ import division
import cv2
import numpy as np
import glob
import pandas as pd
listOfTitles1 = []
listOfTitles2 = []
listOfSimilarities = []
# Sift and Flann
sift = cv2.xfeatures2d.SIFT_create()
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
# Load all the images1
countInner = 0
countOuter = 1
folder = r"/Downloads/images/**/*"
for a in glob.iglob(folder,recursive=True):
for b in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if not b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countInner += 1
print(countInner, "", countOuter)
if countInner <= countOuter:
continue
image1 = cv2.imread(a)
kp_1, desc_1 = sift.detectAndCompute(image1, None)
image2 = cv2.imread(b)
kp_2, desc_2 = sift.detectAndCompute(image2, None)
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
if good_points == 0:
continue
for m, n in matches:
if m.distance < 0.6*n.distance:
good_points.append(m)
number_keypoints = 0
if len(kp_1) >= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
percentage_similarity = float(len(good_points)) / number_keypoints * 100
listOfSimilarities.append(str(int(percentage_similarity)))
listOfTitles2.append(b)
listOfTitles1.append(a)
countInner = 0
if a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countOuter += 1
zippedList = list(zip(listOfTitles1,listOfTitles2, listOfSimilarities))
print(zippedList)
dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title' , 'Similarity'])
dfObj.to_csv(r"/Downloads/images/DuplicateImages3.csv")
我认为通过简单的更改可以获得显着的性能改进:
- 首先,由于您有兴趣比较成对的图像,因此您的循环可以如下所示:
files = ... # preload all file names with glob
for a_idx in range(len(files)):
for b_idx in range(a_idx, len(files)): # notice loop here
image_1 = cv2.imread(files[a_idx])
image_2 = cv2.imread(files[b_idx])
这会考虑所有对而不重复,例如(a, b) && (b, a)
- 其次,在比较每个 b 时不需要重新计算 a 的特征
for a_idx in range(len(files)):
image_1 = cv2.imread(files[a_idx])
kp_1, desc_1 = sift.detectAndCompute(image1, None) # never recoompute SIFT!
for b_idx in range(a_idx, len(files)):
image_2 = cv2.imread(files[b_idx])
kp_2, desc_2 = sift.detectAndCompute(image2, None)
- 我也会检查图像尺寸。我的猜测是有一些非常大的正在减慢你的内循环。即使所有 60*60 == 3600 对也不应该花那么长时间。如果图像真的很大,您可以对其进行缩减采样以提高效率。
我 运行 你在我电脑上的现有实现,在 100 张图片上。该代码用了 6 小时 31 分钟才达到 运行。然后我按照我在评论中建议的那样更改了实现,只为每个图像计算一次 sift.detectAndCompute,缓存结果并在比较中使用缓存的结果。这将我的计算机在同一 100 张图像上的执行时间从 6 小时 31 分减少到 6 分 29 秒。我不知道这对于您的所有图像是否足够快,但它是一个显着的减少。
请参阅下面我修改后的实现。
from __future__ import division
import cv2
import numpy as np
import glob
import pandas as pd
listOfTitles1 = []
listOfTitles2 = []
listOfSimilarities = []
# Sift and Flann
sift = cv2.xfeatures2d.SIFT_create()
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
# Load all the images1
countInner = 0
countOuter = 1
folder = r"/Downloads/images/**/*"
folder = "SiftImages/*"
siftOut = {}
for a in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
image1 = cv2.imread(a)
kp_1, desc_1 = sift.detectAndCompute(image1, None)
siftOut[a]=(kp_1,desc_1)
for a in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
(kp_1,desc_1) = siftOut[a]
for b in glob.iglob(folder,recursive=True):
if not b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countInner += 1
print(countInner, "", countOuter)
if countInner <= countOuter:
continue
#### image1 = cv2.imread(a)
#### kp_1, desc_1 = sift.detectAndCompute(image1, None)
####
#### image2 = cv2.imread(b)
#### kp_2, desc_2 = sift.detectAndCompute(image2, None)
(kp_2,desc_2) = siftOut[b]
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
if good_points == 0:
continue
for m, n in matches:
if m.distance < 0.6*n.distance:
good_points.append(m)
number_keypoints = 0
if len(kp_1) >= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
percentage_similarity = float(len(good_points)) / number_keypoints * 100
listOfSimilarities.append(str(int(percentage_similarity)))
listOfTitles2.append(b)
listOfTitles1.append(a)
countInner = 0
if a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countOuter += 1
zippedList = list(zip(listOfTitles1,listOfTitles2, listOfSimilarities))
print(zippedList)
dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title' , 'Similarity'])
### dfObj.to_csv(r"/Downloads/images/DuplicateImages3.csv")
dfObj.to_csv(r"DuplicateImages3.2.csv")
我有一个图像目录,其中包含许多无法识别的重复项。我的目标是识别重复项。由于重复项已被裁剪、调整大小或转换为不同的图像格式,因此无法通过比较它们的哈希值来检测它们。
我写了一个成功检测重复的脚本,但有一个主要缺点:脚本很慢。在一个包含 60 个项目的文件夹的试驾中,运行 花了五个小时(这也可能反映了我越来越多的错误和缓慢的计算机)。由于我的目录中大约有 66,000 张图片,我估计脚本完成需要 229 天。
任何人都可以提出解决方案吗?我的 research has revealed that you can free up memory by "releasing" the image stored in the variable as the loop completes, but all the information on how to do this seems to be written in C, not python. I was also thinking of trying to use orb 而不是 sift,但担心它的准确性。有没有人对这两种选择中哪一种更适合提出建议?或者重写脚本以减少内存占用的方法?非常感谢。
from __future__ import division
import cv2
import numpy as np
import glob
import pandas as pd
listOfTitles1 = []
listOfTitles2 = []
listOfSimilarities = []
# Sift and Flann
sift = cv2.xfeatures2d.SIFT_create()
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
# Load all the images1
countInner = 0
countOuter = 1
folder = r"/Downloads/images/**/*"
for a in glob.iglob(folder,recursive=True):
for b in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if not b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countInner += 1
print(countInner, "", countOuter)
if countInner <= countOuter:
continue
image1 = cv2.imread(a)
kp_1, desc_1 = sift.detectAndCompute(image1, None)
image2 = cv2.imread(b)
kp_2, desc_2 = sift.detectAndCompute(image2, None)
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
if good_points == 0:
continue
for m, n in matches:
if m.distance < 0.6*n.distance:
good_points.append(m)
number_keypoints = 0
if len(kp_1) >= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
percentage_similarity = float(len(good_points)) / number_keypoints * 100
listOfSimilarities.append(str(int(percentage_similarity)))
listOfTitles2.append(b)
listOfTitles1.append(a)
countInner = 0
if a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countOuter += 1
zippedList = list(zip(listOfTitles1,listOfTitles2, listOfSimilarities))
print(zippedList)
dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title' , 'Similarity'])
dfObj.to_csv(r"/Downloads/images/DuplicateImages3.csv")
我认为通过简单的更改可以获得显着的性能改进:
- 首先,由于您有兴趣比较成对的图像,因此您的循环可以如下所示:
files = ... # preload all file names with glob
for a_idx in range(len(files)):
for b_idx in range(a_idx, len(files)): # notice loop here
image_1 = cv2.imread(files[a_idx])
image_2 = cv2.imread(files[b_idx])
这会考虑所有对而不重复,例如(a, b) && (b, a)
- 其次,在比较每个 b 时不需要重新计算 a 的特征
for a_idx in range(len(files)):
image_1 = cv2.imread(files[a_idx])
kp_1, desc_1 = sift.detectAndCompute(image1, None) # never recoompute SIFT!
for b_idx in range(a_idx, len(files)):
image_2 = cv2.imread(files[b_idx])
kp_2, desc_2 = sift.detectAndCompute(image2, None)
- 我也会检查图像尺寸。我的猜测是有一些非常大的正在减慢你的内循环。即使所有 60*60 == 3600 对也不应该花那么长时间。如果图像真的很大,您可以对其进行缩减采样以提高效率。
我 运行 你在我电脑上的现有实现,在 100 张图片上。该代码用了 6 小时 31 分钟才达到 运行。然后我按照我在评论中建议的那样更改了实现,只为每个图像计算一次 sift.detectAndCompute,缓存结果并在比较中使用缓存的结果。这将我的计算机在同一 100 张图像上的执行时间从 6 小时 31 分减少到 6 分 29 秒。我不知道这对于您的所有图像是否足够快,但它是一个显着的减少。
请参阅下面我修改后的实现。
from __future__ import division
import cv2
import numpy as np
import glob
import pandas as pd
listOfTitles1 = []
listOfTitles2 = []
listOfSimilarities = []
# Sift and Flann
sift = cv2.xfeatures2d.SIFT_create()
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
# Load all the images1
countInner = 0
countOuter = 1
folder = r"/Downloads/images/**/*"
folder = "SiftImages/*"
siftOut = {}
for a in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
image1 = cv2.imread(a)
kp_1, desc_1 = sift.detectAndCompute(image1, None)
siftOut[a]=(kp_1,desc_1)
for a in glob.iglob(folder,recursive=True):
if not a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
(kp_1,desc_1) = siftOut[a]
for b in glob.iglob(folder,recursive=True):
if not b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
continue
if b.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countInner += 1
print(countInner, "", countOuter)
if countInner <= countOuter:
continue
#### image1 = cv2.imread(a)
#### kp_1, desc_1 = sift.detectAndCompute(image1, None)
####
#### image2 = cv2.imread(b)
#### kp_2, desc_2 = sift.detectAndCompute(image2, None)
(kp_2,desc_2) = siftOut[b]
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
if good_points == 0:
continue
for m, n in matches:
if m.distance < 0.6*n.distance:
good_points.append(m)
number_keypoints = 0
if len(kp_1) >= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
percentage_similarity = float(len(good_points)) / number_keypoints * 100
listOfSimilarities.append(str(int(percentage_similarity)))
listOfTitles2.append(b)
listOfTitles1.append(a)
countInner = 0
if a.lower().endswith(('.jpg','.png','.tif','.tiff','.gif')):
countOuter += 1
zippedList = list(zip(listOfTitles1,listOfTitles2, listOfSimilarities))
print(zippedList)
dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title' , 'Similarity'])
### dfObj.to_csv(r"/Downloads/images/DuplicateImages3.csv")
dfObj.to_csv(r"DuplicateImages3.2.csv")