给定文件名列表,根据文件名的相似性将文件名与所有派生词分组
Given a list of filenames, group filenames with all derivatives based on similarity of the filename
我可能没有搜索最佳术语来找到解决方案,但到目前为止,我发现的任何东西都无法解决我的问题,我真的不知道从哪里开始,甚至不知道要调查什么机制。
我的硬盘驱动器上的不同位置有大量图像文件,我正尝试通过删除重复项来清理这些文件。其中大部分很容易使用哈希码找到,但我有很多损坏或编辑过的版本不太容易找到。我知道我需要一些用户交互来识别和删除(存档)不需要的文件,我将做一些进一步的处理以确保日期和地理标记等元数据正确(也用于可能匹配文件)然后显示通过一个简单的 html 界面与所有已知数据的相似图像。
我确定的步骤之一是将名称相似的文件或名称中包含另一个文件名的一部分的文件分组。有时这些可能完全不相关,因此需要用户交互。
下面是文件示例,我想将它们分组为相似的文件名,忽略路径和文件扩展名。
[
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
]
上面的文件列表应该输出如下内容:
{
"IMG_20140413_072335": [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg"
],
"IMAG0097": [
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg"
],
"IMAG0126": [
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
],
"IMAG0132": [
"/Users/stu/Photos/Holidays/IMAG0132.jpg"
],
"IMG_20140322_142557": [
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg"
],
"IMG_20140330_200132": [
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg"
],
"IMG_20140412_195105": [
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png"
],
"IMG_20140412_195110": [
"/Users/stu/Photos/2014/IMG_20140412_195110.png"
],
"IMG_20140413_143245": [
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png"
]
}
关于如何在 Python3 中执行此操作的任何想法?
谢谢
编辑:我刚刚在文件名示例集中添加了几个示例。
以下对我有用:
from pprint import pprint
d = dict()
for i in t:
tmp = os.path.basename(i).split(".")[0] # if file with extension given return the name before "."
# else return the base name, without changes
k = tmp.split("(")[0] # the (..) is a typical windows signiture for simillar names
# if so split and take the name before it
d.setdefault(k,[]) # the line reassures the uniquenes of the records
if k in tmp:
d[k].append(i)
# SENTINEL
if sum([len(i) for i in d.values()]) !=len(t):
raise ValueError("The sanity check wasn't successful !")
pprint(d)
结果:
{'IMAG0097': ['/Users/stu/Photos/2013/IMAG0097.jpg',
'/Users/stu/Photos/2014/IMAG0097.jpg'],
'IMAG0126': ['/Users/stu/Photos/2013/IMAG0126.jpg'],
'IMAG0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
'IMG_20140322_142557-edited': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
'IMG_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
'IMG_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
'IMG_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
'IMG_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335.jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(4).png',
'/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg'],
'IMG_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(7).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245.png',
'/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png']}
所以我想出了一个解决方案,可以满足我的需求。不确定这是解决此问题的最佳方法,但确实可以。
首先,我创建了一个以完整路径为键、文件名减去扩展名作为值的字典。然后按值长度排序,这样当我遍历时,我能够从较短的值开始并进行处理。然后我简单地遍历并检查所有较低的实体,以查找字符串中的匹配项并将匹配项分组在一起。我还通过比较值的长度并仅在达到阈值(下例中为 0.5)时才匹配来允许使用小文件名。
import os
from pprint import pprint
files = [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/20.png",
"/Users/stu/Photos/203.png",
"/Users/stu/Photos/2021.png",
"/Users/stu/Photos/2021q.png",
"/User/2.jpg"
]
relevanceFactor = 0.5
rawFiles = {}
for file in files:
rawFiles[file] = os.path.splitext(os.path.basename(file))[0]
sortedFiles = sorted(rawFiles.items(), key=lambda kv: (len(kv[1]), kv[0]))
alreadyGrouped = []
groupedFiles = {}
for i, file in enumerate(sortedFiles):
fullPath = file[0]
cluster = file[1]
if cluster not in alreadyGrouped:
groupedFiles[cluster] = [fullPath]
for compareFile in sortedFiles[i+1:]:
compareFullPath = compareFile[0]
compareCluster = compareFile[1]
if len(cluster)/len(compareCluster) < relevanceFactor:
break
if (compareCluster not in alreadyGrouped
and cluster in compareCluster):
alreadyGrouped.append(compareCluster)
groupedFiles[cluster].append(compareFullPath)
if cluster not in alreadyGrouped:
alreadyGrouped.append(cluster)
pprint(groupedFiles)
您的照片似乎是由最后一个 'G'
(来自 'IMG'
或 'IMAG'
)和下一个 '.'
或 '('
之间的内容标识的] 或 '-'
.
使用这部分字符串作为键,我们可以轻松地将文件名分组到列表字典中。
files = ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg', '/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/Holidays/IMAG0132.jpg', '/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg', '/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg', '/Users/stu/Downloads/Photos/IMG_20140412_195105.png', '/Users/stu/Photos/2014/IMG_20140412_195110.png', '/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg', '/Users/stu/Photos/2013/IMAG0126546.jpg']
def photo_id(filename):
i = filename.rfind('G') + 1
j1 = filename.find('.', i)
j2 = filename.find('(', i)
j3 = filename.find('-', i)
j = min(j for j in (j1,j2,j3,len(filename)) if j > -1)
return filename[i:j]
photos = {}
for filename in files:
photos.setdefault(photo_id(filename), []).append(filename)
print(photos)
# {'_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg'],
# '0097': ['/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg'],
# '0126': ['/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg'],
# '0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
# '_20140322_142557': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
# '_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
# '_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
# '_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
# '_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png'],
# '_20140413_072335_01': ['/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg'],
# '_20140413_072335_9352': ['/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg'],
# '0126546': ['/Users/stu/Photos/2013/IMAG0126546.jpg']}
我可能没有搜索最佳术语来找到解决方案,但到目前为止,我发现的任何东西都无法解决我的问题,我真的不知道从哪里开始,甚至不知道要调查什么机制。
我的硬盘驱动器上的不同位置有大量图像文件,我正尝试通过删除重复项来清理这些文件。其中大部分很容易使用哈希码找到,但我有很多损坏或编辑过的版本不太容易找到。我知道我需要一些用户交互来识别和删除(存档)不需要的文件,我将做一些进一步的处理以确保日期和地理标记等元数据正确(也用于可能匹配文件)然后显示通过一个简单的 html 界面与所有已知数据的相似图像。
我确定的步骤之一是将名称相似的文件或名称中包含另一个文件名的一部分的文件分组。有时这些可能完全不相关,因此需要用户交互。
下面是文件示例,我想将它们分组为相似的文件名,忽略路径和文件扩展名。
[
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
]
上面的文件列表应该输出如下内容:
{
"IMG_20140413_072335": [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg"
],
"IMAG0097": [
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg"
],
"IMAG0126": [
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
],
"IMAG0132": [
"/Users/stu/Photos/Holidays/IMAG0132.jpg"
],
"IMG_20140322_142557": [
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg"
],
"IMG_20140330_200132": [
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg"
],
"IMG_20140412_195105": [
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png"
],
"IMG_20140412_195110": [
"/Users/stu/Photos/2014/IMG_20140412_195110.png"
],
"IMG_20140413_143245": [
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png"
]
}
关于如何在 Python3 中执行此操作的任何想法?
谢谢
编辑:我刚刚在文件名示例集中添加了几个示例。
以下对我有用:
from pprint import pprint
d = dict()
for i in t:
tmp = os.path.basename(i).split(".")[0] # if file with extension given return the name before "."
# else return the base name, without changes
k = tmp.split("(")[0] # the (..) is a typical windows signiture for simillar names
# if so split and take the name before it
d.setdefault(k,[]) # the line reassures the uniquenes of the records
if k in tmp:
d[k].append(i)
# SENTINEL
if sum([len(i) for i in d.values()]) !=len(t):
raise ValueError("The sanity check wasn't successful !")
pprint(d)
结果:
{'IMAG0097': ['/Users/stu/Photos/2013/IMAG0097.jpg',
'/Users/stu/Photos/2014/IMAG0097.jpg'],
'IMAG0126': ['/Users/stu/Photos/2013/IMAG0126.jpg'],
'IMAG0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
'IMG_20140322_142557-edited': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
'IMG_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
'IMG_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
'IMG_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
'IMG_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335.jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(4).png',
'/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg',
'/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg',
'/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg'],
'IMG_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(7).png',
'/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245.png',
'/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png',
'/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg',
'/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg',
'/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png']}
所以我想出了一个解决方案,可以满足我的需求。不确定这是解决此问题的最佳方法,但确实可以。
首先,我创建了一个以完整路径为键、文件名减去扩展名作为值的字典。然后按值长度排序,这样当我遍历时,我能够从较短的值开始并进行处理。然后我简单地遍历并检查所有较低的实体,以查找字符串中的匹配项并将匹配项分组在一起。我还通过比较值的长度并仅在达到阈值(下例中为 0.5)时才匹配来允许使用小文件名。
import os
from pprint import pprint
files = [
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/20.png",
"/Users/stu/Photos/203.png",
"/Users/stu/Photos/2021.png",
"/Users/stu/Photos/2021q.png",
"/User/2.jpg"
]
relevanceFactor = 0.5
rawFiles = {}
for file in files:
rawFiles[file] = os.path.splitext(os.path.basename(file))[0]
sortedFiles = sorted(rawFiles.items(), key=lambda kv: (len(kv[1]), kv[0]))
alreadyGrouped = []
groupedFiles = {}
for i, file in enumerate(sortedFiles):
fullPath = file[0]
cluster = file[1]
if cluster not in alreadyGrouped:
groupedFiles[cluster] = [fullPath]
for compareFile in sortedFiles[i+1:]:
compareFullPath = compareFile[0]
compareCluster = compareFile[1]
if len(cluster)/len(compareCluster) < relevanceFactor:
break
if (compareCluster not in alreadyGrouped
and cluster in compareCluster):
alreadyGrouped.append(compareCluster)
groupedFiles[cluster].append(compareFullPath)
if cluster not in alreadyGrouped:
alreadyGrouped.append(cluster)
pprint(groupedFiles)
您的照片似乎是由最后一个 'G'
(来自 'IMG'
或 'IMAG'
)和下一个 '.'
或 '('
之间的内容标识的] 或 '-'
.
使用这部分字符串作为键,我们可以轻松地将文件名分组到列表字典中。
files = ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg', '/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/Holidays/IMAG0132.jpg', '/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg', '/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg', '/Users/stu/Downloads/Photos/IMG_20140412_195105.png', '/Users/stu/Photos/2014/IMG_20140412_195110.png', '/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg', '/Users/stu/Photos/2013/IMAG0126546.jpg']
def photo_id(filename):
i = filename.rfind('G') + 1
j1 = filename.find('.', i)
j2 = filename.find('(', i)
j3 = filename.find('-', i)
j = min(j for j in (j1,j2,j3,len(filename)) if j > -1)
return filename[i:j]
photos = {}
for filename in files:
photos.setdefault(photo_id(filename), []).append(filename)
print(photos)
# {'_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg'],
# '0097': ['/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg'],
# '0126': ['/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg'],
# '0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
# '_20140322_142557': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
# '_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
# '_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
# '_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
# '_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png'],
# '_20140413_072335_01': ['/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg'],
# '_20140413_072335_9352': ['/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg'],
# '0126546': ['/Users/stu/Photos/2013/IMAG0126546.jpg']}