提取相似文件名的特征
extract features for similar file name
我有一些文件,例如
shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg
sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg
并且需要得到他们向量的总和,但是对于那些名字以相同名字开头直到这个符号 " _ "
和
的人
shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg
还有那些
sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg
也是同名开头的,我需要保存他们的向量之和
我试过了
def extract_features(directory):
results = []
names = []
for name in listdir(directory):
filename=directory + '/' + name
names.append(name)
for name in listdir(directory):
for n in names:
if (name is n):
image= load_img(filename, target_size=(224, 224,3))
vec=model(preprocess(image).unsqueeze(0).cuda())
vec1 = vec.sum(dim=0).cpu().detach().numpy()
for ind in range(batch_size):
results.append(vec1[ind])
return results
但没有提取任何内容,因为我无法捕捉到以相同名称开头的 file_name
请问有什么更好的解决方案吗?
这是一个在字典中正确分组文件的示例,每个键是下划线前的相关名称 (_
),值是使用该名称的文件的路径。
from pathlib import Path
from itertools import groupby
from collections import defaultdict
path = Path("path/to/dir")
def group_files(directory: Path, ext="jpg"): # We must use a pathlib object here
# We can glob the pathlib object with our preferred extension
list_of_files = list(map(lambda x : x.name, directory.glob(f"**/*.{ext}")))
file_dict = defaultdict(list) # File storage in a dict
keyfunc = lambda x: x.split("_")[0] # Split on the '_' and get the first word
data = sorted(list_of_files, key=keyfunc) # sort on the name before the '_'
for k, g in groupby(data, keyfunc):
file_dict[k] = list(g)
return file_dict
dict_of_files = group_files(path)
# Now we have something like
# {"shghgssd" : ["shghgssd_1212.jpg", "shghgssd_ewewe.jpg", "shghgssd_opopo.jpg"]}
# But with full paths, not printed for brevity
# This means that you can iterate of the keys and values and get some operation going
sums_of_vecs = []
for key, value in dict_of_files.items(): # key is a string, value is a list
print(f"Treating the files with the {key}_... prefix")
for filepath in value:
# DO YOUR COMPUTATION HERE
# APPEND TO RESULTS
pass # nothing happens here...
请注意,如果您控制文件的创建方式,最好将您计划进一步处理的每个批次放在其自己的目录中,然后使用 image_dataset_from_directory
加载它们。
我有一些文件,例如
shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg
sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg
并且需要得到他们向量的总和,但是对于那些名字以相同名字开头直到这个符号 " _ "
和
shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg
还有那些
sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg
也是同名开头的,我需要保存他们的向量之和
我试过了
def extract_features(directory):
results = []
names = []
for name in listdir(directory):
filename=directory + '/' + name
names.append(name)
for name in listdir(directory):
for n in names:
if (name is n):
image= load_img(filename, target_size=(224, 224,3))
vec=model(preprocess(image).unsqueeze(0).cuda())
vec1 = vec.sum(dim=0).cpu().detach().numpy()
for ind in range(batch_size):
results.append(vec1[ind])
return results
但没有提取任何内容,因为我无法捕捉到以相同名称开头的 file_name
请问有什么更好的解决方案吗?
这是一个在字典中正确分组文件的示例,每个键是下划线前的相关名称 (_
),值是使用该名称的文件的路径。
from pathlib import Path
from itertools import groupby
from collections import defaultdict
path = Path("path/to/dir")
def group_files(directory: Path, ext="jpg"): # We must use a pathlib object here
# We can glob the pathlib object with our preferred extension
list_of_files = list(map(lambda x : x.name, directory.glob(f"**/*.{ext}")))
file_dict = defaultdict(list) # File storage in a dict
keyfunc = lambda x: x.split("_")[0] # Split on the '_' and get the first word
data = sorted(list_of_files, key=keyfunc) # sort on the name before the '_'
for k, g in groupby(data, keyfunc):
file_dict[k] = list(g)
return file_dict
dict_of_files = group_files(path)
# Now we have something like
# {"shghgssd" : ["shghgssd_1212.jpg", "shghgssd_ewewe.jpg", "shghgssd_opopo.jpg"]}
# But with full paths, not printed for brevity
# This means that you can iterate of the keys and values and get some operation going
sums_of_vecs = []
for key, value in dict_of_files.items(): # key is a string, value is a list
print(f"Treating the files with the {key}_... prefix")
for filepath in value:
# DO YOUR COMPUTATION HERE
# APPEND TO RESULTS
pass # nothing happens here...
请注意,如果您控制文件的创建方式,最好将您计划进一步处理的每个批次放在其自己的目录中,然后使用 image_dataset_from_directory
加载它们。