提取相似文件名的特征

extract features for similar file name

我有一些文件,例如

shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg
sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg

并且需要得到他们向量的总和,但是对于那些名字以相同名字开头直到这个符号 " _ "

的人
shghgssd_1212.jpg
shghgssd_ewewe.jpg
shghgssd_opopo.jpg

还有那些

sdsdsdj_weuwie.jpg
sdsdsdj_12143.jpg
sdsdsdj_eteyyw.jpg

也是同名开头的,我需要保存他们的向量之和

我试过了

def extract_features(directory):    
    results = []
    names = [] 
    for name in listdir(directory):    
        filename=directory + '/' + name
        names.append(name)
    for name in listdir(directory):
        for n in names:
            if (name is n):
                image= load_img(filename, target_size=(224, 224,3))
                vec=model(preprocess(image).unsqueeze(0).cuda())
                vec1 = vec.sum(dim=0).cpu().detach().numpy()
                for ind in range(batch_size):
                    results.append(vec1[ind])
    return results

但没有提取任何内容,因为我无法捕捉到以相同名称开头的 file_name

请问有什么更好的解决方案吗?

这是一个在字典中正确分组文件的示例,每个键是下划线前的相关名称 (_),值是使用该名称的文件的路径。

from pathlib import Path
from itertools import groupby
from collections import defaultdict

path = Path("path/to/dir")


def group_files(directory: Path, ext="jpg"):  # We must use a pathlib object here
    # We can glob the pathlib object with our preferred extension
    list_of_files = list(map(lambda x : x.name, directory.glob(f"**/*.{ext}")))
    file_dict = defaultdict(list)  # File storage in a dict
    keyfunc = lambda x: x.split("_")[0]  # Split on the '_' and get the first word
    data = sorted(list_of_files, key=keyfunc)  # sort on the name before the '_'
    for k, g in groupby(data, keyfunc):
        file_dict[k] = list(g)
    return file_dict


dict_of_files = group_files(path)

# Now we have something like
# {"shghgssd" : ["shghgssd_1212.jpg", "shghgssd_ewewe.jpg", "shghgssd_opopo.jpg"]}
# But with full paths, not printed for brevity
# This means that you can iterate of the keys and values and get some operation going

sums_of_vecs = []
for key, value in dict_of_files.items():  # key is a string, value is a list
    print(f"Treating the files with the {key}_... prefix")
    for filepath in value:
        # DO YOUR COMPUTATION HERE
        # APPEND TO RESULTS
        pass  # nothing happens here...

请注意,如果您控制文件的创建方式,最好将您计划进一步处理的每个批次放在其自己的目录中,然后使用 image_dataset_from_directory 加载它们。