在两个文件夹中查找名称首字母相同的文件
Find files with same initial part of the name in two folders
我用listdir
读取了两个文件夹中的文件:
from os import listdir
list_1 = [file for file in listdir("./folder1/") if file.endswith(".csv")]
list_2 = [file for file in listdir("./folder2/") if file.endswith(".json")]
现在我有两个列表:
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
我想找到对应的两个子列表,其中包含那些名称的开头部分相同的文件。换句话说:
list_1b = ['12_a1_pp.csv', '32_a3_pp.csv']
list_2b = ['12_a1.json', '32_a3.json']
我该怎么做?
PS 请注意,listdir
部分可能与回答问题无关紧要。我只包含它,因为如果 listdir
的结果保证按字母顺序排列,那么这可能有助于遍历两个列表。当然,在这个简单的示例中,列表很短,但在实际用例中,它们包含数百个文件。
这是使用字典理解和 set.intersection
的一种方式。
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
start_1 = {k: '_'.join(k.split('_')[:-1]) for k in list_1}
start_2 = {k: k.split('.')[0] for k in list_2}
start_intersect = set(start_1.values()) & set(start_2.values())
list_1b = [k for k, v in start_1.items() if v in start_intersect]
list_2b = [k for k, v in start_2.items() if v in start_intersect]
如果任何 "XY" 的文件名以“_XY.csv”结尾,此方法同样有效。它依赖于文件名的格式而不是单个字母。
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
list_1_C = [i.split(".")[0].replace("_pp", "") for i in list_1] #Check List
list_2_C = [i.split(".")[0] for i in list_2] #Check List
print([list_1[i] for i, v in enumerate(list_1_C) if v in list_2_C])
print([list_2[i] for i, v in enumerate(list_2_C) if v in list_1_C])
输出:
['12_a1_pp.csv', '32_a3_pp.csv']
['12_a1.json', '32_a3.json']
当你想到它时,这很简单,所以这里是:
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
starters = [eachfile.partition(".")[0] for eachfile in list2]
for eachelement in starters:
for eachfile in list_1:
if eachfile.startswith(eachelement):
list_1b.append(eachfile)
list_2b.append(eachelement+".json")
此外,如果你想具体到这个案例:
collective_set_1 = { each.replace("_pp.csv","") for each in list_1}
collective_set_2 = { each.replace(".json","") for each in list_2}
intersection = collective_set_1.intersection(collective_set2)
list_1b = [ each+"_pp.csv" for each in intersection ]
list_2b = [ each+".json" for each in intersection ]
一种更 pythonic 的方法是对集合使用 &
(交集)运算符:
common = set(x[:-7] for x in list_1) & set(x[:5] for x in list_2)
list_1b = [x + '_pp.csv' for x in common]
list_2b = [x + '.json' for x in common]
编辑:如果您需要为每个列表拆分特定字符(请参阅评论),这是一个更新版本(搜索 list_1 中的最后一个 '_' 并搜索最后一个 '. ' 在 list_2 中):
common = set(x[:x.rindex('_')] for x in list_1) & set(x[:x.rindex('.')] for x in list_2)
我用listdir
读取了两个文件夹中的文件:
from os import listdir
list_1 = [file for file in listdir("./folder1/") if file.endswith(".csv")]
list_2 = [file for file in listdir("./folder2/") if file.endswith(".json")]
现在我有两个列表:
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
我想找到对应的两个子列表,其中包含那些名称的开头部分相同的文件。换句话说:
list_1b = ['12_a1_pp.csv', '32_a3_pp.csv']
list_2b = ['12_a1.json', '32_a3.json']
我该怎么做?
PS 请注意,listdir
部分可能与回答问题无关紧要。我只包含它,因为如果 listdir
的结果保证按字母顺序排列,那么这可能有助于遍历两个列表。当然,在这个简单的示例中,列表很短,但在实际用例中,它们包含数百个文件。
这是使用字典理解和 set.intersection
的一种方式。
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
start_1 = {k: '_'.join(k.split('_')[:-1]) for k in list_1}
start_2 = {k: k.split('.')[0] for k in list_2}
start_intersect = set(start_1.values()) & set(start_2.values())
list_1b = [k for k, v in start_1.items() if v in start_intersect]
list_2b = [k for k, v in start_2.items() if v in start_intersect]
如果任何 "XY" 的文件名以“_XY.csv”结尾,此方法同样有效。它依赖于文件名的格式而不是单个字母。
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
list_1_C = [i.split(".")[0].replace("_pp", "") for i in list_1] #Check List
list_2_C = [i.split(".")[0] for i in list_2] #Check List
print([list_1[i] for i, v in enumerate(list_1_C) if v in list_2_C])
print([list_2[i] for i, v in enumerate(list_2_C) if v in list_1_C])
输出:
['12_a1_pp.csv', '32_a3_pp.csv']
['12_a1.json', '32_a3.json']
当你想到它时,这很简单,所以这里是:
list_1 = ['12_a1_pp.csv', '32_a3_pp.csv', '45_a17_pp.csv', '81_a123_pp.csv']
list_2 = ['12_a1.json', '32_a3.json', '61_a54.json']
starters = [eachfile.partition(".")[0] for eachfile in list2]
for eachelement in starters:
for eachfile in list_1:
if eachfile.startswith(eachelement):
list_1b.append(eachfile)
list_2b.append(eachelement+".json")
此外,如果你想具体到这个案例:
collective_set_1 = { each.replace("_pp.csv","") for each in list_1}
collective_set_2 = { each.replace(".json","") for each in list_2}
intersection = collective_set_1.intersection(collective_set2)
list_1b = [ each+"_pp.csv" for each in intersection ]
list_2b = [ each+".json" for each in intersection ]
一种更 pythonic 的方法是对集合使用 &
(交集)运算符:
common = set(x[:-7] for x in list_1) & set(x[:5] for x in list_2)
list_1b = [x + '_pp.csv' for x in common]
list_2b = [x + '.json' for x in common]
编辑:如果您需要为每个列表拆分特定字符(请参阅评论),这是一个更新版本(搜索 list_1 中的最后一个 '_' 并搜索最后一个 '. ' 在 list_2 中):
common = set(x[:x.rindex('_')] for x in list_1) & set(x[:x.rindex('.')] for x in list_2)