如何遍历文件夹并从分组的子文件夹中获取某些文件？

Question

我有一个包含多个子文件夹的文件夹，每个子文件夹包含我需要的 3-4 个文件。我正在尝试遍历该文件夹并将每个子文件夹中的所有文件放入一个字典中，该字典稍后会转储到 json 文件中。

到目前为止，我已经设法对单个文件执行此操作，json 文件如下所示：

这是代码：

import os
import json
myDir = "\\iads011n\ContinuousTesting\DailyTesting\REPORTS"
filelist = []
for path, subdirs, files in os.walk(myDir):
    for file in files:
        if (file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.XLS')) and "Release" in file and "Integrated" not in file:
            filelist.append(os.path.join(file))

myDict = dict(zip(range(len(filelist)), filelist))

result=[]
for k,v in myDict.items():
    result.append({'id' : k, 'name' : v})

with open('XLList.json', 'w') as json_file:
    json.dump(result, json_file)

但我想要实现的是：

这是文件夹：

其中一个子文件夹的内容如下所示：

所以基本上我需要的是将同一子文件夹下的所有 xls/xlsx 文件分组。主要问题是并非所有子文件夹都包含相同的项目，有些可能只有一个 xlsx 文件，有些可能只有 3 或 4 个，等等。

Answer 1

问题是，您没有“存储”每个文件所属的文件夹。解决方案如下：

result = []
for i, (path, subdirs, files) in enumerate(os.walk(myDir)): #use enumerate to track folder id
    subdir = {"id": i}
    j = 0 #file counter in subfolder
    for file in files:
        if (file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.XLS')) and "Release" in file and "Integrated" not in file:
            subdir[f"name{j}"] = file
            j += 1
    result.append(subdir)

编辑

要忽略没有有用文件的文件夹：

result = []
i = 0 #track folder id manually
for path, subdirs, files in os.walk(myDir): 
    subdir = {}
    j = 0 #file counter in subfolder
    for file in files:
        if (file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.XLS')) and "Release" in file and "Integrated" not in file:
            subdir[f"name{j}"] = file
            j += 1
    if len(subdir) > 0:
        subdir = {"id": i}
        result.append(subdir)
        i += 1 #increase counter

Answer 2

将子文件夹的文件分离为单独对象的更新解决方案：

...  # other code before it

import re
results = []
for path, subdirs, files in os.walk(myDir):
    id = your_algo_to_get_id() 
    data = {'id': id}
    for i, file in enumerate(files):
        if re.search(r"[\s\S]*release[\s\S]*\.xlsx?$", file, re.I) and
        "integrated" not in file.lower(): 
            data[f'name{i}'] = file
    results.append(data)  # output : [ { 'id': 0, 'name1': '...', ...}, {'id': 1, 'name1': '...'}, ..]

旧的解决方案

假设id = 0：

result = {'id': 0}
for i, filename in enumerate(filelist):
    result[f'name{i}'] = filename

result 的 json 输出将是：

{
  "id": 0,
  "name0": "some-filename.xlsx",
  "name1": "some-filename.xlsx",
  "name2": "some-filename.xlsx",
  "name3": "some-filename.xlsx",
  ...
}

enumerate 是 python 的内置函数。你也可以从 1 开始，如果你不想放 name0.

result = {'id': 0}
for i, filename in enumerate(filelist, 1):
    ...

对您的代码的建议：

for path, subdirs, files in os.walk(myDir):
    for file in files:
        if (file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.XLS')) and "Release" in file and "Integrated" not in file:
            filelist.append(os.path.join(file))

我建议使用带有 ignorecase 的正则表达式 r"xlsx?$" 来匹配文件名，所以一个条件是处理所有场景：

test_filenames = ["sample-name.XLSX", "sample-name.xlsx","sample-name.xls", "sample-name.XLS"]
for filename in test_filenames:
    if re.search(r"xlsx?$", filename, re.I):
        # it's matching

如何遍历文件夹并从分组的子文件夹中获取某些文件？

How to iterate through folder and get certain files from the subfolders grouped?

python

dictionary

xlsx

os.walk

subdirectory

编辑

将子文件夹的文件分离为单独对象的更新解决方案：

旧的解决方案