如何同时解压多个目录下的文件？

Question

我有13个目录，每个目录包含大约30个非常大的压缩文件。使用下面的代码，我正在创建目录的副本，解压缩文件并重命名它们。现在我有一个大问题，它的工作速度变慢了。对每个目录执行上述所有操作大约需要 6 或 7 分钟，因此对于所有目录我需要大约 7 x 13 = 91 分钟。

是否有选项可以加快速度？如果该软件可以并行运行 13 个目录，那么 13 个目录可能只需要大约 7 分钟。我读过有关多处理的内容，但我不知道如何在我当前的代码中实现它。

这是我的代码：

pattern = '*.zip'
for root, dirs, files in os.walk(data_files): 
    for filename in fnmatch.filter(files, pattern): 

        path = os.path.join(root, filename)
        date_zipped_file_s = re.search('-(.\d+)-', filename).group(1)
        date_zipped_file = datetime.datetime.strptime(date_zipped_file_s, '%Y%m%d').date()      
              
        #Create the new directory location
        new_dir = os.path.normpath(os.path.join(os.path.relpath(path, start=data_files), ".."))

        #Join the directory names counter_part and create their paths.
        new = os.path.join(counter_part, new_dir)

        #Create the directories
        if (not os.path.exists(new)):
            os.makedirs(new)
        zipfile.ZipFile(path).extractall(new) 
    
        #Get al the zipped files
        files = os.listdir(new)
    
        #Rename all the files in the created directories
        for file in files:
            filesplit = os.path.splitext(os.path.basename(file))
            if not re.search(r'_\d{8}.', file):
                os.rename(os.path.join(new, file), os.path.join(new, filesplit[0]+'_'+date_zipped_file_s+filesplit[1]))

Answer 1

不，您不应期望通过使用多处理将时间减少运行 13 倍。为此，您可能需要一台具有 (1) 至少 13 个物理核心的计算机，这些核心不是运行任何其他进程，以及 (2) 某种类型的固态驱动器，可以处理至少 13 个多个 I/O 并行请求，任何请求的响应时间没有任何明显的恶化。

如果有的话，你会得到多少改善？只有一种方法可以找出答案。以下代码处理所有输入的 zip 文件名并创建 13 个（根，文件名）元组列表，其中每个列表都有一个根值，该根值不同于所有其他列表。创建了一个大小为 13 的多处理池，池中的每个进程都分配了这 13 个列表中的一个来处理。

您需要更新以下代码中 data_files 和 counter_part 的定义：

import os
import fnmatch
import zipfile
import re
import datetime
from multiprocessing import Pool

def generate_file_lists():
    # Change the following line to a real path
    data_files = '?'
    pattern = '*.zip'
    last_root = None
    args = []
    for root, dirs, files in os.walk(data_files):
        for filename in fnmatch.filter(files, pattern):
            if root != last_root:
                last_root = root
                if args:
                    yield args
                    args = []
            args.append((root, filename))
    if args:
        yield args

def unzip(file_list):
    """
    file_list is a list of (root, filename) tuples where
    root is the same for all tuples.
    """
    # Change the following line to a real path:
    counter_part = '?'
    for root, filename in file_list:
        path = os.path.join(root, filename)
        date_zipped_file_s = re.search('-(.\d+)-', filename).group(1)
        date_zipped_file = datetime.datetime.strptime(date_zipped_file_s, '%Y%m%d').date()

        #Create the new directory location
        new_dir = os.path.normpath(os.path.join(os.path.relpath(path, start=data_files), ".."))

        #Join the directory names counter_part and create their paths.
        new = os.path.join(counter_part, new_dir)

        #Create the directories
        if (not os.path.exists(new)):
            os.makedirs(new)
        zipfile.ZipFile(path).extractall(new)

        #Get al the zipped files
        files = os.listdir(new)

        #Rename all the files in the created directories
        for file in files:
            filesplit = os.path.splitext(os.path.basename(file))
            if not re.search(r'_\d{8}.', file):
                os.rename(os.path.join(new, file), os.path.join(new, filesplit[0]+'_'+date_zipped_file_s+filesplit[1]))

# Required for Windows:
if __name__ == '__main__':
    pool = Pool(13)
    pool.map(unzip, generate_file_lists())

如何同时解压多个目录下的文件？

How to unzip files from multiple directories at the same time?

python

zip

multithreading

multiprocessing