按顺序分组读取多个文件

Question

我在一个目录中有如下数据

 IU.WRT.00.MTR.1999.081.081015.txt
 IU.WRT.00.MTS.2007.229.022240.txt
 IU.WRT.00.MTR.2007.229.022240.txt
 IU.WRT.00.MTT.1999.081.081015.txt
 IU.WRT.00.MTS.1999.081.081015.txt
 IU.WRT.00.MTT.2007.229.022240.txt

首先我想通过使用 3 个文件的相似模式（R、S、T 不同）对数据进行分组，如下所示：

IU.WRT.00.MTR.1999.081.081015.txt
IU.WRT.00.MTS.1999.081.081015.txt
IU.WRT.00.MTT.1999.081.081015.txt

并想对其应用一些操作

然后我想读取数据

IU.WRT.00.MTT.2007.229.022240.txt
IU.WRT.00.MTS.2007.229.022240.txt
IU.WRT.00.MTR.2007.229.022240.txt

并想对其应用类似的操作。

同样，我想继续处理数百万个数据集。

我尝试了示例脚本

import os
import glob
import matplotlib.pyplot as plt
from collections import defaultdict

def groupfiles(pattern):
    files = glob.glob(pattern)
    filedict = defaultdict(list)
    for file in files:
        parts = file.split(".")
        filedict[".".join([parts[5], parts[6], parts[7]])].append(file)
    for filegroup in filedict.values():
        yield filegroup
 
for relatedfiles in groupfiles('*.txt'):
    print(relatedfiles)

    for filename in relatedfiles:
        print(filename)

然而，它会一个一个地读取文件，但每次我需要一次读取 3 个文件（即通过采用顺序标准，首先它会读取前三个文件，然后读取下三个文件，如此 on.I希望高手提前帮忙me.Thanks

Answer 1

按多个键对文件列表进行排序。

import os
files = [f for f in os.listdir("C:/username/folder") if f.endswith(".txt")]
grouped = sorted(files, key=lambda x: (x.split(".")[4:6], x.split(".")[3]))

>>> grouped
['IU.WRT.00.MTR.1999.081.081015.txt',
 'IU.WRT.00.MTS.1999.081.081015.txt',
 'IU.WRT.00.MTT.1999.081.081015.txt',
 'IU.WRT.00.MTR.2007.229.022240.txt',
 'IU.WRT.00.MTS.2007.229.022240.txt',
 'IU.WRT.00.MTT.2007.229.022240.txt']

使用 itertools 中的 grouper 方法以三个为一组迭代排序列表。

from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

for f in grouper(grouped, 3): #f is a tuple of three file names
    #your file operations here

按顺序分组读取多个文件

Reading multiple files sequentially group wise

python

glob

numpy

pandas