使用 glob 按特定顺序导入文件

Question

我有一个很长的时间系列，每年都存档在与今年相对应的文件夹中。但是，在每个文件夹中，数据不是记录在一个文件中，而是按月记录。

例如1954 > 4 月、8 月、12 月...9 月

当我使用 Glob 导入这些文件并使用 Pandas 创建 Dataframe 时，它们以相同的顺序导入（如上）。但是，相反，我需要一个正确的月份序列（JAN、FEB、MAR..）来绘制和使用它们。所以，我的问题是：

有什么方法可以强制 Glob 按特定顺序导入文件，或者使用 Pandas 重新排列文件？

    path = r'path'
    allFiles = glob.glob(path+"/*.dtf")

    df = pd.DataFrame()
    list_ = []
    for file_ in allFiles:
      df = pd.read_csv(file_,header = None,sep=r"\s*")
      list_.append(df)
    df = pd.concat(list_)

谢谢。

Answer 1

您可以使用带有参数 keys 的 concat 和文件名：

测试数据为here.

path = r'path-dtfs'
#add /* for read subfolders
allFiles = glob.glob(path+"/*/*.dtf")
print (allFiles)
['path\1954\FEB.dtf', 'path\1954\JAN.dtf', 'path\1955\APR.dtf', 'path\1955\MAR.dtf']

list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,header = None,sep=r"\s+")
    list_.append(df)

然后通过split and insert. For correct sorting need ordered categorical with sort_values创建新列：

df = pd.concat(list_, keys=allFiles)
       .reset_index(level=1, drop=True)
       .rename_axis('years').reset_index()

s = df['years'].str.split('\')
df['years'] = s.str[-2].astype(int)
df.insert(1, 'months', s.str[-1].str.replace('.dtf', ''))

#add all missing months
cats = ['JAN','FEB','MAR','APR']
df['months'] = df['months'].astype('category', categories=cats, ordered=True)
df = df.sort_values(['years','months']).reset_index(drop=True)
print (df)
   years months  0  1  2
0   1954    JAN  0  1  2
1   1954    JAN  1  5  8
2   1954    FEB  0  9  6
3   1954    FEB  1  6  4
4   1955    MAR  5  6  8
5   1955    MAR  4  7  9
6   1955    APR  0  3  6
7   1955    APR  1  4  1

另一个解决方案是通过 str.extract with to_datetime:

创建 datetime 列

df = pd.concat(list_, keys=allFiles)
       .reset_index(level=1, drop=True)
       .rename_axis('dates')
       .reset_index()
df['dates'] = df['dates'].str.extract('path\\(.*).dtf', expand=False)
df['dates'] = pd.to_datetime(df['dates'], format='%Y\%b')
df = df.sort_values('dates').reset_index(drop=True)
print (df)
       dates  0  1  2
0 1954-01-01  0  1  2
1 1954-01-01  1  5  8
2 1954-02-01  0  9  6
3 1954-02-01  1  6  4
4 1955-03-01  5  6  8
5 1955-03-01  4  7  9
6 1955-04-01  0  3  6
7 1955-04-01  1  4  1

类似的解决方案是通过 to_period 使用 month period:

df = pd.concat(list_, keys=allFiles)
       .reset_index(level=1, drop=True)
       .rename_axis('periods').reset_index()
df['periods'] = df['periods'].str.extract('path\\(.*).dtf', expand=False)
df['periods'] = pd.to_datetime(df['periods'], format='%Y\%b').dt.to_period('M')
df = df.sort_values('periods').reset_index(drop=True)

print (df)
  periods  0  1  2
0 1954-01  0  1  2
1 1954-01  1  5  8
2 1954-02  0  9  6
3 1954-02  1  6  4
4 1955-03  5  6  8
5 1955-03  4  7  9
6 1955-04  0  3  6
7 1955-04  1  4  1

Answer 2

排序列表时可以使用函数作为key

假设您的文件列表 allFiles 是（感谢@jezrael 提供示例列表）：

allFiles = ['path/1954/FEB.dtf', 'path/1954/JAN.dtf',
            'path/1955/APR.dtf', 'path/1955/MAR.dtf']

然后将您的密钥定义为

d = dict(JAN=0, FEB=1, MAR=2, APR=3)

def key(path):
    y, m = path.rsplit('.', 1)[0].split('/')[-2:]
    return int(y), d[m]

在python的sorted函数中使用它

sorted(allFiles, key=key)

['path/1954/JAN.dtf',
 'path/1954/FEB.dtf',
 'path/1955/MAR.dtf',
 'path/1955/APR.dtf']

或者您可以使用

更改列表

allFiles.sort(key=key)

导入时可以使用：

pd.concat(
    [pd.read_csv(file_,header = None,sep=r"\s*")
     for file_ in sorted(allFiles, key=key)]
)

Answer 3

在代码中获取文件名时，您可以使用 sorted 和月份索引作为 key，即

import os
path = r'path'
months = ["JAN","FEB","MAR","APR","MAY","JUN","JULY","AUG","SEP","OCT","NOV","DEC"]
allfiles= sorted(glob.glob(path+"/*.dtf"), key=lambda filename: [months.index(os.path.splitext(os.path.basename(filename))[0])])

df = pd.DataFrame()
list_ = []
for file_ in allFiles:
  df = pd.read_csv(file_,header = None,sep=r"\s*")
  list_.append(df)
df = pd.concat(list_)

希望对您有所帮助

使用 glob 按特定顺序导入文件

Importing files in a specific order using glob

python

glob

dataframe

pandas