重建数据框字典的有效方法

Question

我有一个充满多个数据框的字典。现在我正在寻找一种更改密钥结构的有效方法，但是当涉及更多数据帧/更大的数据帧时，我发现的解决方案相当慢。这就是为什么我想问是否有人知道比我更方便/高效/更快的方法或方法。所以首先，我创建了这个例子来展示我最初开始的地方：

import pandas as pd
import numpy as np

# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}

# fill dic with random entries
for t1 in teams:

    dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30), 
                                  'Goals': pd.Series(np.random.randint(0,5, size = 30)),
                                  'Chances': pd.Series(np.random.randint(0,15, size = 30)),
                                  'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
                                  'Offside': pd.Series(np.random.randint(0, 10, size = 30))})

    dic_teams[t1] = dic_teams[t1].set_index('date')
    dic_teams[t1].index.name = None

现在我基本上有了一个字典，其中每个键都是一个团队，这意味着我有每个团队的数据框，其中包含他们在一段时间内的比赛表现信息。现在我更愿意更改这个特定的字典，以便我得到一个结构，其中键是日期，而不是团队。这意味着我每个日期都有一个数据框，其中包含每个团队在该日期的表现。我设法使用以下代码做到了这一点，该代码有效但在我添加更多团队和性能因素后确实很慢：

# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = pd.DataFrame(index = teams, columns = perf)

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

因为我使用的是嵌套循环，所以我的字典重构很慢。有谁知道我如何改进第二段代码？我不一定只是在寻找解决方案，也在寻找如何做得更好的逻辑或想法。

在此先致谢，非常感谢任何帮助

Answer 1

按照您的方式创建 Pandas 数据帧（奇怪地）非常慢，直接索引也是如此。

复制数据框的速度出奇地快。因此，您可以使用多次复制的空参考数据框。这是代码：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = zygote.copy()

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

这比我机器上的参考快了大约 2 倍。

克服缓慢的数据帧直接索引是很棘手的。我们可以使用 numpy 来做到这一点。的确，我们可以将dataframe转为3D numpy数组，使用numpy进行转置，最后将切片再次转为dataframe。请注意，此方法假设所有值都是整数并且输入数据帧结构良好。

这里是最终实现：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
    full[tId,:,:] = dic_teams[tName].to_numpy()

# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
    dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)

此实现比我机器上的参考快 6.4 倍。请注意，令人遗憾的是，大约 75% 的时间花在了 pd.DataFrame 调用上。因此，如果您想要更快的代码，使用基本的 3D numpy 数组!

重建数据框字典的有效方法

Efficient way to rebuild a dictionary of dataframes

python

performance

dictionary

structure

pandas