计算 Pandas 数据帧每一行的平均绝对误差

Question

下面是我正在使用的 pandas 数据框的示例。
我想计算每一行的平均绝对误差，但只考虑 ID 列值的相关列。

可能有 2 4,6 或 8 列与 ID 列的值相关。例如，'id4' 的相关列是 'id4_signal1_true' 和 'id4_signal1_pred'。对于 'id1' 它的 'id1_signal1_true'、'id1_signal2_true'、'id1_signal1_pred' 和 'id1_signal2_pred'.

import pandas as pd
list = [['id4',0.37,0.97,0.21,0.54,0.11,0.38,0.95,0.2,0.5,0.23],
        ['id1',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23],
        ['id3',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23]]

df = pd.DataFrame(list, columns =['ID','id1_signal1_true','id1_signal2_true','id4_signal1_true','id3_signal1_true',
                                  'id3_signal2_true','id1_signal1_pred','id1_signal2_pred','id4_signal1_pred',
                                  'id3_signal1_pred','id3_signal2_pred'])

我只想考虑特定的相关列来计算每一行的 mae。例如，对于第一行，它应该像 mean(abs(id4_signal1_true-id4_signal1_pred)) 对于第二行，它应该是 mean(abs('id1_signal1_true'-'id1_signal1_pred'),abs('id1_signal2_true'-'id1_signal2_pred'))

下面是输出结果的屏幕截图。 MAE 是我想要得到的专栏

我用下面的代码解决了这个问题。这工作正常。但唯一的问题是我有大约 200 万行，这需要几个小时。我想找到一种有效的方法来做到这一点。非常感谢您对此的帮助

Attack = df
ID_MAE = []
for id in range(len(Attack['ID'])):

    signals = np.array(Attack[Attack.columns[Attack.columns.str.contains(Attack.ID[id])]]) # select columns relevant to current ID
    signal = signals[id]  # select only the specific row (id)

    no_of_signals = int(len(signal)/2) # identify number of signals
    reshaped_arr = np.reshape(signal, (2,no_of_signals))
    signal_true = reshaped_arr[0]  # array for true values
    signal_pred = reshaped_arr[1]  # array for predicted values

    # mae calculation
    MAE = np.mean(np.abs(signal_true - signal_pred), axis=0)
    ID_MAE.append(MAE)

df['MAE'] = ID_MAE

Answer 1

尝试：

将列拆分为 three-level header
groupby ID和Signal并得到MAE
Select 每行的正确 MAE
将 multi-level header 折叠回一个级别。

df = df.set_index("ID").rename_axis(None)
df.columns = df.columns.str.split("_",expand=True)
df = df.rename_axis(["ID","Signal","Type"],axis=1).sort_values(["ID","Signal"],axis=1)
MAE = df.groupby(["ID","Signal"], axis=1).diff().abs().groupby("ID", axis=1).mean()
df.columns = df.columns.map("_".join)

df["MAE"] = df.index.to_series().apply(lambda x: MAE.at[x,x])

>>> df
     id1_signal1_true  id1_signal1_pred  ...  id4_signal1_pred   MAE
id4              0.37              0.38  ...               0.2  0.01
id1              0.41              0.41  ...               0.2  0.02
id3              0.41              0.41  ...               0.2  0.08

[3 rows x 11 columns]

Answer 2

import numpy as np
df2 = df.set_index('ID').apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.name)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())

输出：

ID
id4    0.01
id1    0.02
id3    0.08
dtype: float64

更新：

或者您可以：

df2 = df.apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df["MAE"] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())

好吧，好吧，好吧……现在解释一下……我们走吧:)) 注意：我说明更新部分

有什么问题？你是一些列和行......对于每一行，你想要有相应的列......只意味着 startswith 列的 name 的列......嗯，新想法......

df.apply(lambda x: x.index.str.startswith(x.ID), axis=1)

输出：

0    [False, False, False, True, False, False, Fals...
1    [False, True, True, False, False, False, True,...
2    [False, False, False, False, True, True, False...

如您所见，对于每一行，指定每一列是否以（对应）开头（注意：ID 是 id4、id1、... )

好的，接下来你必须得到所有对应的列，使用：

df.apply(lambda x: (ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)], axis=1)

输出：

1   2   3   4   5   6   7   8   9   10
0   NaN NaN id4_signal1_true    NaN NaN NaN NaN id4_signal1_pred    NaN NaN
1   id1_signal1_true    id1_signal2_true    NaN NaN NaN id1_signal1_pred    id1_signal2_pred    NaN NaN NaN
2   NaN NaN NaN id3_signal1_true    id3_signal2_true    NaN NaN NaN id3_signal1_pred    id3_signal2_pred

如您所知，您可以将布尔值列表作为索引传递给 pandas 系列，并获取所有 True...

的列

等一下，可以更简单...（因为x.index本身就是系列）

df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]], axis=1)

好的，我们得到了所有对应的列，那又怎样？没什么，如您所见，有些列是 NaN，我们必须删除它们，因此使用 to_dict().items():

将数据转换为 name-value 对的 list

df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]].to_dict().items(), axis=1)

输出：

0    ((id4_signal1_true, 0.21), (id4_signal1_pred, ...
1    ((id1_signal1_true, 0.41), (id1_signal2_true, ...
2    ((id3_signal1_true, 0.54), (id3_signal2_true, ...
dtype: object

为什么我们需要 name？因为我们需要计算正确对之间的 MAE...

好的，现在我们有了对，但顺序不正确...我们如何排序？我们现在正确的对具有相同的名称，除了最后一部分：pred 和 true... 所以让我们根据名称对它们进行排序：

df.apply(lambda x: sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items()),  axis=1)

输出：

0    [(id4_signal1_pred, 0.2), (id4_signal1_true, 0...
1    [(id1_signal1_pred, 0.41), (id1_signal1_true, ...
2    [(id3_signal1_pred, 0.5), (id3_signal1_true, 0...

哦，是的，它们的顺序是正确的，我们可以为每一对计算 MAE，因此，我们可以去掉名字，所以每个 list 上的 map 和得到 second elements:

df.apply(lambda x: list(map(lambda t: t[1], sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))),  axis=1)

输出：

0                 [0.2, 0.21]
1    [0.41, 0.41, 0.48, 0.44]
2     [0.5, 0.54, 0.23, 0.11]
dtype: object

ok...现在，我们可以为每对计算 MAE，但是我们如何将每个 list 转换为 pair 的 list...嗯... NumPy！并使用 .reshape(-1,2) 我们将其转换为对并为每对计算 MAE：

(np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2))

输出：

0                   [[0.2, 0.21]]
1    [[0.41, 0.41], [0.48, 0.44]]
2     [[0.5, 0.54], [0.23, 0.11]]
dtype: object

等一下...我们使用 NumPy...为什么不进一步使用？

df.apply(lambda x: np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2),  axis=1)

将 sorted 输出转换为 numpy.array 并使用以下方法获取第二个元素：[:,1] 现在，只需为每个 pair:

计算 MAE

df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]))

输出：

0                         [0.009999999999999981]
1                     [0.0, 0.03999999999999998]
2    [0.040000000000000036, 0.12000000000000001]
dtype: object

我们计算每对的绝对差...而且，我们可以再次简化它：

df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))), axis=1)

最后，我们为每个 numpy.array

计算 mean

第三种更简单快捷的方式：

df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))).mean(), axis=1)

我试图用简单的词来解释它，希望对您有所帮助

计算 Pandas 数据帧每一行的平均绝对误差

Calculate Mean Absolute Error for each row of a Pandas dataframe

python

dataframe

pandas

pandas-groupby

更新：

第三种更简单快捷的方式：