使用 GroupBy 在 Pandas Dataframe 上自定义移动平均线

Customized Moving Average on Pandas Dataframe With GroupBy

有一个包含列 IDFeature_1 的数据集。 Feature_1 可以理解为以秒为单位的特定会话持续时间。还有一个自定义函数,它根据 window 宽度引起的 NaN 的数量在开始时加上简单平均来计算移动平均。这是:

def moving_average_mit_anfang(x, w):
        # First part - simple average
        first_part_result = np.cumsum(x)/np.cumsum(np.ones(len(x)))
        # If appearence of user's sessions is greater than window width, we calculate moving average
        if len(x)>w:
            # Second part - moving average with window w
            sec_part_result = np.convolve(x, np.ones(w), 'valid') / w
            return np.append(first_part_result[:-len(sec_part_result)],sec_part_result)
        # Otherwise we calculate only simple average
        else:
            return first_part_result

我们应该在列Featrue_1上应用这个函数,根据相应ID的出现时间,我们得到每个ID的当前平均值。

示例数据框:

pd.DataFrame(data={'ID':[1,2,3,2,3,1,2,1,3,3,3,2,1],
                   'Feature_1':[4,5,6,73,2,21,13,45,32,9,18,45,39]})

我试过这个:

test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))

得到这个:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-6cc3e6c9b134> in <module>
----> 1 test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))

~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
    505 
    506         if not isinstance(func, str):
--> 507             return self._transform_general(func, *args, **kwargs)
    508 
    509         elif func not in base.transform_kernel_allowlist:

~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
    535                 res = res._values
    536 
--> 537             results.append(klass(res, index=group.index))
    538 
    539         # check for empty "results" to avoid concat ValueError

~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    346                 try:
    347                     if len(index) != len(data):
--> 348                         raise ValueError(
    349                             f"Length of passed values is {len(data)}, "
    350                             f"index implies {len(index)}."

ValueError: Length of passed values is 6, index implies 4.

输出应该是这样的:

    ID  Feature_1  Custom average
0    1          4             4.0
1    2          5             5.0
2    3          6             6.0
3    2         73            39.0
4    3          2             4.0
5    1         21            12.5
6    2         13            43.0
7    1         45            33.0
8    3         32             4.0
9    3          9            20.5
10   3         18            13.5
11   2         45            29.0
12   1         39            42.0

您的新解决方案有效,也可以省略 lambda 函数以获得更简单的解决方案(lambda 也有效):

test_df['Custom average'] = test_df.groupby('ID')['Feature_1'].transform(moving_average_mit_anfang,2)
print (test_df)
    ID  Feature_1  Custom average
0    1          4             4.0
1    2          5             5.0
2    3          6             6.0
3    2         73            39.0
4    3          2             4.0
5    1         21            12.5
6    2         13            43.0
7    1         45            33.0
8    3         32            17.0
9    3          9            20.5
10   3         18            13.5
11   2         45            29.0
12   1         39            42.0