pandas 列的累积总和，直到达到最大值，并平均相邻行

Question

我是一名生物专业的学生，对 python 还很陌生，希望有人能帮助解决我尚未解决的问题

通过一些后续代码，我创建了一个 pandas 数据框，如下例所示：

Distance.     No. of values        Mean rSquared
    1                   500                  0.6
    2                    80                  0.3
    3                    40                  0.4
    4                    30                  0.2
    5                    50                  0.2
    6                    30                  0.1

我可以提供我以前的代码来创建这个数据框，但我认为它不是特别相关。

我需要对个值列求和，直到我得到一个 >= 100 的值；然后合并相邻列的行的数据，取距离的加权平均值和平均r2值，如下例所示

Mean Distance.             No. Of values             Mean rSquared
1                          500                       0.6
(80*2+40*3)/120            (80+40) = 120             (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110       (30+50+30) = 110          (30*0.2+50*0.2+30*0.1)/110

etc...

我知道 pandas 有它的 .cumsum 函数，我可以将它实现到 for 循环中，并使用 if 语句检查上限和当总和大于或等于上限时，将总和重置为 0。但是，我不知道如何计算相邻列的平均值。

如有任何帮助，我们将不胜感激！

Answer 1

您可以使用此代码段来解决您的问题。

# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]


min_threshold = 100
indexes = []
temp_sum = 0

# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]

# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)

# main loop to check and compute the desired output
for index, _ in df.iterrows():
    temp_sum += df.iloc[index]["No. of values"]
    indexes.append(index)

    # if the sum exceeds 'min_threshold' then do some computation
    if temp_sum >= min_threshold:
        temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
        temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
    
        # create temporary dataframe and concatenate with the 'final_df'
        temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
        final_df = pd.concat([final_df, temp_df])
    
        # reset the variables
        temp_sum = 0
        indexes = []

Answer 2

Numpy 有一个函数numpy.frompyfunc您可以使用它来获取基于阈值的累积值。

下面是实现方法。这样，您就可以在值超过阈值时计算出索引。使用它来计算原始数据框中值的 Mean Distance 和 Mean rSquared。

我还利用了@sujanay 的先计算加权值的想法。

c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
     [4,30,0.2], [5,50,0.2], [6,30,0.1]]

import pandas as pd
import numpy as np

df = pd.DataFrame(d,columns=c)

#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]

#use numpy.frompyfunc to setup the threshold condition

sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)

#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)

#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()

#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]

#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
    df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
    df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
    i = j+1

print (df)

这个输出将是：

   Distance  No. of values  ...  Mean Distance  New Mean rSquared
0         1            500  ...           1.00               0.60
1         2             80  ...            NaN                NaN
2         3             40  ...           2.33               0.33
3         4             30  ...            NaN                NaN
4         5             50  ...            NaN                NaN
5         6             30  ...           5.00               0.17

如果只想提取非 NaN 的记录，可以这样做：

final_df = df[df['Mean Distance'].notnull()]

这将导致：

   Distance  No. of values  ...  Mean Distance  New Mean rSquared
0         1            500  ...           1.00               0.60
2         3             40  ...           2.33               0.33
5         6             30  ...           5.00               0.17

我查看了 BEN_YO 对 numpy.frompyfunc 的实现。原始 SO post 可以在这里找到。

Answer 3

如果您先弄清楚分组，pandas groupby-功能将为您完成大量剩余工作。一个循环适合得到分组（除非有人有一个聪明的单行）：

>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
...     if cumsum >= 100:
...         cumsum = 0
...         group = group + 1
...     cumsum = cumsum + n
...     groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]

在进行分组操作之前，您需要使用数值信息的数量来获取权重：

df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)

现在得到这样的总和：

>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0    500
1    120
2    110
Name: No. of values, dtype: int64

最后加权组的平均值如下：

>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
   Distance.  Mean rSquared
0   1.000000       0.600000
1   2.333333       0.333333
2   5.000000       0.172727

pandas 列的累积总和，直到达到最大值，并平均相邻行

Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows

python

weighted-average

cumulative-sum

pandas