pandas 列的累积总和,直到达到最大值,并平均相邻行
Cumulative sum of a pandas column until a maximum value is met, and average adjacent rows
我是一名生物专业的学生,对 python 还很陌生,希望有人能帮助解决我尚未解决的问题
通过一些后续代码,我创建了一个 pandas 数据框,如下例所示:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
我可以提供我以前的代码来创建这个数据框,但我认为它不是特别相关。
我需要对 个值 列求和,直到我得到一个 >= 100 的值;然后合并相邻列的行的数据,取距离的加权平均值和平均r2值,如下例所示
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
我知道 pandas 有它的 .cumsum
函数,我可以将它实现到 for
循环中,并使用 if
语句检查上限和当总和大于或等于上限时,将总和重置为 0。但是,我不知道如何计算相邻列的平均值。
如有任何帮助,我们将不胜感激!
您可以使用此代码段来解决您的问题。
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy 有一个函数numpy.frompyfunc
您可以使用它来获取基于阈值的累积值。
下面是实现方法。这样,您就可以在值超过阈值时计算出索引。使用它来计算原始数据框中值的 Mean Distance
和 Mean rSquared
。
我还利用了@sujanay 的先计算加权值的想法。
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
这个输出将是:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
如果只想提取非 NaN 的记录,可以这样做:
final_df = df[df['Mean Distance'].notnull()]
这将导致:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
我查看了 BEN_YO 对 numpy.frompyfunc 的实现。原始 SO post 可以在这里找到。
如果您先弄清楚分组,pandas groupby
-功能将为您完成大量剩余工作。一个循环适合得到分组(除非有人有一个聪明的单行):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
在进行分组操作之前,您需要使用数值信息的数量来获取权重:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
现在得到这样的总和:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
最后加权组的平均值如下:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727
我是一名生物专业的学生,对 python 还很陌生,希望有人能帮助解决我尚未解决的问题
通过一些后续代码,我创建了一个 pandas 数据框,如下例所示:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
我可以提供我以前的代码来创建这个数据框,但我认为它不是特别相关。
我需要对 个值 列求和,直到我得到一个 >= 100 的值;然后合并相邻列的行的数据,取距离的加权平均值和平均r2值,如下例所示
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
我知道 pandas 有它的 .cumsum
函数,我可以将它实现到 for
循环中,并使用 if
语句检查上限和当总和大于或等于上限时,将总和重置为 0。但是,我不知道如何计算相邻列的平均值。
如有任何帮助,我们将不胜感激!
您可以使用此代码段来解决您的问题。
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy 有一个函数numpy.frompyfunc
您可以使用它来获取基于阈值的累积值。
下面是实现方法。这样,您就可以在值超过阈值时计算出索引。使用它来计算原始数据框中值的 Mean Distance
和 Mean rSquared
。
我还利用了@sujanay 的先计算加权值的想法。
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
这个输出将是:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
如果只想提取非 NaN 的记录,可以这样做:
final_df = df[df['Mean Distance'].notnull()]
这将导致:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
我查看了 BEN_YO 对 numpy.frompyfunc 的实现。原始 SO post 可以在这里找到。
如果您先弄清楚分组,pandas groupby
-功能将为您完成大量剩余工作。一个循环适合得到分组(除非有人有一个聪明的单行):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
在进行分组操作之前,您需要使用数值信息的数量来获取权重:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
现在得到这样的总和:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
最后加权组的平均值如下:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727