Python/ Pandas:寻找左右最大值
Python/ Pandas: Finding a left and right max
我有一个 pandas 数据框,第一列有一个区域,其余列为 8 年的季度数据。大约有 4400 行。这是一个示例:
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0
这是一张描述我正在尝试计算的图像:
timeline
- nadir:最低点(min)
- nadir_qtr:最低点发生的季度
- pre-peak: 最高点before the nadir
- pre-peak_qtr:前高峰发生的季度
- post-peak: 最高点after the nadir
- post-peak_qtr:post峰值出现的季度
恢复:最低点后的一个季度,数字超过峰值前的数字
我可以很容易地计算出最低点。
df['nadir'] = df.iloc[:,2:].min(axis=1)
df['nadir_qtr'] = df.iloc[:,2:].idxmin(axis=1)
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002 nadir nadir_qtr
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0 4039370.0 Q42001
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0 21226 Q12000
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0 95958.0 Q42001
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0 22080.0 Q42002
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0 6722.0 Q42001
但是当涉及到获得前或 post 峰值或四分之一时,我陷入了困境。我最接近的是这样的:
df['pre-peak'] = df.loc[:,:df['nadir_qtr'].max(axis=1)
df['pre-peak_qtr'] = df.loc[:,:df['nadir_qtr']].idxmax(axis=1)
预期输出:
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002 nadir nadir_qtr pre-peak pre-peak_qtr
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0 4039370.0 Q42001 4114911.0 Q22000
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0 21226.0 Q12000 NaN NaN
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0 95958.0 Q42001 103054.0 Q22001
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0 22080.0 Q42002 24816.0 Q32000
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0 6722.0 Q42001 7906.0 Q2200
但是这个的任何变化都会给我错误的数据或错误(最常见的是)
TypeError: reduction operation 'argmax' not allowed for this dtype
我尝试了很多策略,强制迭代每一行作为一个 numpy 数组,拆分每一行。我真的卡住了。
这是一种方法,它使用 'helper' 函数:
# create the data frame
from io import StringIO
import pandas as pd
data = ''' Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python')
其次,定义辅助函数:
def calc_nadir(s):
assert isinstance(s, pd.Series)
return s.min()
def calc_nadir_qtr(s):
return s.argmin()
def calc_pre_peak(s):
return s[ : s.argmin()].max()
def calc_pre_peak_quarter(s):
try:
qtr = s[ : s.argmin()].argmax()
except:
qtr = None
return qtr
def calc_post_peak(s):
return s[s.argmin() : ].max()
def calc_post_peak_qtr(s):
return s[s.argmin() : ].argmax() + s.argmin()
第三,我们使用辅助函数和assemble结果:
nadir = df.apply(lambda x: calc_nadir(x), axis=1).rename('nadir')
nadir_qtr = df.apply(lambda x: calc_nadir_qtr(x), axis=1).rename('nadir_qtr')
pre_peak = df.apply(lambda x: calc_pre_peak(x), axis=1).rename('pre_peak')
pre_peak_qtr = df.apply(lambda x: calc_pre_peak_quarter(x), axis=1).rename('pre_peak_qtr')
post_peak = df.apply(lambda x: calc_post_peak(x), axis=1).rename('post_peak')
post_peak_qtr = df.apply(lambda x: calc_post_peak_qtr(x), axis=1).rename('post_peak_qtr')
results = pd.concat([nadir, nadir_qtr, pre_peak, pre_peak_qtr,
post_peak, post_peak_qtr], axis=1)
print(results)
nadir nadir_qtr pre_peak pre_peak_qtr post_peak post_peak_qtr
0 4039370.0 7 4114911.0 1.0 4254681.0 11
1 21226.0 0 NaN NaN 23232.0 5
2 95958.0 7 103054.0 5.0 123064.0 9
3 22080.0 11 24186.0 2.0 22080.0 11
4 6722.0 7 7906.0 1.0 8326.0 11
我有一个 pandas 数据框,第一列有一个区域,其余列为 8 年的季度数据。大约有 4400 行。这是一个示例:
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0
这是一张描述我正在尝试计算的图像: timeline
- nadir:最低点(min)
- nadir_qtr:最低点发生的季度
- pre-peak: 最高点before the nadir
- pre-peak_qtr:前高峰发生的季度
- post-peak: 最高点after the nadir
- post-peak_qtr:post峰值出现的季度 恢复:最低点后的一个季度,数字超过峰值前的数字
我可以很容易地计算出最低点。
df['nadir'] = df.iloc[:,2:].min(axis=1)
df['nadir_qtr'] = df.iloc[:,2:].idxmin(axis=1)
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002 nadir nadir_qtr
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0 4039370.0 Q42001
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0 21226 Q12000
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0 95958.0 Q42001
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0 22080.0 Q42002
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0 6722.0 Q42001
但是当涉及到获得前或 post 峰值或四分之一时,我陷入了困境。我最接近的是这样的:
df['pre-peak'] = df.loc[:,:df['nadir_qtr'].max(axis=1)
df['pre-peak_qtr'] = df.loc[:,:df['nadir_qtr']].idxmax(axis=1)
预期输出:
idx Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002 nadir nadir_qtr pre-peak pre-peak_qtr
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0 4039370.0 Q42001 4114911.0 Q22000
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0 21226.0 Q12000 NaN NaN
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0 95958.0 Q42001 103054.0 Q22001
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0 22080.0 Q42002 24816.0 Q32000
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0 6722.0 Q42001 7906.0 Q2200
但是这个的任何变化都会给我错误的数据或错误(最常见的是)
TypeError: reduction operation 'argmax' not allowed for this dtype
我尝试了很多策略,强制迭代每一行作为一个 numpy 数组,拆分每一行。我真的卡住了。
这是一种方法,它使用 'helper' 函数:
# create the data frame
from io import StringIO
import pandas as pd
data = ''' Q12000 Q22000 Q32000 Q42000 Q12001 Q22001 Q32001 Q42001 Q12002 Q22002 Q32002 Q42002
0 4085280.0 4114911.0 4108089.0 4111713.0 4055699.0 4076430.0 4043219.0 4039370.0 4201158.0 4243119.0 4231823.0 4254681.0
1 21226.0 21566.0 21804.0 22072.0 21924.0 23232.0 22748.0 22258.0 22614.0 22204.0 22500.0 22660.0
2 96400.0 102000.0 98604.0 97086.0 96354.0 103054.0 97824.0 95958.0 115938.0 123064.0 120406.0 120648.0
3 23820.0 24116.0 24186.0 23726.0 23504.0 23574.0 23162.0 23078.0 22306.0 22334.0 22152.0 22080.0
4 7838.0 7906.0 7714.0 7676.0 7480.0 7520.0 7102.0 6722.0 8324.0 8166.0 8208.0 8326.0
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python')
其次,定义辅助函数:
def calc_nadir(s):
assert isinstance(s, pd.Series)
return s.min()
def calc_nadir_qtr(s):
return s.argmin()
def calc_pre_peak(s):
return s[ : s.argmin()].max()
def calc_pre_peak_quarter(s):
try:
qtr = s[ : s.argmin()].argmax()
except:
qtr = None
return qtr
def calc_post_peak(s):
return s[s.argmin() : ].max()
def calc_post_peak_qtr(s):
return s[s.argmin() : ].argmax() + s.argmin()
第三,我们使用辅助函数和assemble结果:
nadir = df.apply(lambda x: calc_nadir(x), axis=1).rename('nadir')
nadir_qtr = df.apply(lambda x: calc_nadir_qtr(x), axis=1).rename('nadir_qtr')
pre_peak = df.apply(lambda x: calc_pre_peak(x), axis=1).rename('pre_peak')
pre_peak_qtr = df.apply(lambda x: calc_pre_peak_quarter(x), axis=1).rename('pre_peak_qtr')
post_peak = df.apply(lambda x: calc_post_peak(x), axis=1).rename('post_peak')
post_peak_qtr = df.apply(lambda x: calc_post_peak_qtr(x), axis=1).rename('post_peak_qtr')
results = pd.concat([nadir, nadir_qtr, pre_peak, pre_peak_qtr,
post_peak, post_peak_qtr], axis=1)
print(results)
nadir nadir_qtr pre_peak pre_peak_qtr post_peak post_peak_qtr
0 4039370.0 7 4114911.0 1.0 4254681.0 11
1 21226.0 0 NaN NaN 23232.0 5
2 95958.0 7 103054.0 5.0 123064.0 9
3 22080.0 11 24186.0 2.0 22080.0 11
4 6722.0 7 7906.0 1.0 8326.0 11