迭代 df.index 的唯一元素以在列中找到最小值

iterate over unique elements of df.index to find minimum in column

我的 df 是这样的:

import date time as dt

data = [{'expiry': dt.datetime(2020,6,26), 'strike': 137.5, 'diff': 0.797}, 
        {'expiry': dt.datetime(2020,6,26), 'strike': 138.0, 'diff': 0.305}, 
        {'expiry': dt.datetime(2020,6,26), 'strike': 138.5, 'diff': 0.188}, 
        {'expiry': dt.datetime(2020,6,26), 'strike': 139.0, 'diff': 0.688}, 
        {'expiry': dt.datetime(2020,7,24), 'strike': 137.5, 'diff': 0.805},
        {'expiry': dt.datetime(2020,7,24), 'strike': 138.0, 'diff': 0.305}, 
        {'expiry': dt.datetime(2020,7,24), 'strike': 138.5, 'diff': 0.203}, 
        {'expiry': dt.datetime(2020,7,24), 'strike': 139.0, 'diff': 0.703}]
df = pd.DataFrame(data).set_index('expiry')

我正在寻找每个唯一索引(到期)的最小值。以下工作但速度很慢。寻找一种更快的方法来做到这一点,无论是在纯 python、NumPy 还是 pandas.

atm_df = pd.DataFrame()
for date in df.index.unique():
    _df = df.loc[date]
    atm_df = atm_df.append(_df.loc[(_df['diff'] == _df['diff'].min())])
atm_df

所需的输出如下所示(但不要介意这是 df 还是 dict):

            strike  diff
expiry      
2020-06-26  138.5   0.188
2020-07-24  138.5   0.203

您可以使用 Pandas groupby on the index and aggregate with min 获取 diff 列的最小值。将分组结果与 diff 中的值进行比较,然后使用结果布尔值对数据帧进行索引。

df.loc[df['diff'].eq(df.groupby(level=0)['diff'].min())]

           strike   diff
expiry      
2020-06-26  138.5   0.188
2020-07-24  138.5   0.203

对我来说只是一次学习经历 - 纯粹地尝试了 python:

from itertools import groupby
from operator import itemgetter

#convert to dict: 
m = df.reset_index().to_numpy()

#we'll use itertools groupby
#data is already sorted so I wont bother with that
#groupby requires data to be sorted

#the first item in the sublist, expiry
#will be our grouping key
#this is our expiry value

grp_key = itemgetter(0)

#we need the rows with the minimum for diff
diff_min = itemgetter(-1)

columns = df.reset_index().columns

outcome = [dict(zip(columns, min(value,key=diff_min)))
           for key,value 
           in groupby(m, grp_key)
           ]

outcome

    [{'expiry': Timestamp('2020-06-26 00:00:00'), 'strike': 138.5, 'diff': 0.188},
 {'expiry': Timestamp('2020-07-24 00:00:00'), 'strike': 138.5, 'diff': 0.203}]

更新:感谢@steff 将我指向字典 - 如果需要,可以在读入 Pandas 之前在那里解决计算。我们将使用涉及 itemgetter and itertools' groupby

的相同步骤
#sort data
data = sorted(data, key = itemgetter('expiry'))

outcome = [min(value, key = itemgetter("diff"))
           for _,value 
           in groupby(data,key=itemgetter("expiry"))]

outcome

[{'expiry': datetime.datetime(2020, 6, 26, 0, 0),
  'strike': 138.5,
  'diff': 0.188},
 {'expiry': datetime.datetime(2020, 7, 24, 0, 0),
  'strike': 138.5,
  'diff': 0.203}]

minlevel 一起使用,然后您可以使用 eq 将序列与提取的最小值进行比较:

df[df['diff'].eq(df['diff'].min(level=0))]

输出:

            strike   diff
expiry                   
2020-06-26   138.5  0.188
2020-07-24   138.5  0.203

一个基于np.minimum.reduceat -

sidx = df.index.argsort()
df_s = df.iloc[sidx]
I = df_s.index.values

cutidx = np.flatnonzero(np.r_[True,I[:-1]!=I[1:]])
out = np.minimum.reduceat(df_s.values, cutidx, axis=0)
df_out = pd.DataFrame(out, index=I[cutidx], columns=df_s.columns)

如果输入数据框已经按index排序,直接使用df作为df_s