迭代 df.index 的唯一元素以在列中找到最小值
iterate over unique elements of df.index to find minimum in column
我的 df 是这样的:
import date time as dt
data = [{'expiry': dt.datetime(2020,6,26), 'strike': 137.5, 'diff': 0.797},
{'expiry': dt.datetime(2020,6,26), 'strike': 138.0, 'diff': 0.305},
{'expiry': dt.datetime(2020,6,26), 'strike': 138.5, 'diff': 0.188},
{'expiry': dt.datetime(2020,6,26), 'strike': 139.0, 'diff': 0.688},
{'expiry': dt.datetime(2020,7,24), 'strike': 137.5, 'diff': 0.805},
{'expiry': dt.datetime(2020,7,24), 'strike': 138.0, 'diff': 0.305},
{'expiry': dt.datetime(2020,7,24), 'strike': 138.5, 'diff': 0.203},
{'expiry': dt.datetime(2020,7,24), 'strike': 139.0, 'diff': 0.703}]
df = pd.DataFrame(data).set_index('expiry')
我正在寻找每个唯一索引(到期)的最小值。以下工作但速度很慢。寻找一种更快的方法来做到这一点,无论是在纯 python、NumPy 还是 pandas.
atm_df = pd.DataFrame()
for date in df.index.unique():
_df = df.loc[date]
atm_df = atm_df.append(_df.loc[(_df['diff'] == _df['diff'].min())])
atm_df
所需的输出如下所示(但不要介意这是 df 还是 dict):
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
您可以使用 Pandas groupby on the index and aggregate with min 获取 diff
列的最小值。将分组结果与 diff
中的值进行比较,然后使用结果布尔值对数据帧进行索引。
df.loc[df['diff'].eq(df.groupby(level=0)['diff'].min())]
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
对我来说只是一次学习经历 - 纯粹地尝试了 python:
from itertools import groupby
from operator import itemgetter
#convert to dict:
m = df.reset_index().to_numpy()
#we'll use itertools groupby
#data is already sorted so I wont bother with that
#groupby requires data to be sorted
#the first item in the sublist, expiry
#will be our grouping key
#this is our expiry value
grp_key = itemgetter(0)
#we need the rows with the minimum for diff
diff_min = itemgetter(-1)
columns = df.reset_index().columns
outcome = [dict(zip(columns, min(value,key=diff_min)))
for key,value
in groupby(m, grp_key)
]
outcome
[{'expiry': Timestamp('2020-06-26 00:00:00'), 'strike': 138.5, 'diff': 0.188},
{'expiry': Timestamp('2020-07-24 00:00:00'), 'strike': 138.5, 'diff': 0.203}]
更新:感谢@steff 将我指向字典 - 如果需要,可以在读入 Pandas 之前在那里解决计算。我们将使用涉及 itemgetter and itertools' groupby
的相同步骤
#sort data
data = sorted(data, key = itemgetter('expiry'))
outcome = [min(value, key = itemgetter("diff"))
for _,value
in groupby(data,key=itemgetter("expiry"))]
outcome
[{'expiry': datetime.datetime(2020, 6, 26, 0, 0),
'strike': 138.5,
'diff': 0.188},
{'expiry': datetime.datetime(2020, 7, 24, 0, 0),
'strike': 138.5,
'diff': 0.203}]
min
与 level
一起使用,然后您可以使用 eq
将序列与提取的最小值进行比较:
df[df['diff'].eq(df['diff'].min(level=0))]
输出:
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
一个基于np.minimum.reduceat
-
sidx = df.index.argsort()
df_s = df.iloc[sidx]
I = df_s.index.values
cutidx = np.flatnonzero(np.r_[True,I[:-1]!=I[1:]])
out = np.minimum.reduceat(df_s.values, cutidx, axis=0)
df_out = pd.DataFrame(out, index=I[cutidx], columns=df_s.columns)
如果输入数据框已经按index
排序,直接使用df
作为df_s
。
我的 df 是这样的:
import date time as dt
data = [{'expiry': dt.datetime(2020,6,26), 'strike': 137.5, 'diff': 0.797},
{'expiry': dt.datetime(2020,6,26), 'strike': 138.0, 'diff': 0.305},
{'expiry': dt.datetime(2020,6,26), 'strike': 138.5, 'diff': 0.188},
{'expiry': dt.datetime(2020,6,26), 'strike': 139.0, 'diff': 0.688},
{'expiry': dt.datetime(2020,7,24), 'strike': 137.5, 'diff': 0.805},
{'expiry': dt.datetime(2020,7,24), 'strike': 138.0, 'diff': 0.305},
{'expiry': dt.datetime(2020,7,24), 'strike': 138.5, 'diff': 0.203},
{'expiry': dt.datetime(2020,7,24), 'strike': 139.0, 'diff': 0.703}]
df = pd.DataFrame(data).set_index('expiry')
我正在寻找每个唯一索引(到期)的最小值。以下工作但速度很慢。寻找一种更快的方法来做到这一点,无论是在纯 python、NumPy 还是 pandas.
atm_df = pd.DataFrame()
for date in df.index.unique():
_df = df.loc[date]
atm_df = atm_df.append(_df.loc[(_df['diff'] == _df['diff'].min())])
atm_df
所需的输出如下所示(但不要介意这是 df 还是 dict):
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
您可以使用 Pandas groupby on the index and aggregate with min 获取 diff
列的最小值。将分组结果与 diff
中的值进行比较,然后使用结果布尔值对数据帧进行索引。
df.loc[df['diff'].eq(df.groupby(level=0)['diff'].min())]
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
对我来说只是一次学习经历 - 纯粹地尝试了 python:
from itertools import groupby
from operator import itemgetter
#convert to dict:
m = df.reset_index().to_numpy()
#we'll use itertools groupby
#data is already sorted so I wont bother with that
#groupby requires data to be sorted
#the first item in the sublist, expiry
#will be our grouping key
#this is our expiry value
grp_key = itemgetter(0)
#we need the rows with the minimum for diff
diff_min = itemgetter(-1)
columns = df.reset_index().columns
outcome = [dict(zip(columns, min(value,key=diff_min)))
for key,value
in groupby(m, grp_key)
]
outcome
[{'expiry': Timestamp('2020-06-26 00:00:00'), 'strike': 138.5, 'diff': 0.188},
{'expiry': Timestamp('2020-07-24 00:00:00'), 'strike': 138.5, 'diff': 0.203}]
更新:感谢@steff 将我指向字典 - 如果需要,可以在读入 Pandas 之前在那里解决计算。我们将使用涉及 itemgetter and itertools' groupby
的相同步骤#sort data
data = sorted(data, key = itemgetter('expiry'))
outcome = [min(value, key = itemgetter("diff"))
for _,value
in groupby(data,key=itemgetter("expiry"))]
outcome
[{'expiry': datetime.datetime(2020, 6, 26, 0, 0),
'strike': 138.5,
'diff': 0.188},
{'expiry': datetime.datetime(2020, 7, 24, 0, 0),
'strike': 138.5,
'diff': 0.203}]
min
与 level
一起使用,然后您可以使用 eq
将序列与提取的最小值进行比较:
df[df['diff'].eq(df['diff'].min(level=0))]
输出:
strike diff
expiry
2020-06-26 138.5 0.188
2020-07-24 138.5 0.203
一个基于np.minimum.reduceat
-
sidx = df.index.argsort()
df_s = df.iloc[sidx]
I = df_s.index.values
cutidx = np.flatnonzero(np.r_[True,I[:-1]!=I[1:]])
out = np.minimum.reduceat(df_s.values, cutidx, axis=0)
df_out = pd.DataFrame(out, index=I[cutidx], columns=df_s.columns)
如果输入数据框已经按index
排序,直接使用df
作为df_s
。