对日期时间索引的 pandas DataFrame 的周内操作
Within-week operation on a pandas DataFrame indexed by datetime
我想从 datetime
索引的 DataFrame
的每一列中获取每周出现频率最高的值。我知道这可以在 DataFrame
的条目都是 int
或 float
时完成。但我正在寻找一种不利用 int
或 float
数据类型的通用方法。
这是一个示例,其中 DataFrame
中的每个条目都是 tuple
:
2015-11-15 00:00:00 (3, 10.0, 0) nan
2015-11-16 00:00:00 nan nan
2015-11-17 00:00:00 nan nan
2015-11-18 00:00:00 (3, 10.0, 0) nan
2015-11-19 00:00:00 (3, 10.0, 0) nan
2015-11-20 00:00:00 (4, 8.2, 0) nan
2015-11-21 00:00:00 (4, 8.2, 0) nan
2015-11-22 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
2015-11-23 00:00:00 (3, 18.0, 1) (3, 10.0, 0)
2015-11-26 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
2015-11-27 00:00:00 (4, 8.2, 0) (3, 10.0, 0)
2015-11-28 00:00:00 nan (1, 1.4, 1)
2015-11-29 00:00:00 (4, 8.2, 0) (3, 10.0, 0)
2015-11-30 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
这应该转化为DataFrame
由周内最频繁的元组组成如下:
2015-11-15 00:00:00 (3, 10.0, 0) nan
2015-11-22 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
我的偏好是效率,速度在我的应用程序中真的很重要。
编辑
3046920017503 3046920017541
index
2015-11-15 NaN NaN
2015-11-16 NaN NaN
2015-11-17 NaN NaN
2015-11-18 NaN NaN
2015-11-19 NaN NaN
2015-11-20 NaN NaN
2015-11-21 NaN NaN
2015-11-22 NaN NaN
2015-11-23 NaN NaN
2015-11-24 NaN NaN
2015-11-25 NaN NaN
2015-11-26 NaN NaN
2015-11-27 NaN NaN
2015-11-28 NaN NaN
2015-11-29 NaN NaN
2015-11-30 NaN NaN
2015-12-01 (3, 10.0, 0) (3, 10.0, 0)
2015-12-02 (3, 10.0, 0) (3, 10.0, 0)
2015-12-03 (3, 10.0, 0) (3, 10.0, 0)
2015-12-04 (3, 10.0, 0) (3, 10.0, 0)
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-06 (3, 10.0, 0) (3, 10.0, 0)
应该转化为:
2015-11-15 NaN NaN
2015-11-22 NaN NaN
2015-11-29 (3, 10.0, 0) (3, 10.0, 0)
但是方法建议的结果是:
3046920017503 3046920017541
index
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-12 (3, 10.0, 0) (3, 10.0, 0)
假设这是我的数据框df
One Two
Date
2015-11-15 (3, 10.0, 0) NaN
2015-11-16 NaN NaN
2015-11-17 NaN NaN
2015-11-18 (3, 10.0, 0) NaN
2015-11-19 (3, 10.0, 0) NaN
2015-11-20 (4, 8.2, 0) NaN
2015-11-21 (4, 8.2, 0) NaN
2015-11-22 (4, 8.2, 0) (1, 1.4, 1)
2015-11-23 (3, 18.0, 1) (3, 10.0, 0)
2015-11-26 (4, 8.2, 0) (1, 1.4, 1)
2015-11-27 (4, 8.2, 0) (3, 10.0, 0)
2015-11-28 NaN (1, 1.4, 1)
2015-11-29 (4, 8.2, 0) (3, 10.0, 0)
2015-11-30 (4, 8.2, 0) (1, 1.4, 1)
# 'W-Sat' tells pandas to end weeks on Saturday.
df.stack().groupby(
[pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(lambda s: s.value_counts().idxmax()).unstack()
One Two
Date
2015-11-21 (3, 10.0, 0) None
2015-11-28 (4, 8.2, 0) (1, 1.4, 1)
2015-12-05 (4, 8.2, 0) (3, 10.0, 0)
完成此操作的另一种方法是先将其堆叠并操纵级别值
ds = df.stack()
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(lambda s: s.value_counts().idxmax()).unstack()
One Two
0 (3, 10.0, 0) None
1 (4, 8.2, 0) (1, 1.4, 1)
2 (4, 8.2, 0) (3, 10.0, 0)
如果您有 np.nan
跨越整整一周,并且您想 return np.nan
那几周,我们需要告诉 stack
不要 dropna
并将函数传递给 apply
可以处理那些 np.nan
def value_counts_idxmax(s):
try:
return s.value_counts().idxmax()
except ValueError:
return np.nan
df.stack(dropna=False).groupby(
[pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(value_counts_idxmax).unstack()
3046920017503 3046920017541
index
2015-11-21 NaN NaN
2015-11-28 NaN NaN
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-12 (3, 10.0, 0) (3, 10.0, 0)
或者用第二种方法
ds = df.stack(dropna=False)
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(value_counts_idxmax).unstack()
3046920017503 3046920017541
0 NaN NaN
1 NaN NaN
2 (3, 10.0, 0) (3, 10.0, 0)
3 (3, 10.0, 0) (3, 10.0, 0)
我想从 datetime
索引的 DataFrame
的每一列中获取每周出现频率最高的值。我知道这可以在 DataFrame
的条目都是 int
或 float
时完成。但我正在寻找一种不利用 int
或 float
数据类型的通用方法。
这是一个示例,其中 DataFrame
中的每个条目都是 tuple
:
2015-11-15 00:00:00 (3, 10.0, 0) nan
2015-11-16 00:00:00 nan nan
2015-11-17 00:00:00 nan nan
2015-11-18 00:00:00 (3, 10.0, 0) nan
2015-11-19 00:00:00 (3, 10.0, 0) nan
2015-11-20 00:00:00 (4, 8.2, 0) nan
2015-11-21 00:00:00 (4, 8.2, 0) nan
2015-11-22 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
2015-11-23 00:00:00 (3, 18.0, 1) (3, 10.0, 0)
2015-11-26 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
2015-11-27 00:00:00 (4, 8.2, 0) (3, 10.0, 0)
2015-11-28 00:00:00 nan (1, 1.4, 1)
2015-11-29 00:00:00 (4, 8.2, 0) (3, 10.0, 0)
2015-11-30 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
这应该转化为DataFrame
由周内最频繁的元组组成如下:
2015-11-15 00:00:00 (3, 10.0, 0) nan
2015-11-22 00:00:00 (4, 8.2, 0) (1, 1.4, 1)
我的偏好是效率,速度在我的应用程序中真的很重要。
编辑
3046920017503 3046920017541
index
2015-11-15 NaN NaN
2015-11-16 NaN NaN
2015-11-17 NaN NaN
2015-11-18 NaN NaN
2015-11-19 NaN NaN
2015-11-20 NaN NaN
2015-11-21 NaN NaN
2015-11-22 NaN NaN
2015-11-23 NaN NaN
2015-11-24 NaN NaN
2015-11-25 NaN NaN
2015-11-26 NaN NaN
2015-11-27 NaN NaN
2015-11-28 NaN NaN
2015-11-29 NaN NaN
2015-11-30 NaN NaN
2015-12-01 (3, 10.0, 0) (3, 10.0, 0)
2015-12-02 (3, 10.0, 0) (3, 10.0, 0)
2015-12-03 (3, 10.0, 0) (3, 10.0, 0)
2015-12-04 (3, 10.0, 0) (3, 10.0, 0)
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-06 (3, 10.0, 0) (3, 10.0, 0)
应该转化为:
2015-11-15 NaN NaN
2015-11-22 NaN NaN
2015-11-29 (3, 10.0, 0) (3, 10.0, 0)
但是方法建议的结果是:
3046920017503 3046920017541
index
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-12 (3, 10.0, 0) (3, 10.0, 0)
假设这是我的数据框df
One Two
Date
2015-11-15 (3, 10.0, 0) NaN
2015-11-16 NaN NaN
2015-11-17 NaN NaN
2015-11-18 (3, 10.0, 0) NaN
2015-11-19 (3, 10.0, 0) NaN
2015-11-20 (4, 8.2, 0) NaN
2015-11-21 (4, 8.2, 0) NaN
2015-11-22 (4, 8.2, 0) (1, 1.4, 1)
2015-11-23 (3, 18.0, 1) (3, 10.0, 0)
2015-11-26 (4, 8.2, 0) (1, 1.4, 1)
2015-11-27 (4, 8.2, 0) (3, 10.0, 0)
2015-11-28 NaN (1, 1.4, 1)
2015-11-29 (4, 8.2, 0) (3, 10.0, 0)
2015-11-30 (4, 8.2, 0) (1, 1.4, 1)
# 'W-Sat' tells pandas to end weeks on Saturday.
df.stack().groupby(
[pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(lambda s: s.value_counts().idxmax()).unstack()
One Two
Date
2015-11-21 (3, 10.0, 0) None
2015-11-28 (4, 8.2, 0) (1, 1.4, 1)
2015-12-05 (4, 8.2, 0) (3, 10.0, 0)
完成此操作的另一种方法是先将其堆叠并操纵级别值
ds = df.stack()
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(lambda s: s.value_counts().idxmax()).unstack()
One Two
0 (3, 10.0, 0) None
1 (4, 8.2, 0) (1, 1.4, 1)
2 (4, 8.2, 0) (3, 10.0, 0)
如果您有 np.nan
跨越整整一周,并且您想 return np.nan
那几周,我们需要告诉 stack
不要 dropna
并将函数传递给 apply
可以处理那些 np.nan
def value_counts_idxmax(s):
try:
return s.value_counts().idxmax()
except ValueError:
return np.nan
df.stack(dropna=False).groupby(
[pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(value_counts_idxmax).unstack()
3046920017503 3046920017541
index
2015-11-21 NaN NaN
2015-11-28 NaN NaN
2015-12-05 (3, 10.0, 0) (3, 10.0, 0)
2015-12-12 (3, 10.0, 0) (3, 10.0, 0)
或者用第二种方法
ds = df.stack(dropna=False)
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(value_counts_idxmax).unstack()
3046920017503 3046920017541
0 NaN NaN
1 NaN NaN
2 (3, 10.0, 0) (3, 10.0, 0)
3 (3, 10.0, 0) (3, 10.0, 0)