对日期时间索引的 pandas DataFrame 的周内操作

Within-week operation on a pandas DataFrame indexed by datetime

我想从 datetime 索引的 DataFrame 的每一列中获取每周出现频率最高的值。我知道这可以在 DataFrame 的条目都是 intfloat 时完成。但我正在寻找一种不利用 intfloat 数据类型的通用方法。

这是一个示例,其中 DataFrame 中的每个条目都是 tuple

2015-11-15 00:00:00   (3, 10.0, 0)   nan
2015-11-16 00:00:00   nan            nan
2015-11-17 00:00:00   nan            nan
2015-11-18 00:00:00   (3, 10.0, 0)   nan
2015-11-19 00:00:00   (3, 10.0, 0)   nan
2015-11-20 00:00:00   (4, 8.2, 0)    nan
2015-11-21 00:00:00   (4, 8.2, 0)    nan
2015-11-22 00:00:00   (4, 8.2, 0)    (1, 1.4, 1)
2015-11-23 00:00:00   (3, 18.0, 1)   (3, 10.0, 0)
2015-11-26 00:00:00   (4, 8.2, 0)    (1, 1.4, 1)
2015-11-27 00:00:00   (4, 8.2, 0)    (3, 10.0, 0)
2015-11-28 00:00:00   nan            (1, 1.4, 1)
2015-11-29 00:00:00   (4, 8.2, 0)    (3, 10.0, 0)
2015-11-30 00:00:00   (4, 8.2, 0)    (1, 1.4, 1)

这应该转化为DataFrame由周内最频繁的元组组成如下:

2015-11-15 00:00:00   (3, 10.0, 0)   nan
2015-11-22 00:00:00   (4, 8.2, 0)    (1, 1.4, 1)

我的偏好是效率,速度在我的应用程序中真的很重要。

编辑

           3046920017503 3046920017541
index                                 
2015-11-15           NaN           NaN
2015-11-16           NaN           NaN
2015-11-17           NaN           NaN
2015-11-18           NaN           NaN
2015-11-19           NaN           NaN
2015-11-20           NaN           NaN
2015-11-21           NaN           NaN
2015-11-22           NaN           NaN
2015-11-23           NaN           NaN
2015-11-24           NaN           NaN
2015-11-25           NaN           NaN
2015-11-26           NaN           NaN
2015-11-27           NaN           NaN
2015-11-28           NaN           NaN
2015-11-29           NaN           NaN
2015-11-30           NaN           NaN
2015-12-01  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-02  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-03  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-04  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-05  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-06  (3, 10.0, 0)  (3, 10.0, 0)

应该转化为:

2015-11-15           NaN           NaN
2015-11-22           NaN           NaN
2015-11-29           (3, 10.0, 0)  (3, 10.0, 0)

但是方法建议的结果是:

           3046920017503 3046920017541
index                                 
2015-12-05  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-12  (3, 10.0, 0)  (3, 10.0, 0)

假设这是我的数据框df

                     One           Two
Date                                  
2015-11-15  (3, 10.0, 0)           NaN
2015-11-16           NaN           NaN
2015-11-17           NaN           NaN
2015-11-18  (3, 10.0, 0)           NaN
2015-11-19  (3, 10.0, 0)           NaN
2015-11-20   (4, 8.2, 0)           NaN
2015-11-21   (4, 8.2, 0)           NaN
2015-11-22   (4, 8.2, 0)   (1, 1.4, 1)
2015-11-23  (3, 18.0, 1)  (3, 10.0, 0)
2015-11-26   (4, 8.2, 0)   (1, 1.4, 1)
2015-11-27   (4, 8.2, 0)  (3, 10.0, 0)
2015-11-28           NaN   (1, 1.4, 1)
2015-11-29   (4, 8.2, 0)  (3, 10.0, 0)
2015-11-30   (4, 8.2, 0)   (1, 1.4, 1)

# 'W-Sat' tells pandas to end weeks on Saturday.
df.stack().groupby(
    [pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(lambda s: s.value_counts().idxmax()).unstack()

                     One           Two
Date                                  
2015-11-21  (3, 10.0, 0)          None
2015-11-28   (4, 8.2, 0)   (1, 1.4, 1)
2015-12-05   (4, 8.2, 0)  (3, 10.0, 0)

完成此操作的另一种方法是先将其堆叠并操纵级别值

ds = df.stack()
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(lambda s: s.value_counts().idxmax()).unstack()

            One           Two
0  (3, 10.0, 0)          None
1   (4, 8.2, 0)   (1, 1.4, 1)
2   (4, 8.2, 0)  (3, 10.0, 0)

如果您有 np.nan 跨越整整一周,并且您想 return np.nan 那几周,我们需要告诉 stack 不要 dropna 并将函数传递给 apply 可以处理那些 np.nan

def value_counts_idxmax(s):
    try:
        return s.value_counts().idxmax()
    except ValueError:
        return np.nan

df.stack(dropna=False).groupby(
    [pd.Grouper(level=0, freq='W-Sat'), pd.Grouper(level=1)]
).apply(value_counts_idxmax).unstack()


           3046920017503 3046920017541
index                                 
2015-11-21           NaN           NaN
2015-11-28           NaN           NaN
2015-12-05  (3, 10.0, 0)  (3, 10.0, 0)
2015-12-12  (3, 10.0, 0)  (3, 10.0, 0)

或者用第二种方法

ds = df.stack(dropna=False)
g1 = (ds.index.get_level_values(0) - ds.index.levels[0].min()).days // 7
g2 = ds.index.get_level_values(1)
ds.groupby([g1, g2]).apply(value_counts_idxmax).unstack()

  3046920017503 3046920017541
0           NaN           NaN
1           NaN           NaN
2  (3, 10.0, 0)  (3, 10.0, 0)
3  (3, 10.0, 0)  (3, 10.0, 0)