使用多索引数据框，根据另一列的条件获取布尔列的求和结果

Question

我们有一个多索引数据框，如下所示：

                                  date   condition_1    condition_2
item1   0    2021-06-10 06:30:00+00:00          True          False
        1    2021-06-10 07:00:00+00:00         False           True
        2    2021-06-10 07:30:00+00:00          True           True
item2   3    2021-06-10 06:30:00+00:00          True          False
        4    2021-06-10 07:00:00+00:00          True           True
        5    2021-06-10 07:30:00+00:00          True           True
item3   6    2021-06-10 06:30:00+00:00          True           True
        7    2021-06-10 07:00:00+00:00         False           True
        8    2021-06-10 07:30:00+00:00          True           True

date 的值在项目之间重复（因为 df 是数据帧字典上默认连接的结果）。

我们基本上想要向量化的逻辑是“对于所有项目 condition_1 为真的每个日期：在新结果列中对 condition_2 为真的出现次数求和对于他们所有人"。

根据上面的例子，结果基本上是这样的（关于它是如何推导的评论：在结果列旁边）：

                                  date   condition_1    condition_2    result
item1   0    2021-06-10 06:30:00+00:00          True          False         1 [because condition_1 is True for all items and condition_2 is True once]
        1    2021-06-10 07:00:00+00:00         False           True         0 [condition_1 is not True for all items so condition_2 is irrelevant]
        2    2021-06-10 07:30:00+00:00          True           True         3 [both conditions are True for all 3 items]
item2   3    2021-06-10 06:30:00+00:00          True          False         1 [a repeat for the same reasons]
        4    2021-06-10 07:00:00+00:00          True           True         0 [a repeat for the same reasons]
        5    2021-06-10 07:30:00+00:00          True           True         3 [a repeat for the same reasons]
item3   6    2021-06-10 06:30:00+00:00          True           True         1 [a repeat for the same reasons]
        7    2021-06-10 07:00:00+00:00         False           True         0 [a repeat for the same reasons]
        8    2021-06-10 07:30:00+00:00          True           True         3 [a repeat for the same reasons]

Answer 1

这是我的想法。

def cond_sum(s):
    return s.cond1.all() * s.cond2.sum()

df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)

然后如果你想要原来的索引，你可以把它加回去

df.set_index('item', append=True).swaplevel()

请注意，您提到了矢量化，因此您可以将其换成：

dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2

使用多索引数据框，根据另一列的条件获取布尔列的求和结果

Working with a multiindex dataframe, to get summation results over a boolean column, based on a condition from another column

numpy

dataframe

pandas