从 pandas groupby 获取四分位间距和中位数，对所有未提及的日期进行零填充

Question

我有一个类似的数据框（除了我的非常大）：

user1      user2   day   hour  quantity
-------------------------------------
Alice      Bob      1     12     250
Alice      Bob      1     13     250
Bob        Carol    1     10     20
Alice      Bob      4     1      600
.
.
.

...然后假设我得到以下 groupby 和聚合（通过 user1、user2 和 day）：

user1      user2   day   quantity
---------------------
Alice      Bob      1      500
                    4      600
Bob        Carol    1      20
                    3      100

其中 日期应为 0-364（365 天）。我想要的是所有天每个用户的计数的四分位数范围（和中位数）——除了零不计算在内。

如果我对所有排除的天数都有明确的零，生活会更轻松：

user1    user2    day   quantity
---------------------
Alice    Bob      1      500
                  2      0
                  3      0
                  4      600
.....
Bob      Carol    1      20
                  2      0
                  3      100
...

... 因为那时我可以做 df.reset_index().agg({'quantity':scipy.stats.iqr}) 但我正在使用一个非常大的数据框（上面的例子是一个虚拟的），并且用零重新索引是不可能的。

我知道怎么做了：因为我知道有 365 天，所以我应该用零填充其余数字：

Alice-Bob: [500,600] + (365-2) * [0]

并得到 scipy.stats.iqr（和中位数）。但是，这将涉及遍历所有 user1-user2 对。根据经验，这需要很多时间。

有没有向量化的解决方案？我也必须得到中位数，我认为同样的方法应该适用。

Answer 1

要利用零而不将它们放入数据框中，您可以使用这样的东西：

test = df.groupby(['user1', 'user2', 'day'])['quantity'].mean().reset_index()\
         .groupby(['user1', 'user2'])\
         .agg({'day': lambda x: tuple(x), 'quantity': lambda x: tuple(x)})\
         .reset_index()

def med_from_tuple(row):
    # starts with everything zero, and replaces some with the nonzero values in the dataframe
    z = np.zeros(365)
    np.put(z, row['day'], row['quantity'])
    return np.median(z)

test['example'] = test.apply(lambda x: med_from_tuple(x), axis=1)

这将创建数量的中位数，就好像数据框中有零一样。

test
#   user1  user2     day    quantity   example
#0  Alice    Bob  (1, 4)  (250, 600)       0.0
#1    Bob  Carol    (1,)       (20,)       0.0

从 pandas groupby 获取四分位间距和中位数，对所有未提及的日期进行零填充

Getting interquartile range and median from pandas groupby, zero-padding for all unmentioned dates

python

padding

dataframe

pandas

iqr