Pandas 中每个类别的日期时间滚动计数

Question

从具有 date 和 user 列的 DataFrame 开始，我想添加第三个 count_past_5_days 列以指示每行用户在过去 5 天：

date	user	count_past_5_days
2020-01-01	abc	1
2020-01-01	def	1
2020-01-02	abc	2
2020-01-03	abc	3
2020-01-04	abc	4
2020-01-04	def	2
2020-01-04	ghi	1
2020-01-05	abc	5
2020-01-06	abc	5
2020-01-07	abc	5

我试过以下方法：

df.set_index('date').rolling('5D')['user'].count()

但这会获取过去五个滚动日的总计数，而不仅仅是当前行的特定用户。如何仅针对每一行的特定用户获取此滚动计数？

Answer 1

试试这个，你可以将 rolling 链接到 groupby:

df.set_index('date').groupby('user')['user']\
  .rolling('5D')\
  .count()\
  .rename('count_past_5_days')\
  .reset_index()\
  .sort_values('date')

输出：

  user       date  count_past_5_days
0  abc 2020-01-01                1.0
1  def 2020-01-01                1.0
2  abc 2020-01-02                2.0
3  abc 2020-01-03                3.0
4  abc 2020-01-04                4.0
5  def 2020-01-04                2.0
6  ghi 2020-01-04                1.0
7  abc 2020-01-05                5.0
8  abc 2020-01-06                5.0
9  abc 2020-01-07                5.0

Answer 2

您可以对所有值为 1 的 'dummy' 列进行求和。这与 pd.crosstab 在幕后使用的方法相同 - 尽管我们可以直接命名我们的输出列。

out = (
    df.assign(count_past_5_days=1)
    .groupby('user')
    .rolling('5D', on='date')['count_past_5_days']
    .sum()
)

print(out)
user  date      
abc   2020-01-01    1.0
      2020-01-02    2.0
      2020-01-03    3.0
      2020-01-04    4.0
      2020-01-05    5.0
      2020-01-06    5.0
      2020-01-07    5.0
def   2020-01-01    1.0
      2020-01-04    2.0
ghi   2020-01-04    1.0
Name: count_past_5_days, dtype: float64

这会输出一个值与您想要的值相对应的系列。如果您希望您的输出在视觉上与您的输入对齐，您可以使用以下任何一种...

out.sort_index(level='date').reset_index()
out.reset_index().sort_values('date')
out.reindex(pd.MultiIndex.from_frame(df).swaplevel()).reset_index()

请注意，如果您的数据碰巧未排序，选项 3 将保留数据的原始顺序。

>>> out.sort_index(level='date').reset_index()
  user       date  count_past_5_days
0  abc 2020-01-01                1.0
1  def 2020-01-01                1.0
2  abc 2020-01-02                2.0
3  abc 2020-01-03                3.0
4  abc 2020-01-04                4.0
5  def 2020-01-04                2.0
6  ghi 2020-01-04                1.0
7  abc 2020-01-05                5.0
8  abc 2020-01-06                5.0
9  abc 2020-01-07                5.0

Pandas 中每个类别的日期时间滚动计数

Datetime rolling count per category in Pandas

python

pandas

rolling-computation

datetimeindex