多个用户的最长连续条纹

Question

我想找到解决办法，

Provided a table with user_id and the dates they visited the platform, find the top 100 users with the longest continuous streak of visiting the platform as of yesterday.

我发现这些解释了如何为一个用户执行此操作。但是，我不确定如何为多个用户执行此操作。

一个天真的想法可能是获取所有唯一用户并使用 for 循环和上面的答案，获取具有最大连续访问记录的用户。但是，如果可能的话，我对矢量化的方式很感兴趣。

如果需要，这些是我使用的代码，

date_series = pd.Series(np.random.randint(0,10, 400), index=pd.to_datetime(np.random.randint(0,20, 400)*1e9*24*3600), name="uid")
df = date_series.reset_index().rename({"index":"date_val"}, axis=1).drop_duplicates().reset_index(drop=True)

对于给定的用户 ID（比如 uid =1），我可以使用以下方法找到最大连胜，

sub_df = df[df.uid==1].sort_values("date_val")
(sub_df.date_val+pd.Timedelta(days=1) != sub_df.date_val.shift(-1)).cumsum().value_counts().max()

但我不明白如何使用矢量化（不是 for 循环）方法为原始数据帧 (df) 中的所有用户做类似的事情。

Answer 1

我走了很远的路，也许还有一条更短的路。让我们试试

df=df.sort_values(by=['uid','date_val'])# Sort df


#Check sequence
df=(df.assign(diff=df['date_val'].diff().dt.days,
              diff1=df['date_val'].diff(-1).dt.days))

#create a grouper
s=(((df['diff'].isna())&(df['diff1']==-1))|((df['diff'].gt(1))&(df['diff1']==-1))).cumsum()

#Get streak length
df['streak'] =df.groupby([s,'uid'])['date_val'].transform('count')

#Isolate max streak
new=df[df['streak'] ==df.groupby('uid')['streak'].transform('max')].drop(columns=['diff','diff1']).sort_values(by=['uid','date_val'])

Answer 2

在@wwnde 的回答的帮助下，我找到了以下答案。发布它以防有人发现它有用。

df.sort_values(["uid", "date_val"], inplace=True)  # sort the df

df["diff1"] = df.date_val.diff().dt.days  # new column to store the date diff

df["user_segments"] = ((df.diff1 != 1)|(df.uid != df.uid.shift(-1))).cumsum()  # to create groups of consecutive days

df.groupby(["uid", "user_segments"])["date_val"].count().reset_index()\ # now date_val column have consecutive day counts
.groupby("uid")["date_val"].max()\  # then group by and get the max for each user
.sort_values(ascending=False).iloc[:100]  # finally sort it and get the first 100 users

多个用户的最长连续条纹

Longest continuous streaks of multiple users

python

series

dataframe

pandas