如何获取 Pandas 上每列值的最后 5 行(放弃前的 5 次操作)

How to get last 5 rows per column value on Pandas (5 actions before abandonment)

Py 和本论坛相对较新,在此先感谢您的帮助。

我正在尝试获取客户在离开网站页面之前执行的最后 5 次操作。

如果我有这样的数据样本:

   index  session_uuid  timestamp   action
0      1             1          2  action1
1      2             1          4  action2
2      3             1          5  action3
3      4             1          7  action4
4      5             2          2  action1
5      6             2          4  action2
6      7             2         10  action3
7      8             2         15  action4

并且期望的结果将是:

session_uiid - action-1 - action-2 - action-3 - action-4 - action-5
1 action4 action3 action2 action1
2 action4 action3 action2 action1

首选py3,我试过df.tail()函数但不确定如何按每个会话分组然后转置到不同的列

假设 session_uuid 是用户并且只给出最后两个操作的例子。喜欢的话可以把2改成5

import numpy as np
import pandas as pd

df = pd.DataFrame({'session_uuid': [1, 1, 1, 1, 2, 2, 2, 2],
          'timestamp': [2, 4, 5, 7, 2, 4, 10, 15],
          'action': ['action1', 'action2', 'action3', 'action4', 'action1', 'action2', 'action3', 'action4']})
print(df)
   session_uuid  timestamp   action
0             1          2  action1
1             1          4  action2
2             1          5  action3
3             1          7  action4
4             2          2  action1
5             2          4  action2
6             2         10  action3
7             2         15  action4

# first sort the values, then groupby users
df = df.sort_values(['session_uuid','timestamp'])
df1 = df.groupby('session_uuid')['action'].apply(lambda x: list(x)[-2:])
print(df1)
session_uuid
1    [action3, action4]
2    [action3, action4]

如果你喜欢数据框而不是系列:

df1 = df1.to_frame('action').reset_index()
print(df1)
   session_uuid              action
0             1  [action3, action4]
1             2  [action3, action4]

df.tail()将return整个数据集结束。您正在寻找的东西比这要复杂一些。下面是一些解决该问题并概括为最后 K 行的示例代码:

import pandas as pd
import numpy as np

# create the dataset example
index = [1, 2, 3, 4, 5, 6, 7, 8]
session_uuid = [1, 1, 1, 1, 2, 2, 2, 2]
timestamp = [2, 4, 5, 7, 2, 4, 10, 15]
action = ["action1", "action2", "action3", "action4",
          "action1", "action2", "action3", "action4"]
df = pd.DataFrame(
    { 
        "index": index,
        "session_uuid": session_uuid,
        "timestamp": timestamp,
        "action": action
    }
)
# the number of `last` actions you want
k = 2
# the dataframe to return will have k columns that are numbered
final_df = pd.DataFrame(columns=np.arange(k))
# group by session_uuid and sort them by timestamp inside those groups.  Finally, get the last K rows in those sorted groups
last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, "timestamp").groupby(level=0).tail(k).groupby("session_uuid")
# this grabs the session_uuid in the same order as above so we can have that column in the new dataframe
uuids = df.groupby("session_uuid", as_index=False).groups.keys()

# go through each group (or each uuid)
for group in last_k:
    # grab the action values out of the tuple
    group = group[1]["action"]
    # add the last actions to the new DataFrame but reshape it to match the dimensions of the new DataFrame
    final_df = final_df.append(pd.Series(group.values.reshape(k)), ignore_index=True)

# add the UUID columns for reference and put it at the beginning
final_df.insert(loc=0, column="session_uuid", value=uuids)
print(final_df)

此代码采用您的示例数据集和return每个组的最后两个(您可以调整 k)操作。如果少于 K 个值,它会用 NaN 值填充空白 space。

示例输出如下:

   session_uuid        0        1
0             1  action3  action4
1             2  action3  action4

或者如果您的操作少于 K 次:

   session_uuid        0        1
0             1  action1      NaN
1             2  action3  action4