如何获取 Pandas 上每列值的最后 5 行(放弃前的 5 次操作)
How to get last 5 rows per column value on Pandas (5 actions before abandonment)
Py 和本论坛相对较新,在此先感谢您的帮助。
我正在尝试获取客户在离开网站页面之前执行的最后 5 次操作。
如果我有这样的数据样本:
index session_uuid timestamp action
0 1 1 2 action1
1 2 1 4 action2
2 3 1 5 action3
3 4 1 7 action4
4 5 2 2 action1
5 6 2 4 action2
6 7 2 10 action3
7 8 2 15 action4
并且期望的结果将是:
session_uiid - action-1 - action-2 - action-3 - action-4 - action-5
1 action4 action3 action2 action1
2 action4 action3 action2 action1
首选py3,我试过df.tail()函数但不确定如何按每个会话分组然后转置到不同的列
假设 session_uuid 是用户并且只给出最后两个操作的例子。喜欢的话可以把2改成5
import numpy as np
import pandas as pd
df = pd.DataFrame({'session_uuid': [1, 1, 1, 1, 2, 2, 2, 2],
'timestamp': [2, 4, 5, 7, 2, 4, 10, 15],
'action': ['action1', 'action2', 'action3', 'action4', 'action1', 'action2', 'action3', 'action4']})
print(df)
session_uuid timestamp action
0 1 2 action1
1 1 4 action2
2 1 5 action3
3 1 7 action4
4 2 2 action1
5 2 4 action2
6 2 10 action3
7 2 15 action4
# first sort the values, then groupby users
df = df.sort_values(['session_uuid','timestamp'])
df1 = df.groupby('session_uuid')['action'].apply(lambda x: list(x)[-2:])
print(df1)
session_uuid
1 [action3, action4]
2 [action3, action4]
如果你喜欢数据框而不是系列:
df1 = df1.to_frame('action').reset_index()
print(df1)
session_uuid action
0 1 [action3, action4]
1 2 [action3, action4]
df.tail()
将return整个数据集结束。您正在寻找的东西比这要复杂一些。下面是一些解决该问题并概括为最后 K 行的示例代码:
import pandas as pd
import numpy as np
# create the dataset example
index = [1, 2, 3, 4, 5, 6, 7, 8]
session_uuid = [1, 1, 1, 1, 2, 2, 2, 2]
timestamp = [2, 4, 5, 7, 2, 4, 10, 15]
action = ["action1", "action2", "action3", "action4",
"action1", "action2", "action3", "action4"]
df = pd.DataFrame(
{
"index": index,
"session_uuid": session_uuid,
"timestamp": timestamp,
"action": action
}
)
# the number of `last` actions you want
k = 2
# the dataframe to return will have k columns that are numbered
final_df = pd.DataFrame(columns=np.arange(k))
# group by session_uuid and sort them by timestamp inside those groups. Finally, get the last K rows in those sorted groups
last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, "timestamp").groupby(level=0).tail(k).groupby("session_uuid")
# this grabs the session_uuid in the same order as above so we can have that column in the new dataframe
uuids = df.groupby("session_uuid", as_index=False).groups.keys()
# go through each group (or each uuid)
for group in last_k:
# grab the action values out of the tuple
group = group[1]["action"]
# add the last actions to the new DataFrame but reshape it to match the dimensions of the new DataFrame
final_df = final_df.append(pd.Series(group.values.reshape(k)), ignore_index=True)
# add the UUID columns for reference and put it at the beginning
final_df.insert(loc=0, column="session_uuid", value=uuids)
print(final_df)
此代码采用您的示例数据集和return每个组的最后两个(您可以调整 k)操作。如果少于 K 个值,它会用 NaN 值填充空白 space。
示例输出如下:
session_uuid 0 1
0 1 action3 action4
1 2 action3 action4
或者如果您的操作少于 K 次:
session_uuid 0 1
0 1 action1 NaN
1 2 action3 action4
Py 和本论坛相对较新,在此先感谢您的帮助。
我正在尝试获取客户在离开网站页面之前执行的最后 5 次操作。
如果我有这样的数据样本:
index session_uuid timestamp action
0 1 1 2 action1
1 2 1 4 action2
2 3 1 5 action3
3 4 1 7 action4
4 5 2 2 action1
5 6 2 4 action2
6 7 2 10 action3
7 8 2 15 action4
并且期望的结果将是:
session_uiid - action-1 - action-2 - action-3 - action-4 - action-5
1 action4 action3 action2 action1
2 action4 action3 action2 action1
首选py3,我试过df.tail()函数但不确定如何按每个会话分组然后转置到不同的列
假设 session_uuid 是用户并且只给出最后两个操作的例子。喜欢的话可以把2改成5
import numpy as np
import pandas as pd
df = pd.DataFrame({'session_uuid': [1, 1, 1, 1, 2, 2, 2, 2],
'timestamp': [2, 4, 5, 7, 2, 4, 10, 15],
'action': ['action1', 'action2', 'action3', 'action4', 'action1', 'action2', 'action3', 'action4']})
print(df)
session_uuid timestamp action
0 1 2 action1
1 1 4 action2
2 1 5 action3
3 1 7 action4
4 2 2 action1
5 2 4 action2
6 2 10 action3
7 2 15 action4
# first sort the values, then groupby users
df = df.sort_values(['session_uuid','timestamp'])
df1 = df.groupby('session_uuid')['action'].apply(lambda x: list(x)[-2:])
print(df1)
session_uuid
1 [action3, action4]
2 [action3, action4]
如果你喜欢数据框而不是系列:
df1 = df1.to_frame('action').reset_index()
print(df1)
session_uuid action
0 1 [action3, action4]
1 2 [action3, action4]
df.tail()
将return整个数据集结束。您正在寻找的东西比这要复杂一些。下面是一些解决该问题并概括为最后 K 行的示例代码:
import pandas as pd
import numpy as np
# create the dataset example
index = [1, 2, 3, 4, 5, 6, 7, 8]
session_uuid = [1, 1, 1, 1, 2, 2, 2, 2]
timestamp = [2, 4, 5, 7, 2, 4, 10, 15]
action = ["action1", "action2", "action3", "action4",
"action1", "action2", "action3", "action4"]
df = pd.DataFrame(
{
"index": index,
"session_uuid": session_uuid,
"timestamp": timestamp,
"action": action
}
)
# the number of `last` actions you want
k = 2
# the dataframe to return will have k columns that are numbered
final_df = pd.DataFrame(columns=np.arange(k))
# group by session_uuid and sort them by timestamp inside those groups. Finally, get the last K rows in those sorted groups
last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, "timestamp").groupby(level=0).tail(k).groupby("session_uuid")
# this grabs the session_uuid in the same order as above so we can have that column in the new dataframe
uuids = df.groupby("session_uuid", as_index=False).groups.keys()
# go through each group (or each uuid)
for group in last_k:
# grab the action values out of the tuple
group = group[1]["action"]
# add the last actions to the new DataFrame but reshape it to match the dimensions of the new DataFrame
final_df = final_df.append(pd.Series(group.values.reshape(k)), ignore_index=True)
# add the UUID columns for reference and put it at the beginning
final_df.insert(loc=0, column="session_uuid", value=uuids)
print(final_df)
此代码采用您的示例数据集和return每个组的最后两个(您可以调整 k)操作。如果少于 K 个值,它会用 NaN 值填充空白 space。
示例输出如下:
session_uuid 0 1
0 1 action3 action4
1 2 action3 action4
或者如果您的操作少于 K 次:
session_uuid 0 1
0 1 action1 NaN
1 2 action3 action4