替换点击流数据中的源
Replacing the source in click-stream data
我有一个电子商务网站的点击流数据。一些客户可以选择使用贷款/融资选项购买产品。不幸的是,这会创建一个新的推荐源 - 在下面标记为 'finance' 的 reprex 中。它还会创建一个或多个新会话。
我想将来源 'finance' 替换为同一用户之前会话的来源。
在示例中,会话 4-6871.2
和 4-6871.3
的所有观察结果都将根据会话 4-6871.1
具有源 'direct',而 3-6871.1
将具有 'google' 作为会话的来源 3-6871.0
我需要在更大的数据集上执行此操作,因此我需要应用逻辑来查找与 'finance' 源的会话,并将 'finance' 的实例替换为紧接在前的源来自用户之前的会话。
reprex 数据来自 dput
:
structure(list(userId = c("6.154032", "6.154032", "6.154032",
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032",
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036",
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036",
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0",
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2",
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0",
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1",
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773,
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709,
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541,
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459,
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)",
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance",
"finance", "google", "google", "google", "google", "google",
"finance", "finance", "finance", "finance", "finance", "finance",
"finance", "finance", "finance")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -23L))
也许您的完整数据结构中的某些内容使此解决方案无效,但这里有一个候选者:
df <- arrange(df, userId, timeStamp)
tmp <- rle(df$source)
tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
df$source <- inverse.rle(tmp)
table(df$source)
# (direct) google
# 9 14
在第一行中,我确保顺序是正确的。然后,假设对于任何用户来说,他们的第一个来源可以立即成为 "finance",在接下来的两行中,我将所有 "finance" 条目替换为前面的条目。
我有一个电子商务网站的点击流数据。一些客户可以选择使用贷款/融资选项购买产品。不幸的是,这会创建一个新的推荐源 - 在下面标记为 'finance' 的 reprex 中。它还会创建一个或多个新会话。
我想将来源 'finance' 替换为同一用户之前会话的来源。
在示例中,会话 4-6871.2
和 4-6871.3
的所有观察结果都将根据会话 4-6871.1
具有源 'direct',而 3-6871.1
将具有 'google' 作为会话的来源 3-6871.0
我需要在更大的数据集上执行此操作,因此我需要应用逻辑来查找与 'finance' 源的会话,并将 'finance' 的实例替换为紧接在前的源来自用户之前的会话。
reprex 数据来自 dput
:
structure(list(userId = c("6.154032", "6.154032", "6.154032",
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032",
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036",
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036",
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0",
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2",
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0",
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1",
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773,
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709,
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541,
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459,
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)",
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance",
"finance", "google", "google", "google", "google", "google",
"finance", "finance", "finance", "finance", "finance", "finance",
"finance", "finance", "finance")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -23L))
也许您的完整数据结构中的某些内容使此解决方案无效,但这里有一个候选者:
df <- arrange(df, userId, timeStamp)
tmp <- rle(df$source)
tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
df$source <- inverse.rle(tmp)
table(df$source)
# (direct) google
# 9 14
在第一行中,我确保顺序是正确的。然后,假设对于任何用户来说,他们的第一个来源可以立即成为 "finance",在接下来的两行中,我将所有 "finance" 条目替换为前面的条目。