如何从点击流数据创建用户路径
How to create user paths from clickstream data
我有一些点击流数据,我想以特定方式对其进行归因分析,但我需要针对转化用户和未转化用户采用特定格式。
代表数据:
df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set",
"Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"),
Date_time = structure(c(1540103940, 1540104060, 1540104240,
1540318080, 1540318680, 1540318859, 1540314360, 1540413060,
1540413240, 1540538460, 1540538640, 1540629660, 1540755060,
1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Source = c("Facebook", "Facebook", "Facebook", "Google",
"Email", "Google", "Email", "Referral", "Referral", "Facebook",
"Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
cols = list(User_ID = structure(list(), class = c("collector_double",
"collector")), Session_ID = structure(list(), class = c("collector_character",
"collector")), Date_time = structure(list(format = ""), class = c("collector_datetime",
"collector")), Source = structure(list(), class = c("collector_character",
"collector")), Conversion = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
然后设置类:
df <- df %>%
mutate(User_ID = as.factor(User_ID),
Session_ID = as.factor(Session_ID),
Date_time = as.POSIXct(Date_time)
)
我想获取所有用户访问路径以进行购买,或者获取未导致购买的用户访问路径的总路径。
新列 path
的格式例如:Facebook > Facebook > Facebook > Email > Email
for user 2001 我知道如何使用
mutate(path = paste0(source, collapse = " > "))
并发症是:
- 大多数会话 ID 未设置,这意味着它们丢失了
- 部分用户可能会多次转化
- 有些用户可能会转换 return 但不会转换
每一行可以是:
- 按用户 ID 进行转化 - 大多数转化用户只转化一次,但
有些可能会转换多次,在这种情况下会有一行
每次转换。
path
列将反映到
转化 - 仅针对用户的第二次或后续转化
将显示上一次转换之后的路径。
- 或者一个未转换的用户旅程,其总路径采用上述格式
对于上面的 reprex,结果如下所示:
# A tibble: 5 x 5
User_ID Session_ID Date_time Conversion Path
<dbl> <chr> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Google > Google > Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
... 其中:
- 用户2001转换了两次,路径分别表示;
- 用户 2002 转换后回来但没有转换,因此转换和未转换的路径表示为单独的行。
- 用户 2003 从未转换,因此显示此路径。
这是一种使用 dplyr
的方法:
df2 <- df %>%
# Add a column to distinguish between known and unknown sessions
mutate(known_session = Session_ID != "Not Set") %>%
# For each user, split between know and unknown sessions...
group_by(User_ID, known_session) %>%
# Sort first by Session ID, then time
arrange(Session_ID, Date_time) %>%
# Track which # path they're on. Start with path #1;
# new path if prior event was a conversion
mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%
# Label path journey by combining everything so far
mutate(Path = paste0(Source, collapse = " > ")) %>%
# Just keep last step in each path
filter(row_number() == n()) %>%
ungroup() %>%
# Tidying up with just the desired columns, chronological
select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
arrange(Date_time)
我得到的结果略有不同,但我认为它们与提供的示例数据相对应:
> df2
# A tibble: 5 x 5
User_ID Session_ID Date_time
Conversion Path
<fct> <fct> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Google > Google > Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
我有一些点击流数据,我想以特定方式对其进行归因分析,但我需要针对转化用户和未转化用户采用特定格式。
代表数据:
df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set",
"Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"),
Date_time = structure(c(1540103940, 1540104060, 1540104240,
1540318080, 1540318680, 1540318859, 1540314360, 1540413060,
1540413240, 1540538460, 1540538640, 1540629660, 1540755060,
1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Source = c("Facebook", "Facebook", "Facebook", "Google",
"Email", "Google", "Email", "Referral", "Referral", "Facebook",
"Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
cols = list(User_ID = structure(list(), class = c("collector_double",
"collector")), Session_ID = structure(list(), class = c("collector_character",
"collector")), Date_time = structure(list(format = ""), class = c("collector_datetime",
"collector")), Source = structure(list(), class = c("collector_character",
"collector")), Conversion = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
然后设置类:
df <- df %>%
mutate(User_ID = as.factor(User_ID),
Session_ID = as.factor(Session_ID),
Date_time = as.POSIXct(Date_time)
)
我想获取所有用户访问路径以进行购买,或者获取未导致购买的用户访问路径的总路径。
新列 path
的格式例如:Facebook > Facebook > Facebook > Email > Email
for user 2001 我知道如何使用
mutate(path = paste0(source, collapse = " > "))
并发症是:
- 大多数会话 ID 未设置,这意味着它们丢失了
- 部分用户可能会多次转化
- 有些用户可能会转换 return 但不会转换
每一行可以是:
- 按用户 ID 进行转化 - 大多数转化用户只转化一次,但
有些可能会转换多次,在这种情况下会有一行
每次转换。
path
列将反映到 转化 - 仅针对用户的第二次或后续转化 将显示上一次转换之后的路径。 - 或者一个未转换的用户旅程,其总路径采用上述格式
对于上面的 reprex,结果如下所示:
# A tibble: 5 x 5
User_ID Session_ID Date_time Conversion Path
<dbl> <chr> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Google > Google > Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
... 其中:
- 用户2001转换了两次,路径分别表示;
- 用户 2002 转换后回来但没有转换,因此转换和未转换的路径表示为单独的行。
- 用户 2003 从未转换,因此显示此路径。
这是一种使用 dplyr
的方法:
df2 <- df %>%
# Add a column to distinguish between known and unknown sessions
mutate(known_session = Session_ID != "Not Set") %>%
# For each user, split between know and unknown sessions...
group_by(User_ID, known_session) %>%
# Sort first by Session ID, then time
arrange(Session_ID, Date_time) %>%
# Track which # path they're on. Start with path #1;
# new path if prior event was a conversion
mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%
# Label path journey by combining everything so far
mutate(Path = paste0(Source, collapse = " > ")) %>%
# Just keep last step in each path
filter(row_number() == n()) %>%
ungroup() %>%
# Tidying up with just the desired columns, chronological
select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
arrange(Date_time)
我得到的结果略有不同,但我认为它们与提供的示例数据相对应:
> df2
# A tibble: 5 x 5
User_ID Session_ID Date_time
Conversion Path
<fct> <fct> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Google > Google > Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct