如何从点击流数据创建用户路径

How to create user paths from clickstream data

我有一些点击流数据,我想以特定方式对其进行归因分析,但我需要针对转化用户和未转化用户采用特定格式。

代表数据:

df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002, 
                             2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001", 
                                                                                                   "1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set", 
                                                                                                   "Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"), 
                 Date_time = structure(c(1540103940, 1540104060, 1540104240, 
                                         1540318080, 1540318680, 1540318859, 1540314360, 1540413060, 
                                         1540413240, 1540538460, 1540538640, 1540629660, 1540755060, 
                                         1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                 Source = c("Facebook", "Facebook", "Facebook", "Google", 
                            "Email", "Google", "Email", "Referral", "Referral", "Facebook", 
                            "Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0, 
                                                                                                  0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df", 
                                                                                                                                                        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
                                                                                                                                                          cols = list(User_ID = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                            "collector")), Session_ID = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                    "collector")), Date_time = structure(list(format = ""), class = c("collector_datetime", 
                                                                                                                                                                                                                                                                                                                                      "collector")), Source = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                          "collector")), Conversion = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "collector")), skip = 1), class = "col_spec"))

然后设置类:

df <- df %>% 
  mutate(User_ID    = as.factor(User_ID),
         Session_ID = as.factor(Session_ID),
         Date_time  = as.POSIXct(Date_time)
         )

我想获取所有用户访问路径以进行购买,或者获取未导致购买的用户访问路径的总路径。

新列 path 的格式例如:Facebook > Facebook > Facebook > Email > Email for user 2001 我知道如何使用 mutate(path = paste0(source, collapse = " > "))

并发症是:

每一行可以是:

对于上面的 reprex,结果如下所示:

# A tibble: 5 x 5
  User_ID Session_ID Date_time           Conversion Path                                          
    <dbl> <chr>      <dttm>                   <dbl> <chr>                                         
1    2001 1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2    2002 Not Set    2018-10-24 20:34:00          1 Google > Google > Referral > Referral         
3    2003 Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4    2002 1009       2018-10-28 19:34:00          0 Referral > Direct                             
5    2001 Not Set    2018-10-29 08:50:00          1 Google > Direct     

... 其中:

这是一种使用 dplyr 的方法:

df2 <- df %>%
  # Add a column to distinguish between known and unknown sessions
  mutate(known_session = Session_ID != "Not Set") %>%

  # For each user, split between know and unknown sessions...
  group_by(User_ID, known_session) %>%
  # Sort first by Session ID, then time
  arrange(Session_ID, Date_time) %>%
  # Track which # path they're on. Start with path #1; 
  #   new path if prior event was a conversion
  mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%

  # Label path journey by combining everything so far
  mutate(Path = paste0(Source, collapse = " > ")) %>%
  # Just keep last step in each path
  filter(row_number() == n()) %>%
  ungroup() %>%

  # Tidying up with just the desired columns, chronological
  select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
  arrange(Date_time)

我得到的结果略有不同,但我认为它们与提供的示例数据相对应:

> df2
# A tibble: 5 x 5
  User_ID Session_ID Date_time      

     Conversion Path                                          
  <fct>   <fct>      <dttm>                   <dbl> <chr>                                         
1 2001    1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2 2002    Not Set    2018-10-24 20:34:00          1 Referral > Referral                           
3 2003    Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4 2002    1009       2018-10-28 19:34:00          0 Google > Google > Referral > Direct           
5 2001    Not Set    2018-10-29 08:50:00          1 Google > Direct