频繁序列统计

Frequent Sequence stats

我有一个购买交易数据集。下面是一个用于说明的虚拟数据集。我想弄清楚如何 reshape/dcast 获得最频繁的购买顺序。

require(data.table)

MainID=c('A1','A1','A2','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016')

df=data.table(MainID,Purchase,Date)
head(df)

   MainID Purchase      Date
1:     A1        A  1/1/2014
2:     A1        B 5/23/2015
3:     A2        C 6/12/2015
4:     C1        A  3/3/2013
5:     C1        A  5/5/2014
6:     C1        D 7/21/2014

现在我开始寻求 2 对序列的多个组合。对于上面的数据集,以下是一组独特的序列对:(A 导致 B,B 导致 C,A 导致 D,E 导致 B,最后一个 C 导致 E) 请注意,这里我不考虑 A 到 A - 我正在查看不同产品的序列,而不是相同的产品。因此在输出中我想忽略所有那些相似的产品序列。

需要输出:

Pair                  Occurrence         No of customers        % confidence 
A leads to B             1                    3                    1/3
B leads to C             2                    3                    2/3
A leads to D             1                    3                    1/3
E leads to B             1                    3                    1/3
C leads to E             2                    3                    2/3 

我知道排序算法,但我在这里查看一些基本的描述性分析。

如果我明白你想要什么,这可能会奏效。请注意,我将您的数据中的 A2 更改为 A1,并且我添加了一个日期以便日期的长度为 11。我还直接创建了一个 tibble 而不是使用 data.table.

MainID=c('A1','A1','A1','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016', '8/8/2016')
df=data_frame(MainID,Purchase,Date)
df2 <- df %>%
  group_by(MainID) %>%
  arrange(MainID, Date) %>%
  mutate(Next = lead(Purchase, 1),
         Pair = paste(Purchase, "leads to", Next)) %>%
  filter(!is.na(Next), Purchase != Next) %>%
  ungroup() %>%
  group_by(Pair) %>%
  summarise(Occurence = n()) %>%
  mutate(N_consumers = length(unique(MainID)),
         Percent_confidence = paste0(Occurence, "/", N_consumers))

df2
# A tibble: 5 <U+00D7> 4
          Pair Occurence N_consumers Percent_confidence
         <chr>     <int>       <int>              <chr>
1 A leads to B         1           3                1/3
2 A leads to D         1           3                1/3
3 B leads to C         2           3                2/3
4 C leads to E         2           3                2/3
5 E leads to B         1           3                1/3