频繁序列统计
Frequent Sequence stats
我有一个购买交易数据集。下面是一个用于说明的虚拟数据集。我想弄清楚如何 reshape/dcast 获得最频繁的购买顺序。
require(data.table)
MainID=c('A1','A1','A2','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016')
df=data.table(MainID,Purchase,Date)
head(df)
MainID Purchase Date
1: A1 A 1/1/2014
2: A1 B 5/23/2015
3: A2 C 6/12/2015
4: C1 A 3/3/2013
5: C1 A 5/5/2014
6: C1 D 7/21/2014
现在我开始寻求 2 对序列的多个组合。对于上面的数据集,以下是一组独特的序列对:(A 导致 B,B 导致 C,A 导致 D,E 导致 B,最后一个 C 导致 E)
请注意,这里我不考虑 A 到 A - 我正在查看不同产品的序列,而不是相同的产品。因此在输出中我想忽略所有那些相似的产品序列。
需要输出:
Pair Occurrence No of customers % confidence
A leads to B 1 3 1/3
B leads to C 2 3 2/3
A leads to D 1 3 1/3
E leads to B 1 3 1/3
C leads to E 2 3 2/3
我知道排序算法,但我在这里查看一些基本的描述性分析。
如果我明白你想要什么,这可能会奏效。请注意,我将您的数据中的 A2 更改为 A1,并且我添加了一个日期以便日期的长度为 11。我还直接创建了一个 tibble 而不是使用 data.table.
MainID=c('A1','A1','A1','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016', '8/8/2016')
df=data_frame(MainID,Purchase,Date)
df2 <- df %>%
group_by(MainID) %>%
arrange(MainID, Date) %>%
mutate(Next = lead(Purchase, 1),
Pair = paste(Purchase, "leads to", Next)) %>%
filter(!is.na(Next), Purchase != Next) %>%
ungroup() %>%
group_by(Pair) %>%
summarise(Occurence = n()) %>%
mutate(N_consumers = length(unique(MainID)),
Percent_confidence = paste0(Occurence, "/", N_consumers))
df2
# A tibble: 5 <U+00D7> 4
Pair Occurence N_consumers Percent_confidence
<chr> <int> <int> <chr>
1 A leads to B 1 3 1/3
2 A leads to D 1 3 1/3
3 B leads to C 2 3 2/3
4 C leads to E 2 3 2/3
5 E leads to B 1 3 1/3
我有一个购买交易数据集。下面是一个用于说明的虚拟数据集。我想弄清楚如何 reshape/dcast 获得最频繁的购买顺序。
require(data.table)
MainID=c('A1','A1','A2','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016')
df=data.table(MainID,Purchase,Date)
head(df)
MainID Purchase Date
1: A1 A 1/1/2014
2: A1 B 5/23/2015
3: A2 C 6/12/2015
4: C1 A 3/3/2013
5: C1 A 5/5/2014
6: C1 D 7/21/2014
现在我开始寻求 2 对序列的多个组合。对于上面的数据集,以下是一组独特的序列对:(A 导致 B,B 导致 C,A 导致 D,E 导致 B,最后一个 C 导致 E) 请注意,这里我不考虑 A 到 A - 我正在查看不同产品的序列,而不是相同的产品。因此在输出中我想忽略所有那些相似的产品序列。
需要输出:
Pair Occurrence No of customers % confidence
A leads to B 1 3 1/3
B leads to C 2 3 2/3
A leads to D 1 3 1/3
E leads to B 1 3 1/3
C leads to E 2 3 2/3
我知道排序算法,但我在这里查看一些基本的描述性分析。
如果我明白你想要什么,这可能会奏效。请注意,我将您的数据中的 A2 更改为 A1,并且我添加了一个日期以便日期的长度为 11。我还直接创建了一个 tibble 而不是使用 data.table.
MainID=c('A1','A1','A1','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016', '8/8/2016')
df=data_frame(MainID,Purchase,Date)
df2 <- df %>%
group_by(MainID) %>%
arrange(MainID, Date) %>%
mutate(Next = lead(Purchase, 1),
Pair = paste(Purchase, "leads to", Next)) %>%
filter(!is.na(Next), Purchase != Next) %>%
ungroup() %>%
group_by(Pair) %>%
summarise(Occurence = n()) %>%
mutate(N_consumers = length(unique(MainID)),
Percent_confidence = paste0(Occurence, "/", N_consumers))
df2
# A tibble: 5 <U+00D7> 4
Pair Occurence N_consumers Percent_confidence
<chr> <int> <int> <chr>
1 A leads to B 1 3 1/3
2 A leads to D 1 3 1/3
3 B leads to C 2 3 2/3
4 C leads to E 2 3 2/3
5 E leads to B 1 3 1/3