R 中关联规则的数据准备 - 数据框到事务
Data prep for association rules in R - data frame to transaction
我的数据来自 SQL 数据库,并且是表格形式,其中我有多个行用于单个事务。我不希望只使用 "product" 字段,而是希望使用数据框中的所有其他列。
我的数据如下:
transID <- c('1','1','2','3')
state <- c('TX','TX','CA','MA')
product <- c('Oranges','Banana','Fish','Cheese')
Month <- c('January','January','Febuary','March')
Place <- c('A','A','B','C')
transactions <- data.frame(transID,state,product,Month,Place)
transactions
transID state product Month Place
1 1 TX Oranges January A
2 1 TX Banana January A
3 2 CA Fish Febuary B
4 3 MA Cheese March C
理想情况下,我的数据如下所示:
1 (TX,Oranges,Banana,January,A)
2 (CA,Fish,Febuary,B)
3 (MA, Cheese, March,C)
将此类数据转化为交易格式的最佳方式是什么?
我尝试了以下方法,但我只是将第 1 行和第 2 行连接在一起作为一个事务:
transactionData <- ddply(transactions,c("transID"),
function(df1) paste(df1$state,
df1$product,
df1$Month,
df1$Place,
collapse = ","))
像这样重塑怎么样?
reshape(transactions,v.names = "product",timevar = "product",idvar = "state", direction = "wide")
transID state Month Place product.Oranges product.Banana product.Fish product.Cheese
1 1 TX January A Oranges Banana <NA> <NA>
3 2 CA Febuary B <NA> <NA> Fish <NA>
4 3 MA March C <NA> <NA> <NA> Cheese
这是一个基本解决方案:
stack(tapply(transactions[, -1],
transactions[, 1, drop = F],
FUN = function(DF) {
paste(unique(unlist(DF), use.names = F), collapse = ',')
}))[, 2:1]
# ind values
#1 1 TX,Oranges,Banana,January,A
#2 2 CA,Fish,Febuary,B
#3 3 MA,Cheese,March,C
主要部分是 tapply()
部分,它被 transID
拆分,然后取消列出 data.frame
的其余部分,只保留唯一值。这是 tapply()
调用的输出。
1 2 3
"TX,Oranges,Banana,January,A" "CA,Fish,Febuary,B" "MA,Cheese,March,C"
stack()
和 [, 2:1]
纯粹是装饰性的,目的是为了制作一个很好的 data.frame
,订购得很好。
由于 data.frames 商店因素,这样做有点尴尬。
library("arules")
# make all columns into items
df <- data.frame(
id = transactions$transID,
items = factor(c(as.character(transactions$state),
as.character(transactions$product),
as.character(transactions$Month),
as.character(transactions$Place))))
# remove duplicated state, month and place enties
df <- df[!duplicated(df),]
# this is from the manual page '? transactions'
trans <- as(split(df[,"items"], df[,"id"]), "transactions")
inspect(trans)
items transactionID
[1] {A,Banana,January,Oranges,TX} 1
[2] {B,CA,Febuary,Fish} 2
[3] {C,Cheese,MA,March} 3
希望对您有所帮助。
我的数据来自 SQL 数据库,并且是表格形式,其中我有多个行用于单个事务。我不希望只使用 "product" 字段,而是希望使用数据框中的所有其他列。
我的数据如下:
transID <- c('1','1','2','3')
state <- c('TX','TX','CA','MA')
product <- c('Oranges','Banana','Fish','Cheese')
Month <- c('January','January','Febuary','March')
Place <- c('A','A','B','C')
transactions <- data.frame(transID,state,product,Month,Place)
transactions
transID state product Month Place
1 1 TX Oranges January A
2 1 TX Banana January A
3 2 CA Fish Febuary B
4 3 MA Cheese March C
理想情况下,我的数据如下所示:
1 (TX,Oranges,Banana,January,A)
2 (CA,Fish,Febuary,B)
3 (MA, Cheese, March,C)
将此类数据转化为交易格式的最佳方式是什么?
我尝试了以下方法,但我只是将第 1 行和第 2 行连接在一起作为一个事务:
transactionData <- ddply(transactions,c("transID"),
function(df1) paste(df1$state,
df1$product,
df1$Month,
df1$Place,
collapse = ","))
像这样重塑怎么样?
reshape(transactions,v.names = "product",timevar = "product",idvar = "state", direction = "wide")
transID state Month Place product.Oranges product.Banana product.Fish product.Cheese
1 1 TX January A Oranges Banana <NA> <NA>
3 2 CA Febuary B <NA> <NA> Fish <NA>
4 3 MA March C <NA> <NA> <NA> Cheese
这是一个基本解决方案:
stack(tapply(transactions[, -1],
transactions[, 1, drop = F],
FUN = function(DF) {
paste(unique(unlist(DF), use.names = F), collapse = ',')
}))[, 2:1]
# ind values
#1 1 TX,Oranges,Banana,January,A
#2 2 CA,Fish,Febuary,B
#3 3 MA,Cheese,March,C
主要部分是 tapply()
部分,它被 transID
拆分,然后取消列出 data.frame
的其余部分,只保留唯一值。这是 tapply()
调用的输出。
1 2 3
"TX,Oranges,Banana,January,A" "CA,Fish,Febuary,B" "MA,Cheese,March,C"
stack()
和 [, 2:1]
纯粹是装饰性的,目的是为了制作一个很好的 data.frame
,订购得很好。
由于 data.frames 商店因素,这样做有点尴尬。
library("arules")
# make all columns into items
df <- data.frame(
id = transactions$transID,
items = factor(c(as.character(transactions$state),
as.character(transactions$product),
as.character(transactions$Month),
as.character(transactions$Place))))
# remove duplicated state, month and place enties
df <- df[!duplicated(df),]
# this is from the manual page '? transactions'
trans <- as(split(df[,"items"], df[,"id"]), "transactions")
inspect(trans)
items transactionID
[1] {A,Banana,January,Oranges,TX} 1
[2] {B,CA,Febuary,Fish} 2
[3] {C,Cheese,MA,March} 3
希望对您有所帮助。