R 规则为交易准备数据集
R arules preparing dataset for transactions
我准备了一个数据集,用于使用 R 中的 arules 包将其作为事务读取。但是,当我使用命令 itemFrequencyplot 时,我的一个数据预处理导致了一个问题,具体来说,频率最高的项目是“” .有没有人有解决这个问题的建议?
原始数据:
data <- as.data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(data) <- c("Customer", "OrderDate", "Product")
data$Customer <- c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally")
data$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct")
data$Product <- c("Milk", "Eggs", "Bread", "Butter", "Eggs", "Milk", "Bread", "Butter", "Eggs", "Wine")
我做如下改造
library(reshape2)
library(dplyr)
newdata <- data %>%
group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product") %>%
arrange(OrderDate)
newdata[is.na(newdata)] <- " "
newdata <- newdata[ , 3:6]
newdata[sapply(newdata, is.character)] <- lapply(newdata[sapply(newdata, is.character)], as.factor) #converting is.character columns into as.factor
使用write.table创建不带列名的csv文件以通过arules读取
write.table(newdata, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",")
使用arules包读取csv文件作为交易
library(arules)
transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket")
不起作用 - 抛出错误,在阅读了之前关于 Whosebug 的查询后,我能够按如下方式解决它
transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket", rm.duplicates = TRUE)
itemFrequencyPlot(transactiondata, topN = 5)
该图的结果将“ ”列为频率最高的项目,但实际上并非如此,这是我对数据进行预处理的结果。解决问题的建议将不胜感激!
我会这样做(遵循事务手册页中的示例):
data_list <- split(data$Product, paste(data$OrderDate, data$Customer))
trans <- as(data_list, "transactions")
inspect(trans)
items transactionID
[1] {Milk} 1-Oct John
[2] {Bread,Eggs} 2-Oct John
[3] {Butter,Eggs,Milk} 2-Oct Tom
[4] {Bread,Butter,Eggs,Wine} 3-Oct Sally
itemFrequencyPlot(trans, topN = 5)
希望对您有所帮助!
我准备了一个数据集,用于使用 R 中的 arules 包将其作为事务读取。但是,当我使用命令 itemFrequencyplot 时,我的一个数据预处理导致了一个问题,具体来说,频率最高的项目是“” .有没有人有解决这个问题的建议?
原始数据:
data <- as.data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(data) <- c("Customer", "OrderDate", "Product")
data$Customer <- c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally")
data$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct")
data$Product <- c("Milk", "Eggs", "Bread", "Butter", "Eggs", "Milk", "Bread", "Butter", "Eggs", "Wine")
我做如下改造
library(reshape2)
library(dplyr)
newdata <- data %>%
group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product") %>%
arrange(OrderDate)
newdata[is.na(newdata)] <- " "
newdata <- newdata[ , 3:6]
newdata[sapply(newdata, is.character)] <- lapply(newdata[sapply(newdata, is.character)], as.factor) #converting is.character columns into as.factor
使用write.table创建不带列名的csv文件以通过arules读取
write.table(newdata, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",")
使用arules包读取csv文件作为交易
library(arules)
transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket")
不起作用 - 抛出错误,在阅读了之前关于 Whosebug 的查询后,我能够按如下方式解决它
transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket", rm.duplicates = TRUE)
itemFrequencyPlot(transactiondata, topN = 5)
该图的结果将“ ”列为频率最高的项目,但实际上并非如此,这是我对数据进行预处理的结果。解决问题的建议将不胜感激!
我会这样做(遵循事务手册页中的示例):
data_list <- split(data$Product, paste(data$OrderDate, data$Customer))
trans <- as(data_list, "transactions")
inspect(trans)
items transactionID
[1] {Milk} 1-Oct John
[2] {Bread,Eggs} 2-Oct John
[3] {Butter,Eggs,Milk} 2-Oct Tom
[4] {Bread,Butter,Eggs,Wine} 3-Oct Sally
itemFrequencyPlot(trans, topN = 5)
希望对您有所帮助!