"Class variable needs to be a factor" csv 读取数据集错误

Question

我希望离散化 machine-learning datasets, in particular, using supervised discretisation. It turns out that r [has a package/method for this]1 中的连续特征，太棒了！但是由于我不精通 R 我有一些问题，如果你能提供帮助，我将不胜感激。

我收到一个错误

class variable needs to be a factor.

我在网上看了一个例子，他们好像没有这个问题，但是我有。请注意，我不太了解 syntax V2 ~ .，除此之外 V2 应该是列名。

library(caret)
library(Rcpp)
library(arulesCBA)

filename <- "wine.data"
dataset <- read.csv(filename, header=FALSE)
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")

R报如下错误：

Error in .parseformula(formula, data) : class variable needs to be a factor!

您可以在此处找到数据集 wine.data：https://pastebin.com/hvDbEtMN discretizeDF.supervised 的第一个参数是一个公式，这似乎是问题所在。

请帮忙！提前谢谢你。

Answer 1

如小插图中所写，这是为了实现：

several supervised methods to convert continuous variables into a categorical variables (factor) suitable for association rule mining and building associative classifiers.

如果您查看 V2 列，它是连续的：

test = read.csv("wine_dataset.txt",header=FALSE)
str(test)
'data.frame':   178 obs. of  14 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : num  14.2 13.2 13.2 14.4 13.2 ...
 $ V3 : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...

您需要的是一个分类目标，以便算法可以找到合适的方法将其离散化以寻找关联。例如：

#this cuts V2 into 4 categories according to where they fall in the range
test$V2 = factor(cut(test$V2,4,labels=1:4))
dataset2 <- discretizeDF.supervised(V2 ~ ., dataset, method = "mdlp")

以上是一种绕过的方法，但是你需要想办法切好V2。如果你需要使用目标作为连续，那么你可以使用arules中的discretizeDF，我也看到你的第一列只有1,2,3：

test = read.csv("wine_dataset.txt",header=FALSE)
test2 = data.frame(test[,1:2],discretizeDF(test[,-c(1:2)]))

"Class variable needs to be a factor" csv 读取数据集错误

"Class variable needs to be a factor" error for csv-read datasets

syntax

r

discretization