使用 rpart 对新因子(分类)变量进行预测
Prediction using rpart on new factor (categorical) variables
我正在使用 R 练习机器学习。我正在使用 rpart 方法进行训练。数据是来自 UCI 的成人数据集。 Link如下
http://archive.ics.uci.edu/ml/datasets/Adult
#Get the data
adultData <- read.table("adult.data", header = FALSE, sep = ",")
adultName <- read.csv("adult.name", header = TRUE, sep = ",", stringsAsFactors = FALSE)
names(adultData) <- names(adultName)
为了简化练习,我只select几个属性并且将数据集减少到20%而已
selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.20, list=FALSE)
training = adultData[ trainIndex, ]
使用 "rpart" 拟合模型大约需要一分钟(使用 "gbm" 或 "rf" 会更慢)
set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)
问题来自我对新数据值的预测。我创建了一个新的数据框
a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = a)
它 returns 一个错误 "education has a new level"。
我知道问题出在那些分类(因子)变量上。不知何故,他们不承认 "Bachelors" 是他们已经拥有的一个因素,而是一个新字符串(新因素)。
问题源于数据清理不当
下载数据后,我发现了 R 中因子的一个常见问题:
标签具有额外的 space,因此,当您调用标签时(例如,在您的示例中为 "Bachelors"),系统无法识别它,因为该级别具有额外的因素 space:
“单身汉”
你可以通过调用因子的水平来看到这一点:levels(education)
您可以通过将 strip.white 参数设置为 TRUE
来删除读取调用中的白色 spaces
如果你以标准方式上传数据集,你可以看到因子的标签有额外的space
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE)
# levels(adultData$education)
# [1] " 10th" " 11th" " 12th" " 1st-4th"
# [5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
# [9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
# [13] " Masters" " Preschool" " Prof-school" " Some-college"
如果你上传数据集 strip.white = TRUE,你可以看到因子的标签没有多余的 space
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)
# levels(adultData$education)
# [1] "10th" "11th" "12th" "1st-4th" "5th-6th"
# [6] "7th-8th" "9th" "Assoc-acdm" "Assoc-voc" "Bachelors"
# [11] "Doctorate" "HS-grad" "Masters" "Preschool" "Prof-school"
# [16] "Some-college"
我通过上传我已重命名的干净数据集重现了该示例
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)
数据集太宽,无法在此处发布;它可以很容易地从上面的指令中复制出来 link。我的干净数据集可以从这里下载 http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r
经常看数据
dim(adultData)
head(adultData)
str(adultData)
调用你需要的库
library(rpart)
library(caret)
我选择了与您选择的相同的属性,并且我已将数据集减少到仅 40%(这对于训练是可接受的)
selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.40, list=FALSE)
training = adultData[ trainIndex, ]
我还添加了一个测试集
test = adultData[ -trainIndex, ]
模型拟合
set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)
总体准确度
prediction <- predict(modFit, newdata=test)
tab <- table(prediction, test$salary)
sum(diag(tab))/sum(tab)
使用 caret 包进行更好的测试
rpartPred<-predict(modFit,test)
confusionMatrix(rpartPred,test$salary)
绘制模型(不是很清楚)
library(rattle)
fancyRpartPlot(modFit$finalModel)
备选
library(partykit)
finalModel <-as.party(modFit$finalModel)
plot(finalModel)
根据您指定的新数据值进行预测
a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = a)
我正在使用 R 练习机器学习。我正在使用 rpart 方法进行训练。数据是来自 UCI 的成人数据集。 Link如下
http://archive.ics.uci.edu/ml/datasets/Adult
#Get the data
adultData <- read.table("adult.data", header = FALSE, sep = ",")
adultName <- read.csv("adult.name", header = TRUE, sep = ",", stringsAsFactors = FALSE)
names(adultData) <- names(adultName)
为了简化练习,我只select几个属性并且将数据集减少到20%而已
selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.20, list=FALSE)
training = adultData[ trainIndex, ]
使用 "rpart" 拟合模型大约需要一分钟(使用 "gbm" 或 "rf" 会更慢)
set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)
问题来自我对新数据值的预测。我创建了一个新的数据框
a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = a)
它 returns 一个错误 "education has a new level"。
我知道问题出在那些分类(因子)变量上。不知何故,他们不承认 "Bachelors" 是他们已经拥有的一个因素,而是一个新字符串(新因素)。
问题源于数据清理不当
下载数据后,我发现了 R 中因子的一个常见问题: 标签具有额外的 space,因此,当您调用标签时(例如,在您的示例中为 "Bachelors"),系统无法识别它,因为该级别具有额外的因素 space:
“单身汉”
你可以通过调用因子的水平来看到这一点:levels(education)
您可以通过将 strip.white 参数设置为 TRUE
来删除读取调用中的白色 spaces如果你以标准方式上传数据集,你可以看到因子的标签有额外的space
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE)
# levels(adultData$education)
# [1] " 10th" " 11th" " 12th" " 1st-4th"
# [5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
# [9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
# [13] " Masters" " Preschool" " Prof-school" " Some-college"
如果你上传数据集 strip.white = TRUE,你可以看到因子的标签没有多余的 space
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)
# levels(adultData$education)
# [1] "10th" "11th" "12th" "1st-4th" "5th-6th"
# [6] "7th-8th" "9th" "Assoc-acdm" "Assoc-voc" "Bachelors"
# [11] "Doctorate" "HS-grad" "Masters" "Preschool" "Prof-school"
# [16] "Some-college"
我通过上传我已重命名的干净数据集重现了该示例
# Not Run
# adultData <- read.csv2("AdultDataRenamed.csv", header = TRUE, strip.white = TRUE)
数据集太宽,无法在此处发布;它可以很容易地从上面的指令中复制出来 link。我的干净数据集可以从这里下载 http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r
经常看数据
dim(adultData)
head(adultData)
str(adultData)
调用你需要的库
library(rpart)
library(caret)
我选择了与您选择的相同的属性,并且我已将数据集减少到仅 40%(这对于训练是可接受的)
selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
trainIndex = createDataPartition(adultData$salary, p=0.40, list=FALSE)
training = adultData[ trainIndex, ]
我还添加了一个测试集
test = adultData[ -trainIndex, ]
模型拟合
set.seed(33833)
modFit <- train(salary ~ ., method = "rpart", data=training)
总体准确度
prediction <- predict(modFit, newdata=test)
tab <- table(prediction, test$salary)
sum(diag(tab))/sum(tab)
使用 caret 包进行更好的测试
rpartPred<-predict(modFit,test)
confusionMatrix(rpartPred,test$salary)
绘制模型(不是很清楚)
library(rattle)
fancyRpartPlot(modFit$finalModel)
备选
library(partykit)
finalModel <-as.party(modFit$finalModel)
plot(finalModel)
根据您指定的新数据值进行预测
a <- data.frame(age = 40, education = "Bachelors", marital.status = "Divorced", relationship = "Wife", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = a)