randomForest [R] 不接受逻辑变量作为响应,而是接受它作为预测变量吗?
does randomForest [R] not accept logical variable as response, but accept it as predictor?
您好,我在 R 中使用 randomForest,它不接受逻辑变量作为响应 (Y),但似乎接受它作为预测变量 (X)。我有点惊讶 b/c 我认为逻辑本质上是 2-class 因素...
我的问题是:randomForest 是否真的接受逻辑作为预测变量,而不是作为响应?为什么会这样?
其他常见模型(glmnet、svm、...)是否接受逻辑变量?
任何 explanation/discussion 表示赞赏。谢谢
N = 100
data1 = data.frame(age = sample(1:80, N, replace=T),
sex = sample(c('M', 'F'), N, replace=T),
veteran = sample(c(T, F), N, replace=T),
exercise = sample(c(T, F), N, replace=T))
sapply(data1, class)
# age sex veteran exercise
# "integer" "factor" "logical" "logical"
# this doesnt work b/c exercise is logical
rf = randomForest(exercise ~ ., data = data1, importance = T)
# Warning message:
# In randomForest.default(m, y, ...) :
# The response has five or fewer unique values. Are you sure you want to do regression?
# this works, and veteran and exercise (logical) work as predictors
rf = randomForest(sex ~ ., data = data1, importance = T)
importance(rf)
# F M MeanDecreaseAccuracy MeanDecreaseGini
# age -2.0214486 -7.584637 -6.242150 6.956147
# veteran 4.6509542 3.168551 4.605862 1.846428
# exercise -0.1205806 -6.226174 -3.924871 1.013030
# convert it to factor and it works
rf = randomForest(as.factor(exercise) ~ ., data = data1, importance = T)
这种行为的原因是 randomForest 还能够进行回归(除了分类)。你也可以在你得到的警告信息中观察到:
The response has five or fewer unique values. Are you sure you want to do regression?
该函数根据给定响应向量的类型在回归和分类之间做出决定。如果是因素分类,则进行回归(这是有道理的,因为回归响应向量永远不会是因素/分类变量)。
关于您的问题:在您的输入数据集(预测变量)中使用逻辑变量是没有问题的,randomForest 能够像您期望的那样完美地处理它。
training_data <- data.frame(x = rep(c(T,F), times = 1000)) # training data with logical
response <- as.factor(rep(c(F,T), times = 1000)) # inverse of training data
randomForest(response ~ ., data = training_data) # returns 100% accurate classifier
编辑:
why they don't include this coercion (logical to factor) in the source code?
这是猜测,但可能是为了保持一致性和简单性。他们将不得不更改文档
If a factor, classification is assumed, otherwise regression is
assumed
至
If a factor or a logical vector, classification is assumed, otherwise regression is
assumed
然后人们可能会出现询问角色...
如果您的逻辑响应向量仅包含 TRUE 或 FALSE 值,您也会遇到问题。如果你强迫这样一个向量分解,它只会有一个层次。 (尽管在结果始终为 FALSE 的数据集上训练模型并没有真正意义)
但是,如果作者包含这样一个 "intelligent" 强制转换,他们将不得不处理这些问题并定义那些边界情况下的行为,并将其记录下来。
您好,我在 R 中使用 randomForest,它不接受逻辑变量作为响应 (Y),但似乎接受它作为预测变量 (X)。我有点惊讶 b/c 我认为逻辑本质上是 2-class 因素...
我的问题是:randomForest 是否真的接受逻辑作为预测变量,而不是作为响应?为什么会这样? 其他常见模型(glmnet、svm、...)是否接受逻辑变量?
任何 explanation/discussion 表示赞赏。谢谢
N = 100
data1 = data.frame(age = sample(1:80, N, replace=T),
sex = sample(c('M', 'F'), N, replace=T),
veteran = sample(c(T, F), N, replace=T),
exercise = sample(c(T, F), N, replace=T))
sapply(data1, class)
# age sex veteran exercise
# "integer" "factor" "logical" "logical"
# this doesnt work b/c exercise is logical
rf = randomForest(exercise ~ ., data = data1, importance = T)
# Warning message:
# In randomForest.default(m, y, ...) :
# The response has five or fewer unique values. Are you sure you want to do regression?
# this works, and veteran and exercise (logical) work as predictors
rf = randomForest(sex ~ ., data = data1, importance = T)
importance(rf)
# F M MeanDecreaseAccuracy MeanDecreaseGini
# age -2.0214486 -7.584637 -6.242150 6.956147
# veteran 4.6509542 3.168551 4.605862 1.846428
# exercise -0.1205806 -6.226174 -3.924871 1.013030
# convert it to factor and it works
rf = randomForest(as.factor(exercise) ~ ., data = data1, importance = T)
这种行为的原因是 randomForest 还能够进行回归(除了分类)。你也可以在你得到的警告信息中观察到:
The response has five or fewer unique values. Are you sure you want to do regression?
该函数根据给定响应向量的类型在回归和分类之间做出决定。如果是因素分类,则进行回归(这是有道理的,因为回归响应向量永远不会是因素/分类变量)。
关于您的问题:在您的输入数据集(预测变量)中使用逻辑变量是没有问题的,randomForest 能够像您期望的那样完美地处理它。
training_data <- data.frame(x = rep(c(T,F), times = 1000)) # training data with logical
response <- as.factor(rep(c(F,T), times = 1000)) # inverse of training data
randomForest(response ~ ., data = training_data) # returns 100% accurate classifier
编辑:
why they don't include this coercion (logical to factor) in the source code?
这是猜测,但可能是为了保持一致性和简单性。他们将不得不更改文档
If a factor, classification is assumed, otherwise regression is assumed
至
If a factor or a logical vector, classification is assumed, otherwise regression is assumed
然后人们可能会出现询问角色... 如果您的逻辑响应向量仅包含 TRUE 或 FALSE 值,您也会遇到问题。如果你强迫这样一个向量分解,它只会有一个层次。 (尽管在结果始终为 FALSE 的数据集上训练模型并没有真正意义)
但是,如果作者包含这样一个 "intelligent" 强制转换,他们将不得不处理这些问题并定义那些边界情况下的行为,并将其记录下来。