R: Kaggle Titanic Dataset Random Forest NAs 通过强制引入
R: Kaggle Titanic Dataset Random Forest NAs introduced by coercion
我目前正在使用 titanic 数据集在 Kaggle 上练习 R
我正在使用随机森林算法
下面是代码
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
+ Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID,
data=train, importance=TRUE, ntree=5000)
我收到以下错误
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion
我的数据如下所示
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr "<10" "30+" "<10" "30+" ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID : chr "Small" "Small" "Alone" "Small" ...
如果我只输入以下内容,我没有强制转换问题,据我所知,这是唯一发生强制转换以创建 NA 值的地方
as.factor(Survived)
谁能看出问题所在
感谢您的宝贵时间
您需要将 char
列转换为因子。因子在内部被视为整数,而字符字段则不是。请看下面的小演示:
数据:
df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)
df$y <- as.factor(df$y)
> str(df)
'data.frame': 26 obs. of 3 variables:
$ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
$ x1: num 0.457 0.296 0.517 0.478 0.764 ...
$ x2: chr "a" "b" "c" "d" ...
现在如果我 运行 我的 randomForest
函数:
> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion
我遇到了和你一样的错误。
而如果我将 char
列转换为 factor
:
df$x2 <- as.factor(df$x2)
> randomForest(y ~ x1 + x2, data=df)
Call:
randomForest(formula = y ~ x1 + x2, data = df)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 61.54%
Confusion matrix:
0 1 class.error
0 0 16 1
1 0 10 0
效果很好!
我目前正在使用 titanic 数据集在 Kaggle 上练习 R 我正在使用随机森林算法
下面是代码
fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age_Bucket + Embarked
+ Age_Bucket + Fare_Bucket + F_Name + Title + FamilySize + FamilyID,
data=train, importance=TRUE, ntree=5000)
我收到以下错误
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In data.matrix(x) : NAs introduced by coercion
4: In data.matrix(x) : NAs introduced by coercion
我的数据如下所示
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1...
$ Age_Bucket : chr "20-25" "30-40" "25-30" "30-40" ...
$ Fare_Bucket: chr "<10" "30+" "<10" "30+" ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ F_Name : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ FamilyID : chr "Small" "Small" "Alone" "Small" ...
如果我只输入以下内容,我没有强制转换问题,据我所知,这是唯一发生强制转换以创建 NA 值的地方
as.factor(Survived)
谁能看出问题所在
感谢您的宝贵时间
您需要将 char
列转换为因子。因子在内部被视为整数,而字符字段则不是。请看下面的小演示:
数据:
df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F)
df$y <- as.factor(df$y)
> str(df)
'data.frame': 26 obs. of 3 variables:
$ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 1 ...
$ x1: num 0.457 0.296 0.517 0.478 0.764 ...
$ x2: chr "a" "b" "c" "d" ...
现在如果我 运行 我的 randomForest
函数:
> randomForest(y ~ x1 + x2, data=df)
Error in randomForest.default(m, y, ...) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In data.matrix(x) : NAs introduced by coercion
我遇到了和你一样的错误。
而如果我将 char
列转换为 factor
:
df$x2 <- as.factor(df$x2)
> randomForest(y ~ x1 + x2, data=df)
Call:
randomForest(formula = y ~ x1 + x2, data = df)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 61.54%
Confusion matrix:
0 1 class.error
0 0 16 1
1 0 10 0
效果很好!