R ctree strange error

我在 ctree 数据的 for 循环中遇到了一些奇怪的问题。如果我在循环中编写此代码,那么 R 会冻结。

data = read.csv("train.csv") #data description https://www.kaggle.com/c/titanic-gettingStarted/data

treet = ctree(Survived ~ ., data = data)

有时我会收到错误消息:"More than 52 levels in a predicting factor, truncated for printout" 并且我的树以非常奇怪的方式显示。有时工作得很好。真的,真的很奇怪!


functionPlot <- function(traine, i) {
  print(i) # print only once, then RStudio freezes
  tempd <- ctree(Survived ~ ., data = traine)

for(i in 1:2) {
  smp_size <- floor(0.70 * nrow(data))
  train_ind <- sample(seq_len(nrow(data)), size = smp_size)
  set.seed(100 + i)
  train <- data[train_ind, ]
  test <- data[-train_ind, ]

ctree() 函数期望 (a) 每个变量使用适当的 类(数字、因子等),并且 (b) 模型中仅使用有用的预测变量公式.

至于 (b),您提供的变量实际上只是字符(如 Name)而不是因子。这要么需要进行适当的预处理,要么从分析中省略。

即使你不这样做,你也不会得到最好的结果,因为一些变量(如 SurvivedPclass)是用数字编码的,但实际上是应该作为因子的分类变量。如果您查看 https://www.kaggle.com/c/titanic/forums/t/13390/introducing-kaggle-scripts 中的脚本,那么您还将了解如何进行数据准备。在这里,我使用

titanic <- read.csv("train.csv")
titanic$Survived <- factor(titanic$Survived,
  levels = 0:1, labels = c("no", "yes"))
titanic$Pclass <- factor(titanic$Pclass)
titanic$Name <- as.character(titanic$Name)

至于 (b),然后我继续调用 ctree(),仅使用经过充分预处理以进行有意义分析的变量。 (我使用包 partykit 中较新的推荐实现。)

ct <- ctree(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
  data = titanic)



Model formula:
Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked

Fitted party:
[1] root
|   [2] Sex in female
|   |   [3] Pclass in 1, 2: yes (n = 170, err = 5.3%)
|   |   [4] Pclass in 3
|   |   |   [5] Fare <= 23.25: yes (n = 117, err = 41.0%)
|   |   |   [6] Fare > 23.25: no (n = 27, err = 11.1%)
|   [7] Sex in male
|   |   [8] Pclass in 1
|   |   |   [9] Age <= 52: no (n = 88, err = 43.2%)
|   |   |   [10] Age > 52: no (n = 34, err = 20.6%)
|   |   [11] Pclass in 2, 3
|   |   |   [12] Age <= 9
|   |   |   |   [13] Pclass in 3: no (n = 71, err = 18.3%)
|   |   |   |   [14] Pclass in 2: yes (n = 13, err = 30.8%)
|   |   |   [15] Age > 9: no (n = 371, err = 11.3%)

Number of inner nodes:    7
Number of terminal nodes: 8