从 rpart.object 取回原始名称
Getting back original names from rpart.object
我已经保存了使用 R 中的 rpart 包创建的模型。我正在尝试从这些保存的模型中检索一些信息;具体来自 rpart.object。虽然文档 - rpart doc - 很有帮助,但仍有一些事情不清楚:
- 如何找出哪些变量是分类变量,哪些变量是数字变量?目前,我所做的是参考 splits 矩阵中的 'index' 列。我注意到仅对于数字变量,条目不是整数。有更简洁的方法吗?
- csplit 矩阵指的是分类变量可以使用整数取的各种值,即 R 将原始名称映射为整数。有没有办法访问此映射?对于前。如果我的原始变量,比如说,
Country
可以取任何值 France, Germany, Japan
等,csplit 矩阵让我知道某个拆分是基于 Country == 1, 2
。在这里,rpart 分别用 1, 2
替换了对 France, Germany
的引用。如何从模型文件中获取原始名称 - France, Germany, Japan
- ?另外,我怎么知道名称和整数之间的映射是什么?
通常,terms
组件会包含此类信息。 See ?rpart::rpart.object
.
fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
Kyphosis Age Number Start
"factor" "numeric" "numeric" "numeric"
该示例的结构中没有 csplit 节点,因为 none 的 hte 变量是因子。你可以很容易地制作一个:
> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 3
[3,] 3 1 3
[4,] 1 3 3
[5,] 3 1 3
[6,] 3 3 1
[7,] 3 1 3
[8,] 1 1 3
> attr(fit$terms, "dataClasses")
Kyphosis
"factor"
Age
"numeric"
factor(findInterval(Number, c(0, 4, 6, Inf)))
"factor"
Start
"numeric"
整数只是因子变量的值,因此 "mapping" 与从 as.numeric()
到 levels()
的因子相同。如果我试图构建 fit$csplit
-矩阵的字符矩阵版本,用因子变量中的水平名称代替,这将是成功的途径之一:
> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ Numlev : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
[,1] [,2] [,3]
[1,] "low" "low" "high"
[2,] "low" "low" "high"
[3,] "high" "low" "high"
[4,] "low" "high" "high"
[5,] "high" "low" "high"
[6,] "high" "high" "low"
[7,] "high" "low" "high"
[8,] "low" "low" "high"
评论回复:如果只有模型就用str()看一下。我在我创建的示例中看到一个 "ordered" 叶子,它的因子标签存储在名为 "xlevels":
的属性中
$ ordered : Named logi [1:3] FALSE FALSE FALSE
..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
- attr(*, "xlevels")=List of 1
..$ Numlev: chr [1:3] "low" "med" "high"
- attr(*, "ylevels")= chr [1:2] "absent" "present"
- attr(*, "class")= chr "rpart"
我已经保存了使用 R 中的 rpart 包创建的模型。我正在尝试从这些保存的模型中检索一些信息;具体来自 rpart.object。虽然文档 - rpart doc - 很有帮助,但仍有一些事情不清楚:
- 如何找出哪些变量是分类变量,哪些变量是数字变量?目前,我所做的是参考 splits 矩阵中的 'index' 列。我注意到仅对于数字变量,条目不是整数。有更简洁的方法吗?
- csplit 矩阵指的是分类变量可以使用整数取的各种值,即 R 将原始名称映射为整数。有没有办法访问此映射?对于前。如果我的原始变量,比如说,
Country
可以取任何值France, Germany, Japan
等,csplit 矩阵让我知道某个拆分是基于Country == 1, 2
。在这里,rpart 分别用1, 2
替换了对France, Germany
的引用。如何从模型文件中获取原始名称 -France, Germany, Japan
- ?另外,我怎么知道名称和整数之间的映射是什么?
通常,terms
组件会包含此类信息。 See ?rpart::rpart.object
.
fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
Kyphosis Age Number Start
"factor" "numeric" "numeric" "numeric"
该示例的结构中没有 csplit 节点,因为 none 的 hte 变量是因子。你可以很容易地制作一个:
> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 1 3
[3,] 3 1 3
[4,] 1 3 3
[5,] 3 1 3
[6,] 3 3 1
[7,] 3 1 3
[8,] 1 1 3
> attr(fit$terms, "dataClasses")
Kyphosis
"factor"
Age
"numeric"
factor(findInterval(Number, c(0, 4, 6, Inf)))
"factor"
Start
"numeric"
整数只是因子变量的值,因此 "mapping" 与从 as.numeric()
到 levels()
的因子相同。如果我试图构建 fit$csplit
-矩阵的字符矩阵版本,用因子变量中的水平名称代替,这将是成功的途径之一:
> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame': 81 obs. of 5 variables:
$ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
$ Age : int 71 158 128 2 1 1 61 37 113 59 ...
$ Number : int 3 3 4 5 4 2 2 3 2 6 ...
$ Start : int 5 14 5 1 15 16 17 16 16 12 ...
$ Numlev : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
[,1] [,2] [,3]
[1,] "low" "low" "high"
[2,] "low" "low" "high"
[3,] "high" "low" "high"
[4,] "low" "high" "high"
[5,] "high" "low" "high"
[6,] "high" "high" "low"
[7,] "high" "low" "high"
[8,] "low" "low" "high"
评论回复:如果只有模型就用str()看一下。我在我创建的示例中看到一个 "ordered" 叶子,它的因子标签存储在名为 "xlevels":
的属性中$ ordered : Named logi [1:3] FALSE FALSE FALSE
..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
- attr(*, "xlevels")=List of 1
..$ Numlev: chr [1:3] "low" "med" "high"
- attr(*, "ylevels")= chr [1:2] "absent" "present"
- attr(*, "class")= chr "rpart"