使用决策规则拆分其他数据

Question

我正在寻找一种优雅的解决方案，以根据这些规则使用在一个数据集（例如您的训练集）中创建的决策规则来拆分另一个数据集（例如测试数据）的数据。

看这个例子：

# Load PimaIndiansDiabetes dataset from mlbench package
library("mlbench")
data("PimaIndiansDiabetes")
## Split in training and test (2/3 - 1/3)
idtrain <- c(sample(1:768,512))
PimaTrain <-PimaIndiansDiabetes[idtrain,]
Pimatest <-PimaIndiansDiabetes[-idtrain,]

m1 <- RWeka::J48(as.factor(as.character(PimaTrain$diabetes)) ~ .,
                 data = PimaTrain[,-c(9)],
                 control = RWeka::Weka_control(M = 10, C= 0.25))

给出以下输出：

> m1
J48 pruned tree
------------------

glucose <= 154
|   age <= 28
|   |   glucose <= 118: neg (157.0/11.0)
|   |   glucose > 118
|   |   |   pressure <= 52: pos (10.0/3.0)
|   |   |   pressure > 52: neg (54.0/12.0)
|   age > 28
|   |   glucose <= 103: neg (54.0/10.0)
|   |   glucose > 103
|   |   |   mass <= 41.3: neg (129.0/55.0)
|   |   |   mass > 41.3: pos (12.0/1.0)
glucose > 154: pos (96.0/19.0)

Number of Leaves  :     7

Size of the tree :  13

根据这些规则，您将有 7 个小组（或叶子）。我正在寻找的是在测试数据 Pimatest 上应用这些规则（而不是重新训练决策树），这样实际上每个数据点都可以指定到指示的 7 个组之一使用新变量 group.

输出将如下所示：

head(Pimatest)
   pregnant glucose pressure triceps insulin mass pedigree age diabetes group
3         8     183       64       0       0 23.3    0.672  32      pos     7
4         1      89       66      23      94 28.1    0.167  21      neg     1
6         5     116       74       0       0 25.6    0.201  30      neg     5
7         3      78       50      32      88 31.0    0.248  26      pos     1
8        10     115        0       0       0 35.3    0.134  29      neg     5
11        4     110       92       0       0 37.6    0.191  30      neg     5

我目前有一个编码非常糟糕的工作解决方案，因此我正在为这个问题寻找一个优雅的解决方案。

Answer 1

据我了解，您希望能够将每个点与对该点进行分类的一组规则联系起来。您可以通过将 J48 树转换为 party 树并使用 partykit 包中的工具来实现。

因为您没有为随机数生成器设置种子，我们无法获得与您获得的完全相同的 test/training 拆分。我将设置种子以使我的示例可重现，但即使尽管我使用您的代码，但我的树会与您的略有不同。

可重现的例子（主要是你的代码）

library(RWeka)
library("mlbench")
data("PimaIndiansDiabetes")

## Split in training and test (2/3 - 1/3)
set.seed(1234)
idtrain <- c(sample(1:768,512))
PimaTrain <-PimaIndiansDiabetes[idtrain,]
Pimatest <-PimaIndiansDiabetes[-idtrain,]

m1 <- RWeka::J48(as.factor(as.character(PimaTrain$diabetes)) ~ .,
                 data = PimaTrain[,-c(9)],
                 control = RWeka::Weka_control(M = 10, C= 0.25))
m1
J48 pruned tree
------------------
glucose <= 122
|   mass <= 26.8: neg (85.0/1.0)
|   mass > 26.8
|   |   pregnant <= 4: neg (137.0/19.0)
|   |   pregnant > 4
|   |   |   glucose <= 106: neg (44.0/10.0)
|   |   |   glucose > 106: pos (24.0/6.0)
glucose > 122
|   glucose <= 157
|   |   age <= 31
|   |   |   age <= 24: neg (30.0/5.0)
|   |   |   age > 24
|   |   |   |   pressure <= 72: pos (16.0/5.0)
|   |   |   |   pressure > 72: neg (22.0/5.0)
|   |   age > 31: pos (78.0/27.0)
|   glucose > 157: pos (76.0/13.0)

Number of Leaves  :     9
Size of the tree :      17

我的树有 9 片叶子，而不是你的 7 片。这是由于不同的为训练集选择的实例。现在我们准备好获取规则了。

library(partykit)
Pm1 = as.party(m1)
Pm1
Fitted party:
[1] root
|   [2] glucose <= 122
|   |   [3] mass <= 26.8: neg (n = 85, err = 1.2%)
|   |   [4] mass > 26.8
|   |   |   [5] pregnant <= 4: neg (n = 137, err = 13.9%)
|   |   |   [6] pregnant > 4
|   |   |   |   [7] glucose <= 106: neg (n = 44, err = 22.7%)
|   |   |   |   [8] glucose > 106: pos (n = 24, err = 25.0%)
|   [9] glucose > 122
|   |   [10] glucose <= 157
|   |   |   [11] age <= 31
|   |   |   |   [12] age <= 24: neg (n = 30, err = 16.7%)
|   |   |   |   [13] age > 24
|   |   |   |   |   [14] pressure <= 72: pos (n = 16, err = 31.2%)
|   |   |   |   |   [15] pressure > 72: neg (n = 22, err = 22.7%)
|   |   |   [16] age > 31: pos (n = 78, err = 34.6%)
|   |   [17] glucose > 157: pos (n = 76, err = 17.1%)

Number of inner nodes:    8
Number of terminal nodes: 9

这与之前的树是同一棵树，但优点是节点已标记。我们还可以得到为每片叶子写出的规则。

Pm1_rules = partykit:::.list.rules.party(Pm1)
Pm1_rules
                                                                       3 
                                         "glucose <= 122 & mass <= 26.8" 
                                                                       5 
                          "glucose <= 122 & mass > 26.8 & pregnant <= 4" 
                                                                       7 
          "glucose <= 122 & mass > 26.8 & pregnant > 4 & glucose <= 106" 
                                                                       8 
           "glucose <= 122 & mass > 26.8 & pregnant > 4 & glucose > 106" 
                                                                      12 
                "glucose > 122 & glucose <= 157 & age <= 31 & age <= 24" 
                                                                      14 
"glucose > 122 & glucose <= 157 & age <= 31 & age > 24 & pressure <= 72" 
                                                                      15 
 "glucose > 122 & glucose <= 157 & age <= 31 & age > 24 & pressure > 72" 
                                                                      16 
                             "glucose > 122 & glucose <= 157 & age > 31" 
                                                                      17 
                                         "glucose > 122 & glucose > 157"

决定被写成规则。规则集的名称是叶节点的数量。要获得用于测试点的规则，您只需要知道它最终位于哪个叶节点即可。但是 party 对象的 predict 方法会给你那个。

TestPred = predict(Pm1, newdata=Pimatest, type="node")
TestPred
  3   4   5   6   9  12  17  20  22  27  28  29  31  32  33  35  36  38  41  43 
 17   5  16   3  17  17   5   5   7  16   3  16   8  17   3   8   3   7  17   3 
 46  48  50  56  57  60  62  64  65  66  68  70  72  75  76  79  84  95  96  97 
 17   5   3   3  17   5  16  12   8   7   5  15  14   5   3  14   3  12  16   5 
...

我截断了输出，因为它太长了。现在，例如，
我们看到第一个测试点去了节点 17。我们只需要用它来索引规则集。但需要一点小心。 predict 返回的 17 是一个数字。规则集的名称是一个字符串，所以我们需要使用as.character来转换它。

Pm1_rules[as.character(TestPred[1])]
                             17 
"glucose > 122 & glucose > 157"

我们确认：

Pimatest[1,]
  pregnant glucose pressure triceps insulin mass pedigree age diabetes
3        8     183       64       0       0 23.3    0.672  32      pos

所以是的，glucose > 122 和 glucose > 157

您可以用同样的方法获取其他测试点的规则。

使用决策规则拆分其他数据

Use decision rules to split other data

r

decision-tree

weka

rweka