关于分类的Isolation Tree算法题

Isolation Tree algorithm question about classification

在我们创建树 (iTrees) 的部分,我不明白为什么我们使用以下分类代码行(与决策树分类非常相似):

def classify_data(data):

label_column = data.values[:, -1]
unique_classes, counts_unique_classes = np.unique(label_column, return_counts=True)

index = counts_unique_classes.argmax()
classification = unique_classes[index]

return classification

我们正在选择最后一列和最大唯一元素的索引值?这对决策树可能有意义,但我不明白为什么我们在隔离森林中使用它?

整个 iTree 代码如下所示:

def isolation_tree(data,counter=0,
                   max_depth=50,random_subspace=False):
# End loop if max depth or if isolated
if (counter == max_depth) or data.shape[0]<=1:
    classification = classify_data(data)
    return classification
    
else:
    # Counter
    counter +=1
    
    # Select random feature
    split_column = select_feature(data)
    
    # Select random value
    split_value = select_value(data,split_column)

    # Split data
    data_below, data_above = split_data(data,split_column,split_value)

# instantiate sub-tree
question = "{} <= {}".format(split_column,split_value)
sub_tree = {question: []}

# Recursive part
below_answer = isolation_tree(data_below,counter,max_depth=max_depth)
above_answer = isolation_tree(data_above,counter,max_depth=max_depth)

if below_answer == above_answer:
    sub_tree = below_answer
else:
    sub_tree[question].append(below_answer)
    sub_tree[question].append(above_answer)
    
return sub_tree 

编辑:这是数据示例,运行 classify_data:

feat1     feat2
0  3.300000  3.300000
1 -0.519349  0.353008
2 -0.269108 -0.909188
3 -1.887810 -0.555841
4 -0.711432  0.927116
label columns: [ 3.3         0.3530081  -0.90918776 -0.55584138  
0.92711613]
unique_classes, counts unique classes: [-0.90918776 -0.55584138  
0.3530081   0.92711613  3.3       ] [1 1 1 1 1]
-0.9091877609469025

所以后来发现分类部分是为了测试,没有价值。如果您使用此代码(在 Medium 上很受欢迎),请删除分类功能,因为它没有任何用处。