来自 Predict_proba() 的随机森林分类器结果与 predict() 不匹配？

Question

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features', FeatureUnion([
    ('Comments',Pipeline([
        ('selector',ItemSelector(column = "Comments")),
        ('tfidf',TfidfVectorizer(use_idf=False,ngram_range=(1,2),max_df=0.95, min_df=0,sublinear_tf=True)),
    ])),
    ('Vendor', Pipeline([
        ('selector',ItemSelector(column = "Vendor Name")),
        ('tfidf',TfidfVectorizer(use_idf=False)),

    ]))
])),
('clf',RandomForestClassifier(n_estimators =200, max_features='log2',criterion = 'entropy',random_state = 45))
 #('clf',LogisticRegression())
 ])


X_train, X_test, y_train, y_test = train_test_split(X,
                                df['code Description'],
                                test_size = 0.3, 
                                train_size = 0.7,
                                random_state = 100)
model = pipeline.fit(X_train, y_train)
s = pipeline.score(X_test,y_test)
pred = model.predict(X_test)
predicted =model.predict_proba(X_test)

对于某些 class 化，我的 predict 与预测分数匹配。但在某些情况下，

proba_predict = [0.3,0.18,0.155]

但不是class化为class A，而是class化为Class B。

预测class：B

实际 Class : A

右侧栏是我的标签，左侧栏是我的输入文本数据：

Answer 1

我认为您陈述了以下情况：对于测试向量 X_test，您从 predict_proba() 方法中找到预测概率分布 y=[p1, p2, p3]，其中 p1>p2和 p1>p3 但 predict() 方法不会为此向量输出 class 0。

如果你检查 sklearn 的 RandomForestClassifier 的 predict 函数的 source code，你会看到那里调用了 RandomForest 的 predict_proba() 方法：

proba = self.predict_proba(X)

根据这些概率，argmax 用于输出 class。

因此，预测步骤对其输出使用 predict_proba 方法。对我来说，那里似乎不可能出现任何问题。

我假设您在日常工作中混淆了一些 class 的名字，并在那里感到困惑。但是根据你提供的信息无法给出更详细的答案。

来自 Predict_proba() 的随机森林分类器结果与 predict() 不匹配？

Random forest classifier result from Predict_proba() does not match with predict()?

python

classification

machine-learning

random-forest

text-classification