添加优化会降低分类器算法的准确性、精度和 f1

Question

我想建立一个算法来分类文本：火腿或垃圾邮件；我有每个文本类别的 train/test 数据。（我的火车数据有每个类别 8000 sentences，测试每个类别包含 2000 sentences）

X_train 看起来像这样 ['please, call me asap!', 'watch out the new sales!', 'hello jim can we talk?', 'only today you can buy this', 'don't miss our offer!']

y_train 看起来像这样 [1 0 1 0 0] 其中 1 = 火腿，0 = 垃圾邮件

与X_test和y_test相同。

这是我的代码片段：

# classifier can be LogisticRegression, MultinomialNB, RandomForest, DecisionTree
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', classifier),
                    ])
model = text_clf.fit(X_train, y_train)
y_predict = model.predict(X_test)

这些是我测量的参数：

print(accuracy_score(y_test, y_predict))
print(f1_score(y_test, y_predict, average="weighted"))
print(recall_score(y_test, y_predict, pos_label=1, average="binary"))
print(precision_score(y_test, y_predict, average="weighted"))

如果我不使用任何优化 (remove stop words, remove punctuation, stem words, lemmatize words)，我将获得每个参数大约 95% 的结果。如果我使用这些优化，准确性、f1 分数和精度会急剧下降到 50-60%。召回率保持在95%不变。

为什么会这样？我哪里弄错了？我是否正确计算了这些参数？或者这是正常行为？

Answer 1

我刚刚弄清楚出了什么问题：欠拟合。我进行了交叉验证

scores = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy')

现在一切都很好，我得到了我期待的结果。

添加优化会降低分类器算法的准确性、精度和 f1

Adding optimizations decrease the accuracy, precision, f1 of classifier algorithms

python

classification

machine-learning

text-classification