Sklearn 文本分类:为什么准确率这么低?
Sklearn text classification: Why is accuracy so low?
好的,我正在关注 https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195 and https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 尝试根据类别对文本进行分类。我的数据框布局如下并命名为 result
:
target type post
1 intj "hello world shdjd"
2 entp "hello world fddf"
16 estj "hello world dsd"
4 esfp "hello world sfs"
1 intj "hello world ddfd"
目标是按类型对 post 进行分类,目标只是为 16 种类型中的每一种分配编号 1-16。为了对文本进行分类,我这样做了:
result = result[:1000] #shorten df - was :600
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
def tokenizersplit(str):
return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)
tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
def train_model(classifier, trains, t_labels, valids, v_labels):
# fit the training dataset on the classifier
classifier.fit(trains, t_labels)
# predict the labels on validation dataset
predictions = classifier.predict(valids)
return metrics.accuracy_score(predictions, v_labels)
# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)
# Logistic Regression
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("LR accuracy: ", accuracy)
根据我一开始缩短结果的程度,所有算法的准确度峰值都在 0.4 左右。应该是0.8-0.9.
我阅读了 但不知道如何将它应用到我的数据框。我的数据很简单——有类别 (type
) 和文本 (post
)。
这里有什么问题?
编辑 - 朴素贝叶斯取 2:
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(result.post, result.target)
docs_test = result.post
predicted = text_clf.predict(docs_test)
np.mean(predicted == result.target)
print("Naive Bayes: ")
print(np.mean(predicted == result.target))
你在做什么
我认为错误在这些行中:
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
通过两次拟合你重置了LabelEncoder
的知识。
在一个更简单的例子中:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(["class1", "class2", "class3"])
y_valid = le.fit_transform(["class2", "class3"])
print(y_train)
print(y_valid)
输出这些标签编码:
[0 1 2]
[0 1]
这是错误的,因为编码标签 0
是训练的 class1
和验证的 class2
。
修复
我会将您的第一行更改为:
result = result[:1000] #shorten df - was :600
# Encode the labels before splitting
encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])
# CARE that I changed the target from result['type'] to y_encoded
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)
def tokenizersplit(str):
return str.split()
.
.
.
好的,我正在关注 https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195 and https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 尝试根据类别对文本进行分类。我的数据框布局如下并命名为 result
:
target type post
1 intj "hello world shdjd"
2 entp "hello world fddf"
16 estj "hello world dsd"
4 esfp "hello world sfs"
1 intj "hello world ddfd"
目标是按类型对 post 进行分类,目标只是为 16 种类型中的每一种分配编号 1-16。为了对文本进行分类,我这样做了:
result = result[:1000] #shorten df - was :600
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
def tokenizersplit(str):
return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)
tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
def train_model(classifier, trains, t_labels, valids, v_labels):
# fit the training dataset on the classifier
classifier.fit(trains, t_labels)
# predict the labels on validation dataset
predictions = classifier.predict(valids)
return metrics.accuracy_score(predictions, v_labels)
# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)
# Logistic Regression
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("LR accuracy: ", accuracy)
根据我一开始缩短结果的程度,所有算法的准确度峰值都在 0.4 左右。应该是0.8-0.9.
我阅读了 type
) 和文本 (post
)。
这里有什么问题?
编辑 - 朴素贝叶斯取 2:
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(result.post, result.target)
docs_test = result.post
predicted = text_clf.predict(docs_test)
np.mean(predicted == result.target)
print("Naive Bayes: ")
print(np.mean(predicted == result.target))
你在做什么
我认为错误在这些行中:
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
通过两次拟合你重置了LabelEncoder
的知识。
在一个更简单的例子中:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(["class1", "class2", "class3"])
y_valid = le.fit_transform(["class2", "class3"])
print(y_train)
print(y_valid)
输出这些标签编码:
[0 1 2]
[0 1]
这是错误的,因为编码标签 0
是训练的 class1
和验证的 class2
。
修复
我会将您的第一行更改为:
result = result[:1000] #shorten df - was :600
# Encode the labels before splitting
encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])
# CARE that I changed the target from result['type'] to y_encoded
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)
def tokenizersplit(str):
return str.split()
.
.
.