对于同一数据集,one_hot_encode 和 count_vectorizer 之间的准确性有何不同?
How can accuracy differs between one_hot_encode and count_vectorizer for the same dataset?
onehot_enc,伯努利NB:
在这里,我使用了两个不同的文件用于评论和标签,并且我使用 "train_test_split" 将数据随机拆分为 80% 的训练数据和 20% 的测试数据。
reviews.txt:
Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear
labels.txt:
positive
negative
positive
negative
我的代码:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%
CountVectorizer,MultinomialNB:
在这里,我手动将相同的数据拆分为训练 (80%) 和测试 (20%)。我将这两个 csv 文件提供给算法。
但是,与上述方法相比,这种方法的准确性较低。谁能帮我解决同样的问题...
train_data.csv:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test_data.csv:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
我的代码:
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
这里的问题是您正在使用 MultiLabelBinarizer
来:
onehot_enc.fit(reviews_tokens)`
在分成训练和测试之前,测试数据被泄露给模型,因此准确性更高。
另一方面,当您使用 CountVectorizer
时只看到经过训练的数据,然后忽略未出现在经过训练的数据中的词,这可能对分类模型很有价值。
因此,根据您的数据量,这可能会产生巨大的差异。无论如何,您的第二种技术(使用 CountVectorizer
)是正确的,应该在文本数据的情况下使用。 MultiLabelBinarizer
和 one-hot 编码通常只应用于分类数据,而不是文本数据。
能分享一下你的完整数据吗?
onehot_enc,伯努利NB:
在这里,我使用了两个不同的文件用于评论和标签,并且我使用 "train_test_split" 将数据随机拆分为 80% 的训练数据和 20% 的测试数据。
reviews.txt:
Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear
labels.txt:
positive
negative
positive
negative
我的代码:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%
CountVectorizer,MultinomialNB:
在这里,我手动将相同的数据拆分为训练 (80%) 和测试 (20%)。我将这两个 csv 文件提供给算法。
但是,与上述方法相比,这种方法的准确性较低。谁能帮我解决同样的问题...
train_data.csv:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test_data.csv:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
我的代码:
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
这里的问题是您正在使用 MultiLabelBinarizer
来:
onehot_enc.fit(reviews_tokens)`
在分成训练和测试之前,测试数据被泄露给模型,因此准确性更高。
另一方面,当您使用 CountVectorizer
时只看到经过训练的数据,然后忽略未出现在经过训练的数据中的词,这可能对分类模型很有价值。
因此,根据您的数据量,这可能会产生巨大的差异。无论如何,您的第二种技术(使用 CountVectorizer
)是正确的,应该在文本数据的情况下使用。 MultiLabelBinarizer
和 one-hot 编码通常只应用于分类数据,而不是文本数据。
能分享一下你的完整数据吗?