IMDB 情感分析非常准确。我是否遗漏了火车数据泄漏?

Great accuracy on IMDB Sentiment Analysis. Is there any train data leakage I'm missing?

我在使用 python sklearn 库测试的情绪分析分类器上获得了异常高的准确性。这通常是某种训练数据泄漏,但我不知道是不是这种情况。

我的数据集有大约 5 万个非重复的 IMDB 评论。

import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from pprint import pprint
from time import time
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, roc_curve, auc, plot_confusion_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(imdb_data.text, imdb_data.label, test_size=0.30, random_state=2)

imdb_data=pd.read_csv('../../data/home/data/tm/en-sentiment/imdb_reviews_train.csv')
imdb_data=imdb_data.drop_duplicates().reset_index(drop=True)
imdb_data['label'] = imdb_data.label.map(lambda x: int(1) if x =='pos' else int(0) if x =='neg' else np.nan)

x_train, x_test, y_train, y_test = train_test_split(imdb_data.text, imdb_data.label, test_size=0.30, random_state=2)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

parameters_final = {
    'vect__max_df': [0.3],
    'vect__min_df': [1],
    'vect__max_features': [None],
    'vect__ngram_range': [(1, 2)], 
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ['l2'],
    'tfidf__sublinear_tf': (True, False),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ['elasticnet'],
    'clf__max_iter': [50],
}

grid_search = GridSearchCV(pipeline, parameters_final, n_jobs=-1, verbose=1, cv=3)
grid_search.fit(x_train, y_train)
y_pred = grid_search.predict(x_test)
print("Accuracy: ", sklearn.metrics.accuracy_score(y_true=y_test, y_pred=y_pred))

输出:

Accuracy:  0.8967533466687183

可以找到评论数据集here

有什么线索吗?

测试是否存在数据泄漏的一个好方法是检查您链接的存储库中验证集的性能,here

我下载了数据集并尝试使用如下管道构建朴素贝叶斯分类器:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

使用与您相同的训练-测试拆分,我在训练集的 hold-out 数据上获得了 0.86 的准确度分数,在验证集上获得了 0.83 的准确度分数。如果你得到类似的结果,我认为可能只是因为数据集不太难学习。我检查了是否有任何 NA 值可能导致奇怪的性能,但 imdb_data.isnull().any() 确实 return false.