不平衡数据的交叉验证和文本分类

Question

我是 NLP 的新手，我正在尝试构建一个文本分类器，但我的数据目前是 imbalanced.The 最高类别，有多达 280 个条目，而最低类别只有 30 个。我正在尝试对当前数据使用交叉验证技术，但在寻找了几天之后我现在无法实现 it.It 看起来很简单，但我仍然无法实现它。这是我的代码

y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))

我已经进一步做了一些网格搜索和词干提取器，但现在我会对此进行交叉验证 code.I 已经很好地清理了数据，但我仍然得到 60% 的准确率任何帮助将不胜感激

Answer 1

尝试过采样或欠采样。由于数据高度不平衡，因此对 class 具有更多数据点的偏见更大。 over/under 采样后偏差会非常小，准确度会提高。

否则，您可以使用 MLP 而不是 SVM。即使数据不平衡，它也能提供良好的结果。

Answer 2

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold 
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None) 

for train_index, test_index in kf.split(X):
  #print("Train:", train_index, "Validation:",test_index)
  X_train, X_test = X[train_index], X[test_index] 
  y_train, y_test = y[train_index], y[test_index]

不平衡数据的交叉验证和文本分类

cross validation and text classification for imbalanced data

pipeline

machine-learning

svm

python-3.x

cross-validation