分类报告的数据集不平衡？

Question

我正在尝试对模型进行分类以从文本中推断出情绪。我的两个标签是“1”代表积极，“0”代表消极。当分类报告为运行时，它会产生以下输出：

            precision    recall  f1-score   support

           0       0.39      1.00      0.57      1081
           1       0.00      0.00      0.00      1660

    accuracy                           0.39      2741
   macro avg       0.20      0.50      0.28      2741
weighted avg       0.16      0.39      0.22      2741

所以从它的外观来看，它似乎没有对标签 1 进行分类。查看其他 Stack Overflow 帖子，我认为这是一个不平衡的数据集问题，但似乎并非如此。据我了解，标签 1 的数据似乎比标签 0 多，所以我对这里的问题很困惑。

下面是相关的代码片段

import time
#Import the DecisionTreeeClassifier
from sklearn.tree import DecisionTreeClassifier
# Load from the filename
word2vec_df = pd.read_csv(word2vec_filename)
#Initialize the model
clf_decision_word2vec = DecisionTreeClassifier()

start_time = time.time()
# Fit the model
clf_decision_word2vec.fit(word2vec_df, Y_train['Sentiment'])
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))

from sklearn.metrics import classification_report
test_features_word2vec = []
for index, row in X_test.iterrows():
    model_vector = np.mean([sg_w2v_model[token] for token in row['stemmed_tokens']], axis=0)
    if type(model_vector) is list:
        test_features_word2vec.append(model_vector)
    else:
        test_features_word2vec.append(np.array([0 for i in range(1000)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
for num in test_predictions_word2vec:
  print(num)

在该代码片段的末尾，我添加了一个 for 循环以快速测试以查看 test_predictions_word2vec 中的数据，看起来全为零。

不太确定在所有 1 都被遗漏的地方发生了什么（我在这里只包含了一个小子集来显示 0。查看我控制台上的完整输出，没有 1）。

我假设这是因为这里的这一行：

test_features_word2vec.append(np.array([0 for i in range(1000)]))

它看起来只是附加了 0。对此问题的任何帮助将不胜感激！

P.S 测试训练拆分和输出的片段：

from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(split_data, test_size=0.3, shuffle_state=True):
    X_train, X_test, Y_train, Y_test = train_test_split(split_data[['movie_title',  'critics_consensus',    'tomatometer_status',   'tokenized_text',   'stemmed_tokens']], 
                                                        split_data['Sentiment'], 
                                                        shuffle=shuffle_state,
                                                        test_size=test_size, 
                                                        random_state=42)
    print("Value counts for Train sentiments")
    print(Y_train.value_counts())
    print("Value counts for Test sentiments")
    print(Y_test.value_counts())
    print(type(X_train))
    print(type(Y_train))
    X_train = X_train.reset_index()
    X_test = X_test.reset_index()
    Y_train = Y_train.to_frame()
    Y_train = Y_train.reset_index()
    Y_test = Y_test.to_frame()
    Y_test = Y_test.reset_index()
    print(X_train.head())

    

    return X_train, X_test, Y_train, Y_test

# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(split_data)

Value counts for Train sentiments
1    3805
0    2588
Name: Sentiment, dtype: int64
Value counts for Test sentiments
1    1660
0    1081

编辑：添加 'word2vec_df'

的输出

Time taken to fit the model with word2vec vectors: 18.75066113471985
             0         1         2         3         4         5         6  \
0     0.009097 -0.014559 -0.021197  0.060744 -0.019707  0.102395  0.032876   
1     0.008102 -0.003382 -0.014465  0.066731 -0.024593  0.085185  0.023677   
2     0.013941 -0.005870 -0.001550  0.071456 -0.013130  0.094142  0.043876   
3     0.010195 -0.012312 -0.006310  0.069745 -0.012042  0.091056  0.034140   
4     0.006570 -0.010348 -0.016157  0.063258 -0.029932  0.098463  0.034469   
...        ...       ...       ...       ...       ...       ...       ...   
6388  0.000616 -0.000732 -0.006287  0.063298 -0.024651  0.055185 -0.000368   
6389  0.010891 -0.007447 -0.025401  0.063245 -0.028681  0.100588  0.029031   
6390  0.009561 -0.007456 -0.017953  0.076449 -0.029962  0.092921  0.040811   
6391  0.012995 -0.008843 -0.013079  0.058345 -0.027885  0.095623  0.024361   
6392  0.007881  0.003228 -0.013990  0.065434 -0.017051  0.090314  0.031072   

             7         8         9  ...       990       991       992  \
0     0.068392  0.120006  0.038360  ... -0.009643 -0.062597 -0.027641   
1     0.073042  0.101701  0.030647  ... -0.016221 -0.058624 -0.030524   
2     0.061665  0.117775  0.014894  ... -0.017982 -0.065756 -0.044015   
3     0.057861  0.117489  0.015533  ... -0.016098 -0.065427 -0.039047   
4     0.071677  0.100755  0.029278  ... -0.022267 -0.050894 -0.030283   
...        ...       ...       ...  ...       ...       ...       ...   
6388  0.058975  0.085394  0.028661  ... -0.016373 -0.050449 -0.008869   
6389  0.066502  0.106864  0.035051  ... -0.019567 -0.069977 -0.039586   
6390  0.061507  0.120290  0.030399  ...  0.000696 -0.054154 -0.041237   
6391  0.081338  0.111422  0.034755  ... -0.019699 -0.060718 -0.032540   
6392  0.054831  0.125640  0.032965  ... -0.002751 -0.084193 -0.040441   

           993       994       995       996       997       998       999  
0     0.078252  0.034909 -0.007387  0.057867 -0.052527 -0.072866 -0.010007  
1     0.075942  0.039987 -0.012127  0.042507 -0.054933 -0.072949 -0.010296  
2     0.065845  0.057452  0.002048  0.057100 -0.048846 -0.097791 -0.007207  
3     0.059275  0.051354  0.000843  0.050823 -0.046350 -0.090028 -0.005206  
4     0.066598  0.034786 -0.000143  0.056494 -0.046227 -0.070975 -0.007705  
...        ...       ...       ...       ...       ...       ...       ...  
6388  0.061066  0.017348 -0.018751  0.041088 -0.042949 -0.049911 -0.019149  
6389  0.071031  0.043249 -0.002368  0.040806 -0.046722 -0.085424  0.005255  
6390  0.076632  0.065442 -0.000805  0.050374 -0.047395 -0.085746  0.006119  
6391  0.083535  0.030460 -0.004143  0.047868 -0.058123 -0.069077 -0.012215  
6392  0.077906  0.075460 -0.013605  0.056237 -0.059329 -0.093779 -0.009383  

[6393 rows x 1000 columns]

Answer 1

你是正确的：

np.array([0 for i in range(1000)])

创建一个全为零的数组。

你应该试试：

from sklearn.metrics import classification_report
test_features_word2vec = []

averaged_test_vector = X_test['stemmed_tokens'].apply(
        lambda x: np.mean([sg_w2v_model[tok] for tok in x], axis=0) 
    ).tolist()

averaged_test_vector = np.vstack(averaged_test_vector)

test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))

一般来说，如果可用的话，我会使用较低维度的嵌入。 1000 对于小型数据集来说很多。而且我不会使用 DecisionTreeClassifier 因为它很快就会过拟合。我会从 LinearSVC 或 RandomForrestClassifier.

开始

分类报告的数据集不平衡？

Inbalanced Dataset for Classification Report?

python

sentiment-analysis

scikit-learn

multilabel-classification