分类报告的数据集不平衡?
Inbalanced Dataset for Classification Report?
我正在尝试对模型进行分类以从文本中推断出情绪。我的两个标签是“1”代表积极,“0”代表消极。当分类报告为 运行 时,它会产生以下输出:
precision recall f1-score support
0 0.39 1.00 0.57 1081
1 0.00 0.00 0.00 1660
accuracy 0.39 2741
macro avg 0.20 0.50 0.28 2741
weighted avg 0.16 0.39 0.22 2741
所以从它的外观来看,它似乎没有对标签 1 进行分类。查看其他 Stack Overflow 帖子,我认为这是一个不平衡的数据集问题,但似乎并非如此。据我了解,标签 1 的数据似乎比标签 0 多,所以我对这里的问题很困惑。
下面是相关的代码片段
import time
#Import the DecisionTreeeClassifier
from sklearn.tree import DecisionTreeClassifier
# Load from the filename
word2vec_df = pd.read_csv(word2vec_filename)
#Initialize the model
clf_decision_word2vec = DecisionTreeClassifier()
start_time = time.time()
# Fit the model
clf_decision_word2vec.fit(word2vec_df, Y_train['Sentiment'])
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))
from sklearn.metrics import classification_report
test_features_word2vec = []
for index, row in X_test.iterrows():
model_vector = np.mean([sg_w2v_model[token] for token in row['stemmed_tokens']], axis=0)
if type(model_vector) is list:
test_features_word2vec.append(model_vector)
else:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
for num in test_predictions_word2vec:
print(num)
在该代码片段的末尾,我添加了一个 for 循环以快速测试以查看 test_predictions_word2vec 中的数据,看起来全为零。
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
不太确定在所有 1 都被遗漏的地方发生了什么(我在这里只包含了一个小子集来显示 0。查看我控制台上的完整输出,没有 1)。
我假设这是因为这里的这一行:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
它看起来只是附加了 0。对此问题的任何帮助将不胜感激!
P.S 测试训练拆分和输出的片段:
from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(split_data, test_size=0.3, shuffle_state=True):
X_train, X_test, Y_train, Y_test = train_test_split(split_data[['movie_title', 'critics_consensus', 'tomatometer_status', 'tokenized_text', 'stemmed_tokens']],
split_data['Sentiment'],
shuffle=shuffle_state,
test_size=test_size,
random_state=42)
print("Value counts for Train sentiments")
print(Y_train.value_counts())
print("Value counts for Test sentiments")
print(Y_test.value_counts())
print(type(X_train))
print(type(Y_train))
X_train = X_train.reset_index()
X_test = X_test.reset_index()
Y_train = Y_train.to_frame()
Y_train = Y_train.reset_index()
Y_test = Y_test.to_frame()
Y_test = Y_test.reset_index()
print(X_train.head())
return X_train, X_test, Y_train, Y_test
# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(split_data)
Value counts for Train sentiments
1 3805
0 2588
Name: Sentiment, dtype: int64
Value counts for Test sentiments
1 1660
0 1081
编辑:添加 'word2vec_df'
的输出
Time taken to fit the model with word2vec vectors: 18.75066113471985
0 1 2 3 4 5 6 \
0 0.009097 -0.014559 -0.021197 0.060744 -0.019707 0.102395 0.032876
1 0.008102 -0.003382 -0.014465 0.066731 -0.024593 0.085185 0.023677
2 0.013941 -0.005870 -0.001550 0.071456 -0.013130 0.094142 0.043876
3 0.010195 -0.012312 -0.006310 0.069745 -0.012042 0.091056 0.034140
4 0.006570 -0.010348 -0.016157 0.063258 -0.029932 0.098463 0.034469
... ... ... ... ... ... ... ...
6388 0.000616 -0.000732 -0.006287 0.063298 -0.024651 0.055185 -0.000368
6389 0.010891 -0.007447 -0.025401 0.063245 -0.028681 0.100588 0.029031
6390 0.009561 -0.007456 -0.017953 0.076449 -0.029962 0.092921 0.040811
6391 0.012995 -0.008843 -0.013079 0.058345 -0.027885 0.095623 0.024361
6392 0.007881 0.003228 -0.013990 0.065434 -0.017051 0.090314 0.031072
7 8 9 ... 990 991 992 \
0 0.068392 0.120006 0.038360 ... -0.009643 -0.062597 -0.027641
1 0.073042 0.101701 0.030647 ... -0.016221 -0.058624 -0.030524
2 0.061665 0.117775 0.014894 ... -0.017982 -0.065756 -0.044015
3 0.057861 0.117489 0.015533 ... -0.016098 -0.065427 -0.039047
4 0.071677 0.100755 0.029278 ... -0.022267 -0.050894 -0.030283
... ... ... ... ... ... ... ...
6388 0.058975 0.085394 0.028661 ... -0.016373 -0.050449 -0.008869
6389 0.066502 0.106864 0.035051 ... -0.019567 -0.069977 -0.039586
6390 0.061507 0.120290 0.030399 ... 0.000696 -0.054154 -0.041237
6391 0.081338 0.111422 0.034755 ... -0.019699 -0.060718 -0.032540
6392 0.054831 0.125640 0.032965 ... -0.002751 -0.084193 -0.040441
993 994 995 996 997 998 999
0 0.078252 0.034909 -0.007387 0.057867 -0.052527 -0.072866 -0.010007
1 0.075942 0.039987 -0.012127 0.042507 -0.054933 -0.072949 -0.010296
2 0.065845 0.057452 0.002048 0.057100 -0.048846 -0.097791 -0.007207
3 0.059275 0.051354 0.000843 0.050823 -0.046350 -0.090028 -0.005206
4 0.066598 0.034786 -0.000143 0.056494 -0.046227 -0.070975 -0.007705
... ... ... ... ... ... ... ...
6388 0.061066 0.017348 -0.018751 0.041088 -0.042949 -0.049911 -0.019149
6389 0.071031 0.043249 -0.002368 0.040806 -0.046722 -0.085424 0.005255
6390 0.076632 0.065442 -0.000805 0.050374 -0.047395 -0.085746 0.006119
6391 0.083535 0.030460 -0.004143 0.047868 -0.058123 -0.069077 -0.012215
6392 0.077906 0.075460 -0.013605 0.056237 -0.059329 -0.093779 -0.009383
[6393 rows x 1000 columns]
你是正确的:
np.array([0 for i in range(1000)])
创建一个全为零的数组。
你应该试试:
from sklearn.metrics import classification_report
test_features_word2vec = []
averaged_test_vector = X_test['stemmed_tokens'].apply(
lambda x: np.mean([sg_w2v_model[tok] for tok in x], axis=0)
).tolist()
averaged_test_vector = np.vstack(averaged_test_vector)
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
一般来说,如果可用的话,我会使用较低维度的嵌入。
1000 对于小型数据集来说很多。
而且我不会使用 DecisionTreeClassifier
因为它很快就会过拟合。
我会从 LinearSVC
或 RandomForrestClassifier
.
开始
我正在尝试对模型进行分类以从文本中推断出情绪。我的两个标签是“1”代表积极,“0”代表消极。当分类报告为 运行 时,它会产生以下输出:
precision recall f1-score support
0 0.39 1.00 0.57 1081
1 0.00 0.00 0.00 1660
accuracy 0.39 2741
macro avg 0.20 0.50 0.28 2741
weighted avg 0.16 0.39 0.22 2741
所以从它的外观来看,它似乎没有对标签 1 进行分类。查看其他 Stack Overflow 帖子,我认为这是一个不平衡的数据集问题,但似乎并非如此。据我了解,标签 1 的数据似乎比标签 0 多,所以我对这里的问题很困惑。
下面是相关的代码片段
import time
#Import the DecisionTreeeClassifier
from sklearn.tree import DecisionTreeClassifier
# Load from the filename
word2vec_df = pd.read_csv(word2vec_filename)
#Initialize the model
clf_decision_word2vec = DecisionTreeClassifier()
start_time = time.time()
# Fit the model
clf_decision_word2vec.fit(word2vec_df, Y_train['Sentiment'])
print("Time taken to fit the model with word2vec vectors: " + str(time.time() - start_time))
from sklearn.metrics import classification_report
test_features_word2vec = []
for index, row in X_test.iterrows():
model_vector = np.mean([sg_w2v_model[token] for token in row['stemmed_tokens']], axis=0)
if type(model_vector) is list:
test_features_word2vec.append(model_vector)
else:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
for num in test_predictions_word2vec:
print(num)
在该代码片段的末尾,我添加了一个 for 循环以快速测试以查看 test_predictions_word2vec 中的数据,看起来全为零。
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
不太确定在所有 1 都被遗漏的地方发生了什么(我在这里只包含了一个小子集来显示 0。查看我控制台上的完整输出,没有 1)。
我假设这是因为这里的这一行:
test_features_word2vec.append(np.array([0 for i in range(1000)]))
它看起来只是附加了 0。对此问题的任何帮助将不胜感激!
P.S 测试训练拆分和输出的片段:
from sklearn.model_selection import train_test_split
# Train Test Split Function
def split_train_test(split_data, test_size=0.3, shuffle_state=True):
X_train, X_test, Y_train, Y_test = train_test_split(split_data[['movie_title', 'critics_consensus', 'tomatometer_status', 'tokenized_text', 'stemmed_tokens']],
split_data['Sentiment'],
shuffle=shuffle_state,
test_size=test_size,
random_state=42)
print("Value counts for Train sentiments")
print(Y_train.value_counts())
print("Value counts for Test sentiments")
print(Y_test.value_counts())
print(type(X_train))
print(type(Y_train))
X_train = X_train.reset_index()
X_test = X_test.reset_index()
Y_train = Y_train.to_frame()
Y_train = Y_train.reset_index()
Y_test = Y_test.to_frame()
Y_test = Y_test.reset_index()
print(X_train.head())
return X_train, X_test, Y_train, Y_test
# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(split_data)
Value counts for Train sentiments
1 3805
0 2588
Name: Sentiment, dtype: int64
Value counts for Test sentiments
1 1660
0 1081
编辑:添加 'word2vec_df'
的输出Time taken to fit the model with word2vec vectors: 18.75066113471985
0 1 2 3 4 5 6 \
0 0.009097 -0.014559 -0.021197 0.060744 -0.019707 0.102395 0.032876
1 0.008102 -0.003382 -0.014465 0.066731 -0.024593 0.085185 0.023677
2 0.013941 -0.005870 -0.001550 0.071456 -0.013130 0.094142 0.043876
3 0.010195 -0.012312 -0.006310 0.069745 -0.012042 0.091056 0.034140
4 0.006570 -0.010348 -0.016157 0.063258 -0.029932 0.098463 0.034469
... ... ... ... ... ... ... ...
6388 0.000616 -0.000732 -0.006287 0.063298 -0.024651 0.055185 -0.000368
6389 0.010891 -0.007447 -0.025401 0.063245 -0.028681 0.100588 0.029031
6390 0.009561 -0.007456 -0.017953 0.076449 -0.029962 0.092921 0.040811
6391 0.012995 -0.008843 -0.013079 0.058345 -0.027885 0.095623 0.024361
6392 0.007881 0.003228 -0.013990 0.065434 -0.017051 0.090314 0.031072
7 8 9 ... 990 991 992 \
0 0.068392 0.120006 0.038360 ... -0.009643 -0.062597 -0.027641
1 0.073042 0.101701 0.030647 ... -0.016221 -0.058624 -0.030524
2 0.061665 0.117775 0.014894 ... -0.017982 -0.065756 -0.044015
3 0.057861 0.117489 0.015533 ... -0.016098 -0.065427 -0.039047
4 0.071677 0.100755 0.029278 ... -0.022267 -0.050894 -0.030283
... ... ... ... ... ... ... ...
6388 0.058975 0.085394 0.028661 ... -0.016373 -0.050449 -0.008869
6389 0.066502 0.106864 0.035051 ... -0.019567 -0.069977 -0.039586
6390 0.061507 0.120290 0.030399 ... 0.000696 -0.054154 -0.041237
6391 0.081338 0.111422 0.034755 ... -0.019699 -0.060718 -0.032540
6392 0.054831 0.125640 0.032965 ... -0.002751 -0.084193 -0.040441
993 994 995 996 997 998 999
0 0.078252 0.034909 -0.007387 0.057867 -0.052527 -0.072866 -0.010007
1 0.075942 0.039987 -0.012127 0.042507 -0.054933 -0.072949 -0.010296
2 0.065845 0.057452 0.002048 0.057100 -0.048846 -0.097791 -0.007207
3 0.059275 0.051354 0.000843 0.050823 -0.046350 -0.090028 -0.005206
4 0.066598 0.034786 -0.000143 0.056494 -0.046227 -0.070975 -0.007705
... ... ... ... ... ... ... ...
6388 0.061066 0.017348 -0.018751 0.041088 -0.042949 -0.049911 -0.019149
6389 0.071031 0.043249 -0.002368 0.040806 -0.046722 -0.085424 0.005255
6390 0.076632 0.065442 -0.000805 0.050374 -0.047395 -0.085746 0.006119
6391 0.083535 0.030460 -0.004143 0.047868 -0.058123 -0.069077 -0.012215
6392 0.077906 0.075460 -0.013605 0.056237 -0.059329 -0.093779 -0.009383
[6393 rows x 1000 columns]
你是正确的:
np.array([0 for i in range(1000)])
创建一个全为零的数组。
你应该试试:
from sklearn.metrics import classification_report
test_features_word2vec = []
averaged_test_vector = X_test['stemmed_tokens'].apply(
lambda x: np.mean([sg_w2v_model[tok] for tok in x], axis=0)
).tolist()
averaged_test_vector = np.vstack(averaged_test_vector)
test_predictions_word2vec = clf_decision_word2vec.predict(test_features_word2vec)
print(classification_report(Y_test['Sentiment'],test_predictions_word2vec))
一般来说,如果可用的话,我会使用较低维度的嵌入。
1000 对于小型数据集来说很多。
而且我不会使用 DecisionTreeClassifier
因为它很快就会过拟合。
我会从 LinearSVC
或 RandomForrestClassifier
.