在 SMOTE 中遇到 ValuEerror imblearn.over_sampling
ValuError encounted in SMOTE imblearn.over_sampling
我一直在尝试对我的数据集进行过采样,因为它不平衡。我正在进行二进制文本分类,并希望在我的 类 之间保持 1 的比率。我正在尝试 SMOTE 机制来解决问题。
我遵循了这个教程:
https://beckernick.github.io/oversampling-modeling/
但是,我遇到了一个错误:
ValueError: could not convert string to float
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE
data = pd.read_csv("dataset.csv")
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(train_text, train_y)
nb_pipeline.fit(train_text, train_y)
predictions = nb_pipeline.predict(test_text)
nb_conf_mat += confusion_matrix(test_y, predictions)
score1 = f1_score(test_y, predictions)
nb_f1_scores.append(score1)
print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)
任何人都可以告诉我哪里出错了,没有两行 SMOTE,我的程序工作正常。
您应该在矢量化文本数据之后但在拟合分类器之前进行过采样。这意味着在代码中拆分管道。代码的相关部分应该是这样的:
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)
clf = MultinomialNB()
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(test_text))
我一直在尝试对我的数据集进行过采样,因为它不平衡。我正在进行二进制文本分类,并希望在我的 类 之间保持 1 的比率。我正在尝试 SMOTE 机制来解决问题。
我遵循了这个教程: https://beckernick.github.io/oversampling-modeling/
但是,我遇到了一个错误:
ValueError: could not convert string to float
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE
data = pd.read_csv("dataset.csv")
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(train_text, train_y)
nb_pipeline.fit(train_text, train_y)
predictions = nb_pipeline.predict(test_text)
nb_conf_mat += confusion_matrix(test_y, predictions)
score1 = f1_score(test_y, predictions)
nb_f1_scores.append(score1)
print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)
任何人都可以告诉我哪里出错了,没有两行 SMOTE,我的程序工作正常。
您应该在矢量化文本数据之后但在拟合分类器之前进行过采样。这意味着在代码中拆分管道。代码的相关部分应该是这样的:
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range = (1, 10))),
('tfidf_transformer', TfidfTransformer())
])
k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold.split(data):
train_text = data.iloc[train_indices]['sentence'].values
train_y = data.iloc[train_indices]['isRelevant'].values
test_text = data.iloc[test_indices]['sentence'].values
test_y = data.iloc[test_indices]['isRelevant'].values
vectorized_text = nb_pipeline.fit_transform(train_text)
sm = SMOTE(ratio = 1.0)
train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)
clf = MultinomialNB()
clf.fit(train_text_res, train_y_res)
predictions = clf.predict(nb_pipeline.transform(test_text))