如何对管道中的文本(不平衡组)进行重新采样?
How to resample text (imbalanced groups) in a pipeline?
我正在尝试使用 MultinomialNB 进行一些文本分类,但我 运行 遇到了问题,因为我的数据不平衡。 (为简单起见,下面是一些示例数据。实际上,我的数据要大得多。)我正在尝试使用过采样对我的数据进行重新采样,理想情况下我希望将它构建到这个管道中。
下面的管道在没有过度采样的情况下工作正常,但同样,在现实生活中我的数据需要它。这是非常不平衡的。
使用当前代码,我不断收到错误消息:"TypeError: All intermediate steps should be transformers and implement fit and transform."
如何将 RandomOverSampler 构建到此管道中?
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data,columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))
您应该使用 imblearn
包中实现的流水线,而不是 sklearn
中的流水线。例如,这段代码运行良好:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data, columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(df['Description'],
df['Type'], random_state=0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))
我正在尝试使用 MultinomialNB 进行一些文本分类,但我 运行 遇到了问题,因为我的数据不平衡。 (为简单起见,下面是一些示例数据。实际上,我的数据要大得多。)我正在尝试使用过采样对我的数据进行重新采样,理想情况下我希望将它构建到这个管道中。
下面的管道在没有过度采样的情况下工作正常,但同样,在现实生活中我的数据需要它。这是非常不平衡的。
使用当前代码,我不断收到错误消息:"TypeError: All intermediate steps should be transformers and implement fit and transform."
如何将 RandomOverSampler 构建到此管道中?
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data,columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))
您应该使用 imblearn
包中实现的流水线,而不是 sklearn
中的流水线。例如,这段代码运行良好:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
['small fruits', 'grapes']]
df = pd.DataFrame(data, columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(df['Description'],
df['Type'], random_state=0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('RUS', RandomOverSampler()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))