拆分数据集后的过采样 - 文本分类

Question

我对对数据集进行过采样的步骤有一些疑问。我所做的如下：

# Separate input features and target
y_up = df.Label

X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)

# setting up testing and training sets

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)

class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]


# upsample minority
class_1_upsampled = resample(class_1,
                          replace=True, 
                          n_samples=len(class_0), 
                          random_state=27) #

# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

因为我的数据集看起来像：

Label     Text 
1        bla bla bla
0        once upon a time 
1        some other sentences
1        a few sentences more
1        this is my dataset!

我应用了矢量化器将字符串转换为数字：

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]

X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

然后我应用逻辑回归函数：

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

但是，我在这一步遇到了以下错误：

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)

pred_up_log = upsampled_log.predict(X_test_up)

ValueError: X has 3021 features per sample; expecting 5542

因为有人告诉我应该在将我的数据集拆分为 train e test 之后应用过采样，所以我没有对测试集进行向量化。我的疑惑如下：

稍后考虑对测试集进行矢量化是否正确：X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
将数据集拆分为训练和测试后考虑过采样是否正确？

或者，我尝试使用 Smote 功能。下面的代码有效，但如果可能的话，我更愿意考虑过采样，而不是 SMOTE。

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)

nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

如有任何意见和建议，我们将不胜感激。谢谢

Answer 1

最好对整个数据集做countVectorizing和transform，拆分成test和train，保持稀疏矩阵，不要再转换回data.frame。

例如这是一个数据集：

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
                           'at least old','data scientist years','data science is data wrangling', 
                           'This rings particularly','true for data science leaders',
                           'who watch their data','scientists spend days',
                           'painstakingly picking apart','ossified corporate datasets',
                           'arcane Excel spreadsheets','Does data science really',
                           'they just delegate the job','Data Is More Than Just Numbers',
                           'The reason that',
                           'data wrangling is so difficult','data is more than text and numbers'],
                   'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

我们执行矢量化和转换，然后拆分：

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

上采样可以通过重新采样少数的索引来完成类:

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

并且预测有效：

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

如果您担心数据泄漏，那就是来自测试的一些信息实际上通过使用 TfidfTransformer() 进入了训练。老实说还没有看到具体的证明或演示，但下面是您单独应用 tfid 的替代方法：

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]

upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)

X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

拆分数据集后的过采样 - 文本分类

Oversampling after splitting the dataset - Text classification

python

vectorization

scikit-learn

text-classification

logistic-regression