如何在使用 SMOTE 过采样后将 return 文本数据作为输出?
How to return text data as output after oversampling using SMOTE?
我有多个 class 文本数据,我想 SMOTE
因为少数标签。我已经这样做了,但我得到的是稀疏矩阵作为我的输出。
有没有办法在 SMOTE 之后取回文本数据?
这是我的代码示例:
X_train = df['transcript']
y_train = df['label']
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
SMOTE.fit_sample
在内部使用 Scikit-learn 的 label_binarize
:https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/base.py#L87
您应该在 y
值上手动使用 sklearn.preprocessing.LabelBinarizer
,然后再应用 SMOTE
。
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelBinarizer
sm = SMOTE(random_state = 2)
lb = LabelBinarizer()
y_train_bin = lb.fit_transform(y_train)
X_train_res, y_train_res_bin = sm.fit_sample(X_train, y_train_bin)
然后你可以从拟合的LabelBinarizer.inverse_transform
方法中恢复文本标签:
y_train_res = lb.inverse_transform(y_train_res_bin)
实际上 SMOTE
期望 X
只是数字数据。这不是标签的问题,标签可以是字符串。
阅读 here 以了解 SMOTE 的内部工作原理。基本上,它使用所选邻居的凸组合为少数 class 创建一个合成数据点。
因此,使用 TfidfVectorizer
或 CountVectorizer
将您的文本数据(成绩单)转换为数字。您可以使用这些向量化器的 inverse_transform
方法来取回文本,但问题是您会弄乱单词的顺序。
import pandas as pd
df = pd.DataFrame({'transcripts': ['I want to check this',
'how about one more sentence',
'hopefully this works well fr you',
'I want to check this',
'This is the last sentence or transcript'],
'labels': ['good','bad', 'bad', 'good','bad']})
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(df['transcripts'])
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=1, random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X, df.labels)
vec.inverse_transform(X_train_res)
# [array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['sentence', 'more', 'one', 'about', 'how'], dtype='<U10'),
# array(['you', 'fr', 'well', 'works', 'hopefully', 'this'], dtype='<U10'),
# array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['transcript', 'or', 'last', 'the', 'is', 'sentence', 'this'],
# dtype='<U10'),
# array(['want', 'to', 'check', 'this'], dtype='<U10')]
我有多个 class 文本数据,我想 SMOTE
因为少数标签。我已经这样做了,但我得到的是稀疏矩阵作为我的输出。
有没有办法在 SMOTE 之后取回文本数据?
这是我的代码示例:
X_train = df['transcript']
y_train = df['label']
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
SMOTE.fit_sample
在内部使用 Scikit-learn 的 label_binarize
:https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/base.py#L87
您应该在 y
值上手动使用 sklearn.preprocessing.LabelBinarizer
,然后再应用 SMOTE
。
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelBinarizer
sm = SMOTE(random_state = 2)
lb = LabelBinarizer()
y_train_bin = lb.fit_transform(y_train)
X_train_res, y_train_res_bin = sm.fit_sample(X_train, y_train_bin)
然后你可以从拟合的LabelBinarizer.inverse_transform
方法中恢复文本标签:
y_train_res = lb.inverse_transform(y_train_res_bin)
实际上 SMOTE
期望 X
只是数字数据。这不是标签的问题,标签可以是字符串。
阅读 here 以了解 SMOTE 的内部工作原理。基本上,它使用所选邻居的凸组合为少数 class 创建一个合成数据点。
因此,使用 TfidfVectorizer
或 CountVectorizer
将您的文本数据(成绩单)转换为数字。您可以使用这些向量化器的 inverse_transform
方法来取回文本,但问题是您会弄乱单词的顺序。
import pandas as pd
df = pd.DataFrame({'transcripts': ['I want to check this',
'how about one more sentence',
'hopefully this works well fr you',
'I want to check this',
'This is the last sentence or transcript'],
'labels': ['good','bad', 'bad', 'good','bad']})
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(df['transcripts'])
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=1, random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X, df.labels)
vec.inverse_transform(X_train_res)
# [array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['sentence', 'more', 'one', 'about', 'how'], dtype='<U10'),
# array(['you', 'fr', 'well', 'works', 'hopefully', 'this'], dtype='<U10'),
# array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['transcript', 'or', 'last', 'the', 'is', 'sentence', 'this'],
# dtype='<U10'),
# array(['want', 'to', 'check', 'this'], dtype='<U10')]