python 中文本分类的过采样?
Oversampling for text classification in python?
我有一个要分类的文本数据框。但我需要先进行过采样。请在下面找到示例数据:
df=[['I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am not going to class today','I am not going to class today','I am not going to class today','I am not going to class today'],['Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Negative','Negative','Negative','Negative']]
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['Features','Class']
df
Features Class
0 I am going to class today Positive
1 I am going to class today Positive
2 I am going to class today Positive
3 I am going to class today Positive
4 I am going to class today Positive
5 I am going to class today Positive
6 I am going to class today Positive
7 I am going to class today Positive
8 I am going to class today Positive
9 I am going to class today Positive
10 I am not going to class today Negative
11 I am not going to class today Negative
12 I am not going to class today Negative
13 I am not going to class today Negative
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(df['Features'], df['Class'])
# summarize class distribution
print(Counter(y_over))
但这不起作用,给我 ValueError: Expected 2D array, got 1D array instead:
。我怎样才能对这些数据进行过度采样?
我发现了问题。我需要重塑我的数据。
X_over, y_over = oversample.fit_resample(df['Features'].values.reshape(-1,1), df['Class'])
现在可以使用了。
Counter({'Positive': 10, 'Negative': 10})
我有一个要分类的文本数据框。但我需要先进行过采样。请在下面找到示例数据:
df=[['I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am going to class today','I am not going to class today','I am not going to class today','I am not going to class today','I am not going to class today'],['Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Positive','Negative','Negative','Negative','Negative']]
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['Features','Class']
df
Features Class
0 I am going to class today Positive
1 I am going to class today Positive
2 I am going to class today Positive
3 I am going to class today Positive
4 I am going to class today Positive
5 I am going to class today Positive
6 I am going to class today Positive
7 I am going to class today Positive
8 I am going to class today Positive
9 I am going to class today Positive
10 I am not going to class today Negative
11 I am not going to class today Negative
12 I am not going to class today Negative
13 I am not going to class today Negative
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(df['Features'], df['Class'])
# summarize class distribution
print(Counter(y_over))
但这不起作用,给我 ValueError: Expected 2D array, got 1D array instead:
。我怎样才能对这些数据进行过度采样?
我发现了问题。我需要重塑我的数据。
X_over, y_over = oversample.fit_resample(df['Features'].values.reshape(-1,1), df['Class'])
现在可以使用了。
Counter({'Positive': 10, 'Negative': 10})