多 class class 化问题的不平衡 - 四个目标级别
Imbalance in multi class classification problem - four target levels
我的数据不平衡,如下所示,每当我尝试使用 ADASYN 时它都会显示错误,我们是否需要为此提供任何参数条目?有时 运行s 很长一段时间,但即使在 40 分钟的代码 运行.
之后也没有响应
counts percentage
Enquiry Assigned 91284 75.902382
Test Drive Provided 25274 21.015258
Test Drive Arranged 3434 2.855361
Booked 266 0.221178
Test Ride Provided 7 0.005820
请建议我们如何继续使用 python 代码来解决问题。从别人的推荐中我听说
- 可以一次在两个级别之间进行采样,然后可以在同一级别上进行迭代
- 降低 75% 的采样率可能会有帮助?
- 或使用 skmultilearn 的任何解决方案?
代码:
def makeOverSamplesADASYN(X,y):
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import ADASYN
sm = ADASYN(sampling_strategy='all', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
X_adassin, y_adassin = sm.fit_resample(X, y)
makeOverSamplesADASYN(X,data_dummyvar['Sales Stage'])
print(X_adassin.shape)
print(y_adassin.shape)'''
o/p=== > 这个运行很久没结果了,求指教
我使用以下代码对顶部条目进行了下采样。
### " data_dummyvar " is my dataframe with the shape of (120265, 894)
df_majority=data_dummyvar[data_dummyvar['Sales Stage']=='Enquiry Assigned']
df_majority.shape
from sklearn.utils import resample
# Downsample majority class
df_majority_downsampled = resample(df_majority,replace=False,n_samples=25289,random_state=123)
#replace: sample without replacement
# n_samples: to match minority class
#random_state: reproducible results
df_majority_downsampled.shape
df_minority=data_dummyvar[data_dummyvar['Sales Stage'] !='Enquiry Assigned']
df_minority['Sales Stage'].value_counts()
df_first_scaling = pd.concat([df_majority_downsampled,df_minority],ignore_index=True)
g = df_first_scaling['Sales Stage']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
以上代码将得到如下结果:o/p ===>>
counts percentage
Enquiry Assigned 25289 46.598489
Test Drive Provided 25281 46.583748
Test Drive Arranged 3434 6.327621
Booked 266 0.490142
'Enquiry Assigned' 条目现在在此处向下采样。
现在我们需要对我们的数据“df_first_scaling”运行 SMOTE/ADASYN 种算法两次,因为我们还有如下所示的三个条目
def makeOverSamplesADASYN(X,y):
#input DataFrame
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import ADASYN
sm = ADASYN(sampling_strategy='minority', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
global X_adassin_1
global y_adassin_1
X_adassin_1, y_adassin_1 = sm.fit_resample(X, y)
makeOverSamplesADASYN(X,df_first_scaling['Sales Stage']) # function call
print(X_adassin_1.shape)
print(y_adassin_1.shape)
这给出 o/p 形状为==>
(79334, 893)
(79334,)
运行在更新后的数据集上再次使用相同的方法我们可以得到形状为 (101229, 893) & (101229,)
的样本 df
我的数据不平衡,如下所示,每当我尝试使用 ADASYN 时它都会显示错误,我们是否需要为此提供任何参数条目?有时 运行s 很长一段时间,但即使在 40 分钟的代码 运行.
之后也没有响应 counts percentage
Enquiry Assigned 91284 75.902382
Test Drive Provided 25274 21.015258
Test Drive Arranged 3434 2.855361
Booked 266 0.221178
Test Ride Provided 7 0.005820
请建议我们如何继续使用 python 代码来解决问题。从别人的推荐中我听说
- 可以一次在两个级别之间进行采样,然后可以在同一级别上进行迭代
- 降低 75% 的采样率可能会有帮助?
- 或使用 skmultilearn 的任何解决方案?
代码:
def makeOverSamplesADASYN(X,y):
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import ADASYN
sm = ADASYN(sampling_strategy='all', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
X_adassin, y_adassin = sm.fit_resample(X, y)
makeOverSamplesADASYN(X,data_dummyvar['Sales Stage'])
print(X_adassin.shape)
print(y_adassin.shape)'''
o/p=== > 这个运行很久没结果了,求指教
我使用以下代码对顶部条目进行了下采样。
### " data_dummyvar " is my dataframe with the shape of (120265, 894)
df_majority=data_dummyvar[data_dummyvar['Sales Stage']=='Enquiry Assigned']
df_majority.shape
from sklearn.utils import resample
# Downsample majority class
df_majority_downsampled = resample(df_majority,replace=False,n_samples=25289,random_state=123)
#replace: sample without replacement
# n_samples: to match minority class
#random_state: reproducible results
df_majority_downsampled.shape
df_minority=data_dummyvar[data_dummyvar['Sales Stage'] !='Enquiry Assigned']
df_minority['Sales Stage'].value_counts()
df_first_scaling = pd.concat([df_majority_downsampled,df_minority],ignore_index=True)
g = df_first_scaling['Sales Stage']
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)
以上代码将得到如下结果:o/p ===>>
counts percentage
Enquiry Assigned 25289 46.598489
Test Drive Provided 25281 46.583748
Test Drive Arranged 3434 6.327621
Booked 266 0.490142
'Enquiry Assigned' 条目现在在此处向下采样。
现在我们需要对我们的数据“df_first_scaling”运行 SMOTE/ADASYN 种算法两次,因为我们还有如下所示的三个条目
def makeOverSamplesADASYN(X,y):
#input DataFrame
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import ADASYN
sm = ADASYN(sampling_strategy='minority', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
global X_adassin_1
global y_adassin_1
X_adassin_1, y_adassin_1 = sm.fit_resample(X, y)
makeOverSamplesADASYN(X,df_first_scaling['Sales Stage']) # function call
print(X_adassin_1.shape)
print(y_adassin_1.shape)
这给出 o/p 形状为==>
(79334, 893)
(79334,)
运行在更新后的数据集上再次使用相同的方法我们可以得到形状为 (101229, 893) & (101229,)
的样本 df