多 class class 化问题的不平衡 - 四个目标级别

Imbalance in multi class classification problem - four target levels

我的数据不平衡,如下所示,每当我尝试使用 ADASYN 时它都会显示错误,我们是否需要为此提供任何参数条目?有时 运行s 很长一段时间,但即使在 40 分钟的代码 运行.

之后也没有响应
                     counts  percentage
Enquiry Assigned      91284   75.902382
Test Drive Provided   25274   21.015258
Test Drive Arranged    3434    2.855361
Booked                  266    0.221178
Test Ride Provided        7    0.005820

请建议我们如何继续使用 python 代码来解决问题。从别人的推荐中我听说

  1. 可以一次在两个级别之间进行采样,然后可以在同一级别上进行迭代
  2. 降低 75% 的采样率可能会有帮助?
  3. 或使用 skmultilearn 的任何解决方案?

代码:

def makeOverSamplesADASYN(X,y):

    #X →Independent Variable in DataFrame\
     #y →dependent Variable in Pandas DataFrame format
     from imblearn.over_sampling import ADASYN 
     sm = ADASYN(sampling_strategy='all', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
    
     X_adassin, y_adassin = sm.fit_resample(X, y)

 makeOverSamplesADASYN(X,data_dummyvar['Sales Stage'])

 print(X_adassin.shape)
 print(y_adassin.shape)'''   

o/p=== > 这个运行很久没结果了,求指教

我使用以下代码对顶部条目进行了下采样。

### " data_dummyvar " is my dataframe with the shape of (120265, 894)

df_majority=data_dummyvar[data_dummyvar['Sales Stage']=='Enquiry Assigned']
df_majority.shape
from sklearn.utils import resample

# Downsample majority class
df_majority_downsampled = resample(df_majority,replace=False,n_samples=25289,random_state=123)                                   
#replace: sample without replacement
# n_samples: to match minority class
#random_state: reproducible results
df_majority_downsampled.shape
df_minority=data_dummyvar[data_dummyvar['Sales Stage'] !='Enquiry Assigned']
df_minority['Sales Stage'].value_counts()
df_first_scaling = pd.concat([df_majority_downsampled,df_minority],ignore_index=True)
g = df_first_scaling['Sales Stage']
df = pd.concat([g.value_counts(),              
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print (df)

以上代码将得到如下结果:o/p ===>>

                        counts  percentage
Enquiry Assigned      25289   46.598489
Test Drive Provided   25281   46.583748
Test Drive Arranged    3434    6.327621
Booked                  266    0.490142

'Enquiry Assigned' 条目现在在此处向下采样。

现在我们需要对我们的数据“df_first_scaling”运行 SMOTE/ADASYN 种算法两次,因为我们还有如下所示的三个条目

def makeOverSamplesADASYN(X,y):
   #input DataFrame
   #X →Independent Variable in DataFrame\
   #y →dependent Variable in Pandas DataFrame format
   from imblearn.over_sampling import ADASYN 
   sm = ADASYN(sampling_strategy='minority', random_state=None, n_neighbors=5, n_jobs=1, ratio=None)
   global X_adassin_1
   global y_adassin_1
   X_adassin_1, y_adassin_1 = sm.fit_resample(X, y)

makeOverSamplesADASYN(X,df_first_scaling['Sales Stage']) # function call

print(X_adassin_1.shape)
print(y_adassin_1.shape)

这给出 o/p 形状为==>

(79334, 893)
(79334,) 

运行在更新后的数据集上再次使用相同的方法我们可以得到形状为 (101229, 893) & (101229,)

的样本 df