在 pandas 中创建二元分类变量 - 将 4 个类别合并为 2 个

Create binary categorical variables in pandas - combine 4 categories into 2

我有以下 pandas 数据框,其中 'Status' 列包含 4 个分类值 - 'Open'、'Closed'、'Solved' 和 'Pending'.

0   250635                      Comcast Cable Internet Speeds  22-04-15   
1   223441       Payment disappear - service got disconnected  04-08-15   
2   242732                                  Speed and Service  18-04-15   
3   277946  Comcast Imposed a New Usage Cap of 300GB that ...  05-07-15   
4   307175         Comcast not working and no service to boot  26-05-15   

  Date_month_year         Time        Received Via      City     State  \
0       22-Apr-15   3:53:50 PM  Customer Care Call  Abingdon  Maryland   
1       04-Aug-15  10:22:56 AM            Internet   Acworth   Georgia   
2       18-Apr-15   9:55:47 AM            Internet   Acworth   Georgia   
3       05-Jul-15  11:59:35 AM            Internet   Acworth   Georgia   
4       26-May-15   1:25:26 PM            Internet   Acworth   Georgia   

   Zip code  Status Filing on Behalf of Someone  
0     21009  Closed                          No  
1     30102  Closed                          No  
2     30101  Closed                         Yes  
3     30101    Open                         Yes  
4     30101  Solved                          No  

我想将 'Open' 和 'Pending' 类别合并为 'Open' 列,将 'Closed' 和 'Solved' 合并为 'Closed' 列0 和 1 二进制文件。如果我使用 pd.get_dummies(df, columns=['Status']),我会得到以下输出,其中 4 个新列用于 4 个值,但我只想要 2 个,如前所述。我在这里找不到任何以前的线程,所以请提出任何可能的方法。谢谢。

0          22-Apr-15   3:53:50 PM  Customer Care Call    Abingdon  Maryland   
1          04-Aug-15  10:22:56 AM            Internet     Acworth   Georgia   
2          18-Apr-15   9:55:47 AM            Internet     Acworth   Georgia   
3          05-Jul-15  11:59:35 AM            Internet     Acworth   Georgia   
4          26-May-15   1:25:26 PM            Internet     Acworth   Georgia   
             ...          ...                 ...         ...       ...   
2219       04-Feb-15   9:13:18 AM  Customer Care Call  Youngstown   Florida   
2220       06-Feb-15   1:24:39 PM  Customer Care Call   Ypsilanti  Michigan   
2221       06-Sep-15   5:28:41 PM            Internet   Ypsilanti  Michigan   
2222       23-Jun-15  11:13:30 PM  Customer Care Call   Ypsilanti  Michigan   
2223       24-Jun-15  10:28:33 PM  Customer Care Call   Ypsilanti  Michigan   

      Zip code Filing on Behalf of Someone  Status_Closed  Status_Open  \
0        21009                          No              1            0   
1        30102                          No              1            0   
2        30101                         Yes              1            0   
3        30101                         Yes              0            1   
4        30101                          No              0            0   
       ...                         ...            ...          ...   
2219     32466                          No              1            0   
2220     48197                          No              0            0   
2221     48197                          No              0            0   
2222     48197                          No              0            0   
2223     48198                         Yes              0            1   

      Status_Pending  Status_Solved  
0                  0              0  
1                  0              0  
2                  0              0  
3                  0              0  
4                  0              1  
             ...            ...  
2219               0              0  
2220               0              1  
2221               0              1  
2222               0              1  
2223               0              0  

基本原则如下:

for i, row in df.iterrows():
        if 'Open' in row['Status']:
            df.at[i,'Open'] =  True # or any other value 
        if 'Pending' in row['Status']:
            df.at[i,'Open'] =  True # or any other value
        if  'Closed' in row['Status']:
            df.at[i,'Closed'] =  True # or any other value
        if  'Solved' in row['Status']:
            df.at[i,'Closed'] =  True # or any other value

您遍历列检查任何值,如果找到该值,则在新列“打开”中设置一个布尔值。当然,您需要在执行此操作之前创建“打开”列。

随手

df['Status_open'] = 0
df['Status_closed'] = 0
df.loc[(df['Status'] == 'Open') | (df['Status'] == 'Pending'), 'Status_open'] = 1
df.loc[(df['Status'] == 'Closed') | (df['Status'] == 'Solved'), 'Status_closed'] = 1

(未使用电脑测试)

我觉得可以这样做:

open_ls = ['Open', 'Pending']
df['New_Status'] = df['Status'].apply(lambda x: 'Open' if x in open_ls else 'Closed')
pd.get_dummies(df, columns=['New_Status'])