在 pandas 中创建二元分类变量 - 将 4 个类别合并为 2 个
Create binary categorical variables in pandas - combine 4 categories into 2
我有以下 pandas 数据框,其中 'Status' 列包含 4 个分类值 - 'Open'、'Closed'、'Solved' 和 'Pending'.
0 250635 Comcast Cable Internet Speeds 22-04-15
1 223441 Payment disappear - service got disconnected 04-08-15
2 242732 Speed and Service 18-04-15
3 277946 Comcast Imposed a New Usage Cap of 300GB that ... 05-07-15
4 307175 Comcast not working and no service to boot 26-05-15
Date_month_year Time Received Via City State \
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
Zip code Status Filing on Behalf of Someone
0 21009 Closed No
1 30102 Closed No
2 30101 Closed Yes
3 30101 Open Yes
4 30101 Solved No
我想将 'Open' 和 'Pending' 类别合并为 'Open' 列,将 'Closed' 和 'Solved' 合并为 'Closed' 列0 和 1 二进制文件。如果我使用 pd.get_dummies(df, columns=['Status'])
,我会得到以下输出,其中 4 个新列用于 4 个值,但我只想要 2 个,如前所述。我在这里找不到任何以前的线程,所以请提出任何可能的方法。谢谢。
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
... ... ... ... ...
2219 04-Feb-15 9:13:18 AM Customer Care Call Youngstown Florida
2220 06-Feb-15 1:24:39 PM Customer Care Call Ypsilanti Michigan
2221 06-Sep-15 5:28:41 PM Internet Ypsilanti Michigan
2222 23-Jun-15 11:13:30 PM Customer Care Call Ypsilanti Michigan
2223 24-Jun-15 10:28:33 PM Customer Care Call Ypsilanti Michigan
Zip code Filing on Behalf of Someone Status_Closed Status_Open \
0 21009 No 1 0
1 30102 No 1 0
2 30101 Yes 1 0
3 30101 Yes 0 1
4 30101 No 0 0
... ... ... ...
2219 32466 No 1 0
2220 48197 No 0 0
2221 48197 No 0 0
2222 48197 No 0 0
2223 48198 Yes 0 1
Status_Pending Status_Solved
0 0 0
1 0 0
2 0 0
3 0 0
4 0 1
... ...
2219 0 0
2220 0 1
2221 0 1
2222 0 1
2223 0 0
基本原则如下:
for i, row in df.iterrows():
if 'Open' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Pending' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Closed' in row['Status']:
df.at[i,'Closed'] = True # or any other value
if 'Solved' in row['Status']:
df.at[i,'Closed'] = True # or any other value
您遍历列检查任何值,如果找到该值,则在新列“打开”中设置一个布尔值。当然,您需要在执行此操作之前创建“打开”列。
随手
df['Status_open'] = 0
df['Status_closed'] = 0
df.loc[(df['Status'] == 'Open') | (df['Status'] == 'Pending'), 'Status_open'] = 1
df.loc[(df['Status'] == 'Closed') | (df['Status'] == 'Solved'), 'Status_closed'] = 1
(未使用电脑测试)
我觉得可以这样做:
open_ls = ['Open', 'Pending']
df['New_Status'] = df['Status'].apply(lambda x: 'Open' if x in open_ls else 'Closed')
pd.get_dummies(df, columns=['New_Status'])
我有以下 pandas 数据框,其中 'Status' 列包含 4 个分类值 - 'Open'、'Closed'、'Solved' 和 'Pending'.
0 250635 Comcast Cable Internet Speeds 22-04-15
1 223441 Payment disappear - service got disconnected 04-08-15
2 242732 Speed and Service 18-04-15
3 277946 Comcast Imposed a New Usage Cap of 300GB that ... 05-07-15
4 307175 Comcast not working and no service to boot 26-05-15
Date_month_year Time Received Via City State \
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
Zip code Status Filing on Behalf of Someone
0 21009 Closed No
1 30102 Closed No
2 30101 Closed Yes
3 30101 Open Yes
4 30101 Solved No
我想将 'Open' 和 'Pending' 类别合并为 'Open' 列,将 'Closed' 和 'Solved' 合并为 'Closed' 列0 和 1 二进制文件。如果我使用 pd.get_dummies(df, columns=['Status'])
,我会得到以下输出,其中 4 个新列用于 4 个值,但我只想要 2 个,如前所述。我在这里找不到任何以前的线程,所以请提出任何可能的方法。谢谢。
0 22-Apr-15 3:53:50 PM Customer Care Call Abingdon Maryland
1 04-Aug-15 10:22:56 AM Internet Acworth Georgia
2 18-Apr-15 9:55:47 AM Internet Acworth Georgia
3 05-Jul-15 11:59:35 AM Internet Acworth Georgia
4 26-May-15 1:25:26 PM Internet Acworth Georgia
... ... ... ... ...
2219 04-Feb-15 9:13:18 AM Customer Care Call Youngstown Florida
2220 06-Feb-15 1:24:39 PM Customer Care Call Ypsilanti Michigan
2221 06-Sep-15 5:28:41 PM Internet Ypsilanti Michigan
2222 23-Jun-15 11:13:30 PM Customer Care Call Ypsilanti Michigan
2223 24-Jun-15 10:28:33 PM Customer Care Call Ypsilanti Michigan
Zip code Filing on Behalf of Someone Status_Closed Status_Open \
0 21009 No 1 0
1 30102 No 1 0
2 30101 Yes 1 0
3 30101 Yes 0 1
4 30101 No 0 0
... ... ... ...
2219 32466 No 1 0
2220 48197 No 0 0
2221 48197 No 0 0
2222 48197 No 0 0
2223 48198 Yes 0 1
Status_Pending Status_Solved
0 0 0
1 0 0
2 0 0
3 0 0
4 0 1
... ...
2219 0 0
2220 0 1
2221 0 1
2222 0 1
2223 0 0
基本原则如下:
for i, row in df.iterrows():
if 'Open' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Pending' in row['Status']:
df.at[i,'Open'] = True # or any other value
if 'Closed' in row['Status']:
df.at[i,'Closed'] = True # or any other value
if 'Solved' in row['Status']:
df.at[i,'Closed'] = True # or any other value
您遍历列检查任何值,如果找到该值,则在新列“打开”中设置一个布尔值。当然,您需要在执行此操作之前创建“打开”列。
随手
df['Status_open'] = 0
df['Status_closed'] = 0
df.loc[(df['Status'] == 'Open') | (df['Status'] == 'Pending'), 'Status_open'] = 1
df.loc[(df['Status'] == 'Closed') | (df['Status'] == 'Solved'), 'Status_closed'] = 1
(未使用电脑测试)
我觉得可以这样做:
open_ls = ['Open', 'Pending']
df['New_Status'] = df['Status'].apply(lambda x: 'Open' if x in open_ls else 'Closed')
pd.get_dummies(df, columns=['New_Status'])