在 python 中拆分具有多个分类值的数据框标签以编码标签
Split labels of dataframe with multiple categorical values in python for encoding labels
我在数据集中有这样一列。
print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6'],
dtype='object')
我想将该列拆分为 6 列:DX1、DX2、DX3、DX4、DX5、DX6
值为 0 或 1。
例如,如果行值是 'DX1 DX2 DX5 ' 那么,
column names: DX1, DX2, DX3, DX4, DX5, DX6
column values: 1 1 0 0 1 0
我该怎么做?
让
rows = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ', 'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ', 'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ', 'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ', 'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ', 'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ', 'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ', 'DX6']
那么你想要达到的效果可以这样实现:
columns = [row.split(' ') for row in rows]
unique_columns = set([item for sublist in columns for item in sublist if item != '' ])
df = pd.DataFrame(columns=unique_columns)
for i, row in enumerate(rows):
row_dict = {col: 0 for col in unique_columns}
for element in row.split(' '):
if element != '':
row_dict[element] = 1
df.loc[i] = row_dict
>>>
DX3 DX1 DX4 DX2 DX5 DX6
0 0 1 0 0 0 0
1 0 1 0 1 0 0
2 1 1 0 1 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 0 1 1 0
6 0 1 1 1 0 0
7 0 1 1 1 1 0
8 0 1 0 1 1 0
9 1 1 0 0 0 0
10 1 1 1 0 0 0
11 1 1 1 0 1 0
12 1 1 0 0 1 0
13 0 1 1 0 0 0
14 0 1 1 0 1 0
15 0 1 0 0 1 0
16 0 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 1 1 0 0
19 1 0 1 1 1 0
20 1 0 0 1 1 0
21 0 0 1 1 0 0
22 0 0 1 1 1 0
23 0 0 0 1 1 0
24 1 0 0 0 0 0
25 1 0 1 0 0 0
26 1 0 1 0 1 0
27 1 0 0 0 1 0
28 0 0 1 0 0 0
29 0 0 1 0 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1
使用Series.str.strip
with Series.str.get_dummies
:
a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6']
pharma_data = pd.DataFrame({'Treated_with_drugs':a})
df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
print (df)
DX1 DX2 DX3 DX4 DX5 DX6
0 1 0 0 0 0 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 1 0 1 0
6 1 1 0 1 0 0
7 1 1 0 1 1 0
8 1 1 0 0 1 0
9 1 0 1 0 0 0
10 1 0 1 1 0 0
11 1 0 1 1 1 0
12 1 0 1 0 1 0
13 1 0 0 1 0 0
14 1 0 0 1 1 0
15 1 0 0 0 1 0
16 0 1 0 0 0 0
17 0 1 1 0 0 0
18 0 1 1 1 0 0
19 0 1 1 1 1 0
20 0 1 1 0 1 0
21 0 1 0 1 0 0
22 0 1 0 1 1 0
23 0 1 0 0 1 0
24 0 0 1 0 0 0
25 0 0 1 1 0 0
26 0 0 1 1 1 0
27 0 0 1 0 1 0
28 0 0 0 1 0 0
29 0 0 0 1 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1
我在数据集中有这样一列。
print (pharma_data['Treated_with_drugs'].astype('category').cat.categories)
Index(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6'],
dtype='object')
我想将该列拆分为 6 列:DX1、DX2、DX3、DX4、DX5、DX6 值为 0 或 1。
例如,如果行值是 'DX1 DX2 DX5 ' 那么,
column names: DX1, DX2, DX3, DX4, DX5, DX6
column values: 1 1 0 0 1 0
我该怎么做?
让
rows = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ', 'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ', 'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ', 'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ', 'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ', 'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ', 'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ', 'DX6']
那么你想要达到的效果可以这样实现:
columns = [row.split(' ') for row in rows]
unique_columns = set([item for sublist in columns for item in sublist if item != '' ])
df = pd.DataFrame(columns=unique_columns)
for i, row in enumerate(rows):
row_dict = {col: 0 for col in unique_columns}
for element in row.split(' '):
if element != '':
row_dict[element] = 1
df.loc[i] = row_dict
>>>
DX3 DX1 DX4 DX2 DX5 DX6
0 0 1 0 0 0 0
1 0 1 0 1 0 0
2 1 1 0 1 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 0 1 1 0
6 0 1 1 1 0 0
7 0 1 1 1 1 0
8 0 1 0 1 1 0
9 1 1 0 0 0 0
10 1 1 1 0 0 0
11 1 1 1 0 1 0
12 1 1 0 0 1 0
13 0 1 1 0 0 0
14 0 1 1 0 1 0
15 0 1 0 0 1 0
16 0 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 1 1 0 0
19 1 0 1 1 1 0
20 1 0 0 1 1 0
21 0 0 1 1 0 0
22 0 0 1 1 1 0
23 0 0 0 1 1 0
24 1 0 0 0 0 0
25 1 0 1 0 0 0
26 1 0 1 0 1 0
27 1 0 0 0 1 0
28 0 0 1 0 0 0
29 0 0 1 0 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1
使用Series.str.strip
with Series.str.get_dummies
:
a = ['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
'DX6']
pharma_data = pd.DataFrame({'Treated_with_drugs':a})
df = pharma_data['Treated_with_drugs'].str.strip().str.get_dummies(' ')
print (df)
DX1 DX2 DX3 DX4 DX5 DX6
0 1 0 0 0 0 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 1 1 1 1 0 0
4 1 1 1 1 1 0
5 1 1 1 0 1 0
6 1 1 0 1 0 0
7 1 1 0 1 1 0
8 1 1 0 0 1 0
9 1 0 1 0 0 0
10 1 0 1 1 0 0
11 1 0 1 1 1 0
12 1 0 1 0 1 0
13 1 0 0 1 0 0
14 1 0 0 1 1 0
15 1 0 0 0 1 0
16 0 1 0 0 0 0
17 0 1 1 0 0 0
18 0 1 1 1 0 0
19 0 1 1 1 1 0
20 0 1 1 0 1 0
21 0 1 0 1 0 0
22 0 1 0 1 1 0
23 0 1 0 0 1 0
24 0 0 1 0 0 0
25 0 0 1 1 0 0
26 0 0 1 1 1 0
27 0 0 1 0 1 0
28 0 0 0 1 0 0
29 0 0 0 1 1 0
30 0 0 0 0 1 0
31 0 0 0 0 0 1