从字符串列创建虚拟变量
Creating Dummy Variables from String Column
我有一个 pandas 数据框 (N = 1485),如下所示:
ID Intervention
1 Blood Draw, Flushed, Locked
1 Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed
1 Blood Draw, Flushed
2 Blood return Verified, Flushed
2 Cap Changed
3 Port De-Accessed
我希望能够在每个逗号之前对每个字符串进行虚拟编码,因此它看起来类似于:
ID Blood Draw Flushed Locked ....
1 Yes Yes Yes
1 Yes No No
...
谢谢!
您可以尝试以下方法:
for event in ['Blood Draw', 'Flushed', 'Locked']:
df[event] = df['Intervention'].str.contains(event)
这会给你 True
/False
而不是 'Yes'/'No'
这在你 post-processing 时可能更有用。
import numpy as np
df1=df['Intervention'].str.split(',', expand=True)
df2=df1.replace(np.nan, '', regex=True) # Replacing None with blank data
pd.concat([pd.get_dummies(df2[col]) for col in df2], axis=1, keys=df2.columns) # Creates dummies for all the columns
要执行上述步骤,过滤 Intervention
列,执行此过程并与原始数据框连接,以便 dummies 语句起作用(为所有列创建虚拟对象)。
您可以使用 pd.Series.str.get_dummies
和字典映射:
d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))
在我看来,最好转换为字符串,仅用于显示目的。在布尔系列中更有效地保存和操作布尔值。
结果
print(res)
ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked \
0 1 yes no no yes no
1 1 yes no no no yes
2 1 yes no no yes no
3 2 no yes no yes no
4 2 no no yes no no
5 3 no no no no no
Locked Port De-Accessed Tubing Changed
0 yes no no
1 no yes yes
2 no no no
3 no no no
4 no no no
5 no yes no
设置
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
'Intervention': ['Blood Draw, Flushed, Locked',
'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
'Blood Draw, Flushed', 'Blood return Verified, Flushed',
'Cap Changed', 'Port De-Accessed']})
我有一个 pandas 数据框 (N = 1485),如下所示:
ID Intervention
1 Blood Draw, Flushed, Locked
1 Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed
1 Blood Draw, Flushed
2 Blood return Verified, Flushed
2 Cap Changed
3 Port De-Accessed
我希望能够在每个逗号之前对每个字符串进行虚拟编码,因此它看起来类似于:
ID Blood Draw Flushed Locked ....
1 Yes Yes Yes
1 Yes No No
...
谢谢!
您可以尝试以下方法:
for event in ['Blood Draw', 'Flushed', 'Locked']:
df[event] = df['Intervention'].str.contains(event)
这会给你 True
/False
而不是 'Yes'/'No'
这在你 post-processing 时可能更有用。
import numpy as np
df1=df['Intervention'].str.split(',', expand=True)
df2=df1.replace(np.nan, '', regex=True) # Replacing None with blank data
pd.concat([pd.get_dummies(df2[col]) for col in df2], axis=1, keys=df2.columns) # Creates dummies for all the columns
要执行上述步骤,过滤 Intervention
列,执行此过程并与原始数据框连接,以便 dummies 语句起作用(为所有列创建虚拟对象)。
您可以使用 pd.Series.str.get_dummies
和字典映射:
d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))
在我看来,最好转换为字符串,仅用于显示目的。在布尔系列中更有效地保存和操作布尔值。
结果
print(res)
ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked \
0 1 yes no no yes no
1 1 yes no no no yes
2 1 yes no no yes no
3 2 no yes no yes no
4 2 no no yes no no
5 3 no no no no no
Locked Port De-Accessed Tubing Changed
0 yes no no
1 no yes yes
2 no no no
3 no no no
4 no no no
5 no yes no
设置
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
'Intervention': ['Blood Draw, Flushed, Locked',
'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
'Blood Draw, Flushed', 'Blood return Verified, Flushed',
'Cap Changed', 'Port De-Accessed']})