从字符串列创建虚拟变量

Question

我有一个 pandas 数据框 (N = 1485)，如下所示：

ID          Intervention
1           Blood Draw, Flushed, Locked
1           Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed
1           Blood Draw, Flushed
2           Blood return Verified, Flushed
2           Cap Changed
3           Port De-Accessed

我希望能够在每个逗号之前对每个字符串进行虚拟编码，因此它看起来类似于：

ID          Blood Draw          Flushed          Locked      ....
1              Yes                Yes             Yes
1              Yes                No              No
...

谢谢！

Answer 1

您可以尝试以下方法：

for event in ['Blood Draw', 'Flushed', 'Locked']:
    df[event] = df['Intervention'].str.contains(event)

这会给你 True/False 而不是 'Yes'/'No' 这在你 post-processing 时可能更有用。

Answer 2

import numpy as np
df1=df['Intervention'].str.split(',', expand=True)  
df2=df1.replace(np.nan, '', regex=True) # Replacing None with blank data
pd.concat([pd.get_dummies(df2[col]) for col in df2], axis=1, keys=df2.columns)  # Creates dummies for all the columns

要执行上述步骤，过滤 Intervention 列，执行此过程并与原始数据框连接，以便 dummies 语句起作用（为所有列创建虚拟对象）。

Answer 3

您可以使用 pd.Series.str.get_dummies 和字典映射：

d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))

在我看来，最好转换为字符串，仅用于显示目的。在布尔系列中更有效地保存和操作布尔值。

结果

print(res)

   ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked  \
0   1        yes                    no          no     yes             no   
1   1        yes                    no          no      no            yes   
2   1        yes                    no          no     yes             no   
3   2         no                   yes          no     yes             no   
4   2         no                    no         yes      no             no   
5   3         no                    no          no      no             no   

  Locked Port De-Accessed Tubing Changed  
0    yes               no             no  
1     no              yes            yes  
2     no               no             no  
3     no               no             no  
4     no               no             no  
5     no              yes             no

设置

df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
                   'Intervention': ['Blood Draw, Flushed, Locked',
                                    'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
                                    'Blood Draw, Flushed', 'Blood return Verified, Flushed',
                                    'Cap Changed', 'Port De-Accessed']})

从字符串列创建虚拟变量

Creating Dummy Variables from String Column

python

data-structures

pandas

dummy-variable

data-science