根据字符串值在 python 中创建分类列

Question

我有一个包含“名称”列的 pandas 数据框。 Name 列中的字符串可能包含“Joe”、“Bob”或“Joe Bob”。我想为人员类型添加一列：仅 Joe、仅 Bob 或两者。

我能够通过创建布尔列、将它们转换为字符串、组合字符串，然后替换值来做到这一点。就是……感觉不太优雅！我是 Python 的新手...有更好的方法吗？

我的原始数据框：

df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])

0	Name
1	Joe Biden
2	Bobby Kennedy
3	Joe Bob Briggs

我添加了两个布尔列来查找名称：

df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')

df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')

现在我的数据框如下所示：

df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])

0	Name	Joe	Bob
1	Joe Biden	1	0
2	Bobby Kennedy	0	1
3	Joe Bob Briggs	1	1

但我真正想要的是一个具有分类值的“类型”列：Joe、Bob 或两者。

为此，我添加了一列来组合布尔值，然后我替换了值：

df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)

0	Name	Joe	Bob	Type
1	Joe Biden	1	0	10
2	Bobby Kennedy	0	1	1
3	Joe Bob Briggs	1	1	11

df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)

0	Name	Joe	Bob	Type
1	Joe Biden	1	0	Joe
2	Bobby Kennedy	0	1	Bob
3	Joe Bob Briggs	1	1	Both

这感觉很笨拙。谁有更好的方法？

谢谢！

Answer 1

您可以使用 np.select 创建列 Type。

您需要按从最精确到最宽的顺序正确排列 condlist。

df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
                        df['Name'].str.contains('Joe'),
                        df['Name'].str.contains('Bob')],
                       choicelist=['Both', 'Joe', 'Bob'])

输出：

>>> df
             Name  Type
0       Joe Biden   Joe
1   Bobby Kennedy   Bob
2  Joe Bob Briggs  Both

根据字符串值在 python 中创建分类列

Create categorical column in python from string values

python

boolean

dataframe

pandas

categorical-data