如何根据短语存在创建新列?

How to create new columns based on phrase existence?

我想根据短语存在创建新列

这是我的数据

No   Body
1    Office software is already paid
2    Excel software is not paid yet
3    Power point software is already paid

我想按是否存在某个短语进行分类,这是我的代码,

countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)

结果是

No   Body                                   software   is already paid    
1    Office software is already paid        0          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   0          1

我的预期是

No   Body                                   software   is already paid    
1    Office software is already paid        1          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   1          1

我的代码有什么问题?或者我没有使用正确的功能

让我们尝试使用 extractall:

df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
              .str.get_dummies().sum(level=0))

输出:

   No                                  Body  is already paid  software
0   1       Office software is already paid                1         1
1   2        Excel software is not paid yet                0         1
2   3  Power point software is already paid                1         1

您可以使用 Numpy 的 np.core.defchararray.find 来查找短语

from numpy.core.defchararray import find

phrases = np.array(['software', 'is already paid'])

dummies = (find(
    df.Body.values.astype(str),
    phrases[:, None]) > -1
).astype(np.uint)

dummies

array([[1, 1, 1],
       [1, 0, 1]], dtype=uint64)

然后您可以将这些值放入现有数据框中

df['software'], df['is already paid'] = dummies

或使用 assign 并创建一个包含所需列的新副本

df.assign(**dict(zip(phrases, dummies)))

   No                                  Body  software  is already paid
0   1       Office software is already paid         1                1
1   2        Excel software is not paid yet         1                0
2   3  Power point software is already paid         1                1