如何根据短语存在创建新列?
How to create new columns based on phrase existence?
我想根据短语存在创建新列
这是我的数据
No Body
1 Office software is already paid
2 Excel software is not paid yet
3 Power point software is already paid
我想按是否存在某个短语进行分类,这是我的代码,
countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)
结果是
No Body software is already paid
1 Office software is already paid 0 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 0 1
我的预期是
No Body software is already paid
1 Office software is already paid 1 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 1 1
我的代码有什么问题?或者我没有使用正确的功能
让我们尝试使用 extractall
:
df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
.str.get_dummies().sum(level=0))
输出:
No Body is already paid software
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 0 1
2 3 Power point software is already paid 1 1
您可以使用 Numpy 的 np.core.defchararray.find
来查找短语
from numpy.core.defchararray import find
phrases = np.array(['software', 'is already paid'])
dummies = (find(
df.Body.values.astype(str),
phrases[:, None]) > -1
).astype(np.uint)
dummies
array([[1, 1, 1],
[1, 0, 1]], dtype=uint64)
然后您可以将这些值放入现有数据框中
df['software'], df['is already paid'] = dummies
或使用 assign
并创建一个包含所需列的新副本
df.assign(**dict(zip(phrases, dummies)))
No Body software is already paid
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 1 0
2 3 Power point software is already paid 1 1
我想根据短语存在创建新列
这是我的数据
No Body
1 Office software is already paid
2 Excel software is not paid yet
3 Power point software is already paid
我想按是否存在某个短语进行分类,这是我的代码,
countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)
结果是
No Body software is already paid
1 Office software is already paid 0 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 0 1
我的预期是
No Body software is already paid
1 Office software is already paid 1 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 1 1
我的代码有什么问题?或者我没有使用正确的功能
让我们尝试使用 extractall
:
df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
.str.get_dummies().sum(level=0))
输出:
No Body is already paid software
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 0 1
2 3 Power point software is already paid 1 1
您可以使用 Numpy 的 np.core.defchararray.find
来查找短语
from numpy.core.defchararray import find
phrases = np.array(['software', 'is already paid'])
dummies = (find(
df.Body.values.astype(str),
phrases[:, None]) > -1
).astype(np.uint)
dummies
array([[1, 1, 1],
[1, 0, 1]], dtype=uint64)
然后您可以将这些值放入现有数据框中
df['software'], df['is already paid'] = dummies
或使用 assign
并创建一个包含所需列的新副本
df.assign(**dict(zip(phrases, dummies)))
No Body software is already paid
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 1 0
2 3 Power point software is already paid 1 1