Pandas 在行中查找文本并根据此分配虚拟变量值
Pandas finding a text in row and assign a dummy variable value based on this
我有一个包含文本列的数据框,即 df["input"]
、
我想创建一个新变量来检查 df["input"]
列是否包含给定列表中的任何单词,如果前一个虚拟变量等于 0(逻辑为 1),则赋值为 1创建一个等于零的虚拟变量 2) 如果它包含给定列表中的任何单词并且它不包含在之前的列表中,则将其替换为 1。)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
看起来像:
input
amazon listing subtitle
medical
film biotechnology dentist
最终数据集应如下所示:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
一种可能的实现方式是在循环中使用str.contains
创建3列,然后使用idxmax
获取第一个匹配项的列名(或列表名),然后创建来自这些匹配项的虚拟变量:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
但在这种情况下,使用嵌套 for-loop:
可能更有效
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
输出:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
这是一个更简单的-更多pandas向量样式解决方案:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
结果
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
很好的问题和很好的答案(我昨天不知何故错过了)!这是 .str.extractall()
的另一种变体:
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)
我有一个包含文本列的数据框,即 df["input"]
、
我想创建一个新变量来检查 df["input"]
列是否包含给定列表中的任何单词,如果前一个虚拟变量等于 0(逻辑为 1),则赋值为 1创建一个等于零的虚拟变量 2) 如果它包含给定列表中的任何单词并且它不包含在之前的列表中,则将其替换为 1。)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
看起来像:
input
amazon listing subtitle
medical
film biotechnology dentist
最终数据集应如下所示:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0
一种可能的实现方式是在循环中使用str.contains
创建3列,然后使用idxmax
获取第一个匹配项的列名(或列表名),然后创建来自这些匹配项的虚拟变量:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
但在这种情况下,使用嵌套 for-loop:
可能更有效import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
输出:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1
这是一个更简单的-更多pandas向量样式解决方案:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
结果
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0
很好的问题和很好的答案(我昨天不知何故错过了)!这是 .str.extractall()
的另一种变体:
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)