Pandas 在行中查找文本并根据此分配虚拟变量值

Pandas finding a text in row and assign a dummy variable value based on this

我有一个包含文本列的数据框,即 df["input"]

我想创建一个新变量来检查 df["input"] 列是否包含给定列表中的任何单词,如果前一个虚拟变量等于 0(逻辑为 1),则赋值为 1创建一个等于零的虚拟变量 2) 如果它包含给定列表中的任何单词并且它不包含在之前的列表中,则将其替换为 1。)

# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle",  "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]

df = pd.DataFrame({'input': ['amazon listing subtitle', 
                             'medical', 
                             'film biotechnology dentist']})

看起来像:

input
amazon listing subtitle
medical 
film biotechnology dentist

最终数据集应如下所示:

input                           listings  scripting  medical
amazon listing subtitle            1         0         0
medical                            0         0         1          
film biotechnology dentist         0         1         0

一种可能的实现方式是在循环中使用str.contains创建3列,然后使用idxmax获取第一个匹配项的列名(或列表名),然后创建来自这些匹配项的虚拟变量:

import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
    df[k] = df['input'].str.contains('|'.join(v))

arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)

但在这种情况下,使用嵌套 for-loop:

可能更有效
import re
def get_dummy_vars(col, lsts):
    out = []
    len_lsts = len(lsts)
    for row in col:
        tmp = []
        # in the nested loop, we use the any function to check for the first match 
        # if there's a match, break the loop and pad 0s since we don't care if there's another match
        for lst in lsts:
            tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
            if tmp[-1]:
                break
        tmp += [0] * (len_lsts - len(tmp))
        out.append(tmp)
    return out

lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))

输出:

                        input listings medical scripting
0     amazon listing subtitle        1       0         0
1                     medical        0       1         0
2  film biotechnology dentist        0       0         1

这是一个更简单的-更多pandas向量样式解决方案:

patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle",  "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]

df = pd.DataFrame({'input': ['amazon listing subtitle', 
                             'medical', 
                             'film biotechnology dentist']})
#---------------------------------------------------------------#

# step 1, for each column create a reg-expression
for col, items in patterns.items():
    
    # create a regex pattern (word1|word2|word3)
    pattern = f"({'|'.join(items)})"
    
    # find the pattern in the input column
    df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
    
# step 2, if the value to the left is 1, change its value to 0

## 2.1 create a mask
## shift the rows to the right, 
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values

# substract the mask from the df 
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )

print(df)

结果

                        input  listings  scripting  medical
0     amazon listing subtitle         1          0        0
1                     medical         0          0        1
2  film biotechnology dentist         0          1        0

很好的问题和很好的答案(我昨天不知何故错过了)!这是 .str.extractall() 的另一种变体:

search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
    f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
    for column, strings in search.items()
)
result = (
    df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
               .idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)