如果列表中的子字符串出现在字符串中，则为新列赋值

Question

我有一个数据框 df:

tags
"a,b,c,d"
"c,q,k,t"

以及我需要搜索的字符串列表：

searchList =  ["a", "b"]

我需要在名为 "topic" 的数据框中添加一个新列。如果来自 searchList 的字符串出现在 "tags" 列中，我需要将该行中的值设置为 bool True，否则为 bool False。

最终结果：

tags      | topic
"a,b,c,d" | True
"c,q,k,t" | False

到目前为止我的代码：

searchList =  ["a", "b"]
pattern = '|'.join(searchfor)
df["topic"] = df.loc[(df["tags"].str.contains('|'.join(pattern), na=False)), True] = True

但是我得到错误：

KeyError: 'cannot use a single bool to index into setitem'

?

Answer 1

您可以为新列分配掩码，也可以将 pattern 更改为 searchList:

searchList =  ["a", "b"]
df["topic"] = df["tags"].str.contains('|'.join(searchList), na=False)
print (df)
      tags  topic
0  a,b,c,d   True
1  c,q,k,t  False

编辑：

searchList =  ["a", "b"]
df["topic"] = df["tags"].str.split(',', expand=True).isin(searchList).sum(axis=1).eq(2)
print (df)
      tags  topic
0  a,b,c,d   True
1  c,q,k,t  False
2    a,c,d  False

详情:

首先将 Series.str.split 与 expand=True 一起用于新的 DataFrame:

print (df["tags"].str.split(',', expand=True))
   0  1  2     3
0  a  b  c     d
1  c  q  k     t
2  a  c  d  None

然后通过DataFrame.isin比较会员资格：

print (df["tags"].str.split(',', expand=True).isin(searchList))
       0      1      2      3
0   True   True  False  False
1  False  False  False  False
2   True  False  False  False

并通过 sum:

计算 Trues 个值

print (df["tags"].str.split(',', expand=True).isin(searchList).sum(axis=1))
0    2
1    0
2    1
dtype: int64

上次比较 Series.eq，掩码 ==。

如果列表中的子字符串出现在字符串中，则为新列赋值

Assign value to new column if substring from a list appears in string

substring

list

assign

pandas