如何测试字符串是否包含存储在 pandas 列表列中的子字符串之一?
How to test if a string contains one of the substrings stored in a list column in pandas?
我的问题与 How to test if a string contains one of the substrings in a list, in pandas? 非常相似,只是要检查的子字符串列表因观察而异,并且存储在列表列中。有没有办法通过引用系列以矢量化方式访问该列表?
示例数据集
import pandas as pd
df = pd.DataFrame([{'a': 'Bob Smith is great.', 'b': ['Smith', 'foo'])},
{'a': 'The Sun is a mass of incandescent gas.', 'b': ['Jones', 'bar']}])
print(df)
我想生成第三列 'c',如果任何 'b' 字符串是其各自行的 'a' 的子字符串,则等于 1,并且等于零否则。也就是说,我希望在这种情况下:
a b c
0 Bob Smith is great. [Smith, foo] 1
1 The Sun is a mass of incandescent gas. [Jones, bar] 0
我的尝试:
df['c'] = df.a.str.contains('|'.join(df.b)) # Does not work.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_4092606/761645043.py in <module>
----> 1 df['c'] = df.a.str.contains('|'.join(df.b)) # Does not work.
TypeError: sequence item 0: expected str instance, list found
您可以只使用 zip
和列表理解:
df['c'] = [int(any(w in a for w in b)) for a, b in zip(df.a, df.b)]
df
# a b c
#0 Bob Smith is great. [Smith, foo] 1
#1 The Sun is a mass of incandescent gas. [Jones, bar] 0
如果你不关心大小写:
df['c'] = [any(w.lower() in a for w in b) for a, b in zip(df.a.str.lower(), df.b)]
我的问题与 How to test if a string contains one of the substrings in a list, in pandas? 非常相似,只是要检查的子字符串列表因观察而异,并且存储在列表列中。有没有办法通过引用系列以矢量化方式访问该列表?
示例数据集
import pandas as pd
df = pd.DataFrame([{'a': 'Bob Smith is great.', 'b': ['Smith', 'foo'])},
{'a': 'The Sun is a mass of incandescent gas.', 'b': ['Jones', 'bar']}])
print(df)
我想生成第三列 'c',如果任何 'b' 字符串是其各自行的 'a' 的子字符串,则等于 1,并且等于零否则。也就是说,我希望在这种情况下:
a b c
0 Bob Smith is great. [Smith, foo] 1
1 The Sun is a mass of incandescent gas. [Jones, bar] 0
我的尝试:
df['c'] = df.a.str.contains('|'.join(df.b)) # Does not work.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_4092606/761645043.py in <module>
----> 1 df['c'] = df.a.str.contains('|'.join(df.b)) # Does not work.
TypeError: sequence item 0: expected str instance, list found
您可以只使用 zip
和列表理解:
df['c'] = [int(any(w in a for w in b)) for a, b in zip(df.a, df.b)]
df
# a b c
#0 Bob Smith is great. [Smith, foo] 1
#1 The Sun is a mass of incandescent gas. [Jones, bar] 0
如果你不关心大小写:
df['c'] = [any(w.lower() in a for w in b) for a, b in zip(df.a.str.lower(), df.b)]