检查字符串是否存在于 Pandas 中另一列的值中的最有效方法

Question

我有一个 pandas 数据框如下，

id	all_items	items_check1	items_check2
1239	'foobar,foo,foofoo,bar'	'foo,bar'	'foobar'
3298	'foobar,foo'	'foobar'	'bar'
9384	'foo,bar'	'bar,foo'	'bar'

我想检查 items_check1 中的项目是否存在于 all_items 中，稍后将此结果保存到单独的列中作为 check1_output；然后想用 items_check2 和 all_items 再次重复相同的过程；并将其保存到 check2_output.

所以[期望的输出]应该是这样的，

id	all_items	items_check1	items_check2	check1_output	check2_output
1239	'foobar,foo,foofoo,bar'	'foo,bar'	'foobar'	正确	正确
3298	'foobar,foo'	'foobar'	'bar'	正确	错
9384	'foo,bar'	'bar,foo'	'bar'	正确	正确

我有数十亿行，有时 all_items 的单个单元格中的项目数最多可达 100 项。我正在寻找完成此比较的最有效方法。

到目前为止的尝试
以下是我的尝试，这比遍历行更有效，但我很快发现输出并不总是如预期的那样.这种行为的可能原因是什么？

df['check1_output'] = np.where([x[0] in x[1] for x in zip(df['items_check1'], df['all_items'])], True, False)
df['check2_output'] = np.where([x[0] in x[1] for x in zip(df['items_check2'], df['all_items'])], True, False)

[实际输出]

id	all_items	items_check1	items_check2	check1_output	check2_output
1239	'foobar,foo,foofoo,bar'	'foo,bar'	'foobar'	正确	正确
3298	'foobar,foo'	'foobar'	'bar'	正确	正确
9384	'foo,bar'	'bar,foo'	'bar'	错	正确

这是重新生成上述数据帧的片段

df = pd.DataFrame({'id': [1239,3298,9384], 
                   'all_items': ['foobar,foo,foofoo,bar','foobar,foo','foo,bar'],
                   'items_check1': ['foo,bar','foobar','barfoo'],
                   'items_check2': ['foobar','bar','bar']
                  })

编辑： 增加计算时间

我提到的方法需要 610µs 3 行数据帧的时间。但是当我运行它跨越具有数十亿条记录的实际数据时，需要花费很多时间。因此寻找更有效的方法。

Answer 1

尝试将 issubset() 与 str.split() 结合使用：

df["check1_output"] = df.apply(lambda x: set(x["items_check1"].split(",")).issubset(x["all_items"].split(",")), axis=1)
df["check2_output"] = df.apply(lambda x: set(x["items_check2"].split(",")).issubset(x["all_items"].split(",")), axis=1)
>>> df
     id              all_items  ... check1_output check2_output
0  1239  foobar,foo,foofoo,bar  ...          True          True
1  3298             foobar,foo  ...          True         False
2  9384                foo,bar  ...         False          True

Answer 2

Numpy哪里会快

import numpy as np 

…

df['check1_output'] = np.where(df['items_check1'].isin(df['all_items']), True, False)

# do same for the other check

Answer 3

您可以使用 split() 将字符串转换为列表，然后应用 np.in1d 检查特定项目是否在 all_items.

中

df['check1_output'] = df.apply(lambda row: np.in1d(row['items_check1'].split(','), row['all_items'].split(',')).all(), axis=1)
df['check2_output'] = df.apply(lambda row: np.in1d(row['items_check2'].split(','), row['all_items'].split(',')).all(), axis=1)

检查字符串是否存在于 Pandas 中另一列的值中的最有效方法

Most efficient way of checking whether a string is present in another column's values in Pandas

python

comparison

dataframe

pandas