将字符串列表与 numpy 匹配数组相交的最有效方法是什么？

Question

我正在使用 aho corasick 对文档执行一些字符串搜索。使用 numpy 数组以有效的方式存储字符串列表中每个字符串的匹配项：

import ahocorasick
import numpy as np

def ahocorasickFind(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        print(item)
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_list = ['joão','maria','thiago'] # thousands of words in the real code
result = ahocorasickFind(search_list,'asdasdasd joão 1231231 thiago') # huge text in the real code
for index, element in enumerate(result):
    if(element == 1):
        print(search_list[index])

使用上述方法需要大量时间和内存来迭代和测试（if == 1）。那么，如何以一种完美的方式在输入文本中找到“原始”字符串呢？

Answer 1

如果您只对匹配单词感兴趣（即用白色分隔 space），而不是使用完整的搜索文本，使用一组单词可能会更快。但是请注意，这会使用一些额外的内存。复制您的行为的一种直接解决方案是：

words = set(text.split())
for w in search_list:
    if w in words:
        print(w)

或更短（但更改结果的顺序，并从搜索列表中删除重复项）：

for w in set(search_list).intersection(text.split()):
    print(w)

我在相对较大的 text 对象（1.43 亿个字符，2300 万个单词）和一个相当短的 search_list 对象（606 个单词，其中 295 个唯一单词）上快速测试了它，并且我得到的次数是：

corasick: 14.5s
以上第一个版本：4.6s
上面的第二个版本：2.6s（这个加速只是因为只通过跳过重复完成了一半的工作）

然而，第一个版本使用了（相对）可忽略的额外内存，而其他版本使用了相当多的额外内存（对于我使用的数据，可能是近 2GB 的额外内存）

将字符串列表与 numpy 匹配数组相交的最有效方法是什么？

How is the most efficient way to intersect a list of strings with a numpy array of matches?

python

numpy

string-search