
How to find multi-word string from string, and label it in python?

例如,句子是"The corporate balance sheets data are available on an annual basis",我需要标记"corporate balance sheets",这是从给定句子中找到的子串。


"corporate balance sheets"


"The corporate balance sheets data are available on an annual basis".


[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

有一堆句子(超过2GB),还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。谁能给我一个好的算法?


import re
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]


 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

句子="The corporate balance sheets data are available on an annual basis sheets"


[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

由于子字符串中的所有单词都必须匹配,您可以使用 all 来检查并在遍历句子时更新适当的索引:

def encode(sub, sent):
    subwords, sentwords = sub.split(), sent.split()
    res = [0 for _ in sentwords]    
    for i, word in enumerate(sentwords[:-len(subwords) + 1]):
        if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
            for j in range(len(subwords)):
                res[i + j] = 1
    return res

sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]