如何从字符串中找出多词串，并标注在python中？

Question

例如，句子是"The corporate balance sheets data are available on an annual basis"，我需要标记"corporate balance sheets"，这是从给定句子中找到的子串。

所以，我需要找到的模式是：

"corporate balance sheets"

给定字符串：

"The corporate balance sheets data are available on an annual basis".

我想要的输出标签序列是：

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

有一堆句子（超过2GB），还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。谁能给我一个好的算法？

Answer 1

列表理解和使用拆分：

import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]

输出：

 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

句子="The corporate balance sheets data are available on an annual basis sheets"

输出

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Answer 2

由于子字符串中的所有单词都必须匹配，您可以使用 all 来检查并在遍历句子时更新适当的索引：

def encode(sub, sent):
    subwords, sentwords = sub.split(), sent.split()
    res = [0 for _ in sentwords]    
    for i, word in enumerate(sentwords[:-len(subwords) + 1]):
        if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
            for j in range(len(subwords)):
                res[i + j] = 1
    return res


sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

如何从字符串中找出多词串，并标注在python中？

How to find multi-word string from string, and label it in python?

python

preprocessor

nlp

string-matching

labeling