如何从字符串中找出多词串,并标注在python中?
How to find multi-word string from string, and label it in python?
例如,句子是"The corporate balance sheets data are available on an annual basis"
,我需要标记"corporate balance sheets"
,这是从给定句子中找到的子串。
所以,我需要找到的模式是:
"corporate balance sheets"
给定字符串:
"The corporate balance sheets data are available on an annual basis".
我想要的输出标签序列是:
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
有一堆句子(超过2GB),还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。谁能给我一个好的算法?
列表理解和使用拆分:
import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"
lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]
输出:
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
句子="The corporate balance sheets data are available on an annual basis sheets"
输出
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
由于子字符串中的所有单词都必须匹配,您可以使用 all
来检查并在遍历句子时更新适当的索引:
def encode(sub, sent):
subwords, sentwords = sub.split(), sent.split()
res = [0 for _ in sentwords]
for i, word in enumerate(sentwords[:-len(subwords) + 1]):
if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
for j in range(len(subwords)):
res[i + j] = 1
return res
sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
例如,句子是"The corporate balance sheets data are available on an annual basis"
,我需要标记"corporate balance sheets"
,这是从给定句子中找到的子串。
所以,我需要找到的模式是:
"corporate balance sheets"
给定字符串:
"The corporate balance sheets data are available on an annual basis".
我想要的输出标签序列是:
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
有一堆句子(超过2GB),还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。谁能给我一个好的算法?
列表理解和使用拆分:
import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"
lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]
输出:
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
句子="The corporate balance sheets data are available on an annual basis sheets"
输出
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
由于子字符串中的所有单词都必须匹配,您可以使用 all
来检查并在遍历句子时更新适当的索引:
def encode(sub, sent):
subwords, sentwords = sub.split(), sent.split()
res = [0 for _ in sentwords]
for i, word in enumerate(sentwords[:-len(subwords) + 1]):
if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
for j in range(len(subwords)):
res[i + j] = 1
return res
sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]