如何有效地定位句子中的特定单词序列
How to locate specific sequences of words in a sentence efficiently
问题是要找到一个时间高效的函数,该函数接收一个单词句子和一个包含不同数量单词的序列列表(也称为 ngram)和 returns 每个序列列表指示它们在句子中出现位置的索引,并尽可能有效地处理大量序列。
我最终想要的是将句子中出现的 ngram 替换为“_”。
例如,如果我的序列是 ["hello"、"world"] 和 ["my"、"problem"],并且句子是 "hello world this is my problem can you solve it please?" 函数应该 return "hello_world this is my_problem can you solve it please?"
我所做的是根据每个单词的数量对序列进行分组,并将其保存在字典中,其中键是数量,值是该长度序列的列表。
变量 ngrams 就是这个字典:
def replaceNgrams(line, ngrams):
words = line.split()
#Iterates backwards in the length of the sequences
for n in list(ngrams.keys())[::-1]: #O(L*T)
newWords = []
if len(words) >= n:
terms = ngrams[n]
i = 0
while i < len(words)+1-n: #O(L*Tn)
#Gets a sequences of words from the sentences of the same length of the ngrams currently checking
nwords = words[i:i+n].copy()
#Checks if that sequence is in my list of sequences
if nwords in terms: #O(Tn)
newWords.append("_".join(nwords))
i+=n
else:
newWords.append(words[i])
i+=1
newWords += words[i:].copy()
words = newWords.copy()
return " ".join(words)
这按预期工作,但我有太多序列和太多行也无法应用它,这对我来说太慢了(需要一个月才能完成)。
我认为这可以通过基本的字符串操作来实现。我将首先将所有 sequences
连接成单个字符串,然后在 full_text
中查找它们。
如果找到,我将在 output_dict
中使用它们的开始和结束索引跟踪它们。您可以根据需要使用这些索引。
full_text = "hello world this is my problem can you solve it please?"
sequences = [["hello", "world"], ["my", "problem"]]
joined_sequences = [" ".join(sequence) for sequence in sequences]
def find_location(message, seq):
if seq in message:
return message.find(seq)
else:
return None
output_dict = {}
for sequence in joined_sequences:
start_index = find_location(full_text, sequence)
if start_index > -1:
output_dict[sequence] = [start_index, start_index+len(sequence)]
print(output_dict)
这将输出:
{'hello world': [0, 11], 'my problem': [20, 30]}
然后你可以用开始和结束索引做任何你想做的事情。
如果你只需要用中间的下划线替换值,你可能甚至不需要索引。
for sequence in joined_sequences:
if sequence in full_text:
full_text = full_text.replace(sequence, "_".join(sequence.split()))
print(full_text)
这应该给你:
hello_world this is my_problem can you solve it please?
问题是要找到一个时间高效的函数,该函数接收一个单词句子和一个包含不同数量单词的序列列表(也称为 ngram)和 returns 每个序列列表指示它们在句子中出现位置的索引,并尽可能有效地处理大量序列。
我最终想要的是将句子中出现的 ngram 替换为“_”。
例如,如果我的序列是 ["hello"、"world"] 和 ["my"、"problem"],并且句子是 "hello world this is my problem can you solve it please?" 函数应该 return "hello_world this is my_problem can you solve it please?"
我所做的是根据每个单词的数量对序列进行分组,并将其保存在字典中,其中键是数量,值是该长度序列的列表。
变量 ngrams 就是这个字典:
def replaceNgrams(line, ngrams):
words = line.split()
#Iterates backwards in the length of the sequences
for n in list(ngrams.keys())[::-1]: #O(L*T)
newWords = []
if len(words) >= n:
terms = ngrams[n]
i = 0
while i < len(words)+1-n: #O(L*Tn)
#Gets a sequences of words from the sentences of the same length of the ngrams currently checking
nwords = words[i:i+n].copy()
#Checks if that sequence is in my list of sequences
if nwords in terms: #O(Tn)
newWords.append("_".join(nwords))
i+=n
else:
newWords.append(words[i])
i+=1
newWords += words[i:].copy()
words = newWords.copy()
return " ".join(words)
这按预期工作,但我有太多序列和太多行也无法应用它,这对我来说太慢了(需要一个月才能完成)。
我认为这可以通过基本的字符串操作来实现。我将首先将所有 sequences
连接成单个字符串,然后在 full_text
中查找它们。
如果找到,我将在 output_dict
中使用它们的开始和结束索引跟踪它们。您可以根据需要使用这些索引。
full_text = "hello world this is my problem can you solve it please?"
sequences = [["hello", "world"], ["my", "problem"]]
joined_sequences = [" ".join(sequence) for sequence in sequences]
def find_location(message, seq):
if seq in message:
return message.find(seq)
else:
return None
output_dict = {}
for sequence in joined_sequences:
start_index = find_location(full_text, sequence)
if start_index > -1:
output_dict[sequence] = [start_index, start_index+len(sequence)]
print(output_dict)
这将输出:
{'hello world': [0, 11], 'my problem': [20, 30]}
然后你可以用开始和结束索引做任何你想做的事情。
如果你只需要用中间的下划线替换值,你可能甚至不需要索引。
for sequence in joined_sequences:
if sequence in full_text:
full_text = full_text.replace(sequence, "_".join(sequence.split()))
print(full_text)
这应该给你:
hello_world this is my_problem can you solve it please?