如何有效地定位句子中的特定单词序列

Question

问题是要找到一个时间高效的函数，该函数接收一个单词句子和一个包含不同数量单词的序列列表（也称为 ngram）和 returns 每个序列列表指示它们在句子中出现位置的索引，并尽可能有效地处理大量序列。

我最终想要的是将句子中出现的 ngram 替换为“_”。

例如，如果我的序列是 ["hello"、"world"] 和 ["my"、"problem"]，并且句子是 "hello world this is my problem can you solve it please?" 函数应该 return "hello_world this is my_problem can you solve it please?"

我所做的是根据每个单词的数量对序列进行分组，并将其保存在字典中，其中键是数量，值是该长度序列的列表。

变量 ngrams 就是这个字典：

def replaceNgrams(line, ngrams):
    words = line.split()
    #Iterates backwards in the length of the sequences
    for n in list(ngrams.keys())[::-1]: #O(L*T)
        newWords = []
        if len(words) >= n:
            terms = ngrams[n]
            i = 0
            while i < len(words)+1-n: #O(L*Tn)
                #Gets a sequences of words from the sentences of the same length of the ngrams currently checking
                nwords = words[i:i+n].copy()
                #Checks if that sequence is in my list of sequences
                if nwords in terms: #O(Tn)
                    newWords.append("_".join(nwords))
                    i+=n
                else:
                    newWords.append(words[i])
                    i+=1
            newWords += words[i:].copy()
            words = newWords.copy()
    return " ".join(words)

这按预期工作，但我有太多序列和太多行也无法应用它，这对我来说太慢了（需要一个月才能完成）。

Answer 1

我认为这可以通过基本的字符串操作来实现。我将首先将所有 sequences 连接成单个字符串，然后在 full_text 中查找它们。如果找到，我将在 output_dict 中使用它们的开始和结束索引跟踪它们。您可以根据需要使用这些索引。


full_text = "hello world this is my problem can you solve it please?"

sequences = [["hello", "world"], ["my", "problem"]]

joined_sequences = [" ".join(sequence) for sequence in sequences]

def find_location(message, seq):
    if seq in message:
        return message.find(seq)
    else:
        return None

output_dict = {}

for sequence in joined_sequences:
    start_index = find_location(full_text, sequence)
    if start_index > -1:
        output_dict[sequence] = [start_index, start_index+len(sequence)]

print(output_dict)

这将输出：

{'hello world': [0, 11], 'my problem': [20, 30]}

然后你可以用开始和结束索引做任何你想做的事情。

如果你只需要用中间的下划线替换值，你可能甚至不需要索引。

for sequence in joined_sequences:
    if sequence in full_text:
        full_text = full_text.replace(sequence, "_".join(sequence.split()))

print(full_text)

这应该给你：

hello_world this is my_problem can you solve it please?

如何有效地定位句子中的特定单词序列

How to locate specific sequences of words in a sentence efficiently

python

nlp

text-mining

python-3.x