return 如何匹配多个文本列表?

How return more than one match on a list of text?

我目前有一个函数可以生成一个术语及其所在的句子。此时,该函数仅从术语列表中检索第一个匹配项。我希望能够检索所有匹配项,而不仅仅是第一个。

例如,list_of_matches = ["heart attack", "cardiovascular", "hypoxia"] 一个句子是 text_list = ["A heart attack is a result of cardiovascular...", "Chronic intermittent hypoxia is the..."]

理想的输出是:

['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']

# this is the current function
def find_word(list_of_matches, line):
    for words in list_of_matches:
        if any([words in line]):
            return words, line

# returns list of 'term, matched string'
key_vals = [list(find_word(list_of_matches, line.lower())) for line in text_list if 
find_word(list_of_matches, line.lower()) != None]

# output is currently 
['heart attack', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']

您将要在这里使用正则表达式。

import re

def find_all_matches(words_to_search, text):
    matches = []
    for word in words_to_search:
        matched_text = re.search(word, text).group()
        matches.append(matched_text)
    return [matches, text]

请注意,这将 return 一个包含所有匹配项的嵌套列表。

解决方案需要 2 个步骤:

  1. 修复功能
  2. 处理输出

鉴于您不希望的输出遵循模式


    output = [
      [word1, sentence1],
      [word2, sentence1],
      [word3, sentence2],
    ]
  1. 修复函数: 你应该在 'for' 循环中更改 de return 以迭代 list_of_matches 的每个单词,以获得匹配的所有单词,而不仅仅是第一个

。它应该保持这样:


    def find_word(list_of_matches, line):
        answer = []
        for words in list_of_matches:
            if any([words in line]):
                answer.append([words, line])
        return answer

使用上面的函数,输出将是:


    key_vals = [
      [
        ['heart attack', 'a heart attack is a result of cardiovascular...'],
        ['cardiovascular', 'a heart attack is a result of cardiovascular...']
      ],
      [
        ['hypoxia', 'chronic intermittent hypoxia is the...']
      ]
    ]

  1. 处理输出: 现在您需要获取变量“key_vals”并处理使用以下代码处理的每个句子的所有列表列表:
    output = []
    for word_sentence_list in key_vals:
        for word_sentence in word_sentence_list:
            output.append(word_sentence)

最后,输出将是:


    output = [
     ['heart attack', 'a heart attack is a result of cardiovascular...'],
     ['cardiovascular', 'a heart attack is a result of cardiovascular...'],
     ['hypoxia', 'chronic intermittent hypoxia is the...']
    ]